<a href="https://colab.research.google.com/github/jaredmullane/LLM_Class/blob/main/TECH16_LLM_Lecture3_prepared.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install openai
!pip install sentence-transformers
!pip install langchain pypdf langchain-openai #tiktoken chromadb

# Embeddings

In [None]:
import pandas as pd

df = pd.read_csv('http://bit.ly/dataset-sst2',
                 nrows=100, sep='\t', names=['text', 'label'])

df['label'] = df['label'].replace({0: 'negative', 1: 'positive'})

In [None]:
df.head()

Unnamed: 0,text,label
0,"a stirring , funny and finally transporting re...",positive
1,apparently reassembled from the cutting room f...,negative
2,they presume their audience wo n't sit still f...,negative
3,this is a visually stunning rumination on love...,positive
4,jonathan parker 's bartleby should have been t...,positive


In [None]:
from sentence_transformers import SentenceTransformer

sentence_bert_model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

def get_embeddings(sentences):
    return sentence_bert_model.encode(sentences,
                                    batch_size=32,
                                    show_progress_bar=True)

In [None]:
e = get_embeddings(df['text'])

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# Convert NumPy array of embedding into data frame
embedding_df = pd.DataFrame(e)

# Save dataframe as as TSV file without any index and header
embedding_df.to_csv('output.tsv', sep='\t', index=None, header=None)

In [None]:
embedding_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.007687,0.220588,1.107231,-0.805425,-0.189138,0.175037,-0.095774,-1.149741,0.614675,-0.406313,...,-0.091129,-0.930856,-0.770217,0.830264,-0.214198,0.400544,0.974435,-0.599894,0.419343,-0.242024
1,0.13185,-1.051062,0.379231,-0.066255,-0.299081,-0.981428,0.95828,-0.036092,-0.081783,-0.722954,...,0.048544,-0.194993,0.429362,1.433621,0.30352,-0.21746,-0.337119,-0.513364,0.613231,0.939613
2,0.146425,0.026571,0.910159,-0.720145,-0.05732,0.184523,-0.354749,-0.819164,0.438382,-0.083151,...,-0.95381,0.045402,-0.156494,-0.422207,-0.472993,0.141547,0.079621,-0.068062,-0.347549,-0.200186
3,0.725216,0.400027,0.550635,-0.415071,0.321567,0.476579,-0.268125,0.285532,-0.096131,-0.754695,...,-0.834957,-0.458781,-0.016415,0.061812,-0.027487,0.248662,0.33331,-0.202887,-0.089137,-0.841737
4,0.38616,0.437597,-0.229873,-0.328631,0.146723,0.166341,0.614475,-0.451549,-0.415271,0.054353,...,-0.144842,-0.170682,0.482448,0.865259,0.456433,-0.029702,0.791408,0.490984,0.352877,0.135437


In [None]:
# Save dataframe without any index
df.to_csv('metadata.tsv', index=False, sep='\t')

Embeddings projector link: https://projector.tensorflow.org/

# Standard imports

In [None]:
from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('OpenAI')
client = OpenAI(api_key=open_ai_key)


# Langchain & summarizing PDF's

In [None]:
# Get a PDF
!wget https://arxiv.org/pdf/2401.16212.pdf

--2024-02-14 19:48:08--  https://arxiv.org/pdf/2401.16212.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 512852 (501K) [application/pdf]
Saving to: ‘2401.16212.pdf’


2024-02-14 19:48:08 (9.95 MB/s) - ‘2401.16212.pdf’ saved [512852/512852]



In [None]:
!ls

article_increasingreturns.pdf  meta_test.tsv  output_test.tsv  report.pdf
metadata.tsv		       meta.tsv       output.tsv       sample_data


In [None]:
!wget https://www.morganstanley.com/im/publication/insights/articles/article_increasingreturns.pdf

--2024-02-14 19:20:32--  https://www.morganstanley.com/im/publication/insights/articles/article_increasingreturns.pdf
Resolving www.morganstanley.com (www.morganstanley.com)... 104.115.172.233
Connecting to www.morganstanley.com (www.morganstanley.com)|104.115.172.233|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘article_increasingreturns.pdf.1’

article_increasingr     [ <=>                ] 570.73K  --.-KB/s    in 0.1s    

2024-02-14 19:20:33 (4.78 MB/s) - ‘article_increasingreturns.pdf.1’ saved [584426]



In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("2401.16212.pdf")
pages = loader.load_and_split()

In [None]:
pages[0]

Document(page_content='Better Call GPT, Comparing Large Language Models Against Lawyers\nLAUREN MARTIN, NICK WHITEHOUSE, STEPHANIE YIU, LIZZIE CATTERSON, RIVINDU\nPERERA, AI Center of Excellence, Onit Inc., New Zealand\nThis paper presents a groundbreaking comparison between Large Language Models (LLMs) and traditional legal contract review-\ners—Junior Lawyers and Legal Process Outsourcers (LPOs). We dissect whether LLMs can outperform humans in accuracy, speed,\nand cost-efficiency during contract review. Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers,\nuncovering that advanced models match or exceed human accuracy in determining legal issues. In speed, LLMs complete reviews in\nmere seconds, eclipsing the hours required by their human counterparts. Cost-wise, LLMs operate at a fraction of the price, offering a\nstaggering 99.97 percent reduction in cost over traditional methods. These results are not just statistics—they signal a seismic shift in

In [None]:
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0.1, model_name="gpt-4-turbo-preview", api_key=open_ai_key)
chain = load_summarize_chain(llm, chain_type="stuff")

chain.run(pages[0:3])

'This paper, titled "Better Call GPT, Comparing Large Language Models Against Lawyers" by Lauren Martin et al., from Onit Inc., New Zealand, explores the effectiveness of Large Language Models (LLMs) in legal contract review tasks, comparing their performance with Junior Lawyers and Legal Process Outsourcers (LPOs). The study focuses on accuracy, speed, and cost-efficiency, finding that LLMs can match or surpass human accuracy in identifying legal issues, significantly outpace humans in speed by completing reviews in seconds, and offer a dramatic cost reduction of 99.97% over traditional methods. These findings suggest a potential paradigm shift in legal practice towards the adoption of LLMs for contract review, promising increased accessibility and efficiency in legal services. The research contributes to the understanding of LLMs\' capabilities and limitations in the legal domain, indicating a future where LLMs could dominate legal contract review processes.'

In [None]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

# Define prompt
prompt_template = """Write a concise summary in a maximum of 3 bullets of the following text enclosed within three backticks:
```{text}```
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo-preview", api_key=open_ai_key)
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

print(stuff_chain.run(pages[0:3]))

- The paper evaluates Large Language Models (LLMs) against Junior Lawyers and Legal Process Outsourcers (LPOs) in legal contract review, focusing on accuracy, speed, and cost-efficiency, finding that LLMs match or exceed human accuracy, complete reviews significantly faster, and operate at a much lower cost.
- Empirical analysis shows LLMs can provide a 99.97 percent reduction in cost over traditional methods, indicating a potential for significant disruption in the legal industry by enhancing the accessibility and efficiency of legal services.
- The research highlights a shift towards LLM dominance in legal contract review, suggesting a need for reimagined legal workflows and the practical effectiveness of LLMs in real-world legal tasks.


# RAG

In [None]:
!pip install llama-index --upgrade

In [None]:
!pip install pypdf

In [None]:
!wget https://www.goldmansachs.com/intelligence/pages/gs-research/2024-us-equity-outlook-all-you-had-to-do-was-stay/report.pdf

In [None]:
import os
os.environ["OPENAI_API_KEY"] = open_ai_key

In [None]:
# Import necessary classes from the llama_index package
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Read documents from the specified directory and load a specific document, "report.pdf".
documents = SimpleDirectoryReader("./").load_data("report.pdf")

# Create a VectorStoreIndex object from the documents. This will involve processing the documents
# and creating a vector representation for each of them, suitable for semantic searching.
index = VectorStoreIndex.from_documents(documents)

# Convert the VectorStoreIndex object into a query engine. This query engine can be used to
# perform semantic searches on the index, matching natural language queries to the most relevant
# documents in the index.
query_engine = index.as_query_engine()

# Use the query engine to search for documents that are relevant to the query
# from the indexed documents based on the semantic understanding of the query.
response = query_engine.query("What is the 2024 outlook for US GDP?")

# Print the response obtained from the query. This will display the result of the semantic search,
# showing the information or documents that best match the query about the 2024 outlook.
print(response)


Loading files: 100%|██████████| 2/2 [00:00<00:00,  2.03file/s]


The economists forecast above-consensus full-year GDP growth of 2.1% in 2024. However, this view is already reflected in current equity prices. Despite many economists forecasting a recession, the performance of cyclical stocks vs. defensive stocks is consistent with a 2% real GDP growth regime.


# Homework

1. Create a summarization using langchain and compare the "stuff" and "map-reduce" methods
2. Create a simple RAG system over a knowledge base of your choice