## Familiarization with LLM packages such as `LangChain`, `OpenAI` & `tiktoken`

In [1]:
import credentials

### 1. Tiktoken

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb


Encodings specify how text is converted into tokens. Different models use different encodings

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

In [2]:
import tiktoken

In [3]:
encoding = tiktoken.get_encoding("cl100k_base")
#encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [4]:
text = 'Tomorrow it will rain in Budapest.'
encoding.encode(text)

[91273, 433, 690, 11422, 304, 70695, 13]

In [5]:
text = 'TomorrowitwillraininBudapest.'
encoding.encode(text)

[91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]

In [6]:
[encoding.decode_single_token_bytes(token) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

[b'Tomorrow', b'it', b'will', b'rain', b'in', b'B', b'ud', b'apest', b'.']

In [7]:
[bytes.decode(encoding.decode_single_token_bytes(token)) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

['Tomorrow', 'it', 'will', 'rain', 'in', 'B', 'ud', 'apest', '.']

In [8]:
encoding.decode([433])

' it'

In [9]:
encoding.decode([275])

'it'

### 2. `LangChain`, `Pinecone` & other vector databases / vector representers for Document Representation

Load PDF

In [10]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
loader = PyPDFLoader("../docs/investment_report.pdf")
#doc = loader.load_and_split()
doc = loader.load_and_split(text_splitter = RecursiveCharacterTextSplitter(chunk_size = 4000, chunk_overlap = 200))

In [12]:
len(doc)

7

In [13]:
for i in doc:
    print(len(i.page_content))

2991
3986
549
3977
632
3869
2733


In [14]:
print(doc[0].page_content[:400])

Quarterly Investment Report 
April 2023
Despite banking instability and uncertain monetary policy expectations, the U.S. economy ended the quarter on  
a positive note. The S&P 500 Index rose 3.7% in March and 7.5% in the first quarter. The U.S. markets outperformed 
most major international markets. During the quarter, bond prices moved up, pushing Treasury yields in 1-year  and longer maturities


In [15]:
doc[0].metadata

{'source': '../docs/investment_report.pdf', 'page': 0}

In [16]:
page_num = doc[-1].metadata['page']
print('There are', page_num, 'pages in this PDF')

There are 3 pages in this PDF


Embeddings to use

In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [18]:
openai_embedder = OpenAIEmbeddings(openai_api_key = credentials.openai_api)

In [19]:
openai_embed_result = openai_embedder.embed_query('Embed this text!')
len(openai_embed_result)

1536

In [21]:
openai_embed_results = openai_embedder.embed_documents(['Embed this text!', 'This one too, please.'])
print(len(openai_embed_results))
print(len(openai_embed_results[0]))

2
1536


In [22]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf_embedder = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  from .autonotebook import tqdm as notebook_tqdm


In [23]:
hf_embed_result = hf_embedder.embed_query('Embed this text!')
len(hf_embed_result)

768

In [24]:
hf_embed_results = hf_embedder.embed_documents(['Embed this text!', 'This one too, please.'])
print(len(hf_embed_results))
print(len(hf_embed_results[0]))

2
768


Vector databases

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html?highlight=vectorstores

For some, like `Pinecone`, we need an API connection.
For others we can simply store them locally, e.g. in a pickle

1. FAISS

In [25]:
from langchain.vectorstores import FAISS

In [26]:
db = FAISS.from_documents(doc, hf_embedder)

In [27]:
query = 'What happened in the bond market during the past quarter?'
relevant_docs = db.similarity_search(query)

print(len(relevant_docs))

4


In [28]:
for e, i in enumerate(db.similarity_search_with_score(query)):
    print(e, ' --- ', i[1])

0  ---  0.72051513
1  ---  0.87298733
2  ---  1.0739416
3  ---  1.0902194


In [29]:
for e, i in enumerate(db.similarity_search_with_relevance_scores(query)):
    print(e, ' --- ', i[1])

0  ---  0.4905188642503674
1  ---  0.38270473909543834
2  ---  0.24060862024283913
3  ---  0.22909848450190795


2. `Pinecone`

In [29]:
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key = credentials.pinecone_api,
              environment = credentials.pinecone_loc)

In [None]:
index_name = "first-experiment-sbert"
index = Pinecone.from_documents(doc, hf_embedder, index_name=index_name)

Creating QA chain without OpenAI (will be using `flan-t5-large`)

In [30]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [31]:
llm = HuggingFaceHub(repo_id = "google/flan-t5-large", 
                     model_kwargs={"temperature":0, "max_length":512},
                     huggingfacehub_api_token = credentials.huggingface_api)

In [32]:
chain = load_qa_chain(llm, chain_type="stuff")

In [33]:
query = "What will happen to the bond market's yield curve?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The yield curve remains inverted'

In [34]:
query = "Will the bond market yield curve change?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The yield curve remains inverted indicating investors’ expectations that while the Fed will continue to exert upward pressure on short rates, longer maturity rates remain relatively low, reflecting the likelihood of economic slowdown and even possibly a recession.'

In [35]:
query = "What's happening to unemployment?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The unemployment rate increased to 3.6% and the number of unemployed people increased to 5.9 million in February.'

In [36]:
query = "Which banks experienced stress? Did any of them default?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'Silicon Valley Bank and Signature Bank'

In [37]:
query = "Why did Silicon Valley Bank default?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The failure of two big U.S. banks'

In [38]:
query = "Why did Silicon Valley Bank fail?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'weak underwriting standards and excessive leverage'

In [39]:
query = "When was Messi born?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'2023'

#### Wrapper around loaders, vectorstores, embeddings, text-splitters: `VectorstoreIndexCreator`


In [40]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA

In [41]:
index_retriever = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding = OpenAIEmbeddings(openai_api_key=credentials.openai_api),
    text_splitter=RecursiveCharacterTextSplitter()
)

In [42]:
loader = PyPDFLoader("../docs/investment_report.pdf")
index = index_retriever.from_loaders([loader])
db_retriever = index.vectorstore.as_retriever()

In [43]:
qa = RetrievalQA.from_chain_type(llm = llm, chain_type="stuff", retriever = db_retriever)

In [44]:
query = "What's happening to unemployment?"
qa.run(query)

'The unemployment rate increased to 3.6% and the number of unemployed people increased to 5.9 million in February.'

In [45]:
query = "Which banks experienced stress? Did any of them default?"
qa.run(query)

'Silicon Valley Bank (SVB) and Signature Bank'

`flan-t5` occasionally hallucinates, extracts info from out-of-context. This can be avoided by using `OpenAI`'s `GPT` models. Additionally, we can try and guide these models with `PromptTemplates`. Let's see if we can make `flan-t5` a little bit smarter.

In [46]:
query = "What is the first name of the current Singapore President?"
qa.run(query)

'Rachael'

In [50]:
query = "What is the name of the current Singapore President?"
qa.run(query)

'No'

In [47]:
query = "How do you make scrambled eggs?"
qa.run(query)

'not enough information'

In [48]:
query = "How do you go to the Moon?"
qa.run(query)

'Space'

In [49]:
query = "What is the Danube?"
qa.run(query)

"If you don't know the answer, just say that you don't know, don't try to make up an answer."

`PromptTemplate`

In [51]:
from langchain.prompts import PromptTemplate

In [52]:
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

In [53]:
qa_custom_prompt = RetrievalQA.from_chain_type(llm = llm, chain_type="stuff", retriever = db_retriever, chain_type_kwargs=chain_type_kwargs)

In [54]:
query = "Which banks experienced stress? Did any of them default?"
qa_custom_prompt.run(query)

'Silicon Valley Bank (SVB) and Signature Bank'

In [55]:
query = "What is the first name of the current Singapore President?"
qa_custom_prompt.run(query)

'Rafik'

In [56]:
query = "How do you go to the Moon?"
qa_custom_prompt.run(query)

'Space'

In [57]:
query = "What is the Danube?"
qa_custom_prompt.run(query)

'European'

In [58]:
query = "How do you make scrambled eggs?"
qa_custom_prompt.run(query)

'a bowl of water'

Different, but still bad. Need to use `OpenAI` models + prompts to avoid this

### 3. `OpenAI`

https://github.com/hwchase17/langchain/blob/master/docs/getting_started/getting_started.md

In [73]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, ChatMessage, HumanMessage, SystemMessage

In [60]:
print('InstrucGPT')
llm = OpenAI(model_name='text-davinci-003', openai_api_key=credentials.openai_api)
llm('Tell me a joke')

InstrucGPT


'\n\nQ: What did the fish say when it hit the wall?\nA: Dam!'

In [74]:
print('ChatGPT')
llm = ChatOpenAI(model_name='gpt-3.5-turbo', openai_api_key=credentials.openai_api)
llm(([HumanMessage(content = "Tell me a joke")]))

ChatGPT


AIMessage(content='Why did the tomato turn red? Because it saw the salad dressing!', additional_kwargs={})

In [75]:
llm.get_num_tokens('Tell me a joke.')

5

In [76]:
llm.get_num_tokens('Tellmeajoke.')

5

Add `SystemMessage` to prompt

In [86]:
#temperature set to 0 to be as precise as possible
llm = ChatOpenAI(model_name='gpt-3.5-turbo', openai_api_key=credentials.openai_api, temperature=0.0)

In [88]:
message = [
    SystemMessage(content="You are a helpful assistant to a comedian. You come up with funny material on as many topics as possible."),
    HumanMessage(content="Come up with a joke about a giraffe and a lion.")
]

In [89]:
llm(message)

AIMessage(content='Why did the giraffe invite the lion to his party? Because he wanted to have a "tall" order of protection!', additional_kwargs={})

Generate outputs in batches

In [None]:
batch_messages = [
    [
        SystemMessage(content="You are a helpful assistant that translates English to Russian."),
        HumanMessage(content="Translate this sentence from English to Russian: I love programming.")
    ],
    [
        SystemMessage(content="You are a helpful assistant that comes up with catchy company names."),
        HumanMessage(content="Create a name for a company that mass produces aluminium foil.")
    ],
]
result = llm.generate(batch_messages)

In [92]:
result.llm_output

{'token_usage': {'prompt_tokens': 74,
  'completion_tokens': 11,
  'total_tokens': 85},
 'model_name': 'gpt-3.5-turbo'}

In [93]:
result.generations

[[ChatGeneration(text='Я люблю программирование.', generation_info=None, message=AIMessage(content='Я люблю программирование.', additional_kwargs={}))],
 [ChatGeneration(text='FoilWorks', generation_info=None, message=AIMessage(content='FoilWorks', additional_kwargs={}))]]

In [97]:
for i in result.generations:
    print(i[0].text)

Я люблю программирование.
FoilWorks


Add `Memory` to ChatBot so you can refer back to previous topics

In [98]:
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

In [99]:
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know."),
    MessagesPlaceholder(variable_name="history"),
    HumanMessagePromptTemplate.from_template("{input}")
])

In [100]:
memory = ConversationBufferMemory(return_messages=True)
conversation = ConversationChain(memory=memory, prompt=prompt, llm=llm)

In [101]:
conversation.predict(input = 'Hello')

'Hello! How can I assist you today?'

In [102]:
conversation.predict(input = 'Can you briefly tell me the reason behind the Hunder Year War? Please keep it short and concise.')

'The Hundred Year War was a series of conflicts between England and France that lasted from 1337 to 1453. The main cause of the war was a dispute over the succession to the French throne, as the English king claimed to be the rightful heir. Other factors included economic and territorial disputes, as well as long-standing tensions between the two nations.'

In [103]:
conversation.predict(input = 'Where there any other major conflicts between these two countries?')

'Yes, there were several major conflicts between England and France throughout history. Some of the most notable ones include the Norman Conquest of England in 1066, the Battle of Hastings in 1066, the Battle of Agincourt in 1415, and the Battle of Waterloo in 1815. These conflicts were often driven by political, economic, and territorial disputes, as well as cultural and religious differences between the two nations.'

ChatGPT remembers (with memory) that the 2 countries here are England and France

In [113]:
conversation.memory.chat_memory.messages

[HumanMessage(content='Hello', additional_kwargs={}),
 AIMessage(content='Hello! How can I assist you today?', additional_kwargs={}),
 HumanMessage(content='Can you briefly tell me the reason behind the Hunder Year War? Please keep it short and concise.', additional_kwargs={}),
 AIMessage(content='The Hundred Year War was a series of conflicts between England and France that lasted from 1337 to 1453. The main cause of the war was a dispute over the succession to the French throne, as the English king claimed to be the rightful heir. Other factors included economic and territorial disputes, as well as long-standing tensions between the two nations.', additional_kwargs={}),
 HumanMessage(content='Where there any other major conflicts between these two countries?', additional_kwargs={}),
 AIMessage(content='Yes, there were several major conflicts between England and France throughout history. Some of the most notable ones include the Norman Conquest of England in 1066, the Battle of Hasti

### 4. `Langchain` agents - most powerful tool within framework

Agents use an LLM to determine which actions to take and in what order.

Tools: https://python.langchain.com/en/latest/modules/agents/tools.html <br>
Agents: https://python.langchain.com/en/latest/modules/agents/agents/agent_types.html

Let's say I want to find some information on the internet, a number, then use that number in a mathematical expression. I'll need a tool to search the web, then I'll need a tool to solve math problems

Problem: What's the 3rd base log of the number that is the year when Messi won his first Ballon d'Or? --> X = log3(YEAR) 

We'll try to find the info in `Wikipedia`

In [114]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent

In [126]:
llm = ChatOpenAI(temperature = 0.0, openai_api_key = credentials.openai_api)
tools = load_tools(["wikipedia", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

In [127]:
llm.model_name

'gpt-3.5-turbo'

In [128]:
problem = "What's the 3rd base log of the number that is the year when Messi won his first Ballon d'Or?"
agent.run(problem)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to find out the year when Messi won his first Ballon d'Or and then calculate the 3rd base log of that number.
Action: Wikipedia
Action Input: "Lionel Messi Ballon d'Or"[0m
Observation: [36;1m[1;3mPage: 2019 Ballon d'Or
Summary: The 2019 Ballon d'Or  (French: Ballon d'Or) was the 64th annual ceremony of the Ballon d'Or, presented by France Football, and recognising the best footballers in the world for 2019. Lionel Messi won the men's award for a record sixth time in his career.

Page: 2009 Ballon d'Or
Summary: The 2009 Ballon d'Or, given to the best football player in the world as judged by an international panel of sports journalists, was awarded to Lionel Messi of Barcelona on 1 December 2009.Messi won the award by a then record margin, 240 points ahead of 2008 winner Cristiano Ronaldo. Xavi was the second Barcelona player in the top three, finishing a further 63 points behind Ronaldo. Messi's win made him the fir

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-MlWDteCoYCC2gwjf4Z5jxjJh on requests per min. Limit: 3 / min. Please try again in 20s. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..


[32;1m[1;3mI have found that Messi won his first Ballon d'Or in 2009, now I need to calculate the 3rd base log of that number.
Action: Calculator
Action Input: log(2009)/log(3)[0m

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 2.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-MlWDteCoYCC2gwjf4Z5jxjJh on requests per min. Limit: 3 / min. Please try again in 20s. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-MlWDteCoYCC2gwjf4Z5jxjJh on requests per min. Limit: 3 / min. Please try again in 20s. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a pay


Observation: [33;1m[1;3mAnswer: 6.9227264643428[0m
Thought:

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 2.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-MlWDteCoYCC2gwjf4Z5jxjJh on requests per min. Limit: 3 / min. Please try again in 20s. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-MlWDteCoYCC2gwjf4Z5jxjJh on requests per min. Limit: 3 / min. Please try again in 20s. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a pay

[32;1m[1;3mI now know the final answer.
Final Answer: 6.9227264643428[0m

[1m> Finished chain.[0m


'6.9227264643428'

The answer is correct, Messi won his first Ballon d'Or in 2009. Log(3)(2009) is 6.922.