## Familiarization with LLM packages such as `LangChain`, `OpenAI` & `tiktoken`

In [1]:
import credentials

### 1. Tiktoken

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb


Encodings specify how text is converted into tokens. Different models use different encodings

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

In [2]:
import tiktoken

In [3]:
encoding = tiktoken.get_encoding("cl100k_base")
#encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [4]:
text = 'Tomorrow it will rain in Budapest.'
encoding.encode(text)

[91273, 433, 690, 11422, 304, 70695, 13]

In [5]:
text = 'TomorrowitwillraininBudapest.'
encoding.encode(text)

[91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]

In [6]:
[encoding.decode_single_token_bytes(token) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

[b'Tomorrow', b'it', b'will', b'rain', b'in', b'B', b'ud', b'apest', b'.']

In [7]:
[bytes.decode(encoding.decode_single_token_bytes(token)) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

['Tomorrow', 'it', 'will', 'rain', 'in', 'B', 'ud', 'apest', '.']

In [8]:
encoding.decode([433])

' it'

In [9]:
encoding.decode([275])

'it'

### 2. `LangChain`, `Pinecone` & other vector databases / vector representers for Document Representation

Load PDF

In [10]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
loader = PyPDFLoader("../docs/investment_report.pdf")
#doc = loader.load_and_split()
doc = loader.load_and_split(text_splitter = RecursiveCharacterTextSplitter(chunk_size = 4000, chunk_overlap = 200))

In [12]:
len(doc)

7

In [13]:
for i in doc:
    print(len(i.page_content))

2991
3986
549
3977
632
3869
2733


In [14]:
print(doc[0].page_content[:400])

Quarterly Investment Report 
April 2023
Despite banking instability and uncertain monetary policy expectations, the U.S. economy ended the quarter on  
a positive note. The S&P 500 Index rose 3.7% in March and 7.5% in the first quarter. The U.S. markets outperformed 
most major international markets. During the quarter, bond prices moved up, pushing Treasury yields in 1-year  and longer maturities


In [15]:
doc[0].metadata

{'source': '../docs/investment_report.pdf', 'page': 0}

In [16]:
page_num = doc[-1].metadata['page']
print('There are', page_num, 'pages in this PDF')

There are 3 pages in this PDF


Embeddings to use

In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [18]:
openai_embedder = OpenAIEmbeddings(openai_api_key = credentials.openai_api)

In [None]:
openai_embed_result = openai_embedder.embed_query('Embed this text!')
len(openai_embed_result)

In [None]:
openai_embed_results = openai_embedder.embed_documents(['Embed this text!', 'This one too, please.'])
print(len(openai_embed_results))
print(len(openai_embed_results[0]))

In [19]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf_embedder = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  from .autonotebook import tqdm as notebook_tqdm


In [20]:
hf_embed_result = hf_embedder.embed_query('Embed this text!')
len(hf_embed_result)

768

In [21]:
hf_embed_results = hf_embedder.embed_documents(['Embed this text!', 'This one too, please.'])
print(len(hf_embed_results))
print(len(hf_embed_results[0]))

2
768


Vector databases

https://python.langchain.com/en/latest/modules/indexes/vectorstores.html?highlight=vectorstores

For some, like `Pinecone`, we need an API connection.
For others we can simply store them locally, e.g. in a pickle

1. FAISS

In [22]:
from langchain.vectorstores import FAISS

In [23]:
db = FAISS.from_documents(doc, hf_embedder)

In [24]:
query = 'What happened in the bond market during the past quarter?'
relevant_docs = db.similarity_search(query)

print(len(relevant_docs))

4


In [25]:
for e, i in enumerate(db.similarity_search_with_score(query)):
    print(e, ' --- ', i[1])

0  ---  0.72051513
1  ---  0.87298733
2  ---  1.0739416
3  ---  1.0902194


In [26]:
for e, i in enumerate(db.similarity_search_with_relevance_scores(query)):
    print(e, ' --- ', i[1])

0  ---  0.4905188642503674
1  ---  0.38270473909543834
2  ---  0.24060862024283913
3  ---  0.22909848450190795


2. `Pinecone`

In [29]:
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key = credentials.pinecone_api,
              environment = credentials.pinecone_loc)

In [None]:
index_name = "first-experiment-sbert"
index = Pinecone.from_documents(doc, hf_embedder, index_name=index_name)

Creating QA chain without OpenAI (will be using `flan-t5-large`)

In [31]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [33]:
llm = HuggingFaceHub(repo_id = "google/flan-t5-large", 
                     model_kwargs={"temperature":0, "max_length":512},
                     huggingfacehub_api_token = credentials.huggingface_api)

In [35]:
chain = load_qa_chain(llm, chain_type="stuff")

In [62]:
query = "What will happen to the bond market's yield curve?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The yield curve remains inverted'

In [37]:
query = "Will the bond market yield curve change?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The yield curve remains inverted indicating investors’ expectations that while the Fed will continue to exert upward pressure on short rates, longer maturity rates remain relatively low, reflecting the likelihood of economic slowdown and even possibly a recession.'

In [38]:
query = "What's happening to unemployment?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The unemployment rate increased to 3.6% and the number of unemployed people increased to 5.9 million in February.'

In [39]:
query = "Which banks experienced stress? Did any of them default?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'Silicon Valley Bank and Signature Bank'

In [40]:
query = "Why did Silicon Valley Bank default?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The failure of two big U.S. banks'

In [42]:
query = "Why did Silicon Valley Bank fail?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'weak underwriting standards and excessive leverage'

In [41]:
query = "When was Messi born?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

'2023'

#### Wrapper around loaders, vectorstores, embeddings, text-splitters: `VectorstoreIndexCreator`


In [66]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA

In [64]:
index_retriever = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding = HuggingFaceEmbeddings(),
    text_splitter=RecursiveCharacterTextSplitter()
)

In [67]:
loader = PyPDFLoader("../docs/investment_report.pdf")
index = index_retriever.from_loaders([loader])
db_retriever = index.vectorstore.as_retriever()

In [68]:
qa = RetrievalQA.from_chain_type(llm = llm, chain_type="stuff", retriever = db_retriever)

In [69]:
query = "What's happening to unemployment?"
qa.run(query)

'The unemployment rate increased to 3.6% and the number of unemployed people increased to 5.9 million in February.'

In [70]:
query = "Which banks experienced stress? Did any of them default?"
qa.run(query)

'Silicon Valley Bank and Signature Bank'

`flan-t5` occasionally hallucinates, extracts info from out-of-context. This can be avoided by using `OpenAI`'s `GPT` models. Additionally, we can try and guide these models with `PromptTemplates`. Let's see if we can make `flan-t5` a little bit smarter.

In [72]:
query = "What is the first name of the current Singapore President?"
qa.run(query)

'Jerome'

In [73]:
query = "How do you make scrambled eggs?"
qa.run(query)

'no answe'

In [74]:
query = "How do you go to the Moon?"
qa.run(query)

'No one can go to the moon.'

In [75]:
query = "What is the Danube?"
qa.run(query)

'not enough information'

`PromptTemplate`

In [76]:
from langchain.prompts import PromptTemplate

In [78]:
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

In [79]:
qa_custom_prompt = RetrievalQA.from_chain_type(llm = llm, chain_type="stuff", retriever = db_retriever, chain_type_kwargs=chain_type_kwargs)

In [81]:
query = "Which banks experienced stress? Did any of them default?"
qa_custom_prompt.run(query)

'Silicon Valley Bank and Signature Bank'

In [82]:
query = "What is the first name of the current Singapore President?"
qa_custom_prompt.run(query)

'Jerome'

In [83]:
query = "How do you go to the Moon?"
qa_custom_prompt.run(query)

'Space'

In [85]:
query = "What is the Danube?"
qa_custom_prompt.run(query)

'unanswerable'

In [87]:
query = "How do you make scrambled eggs?"
qa_custom_prompt.run(query)

'uncooked'

Different, but still not good enough. Need to use `OpenAI` models + prompts to avoid this

### 3. `OpenAI`

In [None]:
from langchain.llms import OpenAI

In [None]:
llm = OpenAI(model_name='text-davinci-003', openai_api_key=credentials.openai_api)

In [None]:
#need billing set-up for API to be live
#llm('Tell me a joke')

In [None]:
llm.get_num_tokens('Tell me a joke.')

5

In [None]:
llm.get_num_tokens('Tellmeajoke.')

5