# **Retrieval Augmentation**

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, has no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

LLMs extract their knowledge from a database which is obvious but what kind of knowledge?
* `Parametric Knowledge`:<br>
Knowledge gained by the LLM while training which after training is fixed/frozen ie the knowledge does not change or is $static$.
* `Source Knowledge`:<br>
Knowledge gained by the LLMs from the prompt supplied to them.

The LLMs work on a combination of these knowledges

# Installing dependencies

In [1]:
!pip install -qU langchain huggingface_hub transformers pinecone-client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -qU \
    datasets==2.12.0 \
    apache_beam \
    mwparserfromhell

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.3/14.3 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.0/137.0 kB[0m [31m13.6 MB/s

In [3]:
# For tokenizing our content and getting the length of our content
!pip install -qU tiktoken spacy torch sentencepiece sentence-transformers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


### **Phases**
* Buidling Knowledge Base
* Tokenization (Check length of corpus) using tokenizers
* Creating your text chunks
* Create your embeddings
* Upsert them to Pinecone
* Query your data base

**NOTE**:
The following is an experiment to test a few tokenizers and tokenization mtds to yeild optimal results. At the end everything is cohesively brought together

# **Building our Knowledge Base**

### Simply reading our .txt files

In [5]:
# Reading india.txt file
file_india = open("/content/india.txt", "r", encoding='utf8')

txt_content_india = []
for line in file_india:
  if line != "\n":
    txt_content_india.append(line)

print(txt_content_india)

# reading barack_obama.txt file
file_barack = open("/content/Barack_Obama.txt", 'r', encoding = 'utf8')

txt_content_barack = []
for line in file_barack:
  if line != "\n":
    txt_content_barack.append(line)

print(txt_content_barack)

['India, officially known as the Republic of India, is a vibrant and diverse country located\n', 'in South Asia. With a rich cultural heritage, ancient history, and breathtaking\n', 'landscapes, India captivates the imagination of people from all around the world. From\n', 'its bustling cities to its serene rural villages, India offers a tapestry of experiences that\n', 'leave an indelible mark on those who visit or seek to learn about this fascinating nation.\n', 'India is the seventh-largest country by land area and the second-most populous country\n', 'in the world, with over 1.3 billion people. Its capital city is New Delhi, while Mumbai,\n', 'Kolkata, and Chennai are other major metropolises. The country is characterized by its\n', 'unity in diversity, as it is home to numerous religions, languages, and cultures.\n', 'Hinduism, Islam, Sikhism, Christianity, Buddhism, and Jainism are among the major\n', 'religions practiced in India.\n', "One of India's greatest treasures is its an

# **Tokenization**

In [6]:
# Tokenizing our content to get the length of text
import tiktoken
import spacy
from transformers import GPT2Tokenizer
from transformers import AutoTokenizer
# from transformers import XLNetTokenizer # this needs SentencePiece library so install it
from transformers import BertTokenizer


# creating the length function to count the number of tokens
def tiktoken_token_len(text):
  """
  This function simply counts the number of tokens in the content.

  Note: The number of tokens is not equal to the length of the content
  """
  tokens = tokenizer_tiktoken.encode(                       # This is very specific to tiktoken
      text,
      disallowed_special=()
    )
  return len(tokens)


# Function for using trasnformers tokenizers:
def transformer_tokenizer(text, tokenizer_type):
  tokens = tokenizer_type.tokenize(text)
  return len(tokens)


In [7]:
# Declaring our tokenizers from trasnformers and tiktoken
tokenizer_tiktoken = tiktoken.get_encoding('p50k_base')           # max token len is 2048
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')            # max token len is 1024
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-cased") # max token len 512

# Subword tokenizers
tokenizer_bert_uncased = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer_bert_cased = BertTokenizer.from_pretrained("bert-base-cased")
# tokenizer_xlnet = XLNetTokenizer.from_pretrained("xlnet-base-cased") This needs SentencePiece library


# Spacy tokenizer
tokenizer_spacy = spacy.load('en_core_web_sm')                    # this does not work with our declared func above
# This chunk of code is only for spacy tokenizer
corpus_sentences = []
for line in txt_content_barack:
  doc = tokenizer_spacy(line)                                     # Spacy Tokenizer
  for sent in doc.sents:
      corpus_sentences.append(sent.text)
print("Corpus Sentences:", corpus_sentences)
print ('Len of sentences in Barack_Obama input text using spacy tokenizer: ', len(corpus_sentences))

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Corpus Sentences: ['[Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician, lawyer, and author who served as the 44th president of the United States from 2009 to 2017.', 'A member of the Democratic Party, Obama was the first African-American  president of the United States.', 'He previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004.', ', Obama was born in Honolulu, Hawaii.', 'After graduating from Columbia University in 1983, he worked as a community organizer in Chicago.', 'In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review.', 'After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004.', 'Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for th

In [8]:
# Convert list to a string; this is important since our Recursive text splitter take string as input arg
txt_content_india_string = ' '.join(txt_content_india)
txt_content_barack_string = ' '.join(txt_content_barack)


# Call the function with the correct arguments for tikotken tokenizer
length_india_txt = tiktoken_token_len(txt_content_india_string)
length_barack_txt = tiktoken_token_len(txt_content_barack_string)

print("Using tikotken, token length of india.txt is:", length_india_txt, ". Note this no is not the total sents.")
print("Using tikotken, token length of barack_obama.txt is:", length_barack_txt, ". Note this no is not the total sents.")

Using tikotken, token length of india.txt is: 752 . Note this no is not the total sents.
Using tikotken, token length of barack_obama.txt is: 15844 . Note this no is not the total sents.


In [9]:
print(len(txt_content_india_string.split()))
print(len(txt_content_barack_string.split()))

523
12763


**Observation**:The number of tokens is not equal to the number of words in the contents.
Since we exceeded the number of tokens in the case of gpt2 which has max token len of 1024, its always preferable to chunk our text.

# **Creating Chunks:**

In [10]:
# Chunking our text using langchain RecursiveTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 5,
    length_function = tiktoken_token_len,
    separators = ["\n\n", "\n", " ", ""]
)

In [11]:
chunks = text_splitter.split_text(txt_content_barack_string)
print("Length of chunks using tiktoken tokenizer:" ,len(chunks))

Length of chunks using tiktoken tokenizer: 167


In [12]:
# Length of tokens using spacy
def spacy_token_len(text_content):
  corpus_sentences = []
  for line in text_content:
    doc = tokenizer_spacy(line)
    for sent in doc.sents:
      corpus_sentences.append(sent.text)

  return len(corpus_sentences)


spacy_token_len(txt_content_barack)

"""
Code below is for reference only
"""
# corpus_sentences = []
# for line in txt_content_barack:
#   doc = tokenizer_spacy(line)
#   print(doc)                               # Spacy Tokenizer
#   for sent in doc.sents:
#       corpus_sentences.append(sent.text)
# print("Corpus Sentences:", corpus_sentences)
# print ('Len of sentences in Barack_Obama input text using spacy tokenizer: ', len(corpus_sentences))

'\nCode below is for reference only\n'

In [13]:
# text splitter using spacy
text_splitter_spacy = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 5,
    #length_function = spacy_token_len,                # defined above needs to be modified to work
    separators = ["\n\n", "\n", " ", ""]
)

# text splitter using tiktoken
text_splitter_tiktoken = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 5,
    length_function = tiktoken_token_len,             # defined at the top
    separators = ["\n\n", "\n", " ", ""]
)

# text splitter using len: Source- Langchain Docs
text_splitter_len = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 5,
    length_function = len,                            # This is the most basic
    separators = ["\n\n", "\n", " ", ""]
)

# Chunks from our defined text splitter above
chunks_from_spacy = text_splitter_spacy.split_text(txt_content_barack_string)
chunks_from_tiktoken = text_splitter_tiktoken.split_text(txt_content_barack_string)
chunks_from_len = text_splitter_len.split_text(txt_content_barack_string)

# Total length of chunks
print("Length of chunks using spacy tokenizer:" ,len(chunks_from_spacy))
print("Length of chunks using tiktoken tokenizer:" ,len(chunks_from_tiktoken))
print("Length of chunks using len:" ,len(chunks_from_len))


Length of chunks using spacy tokenizer: 848
Length of chunks using tiktoken tokenizer: 167
Length of chunks using len: 848


## Create your embeddings:

In [14]:
import torch
from langchain.embeddings import HuggingFaceEmbeddings

# Checking for the processor
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Declaring our embedding model name from hugging face
model_name = 'sentence-transformers/all-roberta-large-v1'
# Assinging the declared model to a variable
embedding_model = HuggingFaceEmbeddings(model_name= model_name)

Downloading (…)eaf99/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)a0f59eaf99/README.md: 0.00B [00:00, ?B/s]

Downloading (…)f59eaf99/config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)f99/data_config.json: 0.00B [00:00, ?B/s]

Downloading (…)0f59eaf99/merges.txt: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)eaf99/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (…)af99/train_script.py: 0.00B [00:00, ?B/s]

Downloading (…)0f59eaf99/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)59eaf99/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Embedding text using langchain framework:

In [15]:
# Embedding the text
from datetime import datetime
from time import time

# Declaring our time variables
start_time = datetime.now()
t = time()

# Declaring our sample text for testing
text = ['This is new.',
        "I do not know to like this or not."]

In [16]:
# Simply Embedding our testing text
print("Start time for testing text embeddings is {}".format(start_time), "\n")
embeddings_test_text_roberta = embedding_model.embed_documents(text)                # This has to be a list
dimension_test_text = len(embeddings_test_text_roberta), len(embeddings_test_text_roberta[0])
print("Dimensions of testing text", dimension_test_text)
print('Time to embed the testing text: {} mins'.format(round((time() - t) / 60, 4)))
print('Len of the testing text embeddings: ', len(embeddings_test_text_roberta))

Start time for testing text embeddings is 2023-06-23 15:02:35.182669 

Dimensions of testing text (2, 1024)
Time to embed the testing text: 0.0082 mins
Len of the testing text embeddings:  2


In [17]:
# Embedding our input doc barack_obama, tokenized using tiktoken
print("\nStart time for Barack Oabama document embeddings is {}".format(start_time), "\n")
embeddings_docs_roberta = embedding_model.embed_documents(chunks_from_tiktoken)      # This has to be a list
dimension_input_text = len(embeddings_docs_roberta), len(embeddings_docs_roberta[0])
print(dimension_input_text)
print('Time to embed the Barack Obama document: {} mins'.format(round((time() - t) / 60, 4)))
print('Len of the document Barack Obama embeddings: ', len(embeddings_docs_roberta))


Start time for Barack Oabama document embeddings is 2023-06-23 15:02:35.182669 

(167, 1024)
Time to embed the Barack Obama document: 2.6936 mins
Len of the document Barack Obama embeddings:  167


### Embedding text using sentence trasnformers directly

In [18]:
# Importing Sentence Transformer module
from sentence_transformers import SentenceTransformer

# Declaring our time variables
start_time_2 = datetime.now()
t_2 = time()

# loading the model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'sentence-transformers/all-roberta-large-v1'
model = SentenceTransformer(model_name)


print ('\n-Starting the embedding process for the input text at {}'.format(start_time_2), '\n')
# creating the embeddings for the input Barack Obaama document
corpus_embedding = model.encode(chunks_from_tiktoken, show_progress_bar=True).tolist()
dimension_input_text_2 = len(corpus_embedding), len(corpus_embedding[0])
print(dimension_input_text_2)
print('Time to embed the input text: {} mins'.format(round((time() - t_2) / 60, 4)),'\n')

print('len of the corpus embeddings: ', len(corpus_embedding))


-Starting the embedding process for the input text at 2023-06-23 15:05:16.820508 



Batches:   0%|          | 0/6 [00:00<?, ?it/s]

(167, 1024)
Time to embed the input text: 2.4572 mins 

len of the corpus embeddings:  167


**Observations**:
* The time taken to embed our text is significantly more in langchain framework at times almost same when using sentence transformers
* This has been experimented with other embedding models as well.

## **Bringing it all together:**


**Initiating Pinecone**

In [39]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

loader = TextLoader("/content/Barack_Obama.txt")
documents = loader.load()

def split_docs(documents,chunk_size=100,chunk_overlap=5):
  """
  The function uses a text splitter called RecursiveCharacterTextSplitter to
  divide the documents into smaller chunks.
  The function applies the text splitter to each document in the input list and
  returns the resulting chunks.
  """
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                 chunk_overlap=chunk_overlap,
                                                 length_function = tiktoken_token_len, # Try experimenting here
                                                 separators = ["\n\n", "\n", " ", ""]
                                                 )
  docs = text_splitter.split_text(documents)
  return docs

# Calling our fucntion
docs = split_docs(txt_content_barack_string)

print("Length of docs after chunking:", len(docs), "\n")
print("One of the page's chunked content:\n\n", docs)

"""
Embedding using SentenceTransformer Embedding
Note: Somehow for upserting, embeddings_docs_roberta shows Attribute error
"""
# Declaring our time variables
start_time_3 = datetime.now()
t_3 = time()

# Equivalent to HuggingFaceEmbeddings(model_name="all-roberta-large-v1")
print ('\n-Starting the embedding process for the input text at {}'.format(start_time_3), '\n')

# Creating our embeddings:
embeddings_roberta = SentenceTransformerEmbeddings(model_name = "all-roberta-large-v1")

print('Time to embed the input text: {} mins'.format(round((time() - t_3) / 60, 4)),'\n')

Length of docs after chunking: 167 

One of the page's chunked content:

 ['[Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician, lawyer, and author who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, Obama was the first African-American  president of the United States. He previously served as a U.S. senator from Illinois from 2005 to 2008 and', 'from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. , Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented', 'politics, he repre

In [41]:
import pinecone
from langchain.vectorstores import Pinecone

# Initializing Pinecone
pinecone.init(
    api_key="ff38ec75-df01-40a1-b7a7-5772107a84fc",
    environment="us-west4-gcp-free"
)

active_indexes = pinecone.list_indexes()
index_description = pinecone.describe_index(active_indexes[0])
print("Index Description:", index_description)
index_name = 'roberta1024'

# Declaring our time variables
start_time_4 = datetime.now()
t_4 = time()

print ('\n-Starting the upserting process for the input text at {}'.format(start_time_4), '\n')

# Creating our pinecone Index and upserting our indexes
docsearch = Pinecone.from_texts(docs, embeddings_roberta, index_name=index_name)


print('Time to upsert the input text: {} mins'.format(round((time() - t_4) / 60, 4)),'\n')

Index Description: IndexDescription(name='roberta1024', metric='cosine', replicas=1, dimension=1024.0, shards=1, pods=1, pod_type='p1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

-Starting the upserting process for the input text at 2023-06-23 15:47:05.530790 

Time to embed the input text: 2.2924 mins 



In [43]:
"""
prevents from overwriting an existing index
"""
import pinecone
from langchain.vectorstores import Pinecone

# Initialize Pinecone
pinecone.init(
    api_key="ff38ec75-df01-40a1-b7a7-5772107a84fc",
    environment="us-west4-gcp-free"
)

active_indexes = pinecone.list_indexes()
index_description = pinecone.describe_index(active_indexes[0])
print("Index Description:", index_description)
index_name = 'roberta1024'

# Create and configure index if doesn't already exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        metric="cosine",
        dimension=1024)
    docsearch = Pinecone.from_documents(docs, embeddings_roberta, index_name=index_name)

else:
    docsearch = Pinecone.from_existing_index(index_name, embeddings_roberta)

Index Description: IndexDescription(name='roberta1024', metric='cosine', replicas=1, dimension=1024.0, shards=1, pods=1, pod_type='p1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')


## Similarity Search:

In [44]:
# Simplest similarity search
query = "obama was born in honolulu, hawaii"
docs = docsearch.similarity_search(query)
docs

[Document(page_content='Barack Obama Sr. (1934–1982), was a married Luo Kenyan from Nyangoma Kogelo. Obamas parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on a scholarship. The couple married in Wailuku, Hawaii, on February 2, 1961, six months before Obama was born.In late August 1961, a few weeks after he was born, Barack and his mother moved to the University of Washington', metadata={}),
 Document(page_content='from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. , Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he

In [49]:
# Similarity search by semantic_score
def get_similiar_docs(query,k=3,score=True):
  if score:
    similar_docs = docsearch.similarity_search_with_score(query,k=k)
    print("----Seach with score", score)
  else:
    similar_docs = docsearch.similarity_search(query,k=k)
    print("----Just similarity:", score)
  return similar_docs

In [50]:
query = "obama was born in honolulu, hawaii"
similar_docs = get_similiar_docs(query)
similar_docs

----Seach with score True


[(Document(page_content='Barack Obama Sr. (1934–1982), was a married Luo Kenyan from Nyangoma Kogelo. Obamas parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on a scholarship. The couple married in Wailuku, Hawaii, on February 2, 1961, six months before Obama was born.In late August 1961, a few weeks after he was born, Barack and his mother moved to the University of Washington', metadata={}),
  0.683490574),
 (Document(page_content='from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. , Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to ele

**Observation**:
- The semantic score is relatively low using tiktoken tokenization than spacy tokenization

## Initiating HuggingFace and our **google/flan-t5-large** Model

In [60]:
# Initializing our environment
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = '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'

from langchain import HuggingFaceHub
# Defining our llm model
llm = HuggingFaceHub(repo_id= 'google/flan-t5-large',
                     model_kwargs={'temperature' : 0.8, "max_length": 200}
                     )


# **Define your Retrieval QnA chain**

In [62]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

query = str(input())

qa.run(query)

Where did Barack Obama live?


'Central Jakarta'

In [None]:
"""
Do not run this block of code
"""

from langchain.chains import RetrievalQAWithSourcesChain



qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

query_2 = str(input())

qa_with_sources(query_2)

## Observations:

- Spacy until now has been a better alternative for tokenization since similarity scores reach above 0.7 tough tiktoken is not far behind.
- Embeddings at times take significantly longer time in LangChain than if directly used from SentenceTranformer module. For instance, over 5 mins using LangChain and 2.5 minutes using Sentencetrasnformer directly

## Improvements to be made:
- Figure out a way to use spacy tokenization in the Langchain framework as *length _function* parameter for the RecursiveCharacterTextSplitter() mtd.
- Try embedding the query as well to check what similarity scores you get and then run your
RetreivalQA chain.
- Figure out a way to retrieve citations for your answer using RetrievalQAWithSourcesChain. There is an error
  - ValueError: Document prompt requires documents to have metadata variables: ['source']. Received document with missing metadata: ['source'].