# PrepMate Using RAG and Langchain

## Setup

In [33]:
!pip install -qU langchain python-dotenv tiktoken langchain-pinecone langchainhub pandas langchain_community pymupdf langchain-google-genai

In [34]:
# GLOBAL
import os
import pandas as pd
import numpy as np
import tiktoken
from uuid import uuid4
# from tqdm import tqdm
from dotenv import load_dotenv
from tqdm.autonotebook import tqdm


# LANGCHAIN
import langchain
from langchain.llms import OpenAI
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
from langchain_pinecone import PineconeVectorStore
from langchain_core.prompts import PromptTemplate

# VECTOR STORE
import pinecone
from pinecone import Pinecone, ServerlessSpec

# AGENTS
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain.agents import AgentExecutor, Tool, AgentType
from langchain.agents.react.agent import create_react_agent
from langchain import hub

In [22]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = ' '
os.environ['LANGCHAIN_API_KEY'] = ' '
os.environ['GOOGLE_API_KEY'] = ' '
os.environ['PINECONE_API_KEY'] = ' '
os.environ['TAVILY_API_KEY'] = ' '


# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
import google.generativeai as genai
GOOGLE_API_KEY= os.getenv('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

## Loading Documents
There are several Document Loaders in the LangChain library depending on the type of file to be used. The most common ones include CSV, HTML, JSON, Markdown, File Directory or Microsoft Office formats.

However, there is a more extensive [list](https://python.langchain.com/docs/integrations/document_loaders/google_drive/), where you can load directly from Google Cloud, Notion, Youtube or many other services.

We will be using a Pdf file, so we will use the PyMuPdfLoader. Below you can find the code to load the file. As arguments we are using:

- **file path**

Loading the data in this way will benefit our RAG pipeline. The benefits of metadata are listed further below.

In [35]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("./ilets.pdf") # Load the document
data = loader.load()
len(data)

81

## Indexing
The **Vector Store Index** is a tool that embeds your documents into vector representations. When you want to search through these embeddings, your query is also converted into a vector embedding. Then, the Vector Store Index performs a mathematical operation to rank all the document embeddings based on how semantically similar they are to your query embedding.

The key steps are:
- Embedding your documents into vectors
- Turning your search query into a vector
- Comparing the query vector to all the document vectors
- Ranking the document vectors by their similarity to the query vector
- Returning the most relevant documents based on this ranking

This allows you to search your document collection in a semantic, meaning-based way, rather than just looking for exact keyword matches.

To understand the process of vector search, we will analyze the concepts of tokenization, similarity, and embedding, which are implemented by embedding models.

According to OpenAI, as a rule of thumb 1 token corresponds to 4 characters of text for common English text. This means that 100 tokens correspond to 75 words.

## Embeddings
Embeddings are a way to represent high-dimensional sparse data like words in a more compact, lower-dimensional form while preserving the meaningful similarities between the original data points.. The key ideas are:

- **Capturing Similarities:** Similar items, like synonymous words, will have embedding vectors that are close to each other.

- **Spatial Representation:** The embedding vectors are positioned in a multi-dimensional space such that the distance between them (e.g. cosine similarity) reflects how related the original data points are

<img width="1000" alt="Untitled (1)" src="https://github.com/benitomartin/nlp-news-classification/assets/116911431/a7e044ab-2c40-47a2-bb05-e86962790ce0">



### Cosine Similarity
The most common metric used for similarity search is **cosine similarity**. It finds application in scenarios like semantic search and document classification, because it enables the comparison of vector directions, effectively assessing the overall content of documents. By comparing the vector representations of the query and the documents, cosine similarity can identify the most similar and relevant documents to return in the search results.

<img width="1000" alt="Screenshot 2024-05-02 123447" src="https://github.com/benitomartin/nlp-news-classification/assets/116911431/f5356422-29d6-4a4c-8e11-267ad9115b51">



Cosine similarity is a measure of the similarity between two non-zero vectors. It calculates the cosine of the angle between the two vectors, which results in a value between 1 (identical) and -1 (opposite).

<img width="1000" alt="Screenshot 2024-05-02 122629" src="https://github.com/benitomartin/nlp-news-classification/assets/116911431/4215ba02-1fb9-4a72-ad9c-e88740b5a71a">
    

In [36]:
def cosine_similarity(query_emb, document_emb):

    # Calculate the dot product of the query and document embeddings
    dot_product = np.dot(query_emb, document_emb)

    # Calculate the L2 norms (magnitudes) of the query and document embeddings
    query_norm = np.linalg.norm(query_emb)
    document_norm = np.linalg.norm(document_emb)

    # Calculate the cosine similarity
    cosine_sim = dot_product / (query_norm * document_norm)
    return cosine_sim

**Simple Example of Cosine Similarity**


In [37]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

question = "what is ILETS?"
document = "IELTS stands for the International English Language Testing System – an English Language proficiency test. Globally, there are more than 4 million test takers a year, making IELTS the world’s most popular English language proficiency test for higher education and global migration. IELTS is developed and run by the British Council in partnership with IDP Education and Cambridge Assessment English."

# Using Google Generative AI EMbeddings Model

query_emb = embeddings.embed_query(question)
document_emb = embeddings.embed_query(document)
cosine_sim = cosine_similarity(query_emb, document_emb)
print(f'Query Dimensions: {len(query_emb)}')
print(f'Document Dimensions: {len(document_emb)}')
print("Cosine Similarity:", cosine_sim)


Query Dimensions: 768
Document Dimensions: 768
Cosine Similarity: 0.6663191415538716


## Text Splitting

Unfortunatelly, LLM models have some limitations when it comes to the point of processing text. One of those is the **context window**. The context window represents the maximum amount of text/tokens that a model can process at one time as an input to generate a response. Therefore we need to split our documents into smaller chunks that can fit into the model's context window. A complete list of OpenAI models can be found [here](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4). It spans from 4'096 tokens for the `gpt-3.5-turbo-instruct` to the `gpt-4-turbo` with 128'000 tokens.



Like the data loaders, LangChain offers several text splitters. In the table below you can see the main splitting methods and when to use which one. The `Adds Metadata` does not mean that it will add (or not) the metadata from the previous loader. For example for HTML has a `HTMLHeaderTextSplitter` and it means it will splits text at the element level and adds metadata for each chunk based on header text.

In our case we already have the metadata available and we do not need to add them using and splitter.

<img width="863" alt="Screenshot 2024-04-30 132645" src="https://github.com/benitomartin/mlops-car-prices/assets/116911431/7b4dcefb-9320-4085-821f-1b21c81f4d28">


**Source:** https://js.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

The `RecursiveCharacterTextSplitter` is the recommended tool for splitting general text. It segments the text based on a defined chunk size, using a list of characters as separators.

According to LangChain, the default separators include ["\n\n", "\n", " ", ""]. This means it aims to keep paragraphs together first, followed by sentences and words, as they typically exhibit the strongest semantic connections in text.

To leverage this feature, we can utilize the `RecursiveCharacterTextSplitter` along with the tiktoken library to ensure that splits do not exceed the maximum token chunk size allowed by the language model. Each split will be recursively divided if its size exceeds the limit.

The final design of our text splitter will be as follows:

- Model: `Gemini Pro` with a context window of 16,385 tokens

- Chunk Size: number of tokens of one chunk

- Chunk Overlap: number of tokens that overlap between two consecutive chunks

- Separators: the order of separators

In [38]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""]
)

## Vector Stores
A Vector Store is a specialized database that is designed to store and manage high-dimensional vector data. Vector databases store data in the form of vector embedding, which can be retrieved by the LLMs and allow them to understand the context and meaning of the data, allowing better responses.

### Indexing
Pinecone is a serverless vector store, which shows a very good performance for a fast vector search and retrieval process.

The first step to use Pinecone is to create an Index where our embeddings will be stored. There are several parameters to be considered for this:

- Index name
- Dimension: must be equal to the embedding model dimensions
- Metric: must match with the used to tain the embedding model for better results
- Serverless specifications

In [55]:
index_name = "langchain-pinecone-hammad"
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key = PINECONE_API_KEY)

In [56]:
pc.delete_index(index_name)   # Deletes the index

In [57]:
pc.create_index(
    name=index_name,
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"))

index = pc.Index(index_name)

In [42]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Namespaces
Pinecone allows you to split the data into namespaces within an index. This allows to send queries to an specific namespace. You could for example split your data by content, language or any other index suitable for your use case.

For this specific example we will first upload 100 records of our data to a namespace and then we will create two splits each with 50 records. In total we will have 3 namspaces.

In [58]:
splits = text_splitter.split_documents(data)
embed = embedding= GoogleGenerativeAIEmbeddings(model="models/embedding-001")
db = PineconeVectorStore.from_documents(documents=splits,
                                        embedding=embed,
                                        index_name=index_name,
                                        namespace="main"
                                        )

In [59]:
vectorstore = PineconeVectorStore(index_name=index_name,
                                  namespace="main",
                                  embedding=embed)

In [60]:
query = "What is ILETS Writting Task 1"
similarity = vectorstore.similarity_search(query, k=4)

for i in range(len(similarity)):
  print(f"-------Result Nr. {i}-------")
  print(f"Page Content: {similarity[i].page_content}")
  print(f" ")

-------Result Nr. 0-------
Page Content: fluency. The last one, “Grammatical Range and Accuracy” examines your use of grammatical structures 
like the correct use of tenses. Using complex sentences, a variety of tenses, and advanced verb forms 
like passive voice will help you get a higher mark in this section. However, note the word “Accuracy” in 
the title. It is more important to be correct than to be adventurous in your language.  
Like all IELTS scores, writing task 1 is evaluated from 0 to 9. To calculate your score, you should add all
 
-------Result Nr. 1-------
Page Content: Task 2 
 
Overview 
IELTS writing task 2 is the second and most important task of the writing examination. In this task you 
are presented with a topic and you are asked to write an essay of at least 250 words responding to it. 
The topic can be an opinion, an argument, or a troublesome situation, and you are expected to respond 
to it in some way, depending on the question. You might be asked to give your

In [61]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'main': {'vector_count': 389}},
 'total_vector_count': 389}

# RAG Pipeline

In [62]:
embed = embedding= GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectorstore = PineconeVectorStore(index_name=index_name,
                                  namespace="main",
                                  embedding=embed)

### Retrieval

In [48]:
from langchain_google_genai import ChatGoogleGenerativeAI  # Import the Google Generative AI model
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")
# Conversational memory
conversational_memory = ConversationBufferWindowMemory(memory_key='chat_history',k=5,return_messages=True)
# Retrieval qa chain
qa_db = RetrievalQA.from_chain_type(llm=llm,chain_type="stuff",retriever=vectorstore.as_retriever())

  conversational_memory = ConversationBufferWindowMemory(memory_key='chat_history',k=5,return_messages=True)


### Augmented
We are going to use a slightly modified prompt template. First we download the react template, which is a common template using toools and agents and then we will add the instruction of in which tool to look up first.

A collection of templates can be found in the [langchain hub](https://smith.langchain.com/hub)

In [49]:
prompt = hub.pull("hwchase17/react")
print(prompt.template)

Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}


Now we will replace this line:

`Action: the action to take, should be one of [{tool_names}]`

By this line:

`Action: the action to take, should be one of [{tool_names}]. Always look first in Pinecone Document Store`



In [None]:
template= '''
          Answer the following questions as best you can. You have access to the following tools:

          {tools}

          Use the following format:

          Question: the input question you must answer
          Thought: you should always think about what to do
          Action: the action to take, should be one of [{tool_names}]. Always look first in Pinecone Document Store if not then use the tool
          Action Input: the input to the action
          Observation: the result of the action
          ... (this Thought/Action/Action Input/Observation can repeat 2 times)
          Thought: I now know the final answer
          Final Answer: the final answer to the original input question

          Begin!

          Question: {input}
          Thought:{agent_scratchpad}
          '''

prompt = PromptTemplate.from_template(template)

### Generation With Agent

We are going to set up 2 tools for our agent:

- Tavily Search API: Tavily search over several sources like Bing or Google and returns the most relevant content. It offers 1000 API calls per month for free.

- Vectorstore: Our vector store will be used to look for the information first.

In [50]:
# Set up tools and agent
import os

TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")

tavily = TavilySearchResults(max_results=10, tavily_api_key=TAVILY_API_KEY)

tools = [
    Tool(
        name = "Pinecone Document Store",
        func = qa_db.run,
        description = "This agent looks up information from the Pinecone Document Store"
    ),

    Tool(
        name="Tavily",
        func=tavily.run,
        description="If the information is not found by Pinecone Agent, this lookup information from Tavily",
    )

     # email sender tool

]
agent = create_react_agent(llm,tools,prompt)

agent_executor = AgentExecutor(
                        tools=tools,
                        agent=agent,
                        handle_parsing_errors=True,
                        verbose=True,
                        memory=conversational_memory)

Now that every thing is set up, let's run the agent.

In [51]:
ques = input('Enter your query about the document: ')
response = agent_executor.invoke({"input": f"{ques}"})

Enter your query about the document: Who is David S. Wills


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should first try to find information about David S. Wills using the Pinecone Document Store.  If that fails, I'll use Tavily.

Action: Pinecone Document Store
Action Input: "David S. Wills"
[0m[36;1m[1;3mDavid S. Wills is an IELTS tutor from Scotland who has been teaching IELTS since 2010, specializing in writing skills.  He has helped hundreds of students improve their IELTS scores for immigration purposes. He also runs the website www.ted-ielts.com and is the author of "A Complete Guide to IELTS Writing".
[0m[32;1m[1;3mThought: I found the information in Pinecone, so I don't need to use Tavily.

Thought: I now know the final answer.
Final Answer: David S. Wills is an IELTS tutor from Scotland who has been teaching IELTS since 2010, specializing in writing skills. He has helped hundreds of students improve their IELTS scores for immigration purposes.

Let's try a query that doesn't have the answer in the document.

In [52]:
ques = input('Enter your query about the document: ')
response = agent_executor.invoke({"input": f"{ques}"})

Enter your query about the document: What is ilets reading 


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find information about ilets reading.  I'll first try Pinecone to see if it has any relevant documents.

Action: Pinecone Document Store
Action Input: "ilets reading"
[0m[36;1m[1;3mI'm sorry, but I don't have enough information to answer your question about IELTS reading.  The provided text focuses on the writing section of the IELTS exam.
[0m[32;1m[1;3mThought: Pinecone didn't find anything. I'll try Tavily.

Action: Tavily
Action Input: "What is the IELTS reading test like?"
[0m[33;1m[1;3m[{'url': 'https://www.ieltslounge.com/how-is-the-reading-test-in-ielts/', 'content': "The IELTS Reading Test is a crucial part of the IELTS exam, whether you're taking the Academic or General format. It assesses your reading skills through various tasks, helping you demonstrate your ability to understand and interpret text. ... Familiarize yourself with

Let's check the chat context.

In [53]:
conversational_memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='Who is David S. Wills', additional_kwargs={}, response_metadata={}),
  AIMessage(content='David S. Wills is an IELTS tutor from Scotland who has been teaching IELTS since 2010, specializing in writing skills. He has helped hundreds of students improve their IELTS scores for immigration purposes. He also runs the website www.ted-ielts.com and is the author of "A Complete Guide to IELTS Writing".', additional_kwargs={}, response_metadata={}),
  HumanMessage(content='What is ilets reading ', additional_kwargs={}, response_metadata={}),
  AIMessage(content='The IELTS Reading test is a 60-minute section with 40 questions based on three long texts.  The texts are excerpts from books, journals, magazines, and newspapers (for the Academic version) and assess a range of reading skills, including skimming, scanning, understanding main ideas, and reading for detail.  The test format is the same for both computer-based and paper-based tests.  There are vario

Now that we are done, let's clear the memory!

In [None]:
agent_executor.memory.clear()