## Retrieval Augmented Generation (RAG) Implementation With LangChain

<span style='color:red '> <b> What is RAG ? </b> </span>

RAG is a technique for augmenting LLM (Large Language Models) knowledge with additional data.

LLMs can reason about wide-ranging topics but their knowledge is to the public data up to a specific point in time that they were trained on </span>. 

If we want to build AI applications that can reason about private data or data introduced after a models cutoff date, we need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as <b> Retrieval Augmented Generation (RAG) </b>.

<span style='color:red '> <b> What is LangChain ? </b> </span>

LangChain is an open source framework that lets software developers working with artificial intelligence (AI) and its machine learning subset combine large language models with other external components to develop LLM-powered applications. LangChain makes it easy to link powerful LLMs, such as OpenAI's GPT-3.5 and GPT-4, to an array of external data sources to create and reap the benefits of natural language processing (NLP) applications.

## Build a simple Chat Bot using LangChain and OpenAI

We will rely on the LangChain library to bring together the different components needed for the chatbot.

<b>Step-1</b>

We run the following command to set up the <i>OpenAI key</i> as enviornment variable (re-execute the set up if kernel res-starts)

In [1]:
# NOTE : You need an API Key from OpenAI to use this functionality
import getpass
import os

# Setting up the openAI key as environment variable
os.environ["OPENAI_API_KEY"] = getpass.getpass()

Initialise the Chat GPT 3.5 object to generate responses

In [2]:
import os
from langchain_openai import ChatOpenAI

# Get the OpenAI key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") 

# Create the OpenAI object 
# The temperature value ranges from 0 to 2, with lower values indicating 
# greater determinism, and higher values indicating more randomness.
llm_chat = ChatOpenAI(
    temperature = 0.1,
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

<b> Step -2 </b>

Now we have to build a query message and send it to OpenAI service. But before that we need to understand how to structure our query.

<i> Chats with *OpenAI's gpt-3.5-turbo and gpt-4 chat models* are typically structured (in plain text) like this:</i>

System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

In the official OpenAI ChatCompletion endpoint these would be passed to the model in a format like:

[
    
    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Hi AI, how are you today?"},
    
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    
    {"role": "user", "content": "I'd like to understand string theory."}

]


LangChain uses a slightly different format. We build the above message format as follows:

In [4]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

# Prepare LangChain message format
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    HumanMessage(content="I will like to know about Australia day in few sentences.")
]

<b> Step -3 </b>

Send the formatted message to ChatGPT to get a response

In [5]:
# Invoke OpenAI servcie to get a response
response = llm_chat.invoke(messages)
print(response.content)

Australia Day is a national holiday in Australia celebrated on January 26th each year. It marks the anniversary of the arrival of the First Fleet at Port Jackson in New South Wales in 1788, which led to the establishment of the first European settlement in Australia. The day is a time for Australians to come together to celebrate their country, culture, and achievements. However, it is also a day that is controversial for many Indigenous Australians, who see it as a day of mourning and invasion. The date has sparked debate and calls for it to be changed to a more inclusive day that recognizes the history and culture of all Australians.


Because response is just another AIMessage object, we can append it to messages, add another HumanMessage, and generate the next response in the conversation

In [6]:
# Add latest AI response to messages
messages.append(response)

# Now create a new user prompt
prompt = HumanMessage(
    content="Is Australian day a public holiday ?")

# Add to messages
messages.append(prompt)

# Send query to OpenAI service
response = llm_chat.invoke(messages)
print(response.content)

Yes, Australia Day is a public holiday in Australia. It is a day off for the general population, and many businesses and government offices are closed on this day. There are usually various events and activities held across the country to celebrate Australia Day, including fireworks, concerts, barbecues, and citizenship ceremonies.


**Optional Work # 1** 
We can change the persona of ChatGPT. Such as provide output in French or any language other than English.

In [7]:
# Preparing LangChain message prompt
messages = [
    SystemMessage(content="You are a helpful assistant that translates from english to french."),
    HumanMessage(content="Hi AI, how are you today?"),    
    HumanMessage(content="How was Neploean ?")
]

response = llm_chat.invoke(messages)
print(response.content)

Bonjour AI, comment vas-tu aujourd'hui ?


**Optional Work # 2** 
We can change the persona of ChatGPT. Such as enforcing OpenAI service to produce output in json format

In [8]:
# Preparing LangChain message prompt
messages = [
    SystemMessage(content="You are a helpful assistant that parses json output."),
    HumanMessage(content="How was Neploean ? Produce answer in json format")
]

# Create the OpenAI object 
# The temperature value ranges from 0 to 2, with lower values indicating 
# greater determinism, and higher values indicating more randomness.
llm_chat_json = ChatOpenAI(
    temperature = 0.1,
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

response = llm_chat_json.invoke(messages)
print(response.content)

{
    "response": "I'm sorry, but I cannot provide information on how Napoleon was as it is a historical question. If you have any specific questions or need assistance with something else, feel free to ask!"
}


<b> Step -4 Dealing with Hallucinations </b>

The knowledge of LLMs can be limited because LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. This knowledge is called the parametric knowledge of the model.

By default, LLMs have no access to the external world.

So, we expect to get hallucinated output from LLM if we ask about a more recent information. Such as enquiring about LLAMA 2 language model

In [9]:
# Preparing LangChain message prompt
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Can you please tell me about llama 2 ?")
]

# Getting response from OpenAI
response = llm_chat.invoke(messages)
print(response.content)

I'm sorry, but I don't have any information about "llama 2." Could you provide more context or clarify your question so I can better assist you?


We can see that OpenAI model failed to provide the output. To tackle this issue, we feed knowledge into LLMs in another way. It is called <b>source knowledge and it refers to any information fed into the LLM via the prompt</b>. We can do that as follows:

To tackle this issue, we feeding knowledge into LLMs in another way. It is called source knowledge and it refers to any information fed into the LLM via the prompt. We can try that with the Llama 2 question. We can take a description of this object from the Llama 2 source page

In [10]:
# Creating source knowledge
llama2_information = [
    "Code Llama is a code generation model built on Llama 2, trained on 500B tokens of code. It supports common programming languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.",
    "In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs."
]
source_knowledge = "\n".join(llama2_information)

# Creating a query with source knowledge
query = "Can you tell me about the llama 2 ?"
augmented_prompt = f"""Using the contexts below, answer the query.
        Contexts:
        {source_knowledge}

        Query: {query}"""


Now we feed this additional information to model

In [11]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = llm_chat.invoke(messages)
print(res.content)

Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) developed for dialogue use cases. Ranging in scale from 7 billion to 70 billion parameters, Llama 2 includes models optimized for chat applications, known as Llama 2-Chat. These models have shown superior performance compared to open-source chat models across various benchmarks. Additionally, human evaluations have indicated that Llama 2-Chat models are helpful and safe, potentially serving as viable alternatives to closed-source models. The creators of Llama 2 have provided detailed insights into their fine-tuning process and safety enhancements, encouraging the community to leverage their work and contribute to the responsible advancement of large language models.


The quality of this answer is phenomenal. This is made possible due to the augmention of our query with external knowledge (source knowledge). We can use the concept of vector databases to get this information automatically.

## Building RAG Chatbots with LangChain

In this example, we will build an AI chatbot from start-to-finish so that it can answer automatically about Llama 2 instead of providing the information manually (<b>source knowledge provided using vector database</b>). We will be using LangChain,HuggingFace embeddings, OpenAI, and vector DB, to build a chatbot capable of learning from the external world using Retrieval Augmented Generation (RAG).

We will use two techniques to build our chatbot:

1- Scrap a dataset sourced from the Llama 2 ArXiv paper and other related papers to help our chatbot answer questions about the latest and greatest in the world of GenAI.

2- Scrap multiple webpages to help our chatbot answer questions about the latest and greatest in the world of GenAI.

### <span style='color:red '> <b> Techniue 2 </b> </span>

<b> <span style='color:blue '>1. Load data from webpages </span></b>

We will perform web scrapping to read data from <i>multiple urls or webpages</i> to help our chatbot answer questions about the latest and greatest in the world of GenAI.

In [12]:
from langchain_community.document_loaders import WebBaseLoader

# Create a loader object
html_loader = WebBaseLoader(["https://ai.meta.com/resources/models-and-libraries/llama/",\
                        "https://zapier.com/blog/llama-meta/",\
                        "https://en.wikipedia.org/wiki/LLaMA"])

# Load data from html pages
data = html_loader.load()
print(len(data[0].page_content))
print(len(data[1].page_content))
print(len(data[2].page_content))

316
11914
24415


<b> <span style='color:blue '>2. Split document into chunks </span></b>

Ther loaded document is over 25k characters long. This is too long to fit in the context window of many models. To handle this we'll split the `Document` into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. We use the [RecursiveCharacterTextSplitter](/docs/modules/data_connection/document_transformers/recursive_text_splitter), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute "start_index".

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=20, add_start_index=True
)
all_splits = text_splitter.split_documents(data)
print(all_splits)
print(len(all_splits))

[Document(page_content='Update Your Browser\n\n\n\n\nUpdate Your BrowserYou’re using a web browser that isn’t supported by Facebook.To get a better experience, go to one of these sites and get the latest version of your preferred browser:Google ChromeMozilla FirefoxGet Facebook on Your PhoneStay connected anytime, anywhere.', metadata={'source': 'https://ai.meta.com/resources/models-and-libraries/llama/', 'title': 'Update Your Browser', 'language': 'en', 'start_index': 2}), Document(page_content="Meta AI: What is Llama 3 and why does it matter?Skip to contentProductZapier Automation PlatformNo-code automation across 7,000+ appsPRODUCTSZapsDo-it-yourself automation for workflowsTablesDatabases designed for workflowsInterfacesCustom pages to power your workflowsCAPABILITIESApp integrationsExplore 7,000+ app integrationsAI automationCutting-edge AI to upgrade your workflowsSecurityEnterprise-grade securityWhat's newCanvasBetaPlan and map your workflows with AIAI ChatbotBetaAnswer customer

<b> <span style='color:blue '>3. Store chunks to a vector Database </span></b>

We need to index our 34 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). 

We take a text search query, embed it, and perform some sort of "similarity" search (cosine similarity) to identify the relevant splits most similar to our query. We will use <i>chroma vector database.</i>

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Create an embedding model to generate embeddings from text chunks
# Any embedding model can be used
model_id = 'sentence-transformers/all-MiniLM-L6-v2'
model_kwargs = {'device': 'cpu'}
hf_embedding = HuggingFaceEmbeddings(
    model_name=model_id,
    model_kwargs=model_kwargs)

# Generate and store embeddings to vector store
vector_store = Chroma.from_documents(documents=all_splits, embedding=hf_embedding)

  from tqdm.autonotebook import tqdm, trange







To retrieve the relevant content from vector database, we create a retriever object and then pass a query to get the relevanr output

In [15]:
# Create a retriever object to get top 3 matching chunks
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})
# Use query to get relevant output
retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")
print(retrieved_docs)

[Document(page_content="Meta AI: What is Llama 3 and why does it matter?Skip to contentProductZapier Automation PlatformNo-code automation across 7,000+ appsPRODUCTSZapsDo-it-yourself automation for workflowsTablesDatabases designed for workflowsInterfacesCustom pages to power your workflowsCAPABILITIESApp integrationsExplore 7,000+ app integrationsAI automationCutting-edge AI to upgrade your workflowsSecurityEnterprise-grade securityWhat's newCanvasBetaPlan and map your workflows with AIAI ChatbotBetaAnswer customer questions with AI chatbotsCentralBetaCreate your own AI bots for any taskExplore templatesExplore use casesJoin Zapier Early AccessSolutionsSolutionsHow Zapier can help you automate your work across teamsBy teamRevOpsDrive revenue through automationMarketingMultiply campaign effectiveness and ROIITBetter manage systems with automationSalesClose more dealsCustomer SupportElevate customer satisfactionLeadersStreamline decision-making processesBy appSalesforceHubSpotSlackOpen

<b> <span style='color:blue '>4. Implementing RAG </span></b>

Creating a function to use query to extract top 3 matching results from vector database. The extracted information is used as source knowledge and passed along with the query to the OpenAI service to provide a meaningful output.

In [16]:
# Get top 5 results from knowledge base (vector database) 
def augment_prompt(query: str, top_n : int):
    # Create a retriever object to get top 3 matching chunks
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": top_n})
    # Use query to get relevant output
    retrieved_docs = retriever.invoke(query)        
    # Get the text from the results
    source_knowledge= ""
    for i in range(0, len(retrieved_docs)):
        source_knowledge+= "\n".join(retrieved_docs[i].page_content)    
    
    # Feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.
        Contexts:
        {source_knowledge}

        Query: {query}"""
    
    return augmented_prompt

The augmented prompt is ready. We pass this prompt to the OpenAI service to get a reliable answer. 

In [17]:
from langchain_openai import ChatOpenAI
import os
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

# Create the OpenAI object 
# The temperature value ranges from 0 to 2, with lower values indicating 
# greater determinism, and higher values indicating more randomness.
llm_chat_new = ChatOpenAI(
    temperature = 0.1,
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

# Prepare LangChain message format
messages = [
    SystemMessage(content="You are a helpful assistant.")
]

query = "Can you please tell me about llama 2 ?"

# Without RAG 
prompt = HumanMessage(
    content=query)
messages.append(prompt)
response = llm_chat_new.invoke(messages)
print(f"Output without RAG : {response.content}")

# With RAG 
prompt = HumanMessage(
    content=augment_prompt(query,2))
messages.append(prompt)
response = llm_chat_new.invoke(messages)
print(f"Output with RAG : {response.content}")

Output without RAG : I'm sorry, but I don't have any information about "llama 2." Could you provide more context or clarify your question so I can better assist you?
Output with RAG : Llama 2, announced on July 18, 2023, is the next generation of Llama developed by Meta in partnership with Microsoft. It includes foundational models and fine-tuned models for chat applications. The model architecture remains largely unchanged from the previous LLama-1 models, but 40% more data was used to train the foundational models. Llama 2 was released in three model sizes: 7, 13, and 70 billion parameters. The models are released with weights and are free for many commercial use cases, although there are some remaining restrictions. Model weights for the first version of Llama were made available to the research community under a non-commercial license, and access was granted on a case-by-case basis. Unauthorized copies of the model were shared via BitTorrent, leading to takedown requests from Meta 

Similary user can ask about any recent information and will get an answer

In [18]:
from langchain_openai import ChatOpenAI
import os
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

# Preparing LangChain message prompt
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Can you please tell me about llama 2 ?")
]

query= "What safety measures were used in llama 2?"

# Without RAG 
prompt = HumanMessage(
    content=query)
messages.append(prompt)
response = llm_chat_new.invoke(messages)
print(f"Output without RAG : {response.content}")

# With RAG and without RAG
prompt = HumanMessage(
    content=augment_prompt(query,3))
messages.append(prompt)
response = llm_chat_new.invoke(messages)
print(f"Output with RAG : {response.content}")

Output without RAG : I'm sorry, but I don't have specific information about a llama 2 project or safety measures related to it. If you can provide more context or details, I'd be happy to try to help you further. Alternatively, if you have any general questions about safety measures or best practices, feel free to ask!
Output with RAG : Llama 2 incorporated safety measures such as foundational models and fine-tuned models for chat applications. The models were released with weights and were freely available for many commercial use cases. However, there were some restrictions in place, and unauthorized copies of the model were shared via BitTorrent. In response, Meta AI issued DMCA takedown requests against repositories sharing the model links on GitHub. Subsequent versions of Llama were made accessible outside academia and released under licenses allowing for some commercial use.
