## Retrieval Augmented Generation (RAG) Implementation With LangChain

<span style='color:red '> <b> What is RAG ? </b> </span>

RAG is a technique for augmenting LLM (Large Language Models) knowledge with additional data.

LLMs can reason about wide-ranging topics but their knowledge is to the public data up to a specific point in time that they were trained on </span>. 

If we want to build AI applications that can reason about private data or data introduced after a models cutoff date, we need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as <b> Retrieval Augmented Generation (RAG) </b>.

<span style='color:red '> <b> What is LangChain ? </b> </span>

LangChain is an open source framework that lets software developers working with artificial intelligence (AI) and its machine learning subset combine large language models with other external components to develop LLM-powered applications. LangChain makes it easy to link powerful LLMs, such as OpenAI's GPT-3.5 and GPT-4, to an array of external data sources to create and reap the benefits of natural language processing (NLP) applications.

## Build a simple Chat Bot using LangChain and OpenAI

We will rely on the LangChain library to bring together the different components needed for the chatbot.

<b>Step-1</b>

We run the following command to set up the <i>OpenAI key</i> as enviornment variable (re-execute the set up if kernel res-starts)

In [1]:
# Setting up the openAI key
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()

Initialise Chat GPT 3.5 object to be used for generating responses

In [2]:
# NOTE : You need an API Key from OpenAI to use this functionality
import os
from langchain.chat_models import ChatOpenAI

# Creating an OpenAI object
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") 

llm_chat = ChatOpenAI(
    temperature = 0,
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


<b> Step -2 </b>

Now we have to build a query message and send it to OpenAI service. But before that we need to understand how to structure our query.

<i> Chats with *OpenAI's gpt-3.5-turbo and gpt-4 chat models* are typically structured (in plain text) like this:</i>

System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

In the official OpenAI ChatCompletion endpoint these would be passed to the model in a format like:

[
    
    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Hi AI, how are you today?"},
    
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    
    {"role": "user", "content": "I'd like to understand string theory."}

]

In [3]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
   
]

<b> Step -3 </b>

Send the formatted message to ChatGPT to get a response

In [4]:
response = llm_chat.invoke(messages)
print(response.content)

Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?


Because response is just another AIMessage object, we can append it to messages, add another HumanMessage, and generate the next response in the conversation

In [5]:
# add latest AI response to messages
messages.append(response)

# now create a new user prompt
prompt = HumanMessage(
    content="I would like to know about Australia day")

# add to messages
messages.append(prompt)

# send to chat-gpt
response = llm_chat.invoke(messages)

print(response.content)

Australia Day is a national holiday in Australia that is celebrated annually on January 26th. It marks the anniversary of the arrival of the First Fleet at Port Jackson in New South Wales in 1788, which led to the establishment of the first European settlement in Australia.

Australia Day is a day of national pride and celebration, with events and activities held across the country. It is a time for Australians to come together to reflect on the history of their nation, celebrate its diverse culture, and look towards the future.

However, Australia Day is also a controversial holiday for many Indigenous Australians, as it marks the beginning of colonization and the negative impacts it had on their communities. Some Indigenous Australians refer to it as "Invasion Day" or "Survival Day" and use the day to raise awareness about the ongoing issues faced by Indigenous communities.

Overall, Australia Day is a day of reflection, celebration, and debate in Australia, with different perspectiv

**Optional Work # 1** 
We can change the persona of ChatGPT. Such as provide output in French or any language other than English.

In [8]:
messages = [
    SystemMessage(content="You are a helpful assistant that translates from english to german."),
    HumanMessage(content="Hi AI, how are you today?"),
]

response = llm_chat.invoke(messages)
print(response.content)

Hallo! Mir geht es gut, danke. Wie kann ich Ihnen heute helfen?


**Optional Work # 2** 
We can change the persona of ChatGPT. Such as enforcing OpenAI service to produce output in json format

In [23]:
# Preparing LangChain message prompt
messages = [
    SystemMessage(content="You are a helpful assistant that parses json output."),
    HumanMessage(content="How was Neploean ? Produce answer in json format")
]

# Create the OpenAI object 
# The temperature value ranges from 0 to 2, with lower values indicating 
# greater determinism, and higher values indicating more randomness.
llm_chat_json = ChatOpenAI(
    temperature = 0.1,
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

response = llm_chat_json.invoke(messages)
print(response.content)

{
    "response": "I'm sorry, but I cannot provide information on how Napoleon was as it is a historical question. If you have any specific information or questions about Napoleon, feel free to ask and I'll do my best to assist you."
}


<b> Step -4 Dealing with Hallucinations </b>

The knowledge of LLMs can be limited because LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. This knowledge is called the parametric knowledge of the model.

By default, LLMs have no access to the external world.

So, we expect to get hallucinated output from LLM if we ask about a more recent information. Such as enquiring about LLAMA 2 language model

In [10]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="I want to know about Llama 2"),
]

# send to chat-gpt
response = llm_chat.invoke(messages)

print(response.content)

I'm sorry, but I don't have information about a specific topic called "Llama 2." Could you provide more context or details so I can better assist you?


We can see that OpenAI model failed to provide the output. To tackle this issue, we feed knowledge into LLMs in another way. It is called <b>source knowledge and it refers to any information fed into the LLM via the prompt</b>. We can do that as follows:

To tackle this issue, we feeding knowledge into LLMs in another way. It is called source knowledge and it refers to any information fed into the LLM via the prompt. We can try that with the Llama 2 question. We can take a description of this object from the Llama 2 source page

In [11]:
llama2_information = [
    "Code Llama is a code generation model built on Llama 2, trained on 500B tokens of code. It supports common programming languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.",
    "In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs."
]

source_knowledge = "\n".join(llama2_information)

Now we feed this additional information along with query to the model. 

In [27]:
query = "Can you tell me about the llama 2 ?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = llm_chat.invoke(messages)
print(res.content)

{
  "llama2": {
    "description": "Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.",
    "features": [
      "Supports common programming languages like Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash.",
      "Includes fine-tuned LLMs optimized for dialogue use cases, known as Llama 2-Chat.",
      "Outperforms open-source chat models on most benchmarks tested.",
      "Received positive feedback in human evaluations for helpfulness and safety.",
      "Provides detailed descriptions of fine-tuning and safety improvements to encourage community contributions."
    ]
  }
}


The quality of this answer is phenomenal. This is made possible due to the augmention of our query with external knowledge (source knowledge). We can use the concept of vector databases to get this information automatically.

## Building RAG Chatbots with LangChain

In this example, we will build an AI chatbot from start-to-finish so that it can answer automatically about Llama 2 instead of providing the information manually (<b>source knowledge provided using vector database</b>). We will be using LangChain,HuggingFace embeddings, OpenAI, and vector DB, to build a chatbot capable of learning from the external world using Retrieval Augmented Generation (RAG).

We will use two techniques to build our chatbot:

1- Scrap a dataset sourced from the Llama 2 ArXiv paper and other related papers to help our chatbot answer questions about the latest and greatest in the world of GenAI.

2- Scrap multiple webpages to help our chatbot answer questions about the latest and greatest in the world of GenAI.

### <span style='color:red '> <b> Techniue 1 </b> </span>

<b> <span style='color:blue '>Importing the Data </span></b>

We will import data. We will use the Hugging Face Datasets library to load our data. Specifically, we will be using the <b>"jamescalam/llama-2-arxiv-papers"</b> dataset.  

The dataset we are using is sourced from the Llama 2 ArXiv papers. It is a collection of academic papers from ArXiv, a repository of electronic preprints approved for publication after moderation. Each entry in the dataset represents a "chunk" of text from these papers.

Because most Large Language Models (LLMs) only contain knowledge of the world as it was during training, they cannot answer our questions about Llama 2 — at least not without this data.

In [14]:
# Load dataset 
from datasets import load_dataset
dataset = load_dataset(
    "jamescalam/llama-2-arxiv-papers-chunked",
    split="train"
)
print(dataset[0])
print(len(dataset))

{'doi': '1102.0183', 'chunk-id': '0', 'chunk': 'High-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nTechnical Report No. IDSIA-01-11\nJanuary 2011\nIDSIA / USI-SUPSI\nDalle Molle Institute for Arti\x0ccial Intelligence\nGalleria 2, 6928 Manno, Switzerland\nIDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI),\nand was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.\nThis work was partially supported by the Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:\nIntelligent Fill in Form.arXiv:1102.0183v1  [cs.AI]  1 Feb 2011\nTechnical Report No. IDSIA-01-11 1\nHigh-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nJanuary 2011\nAbstr

<b> <span style='color:blue '>1. Create a vector Database </span></b>

We need to now store the dataset into a vector database such as chromadb.You can store data such as embeddings and text on the local computer. Perform query to get relevant documents and then pass it to online LLM model (such as OpenAI) to get a response.

It is always a good idea to store embeddings which is more efficient to perform document similarity.

In [15]:
from chromadb.utils import embedding_functions
from langchain.vectorstores import Chroma
import chromadb

# Create a hugging face transformation embedding model
hugging_face_model = embedding_functions.SentenceTransformerEmbeddingFunction("sentence-transformers/all-MiniLM-L6-v2")


# Define a vector dataset location
chroma_client = chromadb.PersistentClient(path="vetor_db/")  # data stored in 'db' folder

# Define a collection object
#chroma_client.delete_collection("lang_chain_1")
db_collection = chroma_client.create_collection(name="lang_chain_1",embedding_function=hugging_face_model)




modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

A simple example showing how we can use vector database tp get relevant documents

In [25]:
student_info = """
Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""

club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""

university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world."""

# Defining a collection
db_collection.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

# Query vector database
results = db_collection.query(
    query_texts=["What is the student name?"],
    n_results=1
)
print(results['documents'])

Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3
Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3


[['W24BZxF4pWkDiVagf05GEY44URozJBSfc+NtZ8iqSlmJKsOEkVihKdoTPqGCsSJ8tMieuYcGWXojCJpntBOof7eSBFXasZDM5kHVfNeLv7n9RM9OvdTKuJE51cVH40S5ujIyXtwhlQSrNnMEIQlNVkdPEESYW3aqpoSvPmTF0nnpOG5De/2tN68KOuowAEcwjF4cAZNuIYWtAHDAzzBC7xaj9az9Wa9/4wuWeXOPvyB9fENyBqUTg==</latexit><latexit']]


<b> <span style='color:blue '>2. Transform data and store in a vector database </span></b>

Store each chunk of dataset with desired information such as metadata into vector database

In [17]:
from tqdm.auto import tqdm  # for progress bar
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from chromadb.db.base import UniqueConstraintError

data = dataset.to_pandas()  # this makes it easier to iterate over the dataset
batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    doc_ids = [f"{x['doi']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [x['chunk'] for _, x in batch.iterrows()]
    
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'],
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]

    # Adding collections
    db_collection.add(
                documents= texts,
                metadatas=metadata,
                ids=ids
                )

  0%|          | 0/49 [00:00<?, ?it/s]

To retrieve top 2 results:

In [18]:
from langchain.embeddings import HuggingFaceEmbeddings

# Create a hugging face model for embeddings
model_id = 'sentence-transformers/all-MiniLM-L6-v2'
model_kwargs = {'device': 'cpu'}
hf_embedding_model = HuggingFaceEmbeddings(
    model_name=model_id,
    model_kwargs=model_kwargs
)

# query the top 2 results
query = 'What is so special about Llama 2'

results = db_collection.query(
    query_texts=query,
    n_results=2
)

print(results['documents'])




[['Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta\nChauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh\nYazdan, Elisa Garcia Anzano, and Natascha Parks.\n•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.\n46\n•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original\nLlama team who helped get this work started.\n•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the\npaper.\n•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the\ninternal demo.\n•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,\nLaurens van der Maaten, Jason Weston, and Omer Levy.', 'our responsible release strategy can be found in Section 5.3.\nTh

<b> <span style='color:blue '>3. Implementing RAG </span></b>

Creating a function to extract source knowledge information from vector database

In [19]:
def augment_prompt(query: str):
    # get top 5 results from knowledge base
    results = db_collection.query(
            query_texts=query,
            n_results=3
        )

    source_knowledge= ""
    # get the text from the results
    for i in range(0, len(results['documents'])):
        source_knowledge+= "\n".join(results['documents'][i])
    
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    
    return augmented_prompt

Now supplying additional information to OpenAI model

In [20]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = llm_chat.invoke(messages)

print(res.content)

Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) that range in scale from 7 billion to 70 billion parameters. These models, known as Llama 2-Chat, are specifically optimized for dialogue use cases. They have demonstrated superior performance compared to open-source chat models across various benchmarks. Additionally, Llama 2-Chat has undergone human evaluations for helpfulness and safety, suggesting that it could potentially serve as a viable alternative to closed-source models. The developers of Llama 2 have provided detailed insights into their fine-tuning methodology and safety enhancements, aiming to facilitate further advancements in the responsible development of large language models.


We can continue with more Llama 2 questions. Let's try without RAG first:



In [21]:
prompt = HumanMessage(
    content="what safety measures were used in the development of llama 2?"
)

res = llm_chat.invoke(messages + [prompt])
print(res.content)

In the development of Llama 2, several safety measures were implemented to ensure responsible usage of the large language models (LLMs). These safety measures included:

1. **Human Evaluations**: Llama 2 underwent human evaluations for both helpfulness and safety. This process involved assessing the model's performance and ensuring it met certain safety standards.

2. **Model Safety Optimization**: The Llama 2 models were optimized for dialogue use cases, and safety considerations were integrated into the fine-tuning process. This likely involved techniques to mitigate potential risks associated with language models, such as bias detection and mitigation.

3. **Responsible Release Strategy**: The team behind Llama 2 had a responsible release strategy in place, as mentioned in Section 5.3 of the paper. This strategy likely involved guidelines for the ethical and safe deployment of the models.

4. **Community Engagement**: The developers aimed to enable the community to build on their wo

The chatbot is able to respond about Llama 2 thanks to it's conversational history stored in messages. However, it doesn't know anything about the safety measures themselves as we have not provided it with that information via the RAG pipeline. Let's try again but with RAG

In [22]:
prompt = HumanMessage(
    content=augment_prompt(
        "what safety measures were used in the development of llama 2?"
    )
)

res = llm_chat.invoke(messages + [prompt])
print(res.content)

The safety measures used in the development of Llama 2 include the following:

1. Ethical Considerations and Limitations: The developers acknowledge that Llama 2 is a new technology that carries risks with its use. Testing conducted so far has been in English and may not cover all scenarios. Due to this, the potential outputs of Llama 2 cannot be predicted in advance, and the model may produce inaccurate or objectionable responses to user prompts.

2. Responsible Use Guide: Developers are advised to perform safety testing and tuning tailored to their specific applications of Llama 2 before deploying any applications of the model. A Responsible Use Guide is available at https://ai.meta.com/llama/responsible-user-guide for reference.

These safety measures aim to address the potential risks associated with the use of Llama 2 and emphasize the importance of responsible deployment and testing to ensure the model's outputs are accurate and appropriate for the intended applications.
