<a href="https://colab.research.google.com/github/nathan-young1/Introduction-to-Retrieval-Argumented-Generation-RAG-/blob/main/Introduction_to_Retrieval_Argumented_Generation_(RAG).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval Augmented Generation (RAG)

Large language models (LLMs) like ChatGPT are great at understanding language and generating fluent text. However, sometimes they struggle with factual accuracy or keeping information up to date. Retrieval augmented generation (RAG) solves this by adding a "research assistant" step:

1. **Retrieval**: When you give the LLM a prompt or question, RAG first searches through a database of texts 📜 - like having access to a giant virtual library! It retrieves relevant snippets of information that could be useful for composing its response.

2. **Augmentation**: Those retrieved context passages are then incorporated into the prompt to the LLM 📝, giving it an information source to base the answer on. Just like reading research notes from a database and integrating them into your understanding before writing on a topic.

3. **Generation**: Finally, the LLM leverages the augmented context to expand its knowledge and language capabilities to generate a response. Making the text produced not just fluent, but also accurate and factual, since it's based on relevant reference material.

In essence, RAG reduces the LLM's chance of hallucinating because now it gets to consult a knowledge base before responding. This makes responses more reliable and trustworthy, especially for topics requiring specific up-to-date facts.

<img src="https://python.langchain.com/assets/images/vector_stores-125d1675d58cfb46ce9054c9019fea72.jpg" height=400 width=800/>

⭐ Photo credits: [Langchain](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

### **Retrieval**

To use RAG, we need to have a database of documents that can provide relevant information for our queries. In this tutorial, we will create a database from the book "How to Build a Career in AI" by Andrew NG. We will use Langchain, Chroma, and Hugging Face to perform RAG on this book.

The process of creating a database involves the following steps:

- **Chunking**: We divide the book into smaller pieces, such as paragraphs or sentences, that can be easily indexed and retrieved.

- **Embedding**: We use a pre-trained model from Hugging Face to convert each chunk into a vector representation, also known as a sentence embedding. This captures the semantic meaning of the chunk and allows us to compare it with other chunks or queries.

💡: For more information on vector embeddings check out the word embeddings section in my last lesson at [Notebook Link](https://www.kaggle.com/code/nathanyoung1/transformer-based-language-translation-in-pytorch). The word embeddings are combined to form sentence embeddings which we will refer to as vector embeddings throughout this tutorial.

- **Indexing**: We store the vector embeddings in a vector database, such as Chroma, that can efficiently perform similarity search. This means that given a query vector, we can find the most similar vectors in the database, and retrieve the corresponding chunks.

When we want to use RAG to generate a response for a query, we first embed the query using the same model as before. Then, as shown in the image above 👆 we use the vector database to find the most similar embeddings to the query embedding. These similar embeddings are linked to particular chunks of our document. We then fed this chunks as context to the LLM, enabling it to generate a coherent and informative answer.

In [None]:
# install the vector database, langchain, pypdf, hugging face sentence_transformers
!pip install chromadb langchain pypdf sentence_transformers

In [None]:
# install langchain experimental features (this will likely be moved to stable in the future).
!pip install --quiet langchain_experimental

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.docstore.document import Document

# Load the pdf file... By default will split into pages
loader = PyPDFLoader("How to Build a Career in AI.pdf")
pages = loader.load_and_split()

In [None]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

# Load an embedding model from hugging face.
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}

embed_model = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/92.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
from langchain_experimental.text_splitter import SemanticChunker

# create a Semantic text splitter for the document, At a high level,
# this splits into sentences, then groups into groups of 3 sentences,
# and then merges ones that are similar in the embedding space.
text_splitter = SemanticChunker(embed_model)

# split the pages using Semantic Chunker.
documents = text_splitter.split_documents(pages)

In [None]:
from langchain_community.vectorstores import Chroma

# embed and insert all chunks of the documents into the vector database
vector_db = Chroma.from_documents(
    documents,
    embed_model, # model to use for embedding the document chunks before storing.
    persist_directory='vector_db', # persist the database in memory.
    collection_name='ai_career' # name of the collection to store the chunks in.
)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# perform a vector similarity search on a query.
query = "how do i start a career in ai?"

# return the chunks of the most similar five embeddings in the db
docs = vector_db.similarity_search(query, k=5)

print(docs[4].page_content)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

PAGE 9In the previous chapter, I introduced three key steps for building a career in AI: learning 
foundational technical skills, working on projects, and finding a job, all of which is supported 
by being part of a community. In this chapter, I’d like to dive more deeply into the first step: 
learning foundational skills. More research papers have been published on AI than anyone can read in a lifetime. So, when 
learning, it’s critical to prioritize topic selection. I believe the most important topics for a technical 
career in machine learning are:
Foundational machine learning skills: For example, it’s important to understand models such 
as linear regression, logistic regression, neural networks, decision trees, clustering, and anomaly 
detection. Beyond specific models, it’s even more important to understand the core concepts 
behind how and why machine learning works, such as bias/variance, cost functions, regularization, 
optimization algorithms, and error analysis. Deep learni

## **Argumentation** ➕

Now that we have setup a vector database and can retrieve similar chunks to our query, we are going to combine this chunks together to form a context. This context is then passed together with our query as the prompt to our LLM.

In [None]:
# util function to join all retrieved documents chunks together to form a context.
def join_retrieved_docs(docs):
    return "\n\n".join([doc.page_content for doc in docs])

In [None]:
from langchain_core.prompts import ChatPromptTemplate

# Template so we can attach our context and query as prompt to the LLM on the fly.
template = """Answer the question using vital information from the following context,
if the context is relevant, if context given is not relevant,
you should reply >>> 'Sorry, But the context provided doesn't contain enough or relevant
information to answer your question'

>>> 'Context : {context}'

>>> 'Question: {question}'
"""

prompt = ChatPromptTemplate.from_template(template)

## **Generation** ✍️

In [None]:
from langchain.llms import HuggingFaceHub
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "<Your Hugging Face API Token Here>"
# Note: To get an API Token, sign up to hugging face -> go to settings -> access token
# -> create new token -> copy the token and paste above 👆.

In [None]:
repo_id = "google/flan-t5-xxl" # we will be using google flan LLM from hugging face.

llm = HuggingFaceHub(
    repo_id=repo_id,
    # params for LLM text generation
    model_kwargs={
        "temperature": 0.1 # low values means more precise generation.
    }
)

In [None]:
# Import the LLMChain class from langchain.chains module
from langchain.chains import LLMChain

# Define a function that takes a question, a language model, and a prompt template as arguments
def query_llm_with_context(question, llm, prompt_template):

    # Use the vector database to find the most similar document chunks to the question
    # The parameter k specifies the number of document chunks to retrieve
    context = vector_db.similarity_search(question, k=5)

    # Create an instance of the LLMChain class
    # The prompt parameter specifies the format of the input for the language model
    # The llm parameter specifies the name of the language model to use
    llm_chain = LLMChain(
        prompt=prompt_template,
        llm=llm
    )

    # Invoke the LLMChain instance with the input dictionary
    # The input dictionary contains the context and the question keys
    # The context key contains the concatenated document chunks retrieved from the vector database
    # The question key contains the original question to the LLM
    llm_response = llm_chain.invoke(
        input = {
            'context' : join_retrieved_docs(context),
            'question' : question
                })

    # Return the text of the response generated by the language model
    return llm_response['text']


### **Testing**

In [None]:
query = 'what is recommended for new ai startups to do?'

query_llm_with_context(question=query, llm=llm, prompt_template=prompt)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Identify a business problem (not an AI problem). I like to find a domain expert and ask, “What are the top three things that you wish worked better? Why aren’t they working yet?” For example, if you want to apply AI to climate change, you might discover that power-grid operators can’t accurately predict how much power intermittent sources like wind and solar might generate in the future. Brainstorm AI solutions.'

In [None]:
query2 = 'What is Russian Roulette'

query_llm_with_context(question=query2, llm=llm, prompt_template=prompt)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

"Sorry, But the context provided doesn't contain enough or relevant information to answer your question"

👆👆 As you can see in the first example the LLM used our context as the source to generate an answer for us. While in the second example, instead of hallucinating the LLM simply replied that there isn't enough or revelant context to answer us.

### **Final Words**
This tutorial simply introduced you to a RAG techniques & implementation, in production more complex RAG techinques like Sentence-Window retrival, Auto-merging retrival e.t.c are used to improve context relevance. We also use tools 🏹 like TruERA for LLM response Evaluation.

**Congratulations** 🎉🎉
You can now use fundermental RAG techniques. 😊

Follow me on:

* **[LinkedIn Profile](https://www.linkedin.com/in/jonathan-okorie-843126216/)** for questions, deep learning projects, chat e.t.c.

* **[Twitter Profile](https://twitter.com/Nathan_Young_1)** for bite-sized knowledge & (questionable) puns.

* **[Kaggle Profile](https://www.kaggle.com/nathanyoung1)** to be notified when i create a new detailed notebook explanation.