<a href="https://colab.research.google.com/github/jchen8000/DemystifyingLLMs/blob/main/6_Deployment/RAG_LangChain_Groq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6. Deployment of LLMs

## 6.11 Retrieval Augmented Generation (RAG) Application

Retrieval Augmented Generation (RAG) is one of the most powerful applications enabled by LLMs, it's a technique for augmenting LLM knowledge with additional data.

Large Language Models (LLMs) are indeed powerful tools for natural language processing (NLP). However, their knowledge is confined to the public data they were trained on. This means they might not be aware of private data, domain specific data or any new information introduced after their training cutoff date.

To build AI applications that can reason about such private, domain-specific or up-to-date data, it’s important to augment the model’s knowledge with the specific information it needs. This is what RAG is trying to achieve. RAG involves retrieving the relevant information and incorporating it into the model’s prompt, enabling the LLMs to generate responses based on the most current and specific data available.


**How it works:**

1. **Retrieval**:
The model analyzes the user's input and retrieves relevant information from a vast knowledge base.
This includes documents, conversations, or other textual sources.

2. **Augmentation**:
The retrieved information is augmented with additional context and knowledge.
This includes the user's intent and domain-specific knowledge.

3. **Generation**:
The augmented information is used to generate a comprehensive and informative response. The model combines the retrieved content with the new context to create a coherent and relevant output.

### 0. Install the packages and import them

In [None]:
!pip install langchain langchain_community langchain_chroma
!pip install pypdf
!pip install -U langchain-huggingface
!pip install FAISS-gpu
!pip install -qU langchain-groq

In [2]:
import requests
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

### 1. Indexing

Indexing is an important process for a RAG application, ensuring the efficient retrieval of the most relevant information.
It includes the following steps:

* **Load Documents**: The first step is to load the documents or
data you want to use. This can be private, domain-specific or up-to-data data including text files, PDFs, web pages, or any other relevant sources.
* **Split into Small Chunks**: Once the documents are loaded, they are split into smaller, manageable chunks. This is important because smaller chunks are easier to process and retrieve. The splitting can be based on sentences, paragraphs, or fixed token lengths.
* **Embedding**: Each chunk of text is then converted into a vector representation using an embedding model. These embeddings capture the semantic meaning of the text in a numerical format, making it easier for the system to understand and compare different pieces of information.
* **Store in a Vector Database**: The final step is to store these vector embeddings in a vector database. This specialized database is optimized for storing and retrieving high-dimensional vectors, allowing for efficient similarity searches.




#### Load Document

This example is to load a PDF file from a URL. First, we download the PDF to a local file.

```PyPDFLoader()``` function provides a way to load and extract text from PDF documents. It comes with ```langchain_community.document_loaders``` libary.

```loader.load()``` is to load the PDF document and extract its text content. The result ```document``` is a string or a list of strings representing the text content of the PDF document.


In [3]:
# Download a PDF file
url = "https://arxiv.org/pdf/1706.03762"
doc_name = "Attention_Is_All_You_Need.pdf"

response = requests.get(url)
with open(doc_name, "wb") as f:
    f.write(response.content)

# Load the PDF file
loader = PyPDFLoader(doc_name)
documents = loader.load()

#### Split into small trunks

```RecursiveCharacterTextSplitter``` is a class that implements a text splitting algorithm, specifically designed for RAG applications. Where ```chunk_size``` specifies the maximum size of each chunk (or split) in characters. ```chunk_overlap``` specifies the amount of overlap between consecutive chunks. In this case, each chunk will overlap with the previous chunk by 200 characters.

The purpose of splitting the documents is to create a set of smaller, more manageable pieces of text that can be embedded and indexed in the vector database.

In [4]:
# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)

Optionally, check how many splits in total, and the content of any given split.

In [5]:
# print('splits:', len(splits))
# print('splits[20] metadata:', splits[20].metadata)
# print('splits[20] content:', splits[20].page_content)

#### Embedding

Load an embedding model from Hugging Face Transformers library.

```HuggingFaceEmbeddings``` is a class that provides an interface for generating embeddings from a pre-trained language model, which in this example is *bert-base-uncased* model.

Reference Section 2.8 and Section 3.2 of the book [***Demystifying Large Language Models***](https://github.com/jchen8000/DemystifyingLLMs/) for more details about embedding.


In [6]:
# Load the embedding model
embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased", encode_kwargs={"normalize_embeddings": True})


  from tqdm.autonotebook import tqdm, trange


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



#### Store in a Vector Database

This example uses FAISS (Facebook AI Similarity Search) as the vector database, the function ```FAISS.from_documents(splits, embeddings)``` is to create a FAISS index from the splitted texts and their corresponding embeddings. When the below code is executed, FAISS creates an index that maps each splitted text to its corresponding embedding. This index allows for efficient similarity search, clustering, and other operations on the embeddings.

Optionally, you can save the vector database into a local folder.

Reference Section 6.9 of the book [***Demystifying Large Language Models***](https://github.com/jchen8000/DemystifyingLLMs/) for more details on vector database. And reference https://github.com/facebookresearch/faiss for FAISS.

In [7]:
# Create a vector store
vectorstore = FAISS.from_documents(splits, embeddings)

# Save the documents and embeddings
vectorstore.save_local("vectorstore.db")

In [8]:
# Check the size of the vectorstore.db
!du -sh vectorstore.db


216K	vectorstore.db


### 2. Retrieval and Generation

Retrieval and Generation are essential processes in a RAG application, enabling it to deliver more precise and informed responses by leveraging the power of LLMs along with the private, domain-specific, and up-to-date data stored in the vector database.


* **Retrieve Related Information**: When a user submits a query, the system first searches the vector database to find the most relevant chunks of information. These chunks are retrieved based on their semantic similarity to the query.
* **Augment the Prompt**: The retrieved information is then used to augment the original query. This is often done using a framework like LangChain, which helps in seamlessly integrating the additional context into the prompt. This step ensures that the language model has all the necessary information to generate a more accurate and contextually relevant response.
* **Invoke the LLM**: Finally, the augmented prompt is passed to the Large Language Model (LLM). The LLM processes the combined input and generates a response that leverages both the original query and the retrieved information.

#### Retrieve Related Information

```vectorstore.as_retriever()``` function is used for retrieving relevant information from a vector database, based on their semantic similarity.

```format_docs(docs)``` function is defined for formatting the retrieved information.



In [9]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 8})

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Optionally, you can issue a user query and call ```retriever.invoke(user_query)``` function to check what information is retrived from the ```retriever```


In [10]:
# user_query = "What is Attention"
# retrieved_docs = retriever.invoke(user_query)
# print(len(retrieved_docs))
# print(format_docs(retrieved_docs))

#### Augment the Prompt

Create a custom prompt template using ```PromptTemplate.from_template(template)``` function to format the context and user query. The prompt explicitly tells the LLM to answer the question based on the provided context.

Then build a model with LangChain that includes:
* LLM (Groq with model="llama3-70b-8192" in this example)
* Prompt template
* Retrived context from the vector database
* User's query

The code will output a concise and informative answer to the user's query, based on the provided context and the language model's understanding of the conversation.

Please note, an API key from Groq Cloud is required for this example. The API key ***GROQ_API_KEY*** is stored in the Secret of Google Colab. Please see other code examples in this chapter for how to do it.


In [11]:
from google.colab import userdata
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatGroq(groq_api_key=userdata.get('GROQ_API_KEY'), model="llama3-70b-8192")

template = \
"""Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Answer:
"""

custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)


#### Invoke the LLM

Build a simple conversational chatbot.

In [12]:
while True:
    user_question = input("You: ")

    if user_question.lower() == "quit":
        break

    response = rag_chain.invoke(user_question)
    print("Chatbot:", response)
    print("\n\n")


You: What is Transformer?
Chatbot: The Transformer is a model architecture that eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. It is a neural network architecture that allows for significantly more parallelization and has achieved state-of-the-art results in translation quality. Thanks for asking!



You: What is Attention?
Chatbot: According to the provided context, attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It's sometimes called intra-attention.

Thanks for asking!



You: What is Scaled Dot-Product Attention?
Chatbot: I don't know. The text does not explicitly define what Scaled Dot-Product Attention is. It only mentions that Noam proposed scaled dot-product attention, but it does not provide a definition or explanation of what it is. Thanks for asking!



You: What are Encoder and Decoder?
Chatbot: According