# 1. RAG pipeline introduction

## 1.1. What is RAG pipeline

Let's assume that your company has a plan to build an efficient chatbot that can question and answering not only the general questions but also the specific questions included in your company's documents. It is hard to a LLM can answer the specific questions that it did not study before.

The RAG (Retrieval-Augmented Generation) pipeline is an approach in natural language processing (NLP) that helps to retrieve useful information and provide a more precise answer compared with normal LLM generation. It combines information retrieval with language generation techniques. It is also a solution to improve the performance of generative models by incorporating a retriever component.

## 1.2. Why is RAG pipeline

Certainly, RAG pipeline demonstrates its prowess of retrieving the relevant contexts from diverse data sources. It is a powerful tool that pushes the capability of a normal LLM to a new frontier. In a nushell, there are three principal advantages of using RAG such as:

- **Empowering LLM with real-time data access**:
Because the business context always constantly changes over time. Therefore, data is constantly dynamic and transformed in an enterprise that demands AI solutions, which can use LLMs to have the ability to remain up-to-date and current with RAG to facilitate direct access to additional data resources. Ideally, these resources should comprise of real-time and personalized data.

- **Preserving data privacy**:
Many enterprise data is sensitive and confidential. That is why the commercial LLM models like GPT-4, GPT-3.5, Claude, and BARD are banned in several corporations, especially in the case where data is considered as the new gold. Therefore, ensuring data privacy is crucial for enterprises.To this end, with a self-hosted LLM (demonstrated in the RAG workflow), sensitive data can be retained on-premises just like the local stored data.

- **Mitigating LLM hallucinations**:
In fact since many LLMs lack access to factual and real-time information, they often generate inaccurate responses but seem convincing. This phenomenon, so-called hallucination, is mitigated by RAG, which reduces the likelihood of hallucinations by providing the LLM with relevant and factional information.

## 1.3. Where RAG pipeline

It is estatic when Large language models (LLMs) have astounded the world with their unprecedented competencies to understand and generate human-like responses. Their chat feature offers a swift and natural interaction between humans and immense amount of data. For instance, they demonstrate an extraordinary capability of summarizing and extracting the highlights from data or replacing functional queries such as SQL queries with natural language commands.

It is essential to emphasize that business value can be generated by these models without additional effort, but this is unsually not often the matter. Luckily, all that the users try to distil value out of using LLMs is to foster the LLM with their own data. This can be accomplished with retrieval augmented generation (RAG), which is showcased thorough out this tutorial.

By reinforcing an LLM with their business data, enterprises can make their AI applications agile and responsive to the new developments. For instance:

- **Chatbots**: Many companies have already used AI chatbots in their customer service to enable the customer to lively interact on their websites day and night. By using RAG pipeline, companies can leverage a tailored chat version that is highly determined to their product and policy. In specific, questions about product specifications could conveniently be handled.

- **Customer service**: Companies can authorize live service agents to easily answer customer questions with precise, up-to-date information.

- **Enterprise search**: Each enterprises has a wealth of knowledge across the departments that includes company terms, sale policies, IT support articles, and code repositories. That is why employees could seek an internal search engine to get information faster and more precise.

In conclusion, this post explains the benefits of using the RAG technique when implementing an LLM application, along with the components of a RAG pipeline in the next section.



## 1.4. How does RAG pipeline work?

![](https://imgur.com/K9eh8oL.png)

The RAG pipeline typically consists of two main components:

**1.4.1. Retriever:** The retriever is responsible for selecting relevant passages or document chunks from a large corpus of text. It uses information retrieval techniques to identify the most informative and contextually relevant pieces of information. This step helps in reducing the search space and focusing on the most suitable content. In general, it includes these main steps:

- **Document ingestion**:
First, raw data from diverse sources, such as a data source, pdf, text, files, images, or streaming live feeds, are collected into data lake, then ingested to RAG system. The most challenging aspect at this step is the diversity of data types, which requires different pre-processing technologies for each one. To this end, LangChain offers a variety of document loaders that load data for many forms from diverse sources. The term document loader is used loosely. Source documents do not necessarily need to be what you might think of as standard documents (PDFs, text files, and so on). That is why LangChain supports loading data from Confluence, CSV files, Outlook emails, and [more](https://python.langchain.com/docs/integrations/document_loaders). LlamaIndex also provides a variety of loaders, which can be viewed in [LlamaHub](https://llamahub.ai/).

- **Document pre-processing**:
documents are often transformed after they have been loaded in the step Document Ingestion. One ordinary transformation method is text-splitting, which split down long-form document into many continous smaller segments. This is esential for embedding the text by embedding model, for example `e5-large-v2` or `BAAI/bge-large-en`, which has a maximum token length of 512 and 1024, respectively. One noteworthy caution to consider is that splitting may lead to missing out on information. Therefore, expanding your retrieval segment or splitting text under the overlapping style may be very useful in elevating the relevance of the extracted context.

- **Generating embeddings**:
Data must be transformed into a due format that the system can efficiently process. Generating embeddings involves converting data into high-dimensional vectors, which represent text in a numerical format.

- **Storing embeddings in vector databases**:
The processed data and generated embeddings are stored and indexed in distinctive databases known as vector indexing databases. These databases are optimized to save and seek vectorized data, enabling fast search and retrieval operations. Storing the data in accelerated vector databases like Chroma, Pipecone, and Milvus ensures that information remains accessible and can be rapidly retrieved during real-time querying.

**1.4.2. Generator:** The generator is a language model that takes the retrieved passages as input and generates coherent and contextually appropriate responses. It can be based on transformer architectures like GPT (Generative Pre-trained Transformer) or similar models that excel in natural language understanding and generation tasks. There are two main steps included in this phase:

- **LLMs**:
LLMs account for a foundational generative component of the RAG pipeline. These large language models are trained on vast datasets, enabling them to comprehend and anwser human-like text. In terms of RAG pipeline, LLMs are used to germinate fully formed responses based on the user query and contextualized information extracted from the vector DBs during real-time interactions.

- **Querying**:
When a user send a query, the RAG system uses the pre-indexed chunk vectors and input embedding vector to perform efficient searches based on similarity scoring. The system identifies relevant information by comparing the query vector with the stored vectors in the vector databases. The LLMs then use the retrieved data to shape human-like responses.


In conclusion, by combining retrieval and generation, the RAG pipeline aims to leverage the benefits of both approaches. Retrieval helps in extracting relevant information from a large dataset, while generation allows for the creation of diverse and contextually appropriate responses.

# 2. Build RAG pipeline using OpenAI

## 2.1. Build RAG pipeline on unstructure data

## 2.1.1. Setup

We’ll use an OpenAI chat model and embeddings and a Chroma vector store in this walkthrough, but everything shown here works with any ChatModel or LLM, Embeddings, and VectorStore or Retriever.

Firstly, we need to download a list of packages like `langchain, chromadb, openai, tiktoken` for building a RAG pipeline.

In [None]:
!pip install langchain==0.0.352 unstructured[all-docs] pydantic==1.10.13 lxml==4.9.3
!pip install openai==1.6.1 chromadb==0.4.21 tiktoken==0.5.2 langchainhub==0.1.14

We need to set environment variable OPENAI_API_KEY, which can be done directly or loaded from a .env file like so:



In [2]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = "Your OpenAI Key"

# import dotenv
# dotenv.load_dotenv()

## 2.1.2. Build RAG pipeline

Sometimes, LLM can not answer very specific questions that it have never learned yet. For example, they are closed enterprise's documents, new papers, books,.... Thus, it is helpful to build a an application by using Langchain technology that can answer any question you query on them. To illustrate the first simple case, in this guide we’ll build a QA app over the [Dive into Deep Learning - chapter 3.1](https://d2l.ai/chapter_linear-regression/linear-regression.html), which allows us to ask questions about the contents of the post. We can create a simple indexing pipeline and RAG chain to do this in ~20 lines of code:


In [3]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [4]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://d2l.ai/chapter_linear-regression/linear-regression.html",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("page-content")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [5]:
rag_chain.invoke("What is the loss function of Linear Regression?")

'The loss function of Linear Regression is the squared error. It quantifies the distance between the real and predicted values of the target, and it is given by the formula: \\(\\frac{1}{2}\\left(\\mathbf{w}^\\top \\mathbf{x}^{(i)} + b - y^{(i)}\\right)^2\\). The goal of training the model is to find the parameters (\\(\\mathbf{w}^*, b^*\\)) that minimize the total loss across all training examples.'

## 2.1.3. Code explaination

#### DataLoader

We need to first load the blog post contents. We can use [DocumentLoaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/) for this, which are objects that load in data from a source and return a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html). A Document is an object with some page_content (str) and metadata (dict).

In this case we’ll use the [WebBaseLoader](https://python.langchain.com/docs/integrations/document_loaders/web_base), which uses urllib to load HTML form web URLs and BeautifulSoup to parse it to text. We can customize the HTML -> text parsing by passing in parameters to the BeautifulSoup parser via bs_kwargs (see [BeautifulSoup docs](https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup)). In this case only HTML tags with class “page-content” is chosen, so we’ll remove all others.

In [6]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://d2l.ai/chapter_linear-regression/linear-regression.html",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("page-content")
        )
    ),
)
docs = loader.load()

In [7]:
len(docs[0].page_content)

32916

#### Indexing split

Our loaded document is over 33k characters long. This is too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.



In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [9]:
len(all_splits)

45

In [10]:
len(all_splits[0].page_content)

995

In [11]:
all_splits[10].metadata

{'source': 'https://d2l.ai/chapter_linear-regression/linear-regression.html',
 'start_index': 6331}

#### Indexing store

Now we need to index our 44 text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.

In [None]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

#### 4. Retrieval and Generation: Retrieve

Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

First we need to define our logic for searching over documents. LangChain defines a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/) interface which wraps an index that can return relevant Documents given a string query.

The most common type of Retriever is the [VectorStoreRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore), which uses the similarity search capabilities of a vector store to facillitate retrieval. Any VectorStore can easily be turned into a Retriever with VectorStore.as_retriever():

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
retrieved_docs = retriever.invoke("What are the loss function of Linear Regression?")

In [None]:
len(retrieved_docs)

6

In [None]:
print(retrieved_docs[0].page_content)

3.1.1.2. Loss Function¶
Naturally, fitting our model to the data requires that we agree on some
measure of fitness (or, equivalently, of unfitness). Loss
functions quantify the distance between the real and predicted
values of the target. The loss will usually be a nonnegative number
where smaller values are better and perfect predictions incur a loss of
0. For regression problems, the most common loss function is the squared
error. When our prediction for an example \(i\) is
\(\hat{y}^{(i)}\) and the corresponding true label is
\(y^{(i)}\), the squared error is given by:


#### Retrieval and Generation: Generate

Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

We’ll use the gpt-3.5-turbo OpenAI chat model, but any LangChain LLM or ChatModel could be substituted in.

In [None]:
from langchain_community.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

We’ll use a prompt for RAG that is checked into the [LangChain prompt hub](https://smith.langchain.com/hub/rlm/rag-prompt).

In [None]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

In [None]:
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()
example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

In [None]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


We’ll use the [LCEL Runnable protocol](https://python.langchain.com/docs/expression_language/) to define the chain, allowing us to - pipe together components and functions in a transparent way - automatically trace our chain in LangSmith - get streaming, async, and batched calling out of the box

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
for chunk in rag_chain.stream("What is the loss function of Linear Regression?"):
    print(chunk, end="", flush=True)

# 3. Build RAG for Vietnamese language using VinaLlama

The procedure is the same as pipeline using OpenAI. However, to run [VinaLlama-7b-chat](https://huggingface.co/vilm/vinallama-7b-chat), using T4 free Google Colab may not enough memory for running. We maybe consider to upgrade Colab Pro that enable us to use A100 or V100 GPU for inference.

However, `transformers` library power user to inference Large Language Model up to 7B parameters on T4 GPU (with 16 VRAM) by using quantization techniques. They help to reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with `bitsandbytes`.



In [None]:
!pip install transformers==4.34.0
!pip install sentence-transformers==2.2.2
!pip install bitsandbytes==0.41.3
!pip install llama-cpp-python==0.2.26
!pip install accelerate==0.25.0

In [5]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, AutoModelForQuestionAnswering
from transformers import pipeline
from transformers import BitsAndBytesConfig
import torch

In [9]:
# Config model to load under 4-bit quantization
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

# Loading model for generating text
model_name = "vilm/vinallama-2.7b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=nf4_config
)

# Create text generation pipeline from causual large language model
# Config model to have 512 new tokens.
question_answerer = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512
)

# Create a huggingface pipeline for question and answering task
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7},
)

After having a standard pipeline for question and answering, we can test model on a specific question by wrapping them in a prompt template which is suitable for question and answering task.

Each model will have its own standard way to config prompt of chat style. With `VinaLlama-7b-chat` model, the prompt includes three agents system, user, and assistant, which are under this format:

```
<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy trả lời người dùng một cách chính xác.
<|im_end|>
<|im_start|>user
{query}<|im_end|>
<|im_start|>assistant
```

Each message of one agent is wrapped up inside `<|im_start|><im_end>` tag.

`system` is a message from the system that is fixed throughout the chat application. It always is prepended on top of any message sent to the chat application.

`user` is message from humman, it is usually a question or task what you want to ask for.

`assistant` is an activating word for answering.

In [23]:
query = "Ai là thủ tướng của Việt Nam?"

prompt=f"""
<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy trả lời người dùng một cách chính xác.
<|im_end|>
<|im_start|>user
{query}<|im_end|>
<|im_start|>assistant
"""

print(prompt)

response = llm.predict(prompt)
print(response)


<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy trả lời người dùng một cách chính xác.
<|im_end|>
<|im_start|>user
Ai là thủ tướng của Việt Nam?<|im_end|>
<|im_start|>assistant

Thủ tướng hiện tại của Việt Nam là Phạm Minh Chính. Ông đã được bổ nhiệm làm Thủ tướng Chính phủ vào ngày 1 tháng 1 năm 2021, sau khi kế nhiệm của người tiền nhiệm là Nguyễn Xuân Phúc. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The result demonstrates that model can give a proper answer for the input question. Consequently, a rag chain pipeline can be designed to extract relevant context, format them to proper format, then feed the whole template to the large language model. Let's go through step-by-step to shed the light on the way of building this rag pipeline.

Step 1: Loading data from website and split text into multiple chunks with chunk size is 1000 and overlapping size is 200.

In [7]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://phamdinhkhanh.github.io/deepai-book/intro.html",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("tex2jax_ignore mathjax_ignore section")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

Step 2: Building a question and answering pipeline based on embedding model and LLM, which are in charge of finding relevant retrieval contexts and generating the answer according to in-context learning, respectively. For Vietnamese Language, we can use [VoVanPhuc/sup-SimCSE-VietNamese-phobert-base](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) as an open-source huggingface embedding model, this model is trained by supervised learning technique. To generate answer for general purpose, we can load pretrained model [vilm/vinallma-7b-chat](https://huggingface.co/vilm/vinallama-2.7b) on huggingface. This model is trained under chat style, thus, it is particularly suitable for question and answering task.

In [4]:
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig
import torch

# Define model name of question and answering
model_name = "vilm/vinallama-7b-chat" # or "vilm/vinallama-2.7b-chat" for a lower version

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model under 4bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=nf4_config
)

# Create a tokenizer object by loading the pretrained "vilm/vinallama-7b-chat" tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
question_answerer = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512
)

llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7},
)

Downloading generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Step 3: Building vector indexing database storing the embedding vector and relevant text chunks.

In [14]:
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()

Step 4: Building up a rag_chain by LCEL runnable protocal.

In [26]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "<|im_start|>system\n Bạn là một trợ lý ảo cho tác vụ hỏi đáp. Sử dụng những mẩu văn bản được trích xuất để trả lời câu hỏi. Nếu bạn không biết, hãy trả lời tôi không biết.<|im_end|>"),
    ("human", "<|im_start|>user\n Sử dụng ba câu tối đa và giữ câu trả lời nhất quán.\nCâu hỏi: {question} \nBối cảnh: {context} \nCâu trả lời:<|im_end|>"),
    ("assistant", "<|im_start|>assistant")
])

# prompt = ChatPromptTemplate.from_messages([
#     ("human", "Bạn là một trợ lý ảo cho tác vụ hỏi đáp. Sử dụng những mẩu văn bản được trích xuất để trả lời câu hỏi. Nếu bạn không biết, hãy trả lời tôi không biết. Sử dụng ba câu tối đa và giữ câu trả lời nhất quán.\nCâu hỏi: {question} \nBối cảnh: {context} \nCâu trả lời:")
# ])


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
%%timeit
rag_chain.invoke("Nội dung chính của quyển sách này là gì?")

# 4. Reference


1. [Semi Structured RAG - Langchain tutorial](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb)

2. [Multi model RAG - Langchain tutorial](https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb)

3. [RAG pipeline huggingface](https://huggingface.co/docs/transformers/model_doc/rag)

4. [Demystifying retrieval augmented generation pipelines - Nvidia](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/)

5. [Implementing RAG with langchain and huggingface](https://medium.com/international-school-of-ai-data-science/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7)

6. [Implement huggingface models using langchain - analyticsvidhya](https://www.analyticsvidhya.com/blog/2023/12/implement-huggingface-models-using-langchain/)