Retrieval-Augmented Generation (RAG) is the concept of providing large language models (LLMs) with additional information from an external knowledge source. This allows them to generate more accurate and contextual answers while reducing hallucinations. In this article, we will provide a step-by-step guide to building a complete RAG application using the latest open-source LLM by Google Gemma 7B and open source vector database by Faiss.

When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.
When a user asks a question to the LLM. Instead of asking the LLM directly, we generate embeddings for this query and then retrieve the relevant data from our knowledge library that is well maintained and then use that context to return the answer.
We use vector embeddings (numerical representations) to retrieve the requested document. Once the needed information is found from the vector databases, the result is returned to the user.
This largely reduces the possibility of hallucinations and updates the model without retraining the model, which is a costly process. Here’s a very simple diagram that shows the process.

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.

In [None]:
%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu langchain_community

In [None]:
import torch
from datasets import load_dataset
from langchain import HuggingFacePipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

from langchain_community.document_loaders.csv_loader import CSVLoader

from transformers import AutoTokenizer, pipeline

In [None]:
# data = load_dataset("HuggingFaceTB/cosmopedia", "stanford", split="train")
# data.to_csv("stanford_dataset.csv")
# data.head()
# loader = CSVLoader(file_path='/kaggle/working/stanford_dataset.csv')
# data = loader.load()

When you want to deal with long pieces of text, it is necessary to split them into chunks. As simple as this sounds, there is a lot of potential complexity here. Keep the semantically related pieces of text together.

LangChain has many built-in document transformers, making it easy to split, combine, filter, and otherwise manipulate documents. We will use the RecursiveCharacterTextSplitter which recursively tries to split by different characters to find one that works with. We will set the chunk size = 1000 and chunk overlap = 150.

In [None]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
# docs = text_splitter.split_documents(data)

In [None]:
# modelPath = "sentence-transformers/all-MiniLM-l6-v2"
# model_kwargs = {'device':'cpu'}
# encode_kwargs = {'normalize_embeddings': False}

# embeddings = HuggingFaceEmbeddings(
#  model_name=modelPath, 
#  model_kwargs=model_kwargs, 
#  encode_kwargs=encode_kwargs 
# )

In [None]:
# db = FAISS.from_documents(docs, embeddings)

Gemma is a family of 4 new LLM models by Google based on Gemini. It comes in two sizes: 2B and 7B parameters, each with base (pretrained) and instruction-tuned versions. All the variants can be run on various types of consumer hardware, even without quantization, and have a context length of 8K tokens:

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [None]:
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
# tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)

Create a text generation pipeline.

In [None]:
# pipe = pipeline(
#     "text-generation", 
#     model=model, 
#     tokenizer=tokenizer,
#     return_tensors='pt',
#     max_length=512,
#     model_kwargs={"torch_dtype": torch.bfloat16},
#     device="cuda"
# )

# llm = HuggingFacePipeline(
#     pipeline=pipe,
#     model_kwargs={"temperature": 0.7, "max_length": 512},
# )

The final step is to generate the answers using both the vector store and the LLM. It will generate embeddings to the input query or question retrieve the context from the vector store, and feed this to the LLM to generate the answers:

In [None]:
# qa = RetrievalQA.from_chain_type(
#     llm=llm,
#     chain_type="stuff",
#     retriever=db.as_retriever()
# )

# qa.invoke("Write an educational story for young children.")