# Hello, Llama: Simple RAG  

This notebook provides a comprehensive guide to implementing a simple Retrieval-Augmented Generation (RAG) system using a fantasy animals FAQ dataset. 

A **RAG (Retrieval-Augmented Generation)** model is a type of machine learning architecture that combines the strengths of two key components:
- Retrieval: This involves searching a large corpus of documents (or knowledge base) to find relevant information based on a given query.
- Generation: This is where a language model, such as GPT, takes the retrieved information and generates a coherent and contextually appropriate response.

### A RAG system in general works by:
1. Storing documents in a vector database
2. Retrieving relevant information when a query is received
3. Using this retrieved context to generate accurate responses


#### Install required libraries

Run the following cell to install all necessary packages:

In [1]:
# uncomment to install required libraries

# !pip install langchain==0.3.4
# !pip install langchain_community==0.3.3
# !pip install langchain_huggingface==0.1.1

# Install FAISS. For CUDA supported GPU: `pip install faiss-gpu` | For CPU `pip install faiss-cpu`
# !pip install faiss-cpu==1.9.0

#### Import required libraries

In [2]:
import transformers
from langchain_huggingface import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.docstore.document import Document
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
import torch
import os
import time
import json

#### Load dataset in memory

We'll load our fantasy animals FAQ from a json file.

In [3]:
fantasy_animals_file = "fantasy-animals-faqs.json"

with open(fantasy_animals_file, 'r') as file:
    data = json.load(file)  # Load JSON data into a Python dictionary


print(f"Question loaded: {len(data['questions'])}\n\n")

Question loaded: 33




#### Dataset preparation

To load documents you can use the prebuilt document loaders you can find here: https://python.langchain.com/docs/integrations/document_loaders/. 
In this example i will use a manual approach but you can load document with one of these loaders.


In [4]:
documents = []

for question in data['questions']:
    documents.append(
        Document(
            page_content = question['answer'],
            metadata = {'title': question['title']}
        )
    )

# Display basic information about loaded documents
print(f"Number of documents loaded: {len(documents)}")
print(f"Preview of first document:\n{documents[0].page_content}")

Number of documents loaded: 33
Preview of first document:
A Drakelion is a large, lion-like creature with dragon-like scales covering its body. It is known for its fearsome roar and powerful leaps.


### Splitting document to optimize retrival

When building a RAG (Retrieval-Augmented Generation) system, it's crucial to break the content into smaller pieces to ensure the context provided to the LLM (Large Language Model) isn't too large. This is important because many paid LLM services charge based on the number of tokens in each request. If the content isn't properly split, you may incur higher costs and, in some cases, the context size may exceed the LLM's processing limit.

#### Document Splitting Strategies

##### Chunk-based Splitting
Chunk-based splitting is a straightforward method but comes with notable drawbacks. It often splits text arbitrarily, potentially breaking sentences or words, leading to a loss of context and reduced comprehensibility. Furthermore, if the chosen separator is absent or inconsistently applied, chunk sizes can vary significantly, resulting in processing challenges. These chunks may also begin or end abruptly, diminishing readability and overall coherence.
  
##### Semantic Text Splitting
A more advanced technique is semantic text splitting, which includes methods like sentence-based and paragraph-based splitting:
- **Sentence-based splitting** uses natural language processing (NLP) to divide the text into complete sentences, preserving context and improving readability.
- **Paragraph-based splitting**, on the other hand, divides the text into paragraphs, which is especially useful for longer documents, as it maintains the overall structure and coherence of the text.

#### CharacterTextSplitter

In this method, text is split using a basic approach that relies on chunk size and a separator.

Parameters:
- `separator`, specifies the character or sequence of characters at which the text should be split. It defines the boundary where the text will be divided into chunks.
- `chunk_size`, sets the maximum size (in characters) for each chunk of text. It determines how large each individual chunk can be.
- `chunk_overlap`, defines the number of characters that should overlap between consecutive chunks. It allows for some characters to be shared between chunks, which can be useful for maintaining context across chunks.

**Important:**   
If no separator is found in the text, the chunk could indeed be larger than the specified `chunk_size`. The `CharacterTextSplitter` will attempt to split the text at the specified separator (`\n\n` in this case). If the separator is not found within a chunk of text, the splitter may not be able to enforce the `chunk_size` limit strictly.


In [5]:
text_splitter = CharacterTextSplitter(
    separator='\n\n', # split at every double newline
    chunk_size=200, # maximum size of each chunk can contain up to 200 characters
    chunk_overlap=0 # no overlap
)

splitted_documents = text_splitter.split_documents(documents)

In our specific case, the documents will not be split because the text does not contain double line breaks (\n\n). Since the CharacterTextSplitter relies on the presence of the specified separator to divide the text, its absence prevents the text from being split, potentially resulting in a single, large chunk that exceeds the intended chunk_size.

### Loading Documents into a VectoreStore

In this example, we use FAISS (Facebook AI Similarity Search) as the vector store. A vector store is a specialized data structure designed to store and efficiently manage high-dimensional vectors. FAISS, developed by Facebook AI, is a library optimized for similarity search and clustering of dense vectors. You can find more details about FAISS [here](https://python.langchain.com/docs/integrations/vectorstores/faiss/).

To store documents in a vector store, the text must first be converted into vector representations. This is achieved using **SentenceTransformer** from the [sentence-transformers](https://huggingface.co/sentence-transformers) library.


#### What we will do:
- **Generate Embeddings**: using SentenceTransformer we generate dense vector embeddings for sentences or text snippets. These embeddings capture the semantic meaning of the text, which is critical for applications like semantic search or sentence similarity.
- **Store in Vector Store**: The generated embeddings are stored in FAISS for efficient retrieval based on similarity to a given query.

This combination of SentenceTransformer and FAISS enables semantic search, allowing us to retrieve and compare text based on its meaning rather than relying solely on keyword matching.

In [6]:
base_folder = "FILL_WITH_BASE_FOLDER" # Example: "C:/Users/username/Documents/HuggingFace"

# we will use the model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model_name = "all-MiniLM-L6-v2"

# set the model id
model_id = os.path.join(base_folder, model_name)

### Initialize embedding model

In [7]:
# Load the HuggingFace embedding model
embedding_model = HuggingFaceEmbeddings(model_name=model_id)

The `HuggingFaceEmbeddings` class loads a pre-trained model (specified by `model_id`) from the HuggingFace library to generate embeddings for your documents.

### Initialize Vector Store

In [8]:
vector_store = FAISS.from_documents(splitted_documents, embedding_model)

Here, the FAISS vector store is initialized by converting the pre-processed `splitted_documents` into vector representations using the specified `embedding_model`. These vectors are then stored in the FAISS database, enabling efficient similarity searches.

### Execute similarity search and enrich prompt

Once the vector store is ready, we can perform a similarity search.

#### Basic Workflow:
- **Input Query**: Provide a query or question.
- **Retrieve Similar Documents**: The similarity search identifies documents most similar to the query, based on their embeddings in the vector space.
- **Enhance the Prompt**: Add the retrieved documents to the prompt to provide additional context for the LLM (Large Language Model).

In [9]:
query = "Which is the food of the Drakelion?"

number_of_similar_documents_to_find = 2

results = vector_store.similarity_search(query, k=number_of_similar_documents_to_find)

for i in range(0, len(results)):
    print(f"Document similar with order {i}")
    print(f"- Title: {results[i].metadata['title']}")
    print(f"- Page Content: {results[i].page_content}")
    print("--------------------------------------\n")

Document similar with order 0
- Title: What do Drakelions eat?
- Page Content: Drakelions are carnivores, feeding primarily on large mammals such as mountain deer, wild boars, and sometimes even smaller dragons.
--------------------------------------

Document similar with order 1
- Title: What is a Drakelion?
- Page Content: A Drakelion is a large, lion-like creature with dragon-like scales covering its body. It is known for its fearsome roar and powerful leaps.
--------------------------------------



#### Retrieve the most Similar Document

While multiple documents can generally be passed to an LLM for generating responses, certain models, like the simplified Llama model used here, benefit from a more focused approach to reduce the risk of hallucinations. Therefore, only the most relevant document will be selected and added to the prompt in this Retrieval-Augmented Generation (RAG) process.

In [10]:
context = results[0].page_content

context

'Drakelions are carnivores, feeding primarily on large mammals such as mountain deer, wild boars, and sometimes even smaller dragons.'

#### Build HuggingFace pipeline to be used in langchain

In [11]:
base_folder = "FILL_WITH_BASE_FOLDER" # Example: "C:/Users/username/Documents/HuggingFace"

model_name = "Llama-3.2-3B-Instruct"

# set the model id
model_id = os.path.join(base_folder, model_name)



tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id
)

pipe = transformers.pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=128, 
    top_k=50, 
    temperature=0.1
)


hf_pipeline = HuggingFacePipeline(
    pipeline=pipe
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Define prompt to be used in langchain

In [12]:
template = """
System: You are an expert in fantasy animals. To answer consider the 'Extra Information' provided. If you don't know the answer respond with "I don't know the answer." without giving any explanation. 

Extra information: {context}\n\n

Query: {query}
Response:
"""

prompt_template = PromptTemplate.from_template(template)

# chaining prompt_template and pipeline
chain = prompt_template | hf_pipeline.bind(skip_prompt=True)

#### Execute the call to the RAG

In [13]:
# example of unkown capital
input_dict = {"query": query, "context": context}

result = chain.invoke(input_dict)

print(f"\nThe query is: '{query}'")
print(f"The context is: '{context}'")
print(f"\nThe Response is: \n'{result}'")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)



The query is: 'Which is the food of the Drakelion?'
The context is: 'Drakelions are carnivores, feeding primarily on large mammals such as mountain deer, wild boars, and sometimes even smaller dragons.'

The Response is: 
'Drakelions feed primarily on large mammals such as mountain deer, wild boars, and sometimes even smaller dragons.'


### Conclusions

In this example, we demonstrated how to create a simple Retrieval-Augmented Generation (RAG) system using an in-memory vector store. By integrating a HuggingFace model with LangChain, we were able to efficiently retrieve relevant context and generate accurate responses based on that information.