# Retreival Augmented Generation - RAG
RAG is a method used to increase accuracy, mitigate problems related to privacy and knowledge cuttofs, and reduce potential hallucinations when using LLMs. The following code presents both a simple RAG and a hybrid RAG for cat facts. 
Credits go to: https://huggingface.co/blog/ngxson/make-your-own-rag

## Simple RAG
The following simple RAG uses the dataset and simply searches for the most similar matches. We begin by loading the dataset and some examples.

In [78]:
dataset = []
with open('cat-facts.txt', 'r') as file:
    dataset = file.readlines()
    print(f'Loaded {len(dataset)} entries')
print(dataset)

Loaded 150 entries
['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.\n', 'Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.\n', 'When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.\n', 'The technical term for a cat’s hairball is a “bezoar.”\n', 'A group of cats is called a “clowder.”\n', 'Female cats tend to be right pawed, while male cats are more often left pawed. Interestingly, while 90% of humans are right handed, the remaining 10% of lefties also tend to be male.\n', 'A cat can’t climb head first down a tree because every claw on a cat’s paw points the same way. To get down from a tree, a cat must back down.\n', 'Cats make about 100 different sounds. Dogs make only about 10.\n', 'A cat’s brain is biologically more similar to a human brain than it is to a dog’s. Both humans and cats have iden

Next, we embed the datasets as vectors and create a database. We use simple chunking by dividing each fact into a chunk.

In [79]:
import ollama
EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'
vector_db = []
def add_chunk_to_database(chunk):
    embedding = ollama.embed(model = EMBEDDING_MODEL, input = chunk)['embeddings'][0]
    vector_db.append((chunk, embedding))

In [80]:
for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    print(f'Added chunk {i+1}/{len(dataset)} to the database')

Added chunk 1/150 to the database
Added chunk 2/150 to the database
Added chunk 3/150 to the database
Added chunk 4/150 to the database
Added chunk 5/150 to the database
Added chunk 6/150 to the database
Added chunk 7/150 to the database
Added chunk 8/150 to the database
Added chunk 9/150 to the database
Added chunk 10/150 to the database
Added chunk 11/150 to the database
Added chunk 12/150 to the database
Added chunk 13/150 to the database
Added chunk 14/150 to the database
Added chunk 15/150 to the database
Added chunk 16/150 to the database
Added chunk 17/150 to the database
Added chunk 18/150 to the database
Added chunk 19/150 to the database
Added chunk 20/150 to the database
Added chunk 21/150 to the database
Added chunk 22/150 to the database
Added chunk 23/150 to the database
Added chunk 24/150 to the database
Added chunk 25/150 to the database
Added chunk 26/150 to the database
Added chunk 27/150 to the database
Added chunk 28/150 to the database
Added chunk 29/150 to the dat

The next step is the retreival step. We begin by defining a function to work ot the cosine similarity of two vectors.

In [81]:
def cosine_similarity(a, b):
    dot_product = sum([x * y for x, y in zip(a, b)])
    norm_a = sum([ x ** 2 for x in a]) ** 0.5
    norm_b = sum([x ** 2 for x in b]) ** 0.5
    return dot_product/(norm_a * norm_b)

The query is then embededed using the same embedding model. Each chunk is then compared in simialrity to the embedding and the top n chunks are returned. 

In [82]:
def retrieve(query, top_n = 3):
    query_embedding = ollama.embed(model = EMBEDDING_MODEL, input = query)['embeddings'][0]
    similarities = []
    for chunk, embedding in vector_db:
        similarity = cosine_similarity(query_embedding, embedding)
        similarities.append((chunk, similarity))
    similarities.sort(key = lambda x: x[1], reverse = True)
    return similarities[:top_n]

The next step is augmentation, which begins by taking the input query. We retreive the knowledge then create a new prompt with our most similar knowledge as context for the LLM. 

In [83]:
input_query = input("Ask me a Question: ")
retreived_knowledge = retrieve(input_query)

print('Retrieved knowledge: ')
for chunk, similarity in retreived_knowledge:
    print(f' - (siumilarity: {similarity:.2f}) {chunk}')
instruction_prompt = f''' You are a helpful chatbot. 
Use only the following pieces of context to answer the question. Don't make up any new information:
{'\n'.join([f' - {chunk}' for chunk, _ in retreived_knowledge])}
'''

Retrieved knowledge: 
 - (siumilarity: 0.80) Cats are North America’s most popular pets: there are 73 million cats compared to 63 million dogs. Over 30% of households in North America own a cat.

 - (siumilarity: 0.74) Approximately 40,000 people are bitten by cats in the U.S. annually.

 - (siumilarity: 0.74) There are up to 60 million feral cats in the United States alone.



Finally, the augmented system prompt is passed to the LLM and the response from the LLM is returned.

In [84]:
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

print('Chatbot response: ')
for chunk in stream:
    print(chunk['message']['content'], end = '', flush = True)

Chatbot response: 
Based on the information provided, it is estimated that there are approximately 60 million feral cats in the United States.

To show a comparison between RAG and non-RAG responses, the following 5 questions are fed to the model using both RAG and just the model's knowledge base. The RAG answers are more accurate and answer the questions more concisely, while the non-RAG answers incldue more hallucinations. 

In [40]:
print("RAG vs non-RAG")
questions = ["How fast can cats travel?", "How much do cats sleep?", "How can cats get tapeworms?", "How many breeds of cats are there?", "How do cats smell?"]
non_RAG_prompt = "You are a helpful chatbot. Use your knowledge base to answer the question. Please answer the question as simply as you can and do not stray or add additional information that isn't in the question."
for q in questions:
    print("\n")
    print(f"Question: {q}")
    print("RAG:")

    retreived_knowledge = retrieve(q)

    instruction_prompt = f''' You are a helpful chatbot. 
    Use only the following pieces of context to answer the question. Don't make up any new information. Please answer the question as simply as you can and do not stray or add additional information that isn't in the question.:
    {'\n'.join([f' - {chunk}' for chunk, _ in retreived_knowledge])}
    '''

    stream = ollama.chat(
    model=LANGUAGE_MODEL,
    messages=[
        {'role': 'system', 'content': instruction_prompt},
        {'role': 'user', 'content': q},
    ],
    stream=True,
    )
    for chunk in stream:
        print(chunk['message']['content'], end = '', flush = True)
    print("")
    print("Non-RAG: ")
    stream = ollama.chat(
    model=LANGUAGE_MODEL,
    messages=[
        {'role': 'system', 'content': non_RAG_prompt},
        {'role': 'user', 'content': q},
    ],
    stream=True,
    )
    for chunk in stream:
        print(chunk['message']['content'], end = '', flush = True)

    

RAG vs non-RAG


Question: How fast can cats travel?
RAG:
A cat can travel at approximately 31 mph (49 km) over a short distance.
Non-RAG: 
Mice, which are small rodents, can travel 9 miles per hour.

Question: How much do cats sleep?
RAG:
Cats typically sleep 16 to 18 hours per day.
Non-RAG: 
Cats typically sleep for 16-18 hours per day, with a short period of wakefulness to eat, drink, and use the bathroom.

Question: How can cats get tapeworms?
RAG:
Cats can get tapeworms from eating fleas or mice that have ingested infected tapeworm eggs.
Non-RAG: 
Cats can get tapeworms from fleas, which are often found in the same environments as cats. The parasite is usually transmitted to the cat through the flea's feces or its saliva when it bites the cat while feeding on blood.

Question: How many breeds of cats are there?
RAG:
According to the information given, there are:

* Over 100 distinct breeds of domestic cats.
* More than 500 million domestic cats in the world, with at least 40 recog

## Hybrid RAG
The next step is to implement a hybrid RAG, which uses both a dense index and a sparse index to allow the model to return results even when a different phrasing/synonyms are used. 

Firstly, I created the dense index using FAISS to embed the dataset

In [23]:
#create dense index
import faiss
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vectors = model.encode(dataset)
index_d = faiss.IndexFlatL2(len(vectors[0]))
index_d.add(vectors)

I then used BM25 via HuggingFace to create the sparse index. 

In [24]:
from huggingface_hub import login
login(token="YOUR_HUGGING_FACE_TOKEN")

In [25]:
#create sparse index
import bm25s
from bm25s.hf import BM25HF

retreiver = BM25HF(corpus=dataset)
retreiver.index(bm25s.tokenize(dataset))

user = "mariam-elantable"
retreiver.save_to_hub(f"{user}/bm25s-cats")

Finding newlines for mmindex: 100%|██████████| 26.3k/26.3k [00:00<00:00, 16.8MB/s]
Processing Files (0 / 0)                : |          |  0.00B /  0.00B            
[A

[A[A


[A[A[A
[A

[A[A


Processing Files (3 / 3)                : 100%|██████████| 25.2kB / 25.2kB,   ???B/s  
[A

[A[A


[A[A[A
[A

[A[A


[A[A[A
[A

[A[A


[A[A[A
[A

[A[A


[A[A[A
[A

[A[A


Processing Files (3 / 3)                : 100%|██████████| 25.2kB / 25.2kB,  0.00B/s  
New Data Upload                         : |          |  0.00B /  0.00B,  0.00B/s  
  ...gn/T/tmpyux0r645/data.csc.index.npy: 100%|██████████| 9.97kB / 9.97kB            
  ...T/tmpyux0r645/indices.csc.index.npy: 100%|██████████| 9.97kB / 9.97kB            
  .../T/tmpyux0r645/indptr.csc.index.npy: 100%|██████████| 5.29kB / 5.29kB            
No files have been modified since last commit. Skipping to prevent empty commit.


RepoUrl('https://huggingface.co/mariam-elantable/bm25s-cats', endpoint='https://huggingface.co', repo_type='model', repo_id='mariam-elantable/bm25s-cats')

The next step is to implement the retreival process, which begins by first finding the dense similarities and the sparse similarities, taking the ranked lists and combining them using reciprocal rank fusion. The top n matches are then returned. 

In [62]:
def retrieve_hybrid(query, top_n = 3, k = 50, K = 60):
    dense_embedding = model.encode([query])
    dist_d, indices_d = index_d.search(dense_embedding, k)
    retreiver = BM25HF.load_from_hub(f"{user}/bm25s-cats", load_corpus = True)
    docs_s, scores_s = retreiver.retrieve(bm25s.tokenize(query), k = k)
    docs_d = [(dataset[indices_d[0][i]], i) for i in range(k)]
    docs_s = [(docs_s[0][i], i) for i in range(k)]
    sorted_docs = []
    for doc, rank in docs_d:
        score = 1.0/(K + rank)
        s = 0
        i = 0
        while ((i < k)):
            if (docs_s[i][0] != doc):
                i+= 1
            else:
                break

        if (i < len(docs_s)): score += 1.0/(K + i)
        sorted_docs.append((doc, score))
    sorted_docs.sort(key = lambda x: x[1], reverse = True)
    return sorted_docs[:top_n]   

Finally, the augmentation and generation steps are run, which are the same as those in the simple RAG model, to generate a response based on the given information.

In [85]:
input_query = input("Ask me a Question: ")
retreived_knowledge = retrieve_hybrid(input_query)

print('Retrieved knowledge: ')
for chunk, similarity in retreived_knowledge:
    print(f' - (siumilarity: {similarity:.2f}) {chunk}')
instruction_prompt = f''' You are a helpful chatbot. 
Use only the following pieces of context to answer the question. Don't make up any new information:
{'\n'.join([f' - {chunk}' for chunk, _ in retreived_knowledge])}
'''

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 52067.22it/s]
                                                     

Retrieved knowledge: 
 - (siumilarity: 0.02) Cats sleep 16 to 18 hours per day. When cats are asleep, they are still alert to incoming stimuli. If you poke the tail of a sleeping cat, it will respond accordingly.

 - (siumilarity: 0.02) On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.

 - (siumilarity: 0.02) Cats spend nearly 1/3 of their waking hours cleaning themselves.





In [86]:
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

print('Chatbot response: ')
for chunk in stream:
    print(chunk['message']['content'], end = '', flush = True)

Chatbot response: 
Cats are notorious for being sleepy, and they actually spend around 16 to 18 hours per day snoozing! That's why it can be a challenge to wake them up if you need something. Even when they're not sleeping, they're still highly alert and may respond quickly to their surroundings.