<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 3: 🤗 Hugging Face with RAG

**Description:** 

Learners will use 🤗 Hugging Face Inference Client combined with Llama Index to create a basic Retrieval Augmented Generation (RAG) system.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics
* Pandas Basics

**Knowledge Recommended:** 
* Python Intermediate
* Pandas Intermediate

**Data Format:** None

**Libraries Used:** 
* [🤗 Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Pytorch](https://pytorch.org/)- a popular machine learning framework
* [Llama_index](https://docs.llamaindex.ai/en/stable/)- helps index our documents

**Research Pipeline:** None
___

# Introduction to Retrieval Augmented Generation

Large Language Models (LLMs) are trained on an enormous variety of content, including books, wikipedia, and social media. They are often able to answer basic questions in a wide-ranging variety of contexts. Researchers, on the other hand, tend to specialize in their research area—going deep rather than wide. Researchers also tend to be interested in the latest articles and research in their field, while language models “knowledge” is frozen in time once trained. (There are some ways to update the knowledge in a language model, but they can be impractical.) Finally, researchers are concerned with citation and reference. In brief, *LLMs often lack knowledge that is specialized, current, and citable*: the type of knowledge researchers want most.

Retrieval Augmented Generation (RAG), formalized in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” ([Lews, et. al 2020](https://arxiv.org/abs/2005.11401)), has emerged as a solution for these problems. While other methods have focused on adapting an existing model, RAG introduces a new step into the process: retrieval.

In the retrieval step, the user’s query is matched with a vector database of reference documents (called a “knowledge base”) in order to find document chunks that are likely to contain the answer. Once the document chunks have been retrieved, they can be submitted as context with the user’s query to the LLM. 

![The steps of RAG described below in visual form.](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/rag-process.png)

While RAG systems can be quite sophisticated, the basic steps remain the same:

1. User submits a query
2. Relevant document chunks are returned from the vector database
3. A prompt containing the chunks is submitted to the LLM with the user’s query


## What about transfer learning, fine-tuning, parameter-efficient fine-tuning, etc.?

RAG can be combined with fine-tuning and other techniques to improve outputs. An ideal solution may combine RAG with other techniques. At the current time, some research suggests RAG has a more profound effect than fine-tuning. In other words, RAG may improve LLM benchmark scores by a greater degree than other techniques, but the highest scores usually come from a combination of techniques.

![Table showing RAG has a greater affect than finetuning](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ragvsfinetune.png)

From "Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge" ([Soudani, et. al. 2024](https://arxiv.org/abs/2403.01432))


# Building a basic RAG system

By combining our knowledge of working with Hugging Face with a vector database, we can create a basic RAG system. First, we will need to create a knowledge base, including the following steps:

1. Curate a body of relevant documents
2. Extract the texts and chunk them
3. Embed the chunks
4. Create vector database from embeddings

## Installations

In [None]:
# Install transformers and llama-index libraries
!pip install transformers
!pip install llama-index
!pip install llama-index-embeddings-huggingface

## Import Libraries

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import pipeline
from huggingface_hub import login
from huggingface_hub import InferenceClient
import urllib.request
from pathlib import Path

## Gather documents for knowledge base

Let's create a knowledge base that relies on recent, specialized knowledge. Our LLM for this system will be [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), released July 23, 2024. We can find the "freshness" of the model from the model card:

>Data Freshness: The pretraining data has a cutoff of December 2023.

Let's include some information that the model would not have access to:

1. **Jupyter AI documentation**- The Jupyter AI project launched in August of 2023. It is possible Meta Llama 3.1 was trained on a very early version of the documentation. When we ask the model about it, however, it hallucinates giving information that is wrong. So, either the information was not in the training data or it is too specific for it to be retained.
2. **Llama 3.1 Model Card**- Most likely, the model was not trained on its own model card. It probably did not exist yet.
3. **Mistral Large Instruct 2407 Model Card**- This model was released the day after Llama 3.1 (July 24). There is no way Llama 3.1 would know about this model.

In [None]:
# Download the documents and put them in a directory called "documents"
dir_path = Path.cwd() / "documents"
dir_path.mkdir(exist_ok=True)

files ={
    "jupyter-ai-documentation.txt" : 'https://jupyter-ai.readthedocs.io/en/latest/_sources/users/index.md.txt',
    "llama-3.1b-405.txt" : 'https://raw.githubusercontent.com/meta-llama/llama-models/main/models/llama3_1/MODEL_CARD.md',
    "mistral-large-instruct-2407.txt" : 'https://huggingface.co/mistralai/Mistral-Large-Instruct-2407/resolve/main/README.md'
}
    
for file_name, url in files.items():
    urllib.request.urlretrieve(url, f'./documents/{file_name}')

## Simple Directory Reader

The simple directory reader will gather up all the files in a directory and turn them into a list of document objects. It can parse many kinds of files including pdfs, text files, markdown files, etc. It will intelligently select the right reader for the right file, and it will process them differently. For example, a text file is treated as a single document whereas a markdown file is broken down by headings.

### Using other files for the knowledge base

You don't need to use our example documents. Our code is creating the knowledge base from the documents in a directory called documents. You can create this directory and put any kind of files you would like in there for your own knowledge base. We recommend using text or markdown files for this example, but you can consult [the documentation](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader/) if you're curious about how `SimpleDirectoryReader()` interacts with other kinds of files.


In [None]:
# Collect documents into a list
docs = SimpleDirectoryReader("documents").load_data()

All of our files are saved as text files (.txt), so they will be individual document objects. They are also valid markdown files (.md), however, so we could have saved them with the `.md` extension. By default, `SimpleDirectoryReader()` will chunk markdown files into smaller files based on their structure. We will do some basic chunking ourselves, but this kind of intelligent chunking may give better results. For our example, we get 3 documents. How many documents would the markdown versions generate?

In [None]:
print(len(docs))

## Embedding Settings

We will use [LlamaIndex](https://docs.llamaindex.ai/en/stable/) to create our vector database. We will select an embedding model from Hugging Face. We are free to choose any embedding model, since the embedding model *does not* have to match our LLM. We have chosen a popular embedding model from Hugging Face, but feel free to update or change it.

In [None]:
# Choose the Embedding Model from Hugging Face
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

We will create some additional settings:
1. Not specifying an LLM model
2. Choosing a chunk size
3. Choosing a chunk overlap

## Chunking documents
The chunk size is important for the performance of the vector database and the LLM. There are many ways to chunk, including fixed sizes, random chunk sizes, sliding windows, and context-aware chunking. The right size and method will take into consideration the documents, the LLM's context window, and other factors.

In [None]:
# Set a Hugging Face embedding model
Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

In [None]:
# Create a vector database from docs object
index = VectorStoreIndex.from_documents(docs)

## Search function
Now we can set up our retrieval system. The most significant thing we can adjust here is how many documents to retrieve under the variable `top_k`. We can also change the similarity cutoff using `similarity_cutoff`. Essentially, this changes how similar a document needs to be in order to be included. Both of these are worth experimenting with. Keep in mind that there is a limit on the context that can be supplied for the model. More is not always better.


In [None]:
# Documents to retrieve
top_k = 3

# Retriever configuration
retriever = VectorIndexRetriever(
    index = index,
    similarity_top_k=top_k
)

In [None]:
# Query Engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

## Query
Here we craft our query and receive a response from the vector database.

In [None]:
# Query
query = 'What providers does Jupyter AI support?'
response = query_engine.query(query)

In [None]:
# Print the responses
print(response)

## Create LLM prompt (without RAG context)

First, let's create a prompt to pass to the LLM. We'll automatically insert the query.

In [None]:
# Create some instructions for the model

ragless_prompt = f"""
[INST] ResearchBuddy, a virtual consultant for research tasks communicates in clear, accessible language helping answer technical questions on documentation.

Please respond to the following comment.
{query}

[/INST]
"""

## Add RAG context to our LLM Prompt
Now let's create a context string from our responses received above.

In [None]:
# Create a context string from response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

In [None]:
# Create a RAG prompt with the context
ragful_prompt = ragless_prompt + context

Now we have two versions of LLM prompt:

* `ragless_prompt`- Has basic instructions with our query
* `ragful_prompt`- Has basic instructions, our query, and the context from our vector database

We are ready to pass these prompts to the LLM.

## Pass the prompts to the LLM
We can choose to pass these prompts to the LLM of our choice. In this case, we are using Llama 3.1-8B-Instruct, but we could easily choose another model using the `InferenceClient()`. 

In [None]:
# Log in using an access token
login()

In [None]:
# Choose the model
client = InferenceClient("meta-llama/Meta-Llama-3.1-8B-Instruct")

In [None]:
# Ask the model without context

for message in client.chat_completion(
	messages=[{"role": "user", "content": ragless_prompt}],
	max_tokens=500,
	stream=True,
):
    print(message.choices[0].delta.content, end="")

In [None]:
# Ask the model with RAG context

for message in client.chat_completion(
	messages=[{"role": "user", "content": ragful_prompt}],
	max_tokens=500,
	stream=True,
):
    print(message.choices[0].delta.content, end="")