
<div align="center">
<h1><a href="https://github.com/peremartra/Large-Language-Model-Notebooks-Course">Learn by Doing LLM Projects</a></h1>
    <h3>Understand And Apply Large Language Models</h3>
    <h2>IMPLEMENTING SEMANTIC CACHE TO IMPROVE A RAG SYSTEM</h2>
    by <b>Pere Martra</b>
</div>

<br>

<div align="center">
    &nbsp;
    <a target="_blank" href="https://www.linkedin.com/in/pere-martra/"><img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>
    
</div>

<br>
<hr>


### This notebook is part of a comprehensive course on Large Language Models available on GitHub: https://github.com/peremartra/Large-Language-Model-Notebooks-Course. If you want to stay informed about new lessons or updates, simply follow or star the repository.

In this notebook, we will explore a typical RAG system where we will utilize an open-source model and the vector database Chroma DB. However, we will integrate a semantic cache system that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache.

The semantic comparison will be performed using the Euclidean distance of question embeddings. This is because, semantically, "What is the capital of France?" is essentially the same as "Tell me the name of the capital of France?"

Therefore, even though the model's response may vary due to the request for a short answer in the second question, the information to retrieve from the vector database will be the same. This places the cache system between the user and the database, not between the user and the Large Language Model. 

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/semantic_cache.jpg?raw=true">

### Feel Free to fork or edit the noteboook for you own convenience. Please consider ***UPVOTING IT***. It helps others to discover the notebook, and it encourages me to continue publishing.

# Import and load the libraries. 
To start we need to install the necesary Python packages. 
* **[sentence transformers](http:/www.sbert.net/)**. This library is necessary to transform the sentences into fixed-length vectors, also know as embeddings. 
* **[xformers](https://github.com/facebookresearch/xformers)**. it's a package that provides libraries an utilities to facilitate the work with transformers models. We need to install in order to avoid an error when we work with the model and embeddings.  
* **[chromadb](https://www.trychroma.com/)**. This is our vector Database. ChromaDB is easy to use and open source, maybe the most used Vector Database used to store embeddings. 

In [1]:
!pip install -q transformers==4.38.1
!pip install -q accelerate

In [None]:
!pip install -q sentence-transformers==2.2.2
!pip install -q xformers==0.0.23
!pip install -q chromadb==0.4.20

I'm sure that you know the next two packages: Numpy and Pandas, maybe the most used python libraries.

Numpy is a powerful library for numerical computing. 

Pandas is a library for data manipulation

In [3]:
import numpy as np 
import pandas as pd

# Load the Dataset
As you can see the notebook is ready to work with three different Datasets. Just uncomment the lines of the Dataset you want to use. 

I selected Datasets with News. Two of them have just a brief decription of the new, but the other contains the full text. 

As we are working in a free and limited space, and we can use just 30 gb of memory I limited the number of news to use with the variable MAX_NEWS. 

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*

In [4]:
#news = pd.read_csv('/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv', sep=';')
#MAX_NEWS = 1000
#DOCUMENT="title"
#TOPIC="topic"

#news = pd.read_csv('/kaggle/input/bbc-news/bbc_news.csv')
#MAX_NEWS = 1000
#DOCUMENT="description"
#TOPIC="title"

news = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')
MAX_NEWS = 1000
DOCUMENT="Article Body"
TOPIC="Article Header"


ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called **Id**.


In [5]:
news["id"] = news.index
news.head()

Unnamed: 0.1,Unnamed: 0,Published Date,Author,Source,Article Header,Sub_Headings,Article Body,Url,id
0,0,"July 7, 2023",Adam Zewe,MIT News Office,Learning the language of molecules to predict ...,This AI system only needs a small amount of da...,['Discovering new materials and drugs typicall...,https://news.mit.edu/2023/learning-language-mo...,0
1,1,"July 6, 2023",Alex Ouyang,Abdul Latif Jameel Clinic for Machine Learning...,MIT scientists build a system that can generat...,"BioAutoMATED, an open-source, automated machin...",['Is it possible to build machine-learning mod...,https://news.mit.edu/2023/bioautomated-open-so...,1
2,2,"June 30, 2023",Jennifer Michalowski,McGovern Institute for Brain Research,"When computer vision works more like a brain, ...",Training artificial neural networks with data ...,"['From cameras to self-driving cars, many of t...",https://news.mit.edu/2023/when-computer-vision...,2
3,3,"June 30, 2023",Mary Beth Gallagher,School of Engineering,Educating national security leaders on artific...,"Experts from MIT’s School of Engineering, Schw...",['Understanding artificial intelligence and ho...,https://news.mit.edu/2023/educating-national-s...,3
4,4,"June 30, 2023",Adam Zewe,MIT News Office,Researchers teach an AI to write better chart ...,A new dataset can help scientists develop auto...,['Chart captions that explain complex trends a...,https://news.mit.edu/2023/researchers-chart-ca...,4


In [6]:
#Because it is just a course we select a small portion of News.
subset_news = news.head(MAX_NEWS)

# Import and configure the Vector Database
I'm going to use ChromaDB, the most popular OpenSource vector Database. 

First we need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows us to change the setting for the ChromaDB system, and customize its behavior. 

In [7]:
import chromadb
from chromadb.config import Settings

Now we need to create the seetings object calling the **Settings** function imported previously. We store the object in the variable **settings_chroma**.

Is necessary to inform two parameters 
* chroma_db_impl. Here we specify the database implementation and the format how store the data. I choose ***duckdb***, because his high-performace. It operate primarly in memory. And is fully compatible with SQL. The store format ***parquet*** is good for tabular data. With good compression rates and performance. 

* persist_directory: It just contains the directory where the data will be stored. Is possible work without a directory and the data will be stored in memory without persistece, but Kaggle dosn't support that. 

In [8]:
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")

# Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection exist we need to delete it. 

In the next lines, we are creating the collection by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [9]:
collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)
    

It's time to add the data to the collection. Using the function ***add*** we need to inform, at least ***documents***, ***metadatas*** and ***ids***. 
* In the **document** we store the big text, it's a different column in each Dataset. 
* In **metadatas**, we can informa a list of topics. 
* In **id** we need to inform an unique identificator for each row. It MUST be unique! I'm creating the ID using the range of MAX_NEWS. 


In [10]:
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)

In [11]:
def query_database(query_text, n_results=10):
    results = collection.query(query_texts=query_text, n_results=n_results )
    return results

## Creating the semantic cache system
To implement the cache system, we will use Faiss, a library that allows storing embeddings in memory. It's quite similar to what Chroma does, but without its persistence.

For this purpose, we will create a class called semantic_cache that will work with its own encoder and provide the necessary functions for the user to perform queries.

In this class, we first query Faiss (the cache), and if the returned results are above the specified threshold, it will return the result from the cache. Otherwise, it will fetch the result from the Chroma database.

In [12]:
!pip install -q faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [38]:
import faiss
from sentence_transformers import SentenceTransformer
import time
import json

class semantic_cache:
    def __init__(self, json_file='cache.json'):
        # Initialize Faiss index with Euclidean distance
        self.index = faiss.IndexFlatL2(768)  # Use IndexFlatL2 with Euclidean distance
        if self.index.is_trained:
            print('Index trained')

        # Initialize Sentence Transformer model
        self.encoder = SentenceTransformer('all-mpnet-base-v2')
        self.MAX_SIZE_CACHE = 100


        # Uncomment the following lines to use DialoGPT for question generation
        # self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
        # self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

        # Set Euclidean distance threshold
        self.euclidean_threshold = 0.7
        self.json_file = json_file
        self.load_cache()
    
    def load_cache(self):
        # Load cache from JSON file, creating an empty cache if the file is not found
        try:
            with open(self.json_file, 'r') as file:
                self.cache = json.load(file)
        except FileNotFoundError:
            self.cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}

    def save_cache(self):
        # Save the cache to the JSON file
        with open(self.json_file, 'w') as file:
            json.dump(self.cache, file)
            
    def generate_answer(self, question: str) -> str:
        # Method to generate an answer using a separate function (make_prediction in this case)
        try:
            #result = make_prediction([question])
            result = query_database([question], 1)
            response_text = result['documents'][0][0]

            return result, response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'generate_answer' method: {e}")
    
    def ask(self, question: str) -> str:
        # Method to retrieve an answer from the cache or generate a new one
        start_time = time.time()
        try:
            l = [question]
            embedding = self.encoder.encode(l)

            # Search for the nearest neighbor in the index
            D, I = self.index.search(embedding, 1)

            if D[0] >= 0:
                if I[0][0] != -1 and D[0][0] <= self.euclidean_threshold:
                    row_id = int(I[0][0])
                    print(f'Found cache in row: {row_id} with score {1 - D[0][0]}')
                    end_time = time.time()
                    elapsed_time = end_time - start_time
                    print(f"Time taken: {elapsed_time} seconds")
                    return self.cache['response_text'][row_id]

            # Handle the case when there are not enough results or Euclidean distance is not met
            answer, response_text = self.generate_answer(question)
            
            if len(self.cache["questions"]) == self.MAX_SIZE_CACHE:
                self.cache["questions"].pop(0)
                self.cache["embeddings"].pop(0)
                self.cache["answers"].pop(0)
                self.cache["response_text"].pop(0)

            self.cache['questions'].append(question)
            self.cache['embeddings'].append(embedding[0].tolist())
            self.cache['answers'].append(answer)
            self.cache['response_text'].append(response_text)

            self.index.add(embedding)
            self.save_cache()
            end_time = time.time()
            elapsed_time = end_time - start_time
            print(f"Time taken: {elapsed_time} seconds")

            return response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'ask' method: {e}")
            


In [39]:
cache = semantic_cache()

Index trained


In [40]:
question1 = "recent investigations about LLMs"
results = cache.ask(question1)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Time taken: 0.20497632026672363 seconds


In [16]:
print(results)

['Companies today are incorporating artificial intelligence into every corner of their business. The trend is expected to continue until machine-learning models are incorporated into most of the products and services we interact with every day.', 'As those models become a bigger part of our lives, ensuring their integrity becomes more important. That’s the mission of Verta, a startup that spun out of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).', 'Verta’s platform helps companies deploy, monitor, and manage machine-learning models safely and at scale. Data scientists and engineers can use Verta’s tools to track different versions of models, audit them for bias, test them before deployment, and monitor their performance in the real world.', '“Everything we do is to enable more products to be built with AI, and to do that safely,” Verta founder and CEO Manasi Vartak SM ’14, PhD ’18 says. “We’re already seeing with ChatGPT how AI can be used to generate data, art

In [41]:
question1 = "What is Monica Agrawal currently working on?"
results = cache.ask(question1)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Time taken: 0.0984659194946289 seconds


In [42]:
print(results)

['The School of Engineering is welcoming\xa011 new faculty members to its departments, institutes, labs, and centers. With research and teaching activities ranging from the development of novel microscopy techniques to intelligent systems and mixed-autonomy mobility, they are poised to make significant\xa0contributions in new directions across the school and to a wide range of research efforts around\xa0the Institute.', '“I am pleased to welcome our outstanding new faculty,” says Anantha Chandrakasan, dean of the School of Engineering. “Their contributions as educators, researchers, and collaborators will enhance the engineering community and strengthen our global impact.”', 'Pulkit Agrawal\xa0will join the Department of Electrical Engineering and Computer Science as an assistant professor in July. Agrawal earned a BS in electrical engineering from the Indian Institute of Technology, Kanpur, and was awarded the Director’s Gold Medal. He earned a PhD in computer science from the Univers

In [43]:
question1 = "Can you tell me about Monica Agrawal's current projects?"
results = cache.ask(question1)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Found cache in row: 1 with score 0.7254692316055298
Time taken: 0.03159523010253906 seconds


In [20]:
print(results)

['The School of Engineering is welcoming\xa011 new faculty members to its departments, institutes, labs, and centers. With research and teaching activities ranging from the development of novel microscopy techniques to intelligent systems and mixed-autonomy mobility, they are poised to make significant\xa0contributions in new directions across the school and to a wide range of research efforts around\xa0the Institute.', '“I am pleased to welcome our outstanding new faculty,” says Anantha Chandrakasan, dean of the School of Engineering. “Their contributions as educators, researchers, and collaborators will enhance the engineering community and strengthen our global impact.”', 'Pulkit Agrawal\xa0will join the Department of Electrical Engineering and Computer Science as an assistant professor in July. Agrawal earned a BS in electrical engineering from the Indian Institute of Technology, Kanpur, and was awarded the Director’s Gold Medal. He earned a PhD in computer science from the Univers

Once we have our information inside the Database we can query It, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents. 

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search. 


# Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models. 

We are importing: 
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***databricks/dolly-v2-3b***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification. 

The model selected is [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), the smallest Dolly model. It have 3billion paramaters, more than enough for our sample, and works much better than GPT2. 

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  


In [21]:
from getpass import getpass
hf_key = getpass("Hugging Face Key: ")

Hugging Face Key:  ·····································


In [22]:
!huggingface-cli login --token $hf_key

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [25]:
#In a MAC Silicon the device must be 'mps'
# device = torch.device('mps') #to use with MAC Silicon
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [34]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

#model_id = "databricks/dolly-v2-3b"
model_id = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The next step is to initialize the pipeline using the objects created above. 

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.  

In [35]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",
)

## Creating the extended prompt
To create the prompt we use the result from query the Vector Database  and the sentence introduced by the user. 

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**. 

We only need to join the two parts together to create the prompt that we are going to send to the model. 

You can limit the lenght of the context passed to the model, because we can get some Memory problems with one of the datasets that contains a realy large text in the document part. 

In [36]:
question = "Can you tell me about Monica Agrawal's current projects?"
#context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template = f"Relevant context: {results}\n\n The user's question: {question}"
prompt_template

'Relevant context: [\'The School of Engineering is welcoming\\xa011 new faculty members to its departments, institutes, labs, and centers. With research and teaching activities ranging from the development of novel microscopy techniques to intelligent systems and mixed-autonomy mobility, they are poised to make significant\\xa0contributions in new directions across the school and to a wide range of research efforts around\\xa0the Institute.\', \'“I am pleased to welcome our outstanding new faculty,” says Anantha Chandrakasan, dean of the School of Engineering. “Their contributions as educators, researchers, and collaborators will enhance the engineering community and strengthen our global impact.”\', \'Pulkit Agrawal\\xa0will join the Department of Electrical Engineering and Computer Science as an assistant professor in July. Agrawal earned a BS in electrical engineering from the Indian Institute of Technology, Kanpur, and was awarded the Director’s Gold Medal. He earned a PhD in compu

Now all that remains is to send the prompt to the model and wait for its response!


In [37]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Relevant context: ['The School of Engineering is welcoming\xa011 new faculty members to its departments, institutes, labs, and centers. With research and teaching activities ranging from the development of novel microscopy techniques to intelligent systems and mixed-autonomy mobility, they are poised to make significant\xa0contributions in new directions across the school and to a wide range of research efforts around\xa0the Institute.', '“I am pleased to welcome our outstanding new faculty,” says Anantha Chandrakasan, dean of the School of Engineering. “Their contributions as educators, researchers, and collaborators will enhance the engineering community and strengthen our global impact.”', 'Pulkit Agrawal\xa0will join the Department of Electrical Engineering and Computer Science as an assistant professor in July. Agrawal earned a BS in electrical engineering from the Indian Institute of Technology, Kanpur, and was awarded the Director’s Gold Medal. He earned a PhD in computer scienc

# Conclusions, Fork and Improve
A very short notebook, but with a lot of content.

We have used a vector database to store information. Then move on to retrieve it and use it to create an extended prompt that we've used to call one of the newer large language models available in Hugging Face.

The model has returned a response to us taking into account the context that we have passed to it in the prompt.

This way of working with language models is very powerful.

We can make the model use our information without the need for Fine Tuning. This technique really has some very big advantages over fine tuning.

Please don't stop here.

* The notebook is prepared to use two more Datasets. Do tests with it.

* Find another model on Hugging Face and compare it.

* Modify the way to create the prompt.

## Continue learning
This notebook is part of a [course on large language models](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) I'm working on and it's available on [GitHub](https://github.com/peremartra/Large-Language-Model-Notebooks-Course). You can see the other lessons and if you like it, don't forget to subscribe to receive notifications of new lessons.

Other notebooks in the Large Language Models series: 
https://www.kaggle.com/code/peremartramanonellas/ask-your-documents-with-langchain-vectordb-hf

### If you liked the notebook Please consider ***UPVOTING IT***. It helps others to discover it, and encourages me to continue publishing.