<a href="https://colab.research.google.com/github/mrhamedani/LLM-Agents/blob/main/5_SemanticCache_ChromaDB%26Faiss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


✅ Purpose of Semantic Cache:

Identifying similar or identical user requests.
If a similar request is found, the information is retrieved from the cache instead of fetching it again from the original source.

✅ Why is the cache placed between the user and the vector database, not between the user and the model?

If the cache is applied at the model’s response stage, the accuracy of the model’s answers may decrease.

✅ Why is this cache necessary?

One optimization method is using a semantic cache to check whether a request has been made before.

✅ Two time-consuming stages in an RAG system:

Retrieving information to construct an enriched prompt.
Requesting a response from the language model.








<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/semantic_cache.jpg?raw=true">



# Import and load the libraries.
To start we need to install the necesary Python packages.
* **[sentence transformers](http:/www.sbert.net/)**. This library is necessary to transform the sentences into fixed-length vectors, also know as embeddings.
* **[xformers](https://github.com/facebookresearch/xformers)**. it's a package that provides libraries an utilities to facilitate the work with transformers models. We need to install in order to avoid an error when we work with the model and embeddings.  
* **[chromadb](https://www.trychroma.com/)**. This is our vector Database. ChromaDB is easy to use and open source, maybe the most used Vector Database used to store embeddings.
* **[accelerate](https://github.com/huggingface/accelerate)** Necesary to run the Model in a GPU.  

In [None]:
!pip install -q transformers==4.38.1
!pip install -q accelerate==0.27.2
!pip install -q sentence-transformers==2.5.1
!pip install -q xformers==0.0.24
!pip install -q chromadb==0.4.24
!pip install -q datasets==2.17.1

In [3]:
import numpy as np
import pandas as pd

In [None]:
#Login to Hugging Face. It is mandatory to use the Gemma Model,
#and recommended to acces public models and Datasets.
from getpass import getpass
if 'hf_key' not in locals():
  hf_key = getpass("Your Hugging Face API Key: ")
!huggingface-cli login --token $hf_key

In [None]:
from datasets import load_dataset
data = load_dataset("keivalya/MedQuad-MedicalQnADataset", split='train')     # download the dataset from Hugging Face related to MedicalQnA

In [42]:
data = pd.DataFrame(data)
data

Unnamed: 0,qtype,Question,Answer,id
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,0
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,1
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...,2
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",3
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",4
...,...,...,...,...
16402,symptoms,What are the symptoms of Familial visceral myo...,What are the signs and symptoms of Familial vi...,16402
16403,information,What is (are) Pseudopelade of Brocq ?,Pseudopelade of Brocq (PBB) is a slowly progre...,16403
16404,symptoms,What are the symptoms of Pseudopelade of Brocq ?,What are the signs and symptoms of Pseudopelad...,16404
16405,treatment,What are the treatments for Pseudopelade of Br...,Is there treatment or a cure for pseudopelade ...,16405


In [13]:
MAX_ROWS = 15000
DOCUMENT="Answer"
TOPIC="qtype"

ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called **Id**.


In [23]:
subset_data = data.head(MAX_ROWS) #Because it is just a sample we select a small portion of News.


I'm going to use ChromaDB, the most popular OpenSource vector Database.

First we need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows us to change the setting for the ChromaDB system, and customize its behavior.

In [19]:
import chromadb
from chromadb.config import Settings
import sentence_transformers
from sentence_transformers import SentenceTransformer

In [16]:
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")  #directory to persist chromaDB

In [20]:
collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)


The data must be added to the collection with the add function. At least three parts must be specified:

**Documents** → full text of each news item (stored in a specific column of the dataset)

**metadatas** → Meta information, such as the title or category of the news

**ids** → a unique identifier for each data row

**embedding**:To send and use information in Chroma DB, they must be captured as images

In [24]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")     # choose a sentence transformer model
embeddings = embedding_model.encode(subset_data[DOCUMENT].tolist(), convert_to_numpy=True)   # create embeddings

# add embeddings to collection chromaDB
collection.add(documents=subset_data[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_data[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_ROWS)],
    embeddings=embeddings.tolist(),)


Once we have the information inside the Database we can query It, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents.

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search.

Let's define a function to query the ChromaDB Database.

In [25]:
def query_database(query_text, n_results=10):
    results = collection.query(query_texts=query_text, n_results=n_results )
    return results

## Creating the semantic cache system
To implement the cache system, we will use Faiss, a library that allows storing embeddings in memory. It's quite similar to what Chroma does, but without its persistence.

For this purpose, we will create a class called semantic_cache that will work with its own encoder and provide the necessary functions for the user to perform queries.

In this class, we first query Faiss (the cache), and if the returned results are above the specified **threshold**, it will return the result from the cache. Otherwise, it will fetch the result from the Chroma database.

The cache is stored in .json file.

In [None]:
!pip install -q faiss-cpu==1.8.0

In [27]:
import faiss
import time
import json

This function initializes the semantic cache.

It employs the FlatLS index, which might not be the fastest but is ideal for small datasets. Depending on the characteristics of the data intended for the cache and the expected dataset size, another index such as HNSW or IVF could be utilized.

In [28]:
def init_cache():
  index = faiss.IndexFlatL2(768)
  if index.is_trained:
            print('Index trained')

  # Initialize Sentence Transformer model
  encoder = SentenceTransformer('all-mpnet-base-v2')

  return index, encoder

In the `retrieve_cache` function, the .json file is retrieved from disk in case there is a need to reuse the cache across sessions.

In [29]:
def retrieve_cache(json_file):
      try:
          with open(json_file, 'r') as file:
              cache = json.load(file)
      except FileNotFoundError:
          cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}

      return cache

The `store_cache` function saves the file containing the cache data to disk.

In [30]:
def store_cache(json_file, cache):
  with open(json_file, 'w') as file:
        json.dump(cache, file)

These functions will be used within the `SemanticCache` class, which includes the search function and its initialization function.

Even though the `ask` function has a substantial amount of code, its purpose is quite straightforward. It looks in the cache for the closest question to the one just made by the user.

Afterward, checks if it is within the specified threshold. If positive, it directly returns the response from the cache; otherwise, it calls the `query_database` function to retrieve the data from ChromaDB.

I've used Euclidean distance instead of Cosine, which is widely employed in vector comparisons. This choice is based on the fact that Euclidean distance is the default metric used by Faiss. Although Cosine distance can also be calculated, doing so adds complexity that may not significantly contribute to the final result.


In [31]:
class semantic_cache:
  def __init__(self, json_file="cache_file.json", thresold=0.35):
      # Initialize Faiss index with Euclidean distance
      self.index, self.encoder = init_cache()

      # Set Euclidean distance threshold
      # a distance of 0 means identicals sentences
      # We only return from cache sentences under this thresold
      self.euclidean_threshold = thresold

      self.json_file = json_file
      self.cache = retrieve_cache(self.json_file)

  def ask(self, question: str) -> str:
      # Method to retrieve an answer from the cache or generate a new one
      start_time = time.time()
      try:
          #First we obtain the embeddings corresponding to the user question
          embedding = self.encoder.encode([question])

          # Search for the nearest neighbor in the index
          self.index.nprobe = 8
          D, I = self.index.search(embedding, 1)

          if D[0] >= 0:
              if I[0][0] >= 0 and D[0][0] <= self.euclidean_threshold:
                  row_id = int(I[0][0])

                  print('Answer recovered from Cache. ')
                  print(f'{D[0][0]:.3f} smaller than {self.euclidean_threshold}')
                  print(f'Found cache in row: {row_id} with score {D[0][0]:.3f}')
                  print(f'response_text: ' + self.cache['response_text'][row_id])

                  end_time = time.time()
                  elapsed_time = end_time - start_time
                  print(f"Time taken: {elapsed_time:.3f} seconds")
                  return self.cache['response_text'][row_id]

          # Handle the case when there are not enough results
          # or Euclidean distance is not met, asking to chromaDB.
          answer  = query_database([question], 1)
          response_text = answer['documents'][0][0]

          self.cache['questions'].append(question)
          self.cache['embeddings'].append(embedding[0].tolist())
          self.cache['answers'].append(answer)
          self.cache['response_text'].append(response_text)

          print('Answer recovered from ChromaDB. ')
          print(f'response_text: {response_text}')

          self.index.add(embedding)
          store_cache(self.json_file, self.cache)
          end_time = time.time()
          elapsed_time = end_time - start_time
          print(f"Time taken: {elapsed_time:.3f} seconds")

          return response_text
      except Exception as e:
          raise RuntimeError(f"Error during 'ask' method: {e}")

### Testing the semantic_cache class.

In [None]:
# Initialize the cache.
cache = semantic_cache('4cache_file.json')

In [33]:
results = cache.ask("How work a vaccine?")

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:08<00:00, 10.1MiB/s]


Answer recovered from ChromaDB. 
response_text: Summary : Shots may hurt a little, but the diseases they can prevent are a lot worse. Some are even life-threatening. Immunization shots, or vaccinations, are essential. They protect against things like measles, mumps, rubella, hepatitis B, polio, tetanus, diphtheria, and pertussis (whooping cough). Immunizations are important for adults as well as children.    Your immune system helps your body fight germs by producing substances to combat them. Once it does, the immune system "remembers" the germ and can fight it again. Vaccines contain germs that have been killed or weakened. When given to a healthy person, the vaccine triggers the immune system to respond and thus build immunity.     Before vaccines, people became immune only by actually getting a disease and surviving it. Immunizations are an easier and less risky way to become immune.     NIH: National Institute of Allergy and Infectious Diseases
Time taken: 10.427 seconds


As expected, this response has been obtained from ChromaDB. The class then stores it in the cache.

Now, if we send a second question that is quite different, the response should also be retrieved from ChromaDB. This is because the question stored previously is so dissimilar that it would surpass the specified threshold in terms of Euclidean distance.

In [35]:
results = cache.ask("Explain briefly what is a Periodic Paralyses")

Answer recovered from Cache. 
0.000 smaller than 0.35
Found cache in row: 1 with score 0.000
response_text: Familial periodic paralyses are a group of inherited neurological disorders caused by mutations in genes that regulate sodium and calcium channels in nerve cells. They are characterized by episodes in which the affected muscles become slack, weak, and unable to contract. Between attacks, the affected muscles usually work as normal.
                
The two most common types of periodic paralyses are: Hypokalemic periodic paralysis is characterized by a fall in potassium levels in the blood. In individuals with this mutation attacks often begin in adolescence and are triggered by strenuous exercise, high carbohydrate meals, or by injection of insulin, glucose, or epinephrine. Weakness may be mild and limited to certain muscle groups, or more severe and affect the arms and legs. Attacks may last for a few hours or persist for several days. Some patients may develop chronic muscle w

Perfect, the semantic cache system is behaving as expected.

Let's proceed to test it with a question very similar to the one we just asked.

In this case, the response should come directly from the cache without the need to access the ChromaDB database.

In [36]:
results = cache.ask("Briefly explain me what is a periodic paralyses")

Answer recovered from Cache. 
0.018 smaller than 0.35
Found cache in row: 1 with score 0.018
response_text: Familial periodic paralyses are a group of inherited neurological disorders caused by mutations in genes that regulate sodium and calcium channels in nerve cells. They are characterized by episodes in which the affected muscles become slack, weak, and unable to contract. Between attacks, the affected muscles usually work as normal.
                
The two most common types of periodic paralyses are: Hypokalemic periodic paralysis is characterized by a fall in potassium levels in the blood. In individuals with this mutation attacks often begin in adolescence and are triggered by strenuous exercise, high carbohydrate meals, or by injection of insulin, glucose, or epinephrine. Weakness may be mild and limited to certain muscle groups, or more severe and affect the arms and legs. Attacks may last for a few hours or persist for several days. Some patients may develop chronic muscle w

The two questions are so similar that their Euclidean distance is truly minimal, almost as if they were identical.

Now, let's try another question, this time a bit more distinct, and observe how the system behaves.

In [49]:
question_def = "Write in 20 words what is a periodic paralyses"
results = cache.ask(question_def)

Answer recovered from Cache. 
0.220 smaller than 0.35
Found cache in row: 1 with score 0.220
response_text: Familial periodic paralyses are a group of inherited neurological disorders caused by mutations in genes that regulate sodium and calcium channels in nerve cells. They are characterized by episodes in which the affected muscles become slack, weak, and unable to contract. Between attacks, the affected muscles usually work as normal.
                
The two most common types of periodic paralyses are: Hypokalemic periodic paralysis is characterized by a fall in potassium levels in the blood. In individuals with this mutation attacks often begin in adolescence and are triggered by strenuous exercise, high carbohydrate meals, or by injection of insulin, glucose, or epinephrine. Weakness may be mild and limited to certain muscle groups, or more severe and affect the arms and legs. Attacks may last for a few hours or persist for several days. Some patients may develop chronic muscle w

We observe that the Euclidean distance has increased, but it still remains within the specified threshold. Therefore, it continues to return the response directly from the cache.


 # We use TRANSFORMERS for working with language models (LLMs).
The three main tools used here are:

1️⃣ AutoTokenizer → An automatic tokenizer that converts text into tokens suitable for the model.

2️⃣ AutoModelForCausalLM → Language models based on Causal Language Modeling (like GPT) for text generation.

3️⃣ pipeline → A simple interface for performing NLP tasks such as text generation or text classification.  

In [57]:
from torch import cuda, torch
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
model_id = "google/gemma-2b-it"    #model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="cuda",
                                            torch_dtype=torch.bfloat16)

# To use or not to use the pipeline
📌 When should we use pipeline?

For quick and simple tasks, such as testing the model with short inputs.
When we don't want to manually handle tokenization, input processing, and model execution.
For tasks like summarization, translation, and text generation without needing complex configurations.

📌 When should we use AutoModelForCausalLM and AutoTokenizer?

When we need precise control over the model and data processing.
In advanced scenarios such as RAG, batch processing, GPU optimization, and custom model settings.
When we want to fine-tune the model and manage the output more accurately.

## Creating the extended prompt
To create the prompt we use the result from query the 'semantic_cache' class  and the question introduced by the user.

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.

We only need to put the two parts together to create the prompt then send it to the model.

In [62]:
prompt_template = f"""
You are an AI assistant that provides clear and concise answers based on the given context.

Relevant context: {results}

The user's question: {question_def}

Please provide a detailed yet concise answer.
"""
print(prompt_template)


You are an AI assistant that provides clear and concise answers based on the given context.

Relevant context: Familial periodic paralyses are a group of inherited neurological disorders caused by mutations in genes that regulate sodium and calcium channels in nerve cells. They are characterized by episodes in which the affected muscles become slack, weak, and unable to contract. Between attacks, the affected muscles usually work as normal.
                
The two most common types of periodic paralyses are: Hypokalemic periodic paralysis is characterized by a fall in potassium levels in the blood. In individuals with this mutation attacks often begin in adolescence and are triggered by strenuous exercise, high carbohydrate meals, or by injection of insulin, glucose, or epinephrine. Weakness may be mild and limited to certain muscle groups, or more severe and affect the arms and legs. Attacks may last for a few hours or persist for several days. Some patients may develop chronic musc

In [60]:
input_ids = tokenizer(prompt_template, return_tensors="pt").to("cuda")   # pt = pytorch tensor

In [61]:
outputs = model.generate(**input_ids,max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

<bos>
You are an AI assistant that provides clear and concise answers based on the given context.

Relevant context: Familial periodic paralyses are a group of inherited neurological disorders caused by mutations in genes that regulate sodium and calcium channels in nerve cells. They are characterized by episodes in which the affected muscles become slack, weak, and unable to contract. Between attacks, the affected muscles usually work as normal.
                
The two most common types of periodic paralyses are: Hypokalemic periodic paralysis is characterized by a fall in potassium levels in the blood. In individuals with this mutation attacks often begin in adolescence and are triggered by strenuous exercise, high carbohydrate meals, or by injection of insulin, glucose, or epinephrine. Weakness may be mild and limited to certain muscle groups, or more severe and affect the arms and legs. Attacks may last for a few hours or persist for several days. Some patients may develop chronic