<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/2-Vector%20Databases%20with%20LLMs/semantic_cache_chroma_vector_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'mit-ai-news-published-till-2023:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F3496946%2F6104553%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240229%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240229T174141Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D243d8efb8ec005af98a76410e9ee2f91cf16a2a596bd89921e77454d4b457eeb48bf94531577c56d97df1ca879a6a567ea2c1a72d1c3ae5b55e67cf4bc562290b39f96bff315d1b453498cbdf496ebd09316c26d770df964c16a3ff2c0dad1192139de496156f721f65079d6f3a1db05d7fa93850decaf98f2460d1adc4c4e03ee335ecfde8b7e9aaa0552ed61c58311098b20a39e11b39a1a01d9aebbfcfcddf2cb7f4939146548e1146e26aa817ffb5e97a115b2f635c9a97f217e09baacd74fdbed1883ebe53898f7a08078a9a9ceb1fb4181080a5c98c81d0a8279584e8798e3a6df481bd9a6b780b8a17bb1a82b052d9c6c247a175703622b3f570ad7e9'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading mit-ai-news-published-till-2023, 1988389 bytes compressed
Downloaded and uncompressed: mit-ai-news-published-till-2023
Data source import complete.



<div align="center">
<h1><a href="https://github.com/peremartra/Large-Language-Model-Notebooks-Course">Learn by Doing LLM Projects</a></h1>
    <h3>Understand And Apply Large Language Models</h3>
    <h2>IMPLEMENTING SEMANTIC CACHE TO IMPROVE A RAG SYSTEM</h2>
    by <b>Pere Martra</b>
</div>

<br>

<div align="center">
    &nbsp;
    <a target="_blank" href="https://www.linkedin.com/in/pere-martra/"><img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>
    
</div>

<br>
<hr>


### This notebook is part of a comprehensive course on Large Language Models available on GitHub: https://github.com/peremartra/Large-Language-Model-Notebooks-Course. If you want to stay informed about new lessons or updates, simply follow or star the repository.

In this notebook, we will explore a typical RAG system where we will utilize an open-source model and the vector database Chroma DB. However, we will integrate a semantic cache system that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache.

The semantic comparison will be performed using the Euclidean distance of question embeddings. This is because, semantically, "What is the capital of France?" is essentially the same as "Tell me the name of the capital of France?"

Therefore, even though the model's response may vary due to the request for a short answer in the second question, the information to retrieve from the vector database will be the same. This places the cache system between the user and the database, not between the user and the Large Language Model.

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/semantic_cache.jpg?raw=true">

### Feel Free to fork or edit the noteboook for you own convenience. Please consider ***UPVOTING IT***. It helps others to discover the notebook, and it encourages me to continue publishing.

# Import and load the libraries.
To start we need to install the necesary Python packages.
* **[sentence transformers](http:/www.sbert.net/)**. This library is necessary to transform the sentences into fixed-length vectors, also know as embeddings.
* **[xformers](https://github.com/facebookresearch/xformers)**. it's a package that provides libraries an utilities to facilitate the work with transformers models. We need to install in order to avoid an error when we work with the model and embeddings.  
* **[chromadb](https://www.trychroma.com/)**. This is our vector Database. ChromaDB is easy to use and open source, maybe the most used Vector Database used to store embeddings.
* **[accelerate](https://github.com/huggingface/accelerate)** Necesary to run the Model in a GPU.  

!pip install -q transformers==4.38.1
!pip install -qqq accelerate==0.20.3
!pip install -q sentence-transformers==2.4.0
!pip install -q xformers==0.0.24
!pip install -q chromadb==0.4.24

In [2]:
!pip install -U qqq

Collecting qqq
  Downloading qqq-0.0.1-py3-none-any.whl (1.3 kB)
Installing collected packages: qqq
Successfully installed qqq-0.0.1


In [3]:
!pip install -q transformers
!pip install -qqq accelerate
!pip install -q sentence-transformers
!pip install -q xformers
!pip install -q chromadb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.3/156.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.2/218.2 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m79.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

I'm sure that you know the next two packages: Numpy and Pandas, maybe the most used python libraries.

Numpy is a powerful library for numerical computing.

Pandas is a library for data manipulation

In [4]:
import numpy as np
import pandas as pd

# Load the Dataset
As we are working in a free and limited space, and we can use just 30 gb of memory I limited the number of news to use with the variable MAX_NEWS.

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*. This information will be needed when we load the data into the Chroma vector database.

In [5]:
news = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')
MAX_NEWS = 1000
DOCUMENT="Article Body"
TOPIC="Article Header"


ChromaDB requires that the data has a unique identifier. We can make it with this statement, which will create a new column called **Id**.


In [6]:
news["id"] = news.index
news.head()

Unnamed: 0.1,Unnamed: 0,Published Date,Author,Source,Article Header,Sub_Headings,Article Body,Url,id
0,0,"July 7, 2023",Adam Zewe,MIT News Office,Learning the language of molecules to predict ...,This AI system only needs a small amount of da...,['Discovering new materials and drugs typicall...,https://news.mit.edu/2023/learning-language-mo...,0
1,1,"July 6, 2023",Alex Ouyang,Abdul Latif Jameel Clinic for Machine Learning...,MIT scientists build a system that can generat...,"BioAutoMATED, an open-source, automated machin...",['Is it possible to build machine-learning mod...,https://news.mit.edu/2023/bioautomated-open-so...,1
2,2,"June 30, 2023",Jennifer Michalowski,McGovern Institute for Brain Research,"When computer vision works more like a brain, ...",Training artificial neural networks with data ...,"['From cameras to self-driving cars, many of t...",https://news.mit.edu/2023/when-computer-vision...,2
3,3,"June 30, 2023",Mary Beth Gallagher,School of Engineering,Educating national security leaders on artific...,"Experts from MIT’s School of Engineering, Schw...",['Understanding artificial intelligence and ho...,https://news.mit.edu/2023/educating-national-s...,3
4,4,"June 30, 2023",Adam Zewe,MIT News Office,Researchers teach an AI to write better chart ...,A new dataset can help scientists develop auto...,['Chart captions that explain complex trends a...,https://news.mit.edu/2023/researchers-chart-ca...,4


In [7]:
#Because it is just a sample we select a small portion of News.
subset_news = news.head(MAX_NEWS)

# Import and configure the Vector Database
I'm going to use ChromaDB, the most popular OpenSource vector Database.

First we need to import ChromaDB, and after that import the **Settings** class from **chromadb.config** module. This class allows us to change the setting for the ChromaDB system, and customize its behavior.

In [8]:
import chromadb
from chromadb.config import Settings

Now we need to create the seetings object calling the **Settings** function imported previously. We store the object in the variable **settings_chroma**.

Is necessary to inform two parameters
* chroma_db_impl. Here we specify the database implementation and the format how store the data. I choose ***duckdb***, because his high-performace. It operate primarly in memory. And is fully compatible with SQL. The store format ***parquet*** is good for tabular data. With good compression rates and performance.

* persist_directory: It just contains the directory where the data will be stored. Is possible work without a directory and the data will be stored in memory without persistece, but Kaggle dosn't support that.

In [9]:
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")

# Filling and Querying the ChromaDB Database
The Data in ChromaDB is stored in collections. If the collection exist we need to delete it.

In the next lines, we are creating the collection by calling the ***create_collection*** function in the ***chroma_client*** created above.

In [10]:
collection_name = "news_collection"
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)


It's time to add the data to the collection. Using the function ***add*** we need to inform, at least ***documents***, ***metadatas*** and ***ids***.
* In the **document** we store the big text, it's a different column in each Dataset.
* In **metadatas**, we can informa a list of topics.
* In **id** we need to inform an unique identificator for each row. It MUST be unique! I'm creating the ID using the range of MAX_NEWS.


In [11]:
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:03<00:00, 21.9MiB/s]


In [12]:
def query_database(query_text, n_results=10):
    results = collection.query(query_texts=query_text, n_results=n_results )
    return results

## Creating the semantic cache system
To implement the cache system, we will use Faiss, a library that allows storing embeddings in memory. It's quite similar to what Chroma does, but without its persistence.

For this purpose, we will create a class called semantic_cache that will work with its own encoder and provide the necessary functions for the user to perform queries.

In this class, we first query Faiss (the cache), and if the returned results are above the specified threshold, it will return the result from the cache. Otherwise, it will fetch the result from the Chroma database.

In [13]:
!pip install -q faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25h

def __init__(self, json_file='cache.json'):
        # Initialize Faiss index with Euclidean distance
        self.index = faiss.IndexFlatL2(768)  # Use IndexFlatL2 with Euclidean distance
        if self.index.is_trained:
            print('Index trained')

        # Initialize Sentence Transformer model
        self.encoder = SentenceTransformer('all-mpnet-base-v2')
        #self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.MAX_SIZE_CACHE = 100

        # Set Euclidean distance threshold
        # a distance of 0 means identicals sentences
        self.euclidean_threshold = 0.35 # We only return from cache sentences under 0.2
        self.json_file = json_file
        self.load_cache()

In [14]:
import faiss
from sentence_transformers import SentenceTransformer
import time
import json

class semantic_cache:
    def __init__(self, json_file='cache.json'):
        # Initialize Faiss index with Euclidean distance
        self.index = faiss.IndexFlatL2(768)  # Use IndexFlatL2 with Euclidean distance
        if self.index.is_trained:
            print('Index trained')

        # Initialize Sentence Transformer model
        self.encoder = SentenceTransformer('all-mpnet-base-v2')
        #self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.MAX_SIZE_CACHE = 100

        # Set Euclidean distance threshold
        # a distance of 0 means identicals sentences
        self.euclidean_threshold = 0.35 # We only return from cache sentences under 0.2
        self.json_file = json_file
        self.load_cache()

    def load_cache(self):
        # Load cache from JSON file, creating an empty cache if the file is not found
        try:
            with open(self.json_file, 'r') as file:
                self.cache = json.load(file)
        except FileNotFoundError:
            self.cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}

    def save_cache(self):
        # Save the cache to the JSON file
        with open(self.json_file, 'w') as file:
            json.dump(self.cache, file)

    def generate_answer(self, question: str) -> str:
        # Method to generate an answer using a separate function (make_prediction in this case)
        try:
            result = query_database([question], 1)
            response_text = result['documents'][0][0]

            return result, response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'generate_answer' method: {e}")

    def ask(self, question: str) -> str:
        # Method to retrieve an answer from the cache or generate a new one
        start_time = time.time()
        try:
            l = [question]
            embedding = self.encoder.encode(l)

            if self.index.is_trained:
                print('Index trained')
            else:
                self.index.train(embedding)
            # Search for the nearest neighbor in the index
            self.index.nprobe = 8
            D, I = self.index.search(embedding, 1)

            if D[0] >= 0:
                print (self.euclidean_threshold)
                print(f'D[0][0]: {D[0][0]}')
                print(f'I[0][0]: {I[0][0]}')
                if I[0][0] > 0 and D[0][0] <= self.euclidean_threshold:
                    row_id = int(I[0][0])
                    print(f'{D[0][0]} smaller than {self.euclidean_threshold}')
                    print(f'Found cache in row: {row_id} with score {D[0][0]}')
                    end_time = time.time()
                    elapsed_time = end_time - start_time
                    print(f"Time taken: {elapsed_time} seconds")
                    return self.cache['response_text'][row_id]

            # Handle the case when there are not enough results or Euclidean distance is not met
            answer, response_text = self.generate_answer(question)

            self.cache['questions'].append(question)
            self.cache['embeddings'].append(embedding[0].tolist())
            self.cache['answers'].append(answer)
            self.cache['response_text'].append(response_text)

            print(f'questions: {question}')
            print(f'answers: {answer}')
            print(f'response_text: {response_text}')

            #self.index.train(embedding)
            self.index.add(embedding)
            self.save_cache()
            end_time = time.time()
            elapsed_time = end_time - start_time
            print(f"Time taken: {elapsed_time} seconds")

            return response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'ask' method: {e}")



In [15]:
cache = semantic_cache('nuevo3.json')

Index trained


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [16]:
question_def = "I need a briefing around 20 words about the study of Learning the language of molecules to predict their properties is?"
results = cache.ask(question_def)

Index trained
0.35
D[0][0]: inf
I[0][0]: 0
questions: I need a briefing around 20 words about the study of Learning the language of molecules to predict their properties is?
answers: {'ids': [['id0']], 'distances': [[0.8585882186889648]], 'metadatas': [[{'Article Header': 'Learning the language of molecules to predict their properties'}]], 'embeddings': None, 'documents': [["['Discovering new materials and drugs typically involves a manual, trial-and-error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.', 'Researchers from MIT and the MIT-Watson AI Lab have developed a new, unified framework that can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep-learning approaches.', 'To teach a machine-learning model to predict a molecule’s biologic

In [17]:
print(results)

['Discovering new materials and drugs typically involves a manual, trial-and-error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.', 'Researchers from MIT and the MIT-Watson AI Lab have developed a new, unified framework that can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep-learning approaches.', 'To teach a machine-learning model to predict a molecule’s biological or mechanical properties, researchers must show it millions of labeled molecular structures — a process known as training. Due to the expense of discovering molecules and the challenges of hand-labeling millions of structures, large training datasets are often hard to come by, which limits the effectiveness of machine-learning approaches.', 'By contrast, the system created

In [18]:
question1 = "Are LLMs ready to drive cars?"
results = cache.ask(question1)

Index trained
0.35
D[0][0]: 1.7402358055114746
I[0][0]: 0
questions: Are LLMs ready to drive cars?
answers: {'ids': [['id787']], 'distances': [[1.0734435319900513]], 'metadatas': [[{'Article Header': 'Car talk'}]], 'embeddings': None, 'documents': [["['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum

In [19]:
print(results)

['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we get there?” he asked.', 

In [20]:
question_def = "Can LLMs drive cars?"
results = cache.ask(question1)

Index trained
0.35
D[0][0]: 0.0
I[0][0]: 1
0.0 smaller than 0.35
Found cache in row: 1 with score 0.0
Time taken: 0.01731729507446289 seconds


In [21]:
print(results)

['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we get there?” he asked.', 

Once we have our information inside the Database we can query It, and ask for data that matches our needs. The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the similarity between the search terms and the content of documents.

The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search.


# Loading the model and creating the prompt
TRANSFORMERS!!
Time to use the library **transformers**, the most famous library from [hugging face](https://huggingface.co/) for working with language models.

We are importing:
* **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
* **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook ***databricks/dolly-v2-3b***.
* **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification.

The model selected is [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), the smallest Dolly model. It have 3billion paramaters, more than enough for our sample, and works much better than GPT2.

Please, feel free to test [different Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), you need to search for NLP models trained for text-generation. My recomendation is choose "small" models, or we will run out of memory in kaggle.  


In [22]:
from getpass import getpass
hf_key = getpass("Hugging Face Key: ")

Hugging Face Key: ··········


In [23]:
!huggingface-cli login --token $hf_key

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [24]:
!pip install torch



In [25]:
from torch import cuda, torch
#In a MAC Silicon the device must be 'mps'
# device = torch.device('mps') #to use with MAC Silicon
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [26]:
from transformers import AutoTokenizer, AutoModelForCausalLM

#model_id = "databricks/dolly-v2-3b"
model_id = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="cuda",
                                            torch_dtype=torch.bfloat16)


tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [27]:
results = cache.ask(question_def)

Index trained
0.35
D[0][0]: 0.19357122480869293
I[0][0]: 1
0.19357122480869293 smaller than 0.35
Found cache in row: 1 with score 0.19357122480869293
Time taken: 0.017775297164916992 seconds


In [28]:
prompt_template = f"Relevant context: {results}\n\n The user's question: {question_def}"
prompt_template

"Relevant context: ['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we get t

In [29]:
input_ids = tokenizer(prompt_template, return_tensors="pt").to("cuda")

In [30]:
outputs = model.generate(**input_ids,
                         max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

<bos>Relevant context: ['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we g

The next step is to initialize the pipeline using the objects created above.

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting ***device_map*** to ***auto*** we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.  

In [None]:
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    max_new_tokens=256,
    trust_remote_code=True,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



## Creating the extended prompt
To create the prompt we use the result from query the Vector Database  and the sentence introduced by the user.

The prompt have two parts, the **relevant context** that is the information recovered from the database and the **user's question**.

We only need to join the two parts together to create the prompt that we are going to send to the model.

You can limit the lenght of the context passed to the model, because we can get some Memory problems with one of the datasets that contains a realy large text in the document part.

In [None]:
question = question_def
#context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template = f"Relevant context: {results}\n\n The user's question: {question}"
prompt_template

"Relevant context: ['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we get t

Now all that remains is to send the prompt to the model and wait for its response!


In [None]:
lm_response = pipeline(prompt_template)
#print(lm_response[0]["generated_text"])

KeyError: "Unknown task Relevant context: ['Discussions of self-driving vehicles are often accompanied by highly confident predictions: Visions of the future include whole networks of automated cars seamlessly zipping around metropolitan areas, safely and efficiently, with every person inside them a passive, hands-off passenger.', 'On Tuesday at MIT, the U.S. government’s chief auto safety official offered a more restrained view, suggesting that technology could provide important new safeguards for cars, while observing that it is too soon to say precisely what form vehicular automation will eventually take.', '“Right now, we really don’t know what the future is,” said Mark Rosekind, administrator of the National Highway Traffic Safety Administration (NHTSA), during a public forum at the Institute.', '“There’s this image we’ll be taking naps and doing crossword puzzles” while in cars, Rosekind noted, adding that the more immediate question is what it would take to make such a scenario possible. “Can we get there?” he asked.', 'In his remarks, Rosekind expressed enthusiasm for the possibility of automation-based safety improvements and said that NHTSA is trying to expedite the process through which more testing of automation takes place. The agency aims to complete within six months a policy document through which it can give guidance to automakers and technology companies, and outline a path forward for more experimentation on roads.', '“I think we need a huge amount of data,” he said.', 'The government’s principal goal while examining all of this, Rosekind emphasized, is safety.', '“It’s all about the human,” Rosekind said. “The human has to be front and center.”', 'Two views of automation', 'The forum, “The Present and Future of Automated Driving: Technology, Policy, and the Human Factor,” drew an audience of over 250 people to MIT’s Kresge Auditorium. The event was hosted by the MIT AgeLab. Rosekind participated in a conversation with Bryan Reimer, a research scientist at the MIT AgeLab and associate director of the New England University Transportation Center, of which MIT is a part.', 'In his remarks, Rosekind highlighted the large number of auto fatalities in the U.S: There were 32,675 such deaths in 2014. That is actually down substantially — about 20 percent — over the last decade. And yet, Rosekind said, preliminary data indicate the figure may jump back up by 9 percent for 2015, perhaps partly because gas prices have been lower and the volume of vehicles on the road may have thus increased.', 'Rosekind noted that safety technologies, especially seatbelts and airbags, have saved large numbers of lives in recent decades, but automation devices held significant promise.', '“The question is how we start nailing on better and better technologies,” he said.', 'One of the keys to automated safety, he stressed, was connectivity: making sure vehicles are communicating with each other on the road.', '“Connected vehicles give you further levels of safety that you can’t get with independent autonomous vehicles,” Rosekind said. Such vehicle-to-vehicle communication, he explained, could help reduce accidents at intersections and in all kinds of scenarios where driver vision is normally limited.', 'On the other hand, Rosekind noted, in response to an audience question, the development of communication among all autos on the road would either require massive retrofitting among current autos or take a long time to phase in: “If you had perfect, connected autonomous vehicles on the road tomorrow, it would still take 20 to 30 years to turn over the fleet.”', 'In response to further questions from the audience, Rosekind acknowledged that issues about data privacy and security from hackers were among the hurdles that have to be cleared in order for automation to jump forward.', '“Humans aren’t going to trust the vehicles unless you address those [issues],” he suggested.', 'And Rosekind took a neutral stance on one of the main issues involving self-driving cars: whether they could be totally autonomous, which is the direction Google has been moving in, or whether more incremental versions of vehicle automation will take hold, which is what some automakers believe.', '“Folks tend to separate this into two views,” Rosekind agreed, emphasizing again that the degree of automation was still very much to be determined: “I don’t think we know yet.”']\n\n The user's question: Can LLMs drive cars?, available tasks are ['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-to-image', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"

# Conclusions, Fork and Improve
A very short notebook, but with a lot of content.

We have used a vector database to store information. Then move on to retrieve it and use it to create an extended prompt that we've used to call one of the newer large language models available in Hugging Face.

The model has returned a response to us taking into account the context that we have passed to it in the prompt.

This way of working with language models is very powerful.

We can make the model use our information without the need for Fine Tuning. This technique really has some very big advantages over fine tuning.

Please don't stop here.

* The notebook is prepared to use two more Datasets. Do tests with it.

* Find another model on Hugging Face and compare it.

* Modify the way to create the prompt.

## Continue learning
This notebook is part of a [course on large language models](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) I'm working on and it's available on [GitHub](https://github.com/peremartra/Large-Language-Model-Notebooks-Course). You can see the other lessons and if you like it, don't forget to subscribe to receive notifications of new lessons.

Other notebooks in the Large Language Models series:
https://www.kaggle.com/code/peremartramanonellas/ask-your-documents-with-langchain-vectordb-hf

### If you liked the notebook Please consider ***UPVOTING IT***. It helps others to discover it, and encourages me to continue publishing.