# Using a vector / embedding database to provide context to an LLM
Inspired by: [Pere Marta's LLM Notebooks Course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course])

In [1]:
# hugging face's transformers library
%pip install -q transformers==4.41.2

# hugging face datasets
%pip install datasets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images
!pip install -q sentence-transformers==2.2.2
# Chroma - the open-source embedding database.
!pip install -q chromadb==0.4.20

In [3]:
import numpy as np
import pandas as pd

In [4]:
from datasets import load_dataset

# Getting our data


In [5]:
news = pd.read_csv("./datasets/labelled_newscatcher_dataset.csv", sep=";")
news["id"] = news.index
news.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


In [6]:
# some constants
MAX_NEWS = 1000
DOCUMENT="title"
TOPIC="topic"

In [7]:
# create a smaller subset of data for easier training
subset_news = news.head(MAX_NEWS)

In [8]:
import chromadb
from chromadb.config import Settings
# need settings to change the setting for the ChromaDB system, and customize its behavior.

In [9]:
chroma_client = chromadb.PersistentClient(path="./")

# Filling and querying the ChromaDB Database

The data in ChromaDB is stored as a collection. We need to delete the collection if it already exists.

In [10]:
from datetime import datetime

In [11]:
# unique collection name
collection_name = "news_collection"+datetime.now().strftime("%s")

# delete the collection if it already exists
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

# finally, we get our collection
collection = chroma_client.create_collection(name=collection_name)

It's time to add the data to the collection. Using the function add we need to inform, at least documents, metadatas and ids.

- In the **document** we store the big text, it's a different column in each Dataset.
- In **metadatas**, we can informa a list of topics.
  - The metadata is not used in the search, but they can be utilized for filtering or refining the results after the initial search.
- In **id** we need to inform an unique identificator for each row. It MUST be unique! I'm creating the ID using the range of MAX_NEWS. 

In [12]:
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)

[0;93m2024-07-12 15:59:25.080207 [W:onnxruntime:, helper.cc:82 IsInputSupported] CoreML does not support input dim > 16384. Input:embeddings.word_embeddings.weight, shape: {30522,384}[m
[0;93m2024-07-12 15:59:25.080614 [W:onnxruntime:, coreml_execution_provider.cc:104 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 49 number of nodes in the graph: 323 number of nodes supported by CoreML: 231[m


In [13]:
type(collection)

chromadb.api.models.Collection.Collection

# Querying based on distance in the vector space 

In [14]:
results = collection.query(query_texts=["game"], n_results=1 )

print(results)

{'ids': [['id825']], 'distances': [[1.303723692893982]], 'metadatas': [[{'topic': 'TECHNOLOGY'}]], 'embeddings': None, 'documents': [['A new ‘Call Of Duty’ alternate reality game has launched']], 'uris': None, 'data': None}


# Vector MAP

In [15]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [16]:
getado = collection.get(ids="id815", 
                       include=["documents", "embeddings"])

In [17]:
print(getado)

{'ids': ['id815'], 'embeddings': [[0.01394363772124052, 0.05963527038693428, 0.02171013318002224, 0.01002002228051424, -0.007083873730152845, -0.007371362764388323, 0.0019744352903217077, 0.100026935338974, -0.008951264433562756, 0.00966266542673111, -0.02148592844605446, 0.0073929340578615665, -0.037767454981803894, 0.029468564316630363, -0.0341857373714447, 0.010592324659228325, 0.04066145420074463, -0.08658261597156525, -0.026019027456641197, -0.040663450956344604, -0.08636076003313065, 0.08372718095779419, 0.02209404483437538, 0.05978499352931976, 0.018113739788532257, -0.017061375081539154, -0.11369919776916504, -0.030749034136533737, 0.02938021346926689, 0.13694056868553162, -0.03374531865119934, -0.004400501027703285, -0.05974799767136574, 0.03632986173033714, -0.04297184944152832, -0.0338977612555027, -0.019826829433441162, -0.014812154695391655, -0.04478452727198601, 0.03193549066781998, 0.0506109781563282, -0.009234302677214146, -0.0023277224972844124, 0.04476313292980194, 0.

In [18]:
# This is the entire vector representation of the collection entry
word_vectors = getado["embeddings"]
word_list = getado["documents"]
word_vectors

[[0.01394363772124052,
  0.05963527038693428,
  0.02171013318002224,
  0.01002002228051424,
  -0.007083873730152845,
  -0.007371362764388323,
  0.0019744352903217077,
  0.100026935338974,
  -0.008951264433562756,
  0.00966266542673111,
  -0.02148592844605446,
  0.0073929340578615665,
  -0.037767454981803894,
  0.029468564316630363,
  -0.0341857373714447,
  0.010592324659228325,
  0.04066145420074463,
  -0.08658261597156525,
  -0.026019027456641197,
  -0.040663450956344604,
  -0.08636076003313065,
  0.08372718095779419,
  0.02209404483437538,
  0.05978499352931976,
  0.018113739788532257,
  -0.017061375081539154,
  -0.11369919776916504,
  -0.030749034136533737,
  0.02938021346926689,
  0.13694056868553162,
  -0.03374531865119934,
  -0.004400501027703285,
  -0.05974799767136574,
  0.03632986173033714,
  -0.04297184944152832,
  -0.0338977612555027,
  -0.019826829433441162,
  -0.014812154695391655,
  -0.04478452727198601,
  0.03193549066781998,
  0.0506109781563282,
  -0.009234302677214146

# Loading the model and creating the prompt

We will use the transformers library from hugging face

We are importing:

- **Autotokenizer**: It is a utility class for tokenizing text inputs that are compatible with various pre-trained language models.
- **AutoModelForCasualLLM**: it provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models), or the model used in this notebook databricks/dolly-v2-3b.
- **pipeline**: provides a simple interface for performing various natural language processing (NLP) tasks, such as text generation (our case) or text classification.

The model that we will use is a Small Language Model called TinyLlama

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

The next step is to initialize the pipeline using the objects created above.

The model's response is limited to 256 tokens, for this project I'm not interested in a longer response, but it can easily be extended to whatever length you want.

Setting device_map to auto we are instructing the model to automaticaly select the most appropiate device: CPU or GPU for processing the text generation.

In [20]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",
)

In [21]:
results

{'ids': [['id825']],
 'distances': [[1.303723692893982]],
 'metadatas': [[{'topic': 'TECHNOLOGY'}]],
 'embeddings': None,
 'documents': [['A new ‘Call Of Duty’ alternate reality game has launched']],
 'uris': None,
 'data': None}

In [22]:
question = "Can I buy a new Toshiba laptop?"
# get 20 results for laptop
results = collection.query(query_texts=["laptop"], n_results=20 )
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template_with_context = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """

prompt_template_without_context = f"""
Question: {question}
Answer: """
print(f'{prompt_template_with_context=}\n')
print(f'{prompt_template_without_context=}\n')

prompt_template_with_context="\nRelevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot #Apple to Reportedly Launch Its Cheapest MacBook Ever #MacBook Air (2020) review: Nobody does it better #Dell announces the premium Latitude 7410 Chromebook Enterprise: available now #Surface Reveals Microsoft’s Turbocharged Android #Dell have slashed over $1100 off their RTX 2080

In [23]:
# pipe already has a model and a tokenizer
lm_response_with_context = pipe(prompt_template_with_context)
print(lm_response_with_context[0]["generated_text"])


Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot #Apple to Reportedly Launch Its Cheapest MacBook Ever #MacBook Air (2020) review: Nobody does it better #Dell announces the premium Latitude 7410 Chromebook Enterprise: available now #Surface Reveals Microsoft’s Turbocharged Android #Dell have slashed over $1100 off their RTX 2080 gaming laptops today #Realme C

In [24]:
# Compare this to without context
lm_response_without_context = pipe(prompt_template_without_context)
print(lm_response_without_context[0]["generated_text"])


Question: Can I buy a new Toshiba laptop?
Answer: 
Yes, you can buy a new Toshiba laptop. Toshiba is a well-known brand in the laptop industry, and their laptops are known for their durability, performance, and reliability. Toshiba has a wide range of laptops, from budget-friendly models to high-end premium models. You can find a Toshiba laptop that suits your needs and budget at a variety of online and offline retailers.


# Summary
- We used a news dataset and stored it's titles as vectors in a ChromaDB
- Then move on to retrieve it and use it to create an extended prompt that we've used to call one f the language models available in Hugging Face
- The model has returned a response to us taking into account the context that we have passed to it in the prompt.
- We can compare the results with and without the prompt and notice that it changes the answer completely