<a href="https://colab.research.google.com/github/jaewon078/RAG-with-vector-db/blob/main/RAG_with_Vector_Database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> RAG w/ Vector Database (Using ChromaDB) </h1>

<h2> General Information: </h2>

* Model: TinyLlama
* Colab Environment: CPU

Relevant Keywords:
- Vector Database
- ChromaDB
- RAG
- Embeddings


<h2> Import Libraries </h2>

- Sentence transformers: used to transform sentences into fixed-length vectors (embeddings).
- ChromaDB: used as our vector database. It is open-source and commonly used to store embeddings.

In [1]:
!pip install -q transformers==4.41.2
!pip install -q sentence-transformers==2.2.2
!pip install -q chromadb==0.4.20

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m727.0 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.7/507.7 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m60.1 MB

In [2]:
import numpy as np
import pandas as pd

<h2> Copying a Kaggle Dataset </h2>

<p> I'll be using <a href="https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset">this link</a>, and will be copying this into Google Drive so that we can use this in Colab </p>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!pip install kaggle



In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'

In [6]:
!kaggle datasets download -d kotartemiy/topic-labeled-news-dataset

Dataset URL: https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset
License(s): CC0-1.0
Downloading topic-labeled-news-dataset.zip to /content
 53% 5.00M/9.45M [00:00<00:00, 27.6MB/s]
100% 9.45M/9.45M [00:00<00:00, 40.7MB/s]


In [7]:
import zipfile

# This should work out of the box, but feel free to define the path to your zip file
file_path = '/content/topic-labeled-news-dataset.zip'

In [8]:
with zipfile.ZipFile(file_path, 'r') as zip_ref:
    zip_ref.extractall('/content/drive/MyDrive/kaggle')

<h2> Loading the Dataset </h2>

In [9]:
# Loads our CSV file into a pandas DataFrame called news
news = pd.read_csv('/content/drive/MyDrive/kaggle/labelled_newscatcher_dataset.csv', sep=';')

# Setting Constants
MAX_NEWS = 1000 # Limits the number of news items to use (for free Colab)
DOCUMENT = "title" # Specifies which column contains the main text of the news
TOPIC = "topic" # Specifies which column contains the topic of the news

In [10]:
# ChromaDB requires that our data have a unique identifier, which we achieve below
news["id"] = news.index

In [11]:
# We select a small portion of news, determined by MAX_NEWS
subset_news = news.head(MAX_NEWS)

<h2> An Example of How Our DataFrame Looks Now </h2>

In [12]:
subset_news.head(3)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2


<h2> Importing and Configuring the Vector Database (ChromaDB) </h2>

In [13]:
import chromadb
from chromadb.config import Settings

In [14]:
chroma_client = chromadb.PersistentClient(path="/content/drive/MyDrive/chromadb")

<h2> Filling and Querying the ChromaDB Database </h2>

In [15]:
from datetime import datetime

In [16]:
# Create a unique collection name using time

collection_name = "news_collection"+datetime.now().strftime("%s")

# Delete an existing collection w/ the same name if it exists
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)


In [17]:
# Adds data to the collection

collection.add(
    documents=subset_news[DOCUMENT].tolist(), # main text of each news item
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()], # list of dicts containing topic of news item
    ids=[f"id{x}" for x in range(MAX_NEWS)], # unique identifiers for each item
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:04<00:00, 19.6MiB/s]


In [18]:
# Queries the collection

results = collection.query(query_texts=["laptop"], n_results=10)

# Prints out our results
print(results)

{'ids': [['id173', 'id829', 'id117', 'id535', 'id141', 'id218', 'id390', 'id273', 'id56', 'id900']], 'distances': [[0.8593594431877136, 1.0294400453567505, 1.0793331861495972, 1.093001127243042, 1.1329681873321533, 1.2130440473556519, 1.214331865310669, 1.2164140939712524, 1.2220635414123535, 1.2754170894622803]], 'metadatas': [[{'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}]], 'embeddings': None, 'documents': [['The Legendary Toshiba is Officially Done With Making Laptops', '3 gaming laptop deals you can’t afford to miss today', 'Lenovo and HP control half of the global laptop market', 'Asus ROG Zephyrus G14 gaming laptop announced in India', 'Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865)', "Apple's Next MacBook

<h2> Vector Map (for demonstration) </h2>

In [19]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [20]:
# Retrieves an item with the id "id141" from our ChromaDB collection

document_141_data = collection.get(ids="id141", include=["documents", "embeddings"])

In [22]:
# Extracts embeddings and documents

word_vectors = document_141_data["embeddings"]
word_list = document_141_data["documents"]

In [23]:
# Displays documents
word_list

['Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865)']

In [24]:
# Displays embeddings
word_vectors

[[-0.0808560848236084,
  -0.049963705241680145,
  -0.023777484893798828,
  -0.011053602211177349,
  0.02665771171450615,
  -0.04479333013296127,
  -0.02889663353562355,
  0.026656104251742363,
  0.0014397227205336094,
  -0.016407841816544533,
  0.0653492733836174,
  -0.06901992857456207,
  -0.05748078227043152,
  0.010111615061759949,
  0.05043035000562668,
  -0.002057764446362853,
  0.07256408035755157,
  -0.12437368929386139,
  0.010659442283213139,
  -0.10942046344280243,
  -0.01143240462988615,
  -0.010376011952757835,
  -0.020610831677913666,
  -0.024394094944000244,
  0.07828476279973984,
  0.005820558872073889,
  0.023317726328969002,
  -0.08243829756975174,
  -0.02726505883038044,
  0.0046674772165715694,
  0.004340188577771187,
  0.03252805024385452,
  -0.026030974462628365,
  0.07963905483484268,
  0.042182061821222305,
  -0.12119994312524796,
  0.04907083883881569,
  -0.07625846564769745,
  0.04331624507904053,
  -0.08360457420349121,
  -0.07140401750802994,
  -0.01879251375

<h2> Loading the Model and Creating the Prompt </h2>


From transformers (from hugging face), we will be importing:
- Autotokenizer: tokenizes text inputs, compatible w/ various pre-trained language models
- AudoModelForCasualLLM: provides an interface to pre-trained language models specifically designed for language generation tasks
- pipeline: provides a simple interface for various NLP tasks (like text generation)

In [25]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [26]:
# Setting up the pipeline, using the above

pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256, # response limited to 256
    device_map="auto", # the model decides whether to select CPU or GPU for text generation
)

<h2> Creating the Extended Prompt (RAG) </h2>


In [27]:
# Our prompt has two parts:
# - The user's question (question)
# - The relevant context retrieved from a query to our vector database (context)

question = "Is it possible to buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """
prompt_template

"\nRelevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot\nConsidering the relevant context, answer the question.\nQuestion: Is it possible to buy a new Toshiba laptop?\nAnswer: "

In [28]:
# Generating the response

lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])


Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot
Considering the relevant context, answer the question.
Question: Is it possible to buy a new Toshiba laptop?
Answer: 
Based on the given material, it is not possible to buy a new Toshiba laptop. The article mentions that Toshiba is officially done with making laptops, and the company has announced that it will 



---



<h2> Connecting to an existing ChromaDB Collection </h2>

<p> This part is a bit irrelevant from my demonstration of RAG w/ a vector database, but I thought it beneficial to show how you can connect to an existing collection for future reference. </p>

In [29]:
!pip install chromadb



In [30]:
import chromadb
chroma_client_2 = chromadb.PersistentClient(path="/content/drive/MyDrive/chromadb")

In [31]:
collection2 = chroma_client_2.get_collection(name=collection_name)
results2 = collection.query(query_texts=["laptop"], n_results=10 )

In [32]:
print(results2)

{'ids': [['id173', 'id829', 'id117', 'id535', 'id141', 'id218', 'id390', 'id273', 'id56', 'id900']], 'distances': [[0.8593594431877136, 1.0294400453567505, 1.0793331861495972, 1.093001127243042, 1.1329681873321533, 1.2130440473556519, 1.214331865310669, 1.2164140939712524, 1.2220635414123535, 1.2754170894622803]], 'metadatas': [[{'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}]], 'embeddings': None, 'documents': [['The Legendary Toshiba is Officially Done With Making Laptops', '3 gaming laptop deals you can’t afford to miss today', 'Lenovo and HP control half of the global laptop market', 'Asus ROG Zephyrus G14 gaming laptop announced in India', 'Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865)', "Apple's Next MacBook