# Semantic Search Using Chroma DB

In [1]:
# !pip install chromadb --user
import pandas as pd
import chromadb

In [3]:
# data source: https://www.kaggle.com/datasets/nulldata/medium-post-titles
df = pd.read_csv("medium_post_titles.csv")

df = df.dropna()
df = df[~df["subtitle_truncated_flag"]]

topics_of_interest = ['artificial-intelligence', 'data-science', 'machine-learning']
# topics_of_interest = ['data-science']

df = df[df['category'].isin(topics_of_interest)]

df['text'] = df['title']  + df['subtitle']

df['meta'] = df.apply( lambda x: {
    'text': x['text'],
    'category': x['category']  
}, axis=1)

In [4]:
df.head(2)

Unnamed: 0,category,title,subtitle,subtitle_truncated_flag,text,meta
4,artificial-intelligence,"""Can I Train my Model on Your Computer?""",How we waste computational resources and how t...,False,"""Can I Train my Model on Your Computer?""How we...","{'text': '""Can I Train my Model on Your Comput..."
289,data-science,(Robot) data scientists as a service,Automating data science with symbolic regressi...,False,(Robot) data scientists as a serviceAutomating...,{'text': '(Robot) data scientists as a service...


## Chroma DB Setup 

In [7]:
from chromadb.config import Settings

***Note1:*** For inserting data, we need to do the vector embeddings. If we are not defining or using our own vector embedding, then Chrome will use the default one.

***Note2:*** So before inserting data into the vector database, we have to have a collection. The collection is basically a collection of vectors where we define vectors, IDs and other informations, and it's kind of like an (abstract) index.

In [9]:
# Chroma DB Setup
# chroma_client = chromadb.Client() # default: in memory
client = chromadb.PersistentClient(path="medium-chroma-db") # persistent memory

# collection creation
article_collection = chroma_client.create_collection(name="medium-article")

## Data Insertion

In [10]:
# inserting data

article_collection.upsert(
    ids=[f"{x}" for x in df.index.tolist()],
    documents=df['text'].tolist(),
    metadatas=df['meta'].tolist()    
)

C:\Users\molak\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:14<00:00, 5.61MiB/s]


In [None]:
# article_collection.add(
#     ids=[f"{x}" for x in df.index.tolist()],
#     documents=df['text'].tolist(),
#     metadatas=df['meta'].tolist()    
# )

### Note: You can use *.add()* instead of *.upsert()* as well, but What is the difference between *.add()* and *.upsert()*?

- *.add()* basically adds vectors into collections where *.upsert()* means updating and inserting.

- If the *.upsert()* method finds a vector ID that is already present in the database, it is not going to add that but it's going to update it instead. If the ID is not present in the database, then it is going to add it or insert it. That's why it's called the Upsert. 
- *.upsert()* means we are updating all these vectors because these vectors are already there in the database, so it's going to update even if it's there.

- If you use *.add()* and it finds a vector ID that is already present in the database thenit will provide an error saying that ID is already there, so you cannot add any vector that's already present with the same ID.

***Summary:*** In the case you have any vectors that you need to update, you can use *.upsert()*. If you are trying to insert them into the database for the first time, you can use *.add()*. In any case, *.upsert()* will work.

## Vector Query

In [15]:
qry_str1 = "best data science library?"
qry_str2 = "best data ai library?"

In [16]:
article_collection.query(query_texts=qry_str1, n_results=2)
# n_results: how many results needed

{'ids': [['65427', '2586']],
 'distances': [[0.6087137460708618, 0.6778910160064697]],
 'metadatas': [[{'category': 'data-science',
    'text': 'My Favorite Data Science/Machine Learning ResourcesA summary of sources to get into Data Science'},
   {'category': 'artificial-intelligence',
    'text': '5 Resources to Inspire Your Next Data Science ProjectDon’t worry — getting started is the hardest part'}]],
 'embeddings': None,
 'documents': [['My Favorite Data Science/Machine Learning ResourcesA summary of sources to get into Data Science',
   '5 Resources to Inspire Your Next Data Science ProjectDon’t worry — getting started is the hardest part']],
 'uris': None,
 'data': None}

In [17]:
article_collection.query(query_texts=qry_str2, n_results=2)

{'ids': [['103719', '24137']],
 'distances': [[0.6297021508216858, 0.7120205760002136]],
 'metadatas': [[{'category': 'machine-learning',
    'text': 'Top 7 libraries and packages of the year for Data Science and AI: Python & RThis is a list of the best libraries and packages that changed our lives this year, compiled from my weekly digests'},
   {'category': 'artificial-intelligence',
    'text': 'Data Commons Version 1.0: A Framework to Build Toward AI for GoodA roadmap for data from the 2018 AI for Good Summit'}]],
 'embeddings': None,
 'documents': [['Top 7 libraries and packages of the year for Data Science and AI: Python & RThis is a list of the best libraries and packages that changed our lives this year, compiled from my weekly digests',
   'Data Commons Version 1.0: A Framework to Build Toward AI for GoodA roadmap for data from the 2018 AI for Good Summit']],
 'uris': None,
 'data': None}

In [22]:
qry_str3 = "what is the best data science library?"
qry_str4 = "what is the best ai library?"

In [20]:
article_collection.query(query_texts=qry_str3, n_results=2)

{'ids': [['65427', '6380']],
 'distances': [[0.651831328868866, 0.7275521755218506]],
 'metadatas': [[{'category': 'data-science',
    'text': 'My Favorite Data Science/Machine Learning ResourcesA summary of sources to get into Data Science'},
   {'category': 'data-science',
    'text': 'A Road Map for Data ScienceWhat is Data Science?'}]],
 'embeddings': None,
 'documents': [['My Favorite Data Science/Machine Learning ResourcesA summary of sources to get into Data Science',
   'A Road Map for Data ScienceWhat is Data Science?']],
 'uris': None,
 'data': None}

In [23]:
article_collection.query(query_texts=qry_str4, n_results=2)

{'ids': [['103719', '112075']],
 'distances': [[0.670006275177002, 0.7815368175506592]],
 'metadatas': [[{'category': 'machine-learning',
    'text': 'Top 7 libraries and packages of the year for Data Science and AI: Python & RThis is a list of the best libraries and packages that changed our lives this year, compiled from my weekly digests'},
   {'category': 'machine-learning',
    'text': 'What are Some ‘Advanced ‘ AI and Machine Learning Online Courses?Where can you find advanced AI and machine learning courses? A comprehensive review based on my personal experience with these courses.'}]],
 'embeddings': None,
 'documents': [['Top 7 libraries and packages of the year for Data Science and AI: Python & RThis is a list of the best libraries and packages that changed our lives this year, compiled from my weekly digests',
   'What are Some ‘Advanced ‘ AI and Machine Learning Online Courses?Where can you find advanced AI and machine learning courses? A comprehensive review based on my pe

In [None]:
# article_collection.delete()