# Setting Chroma DB

[Chroma DB](https://www.trychroma.com/) is an open-source embedding database.

In [52]:
import os
import numpy as np
import pandas as pd
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions


# Initiating a persistent Chroma client


In [126]:
db_local_persistent_folder = "./../db/"
client = chromadb.PersistentClient(path=db_local_persistent_folder)

To check if client is connected to database, we can use the `Client.heartbeat()` function:

In [127]:
client.heartbeat() # returns a nanosecond heartbeat. Useful for making sure the client remains connected.

1700487856111201000

# Loading Catalog

Before uploading data to the database, we need to define the documents catalog. We'll use the preprocessed version of the MIND dataset.

In [154]:
df_items = pd.read_parquet(f"./../data/dev/df_items_preprocessed.parquet")
df_items.tail()

Unnamed: 0,item_id,category,subcategory,title,abstract,url,title_entities,abstract_entities,publish_time,publish_timestamp,sentence
42411,N63550,lifestyle,lifestyleroyals,Why Kate & Meghan Were on Different Balconies ...,There's no scandal here. It's all about the or...,https://assets.msn.com/labs/mind/BBWyynu.html,"[{'Confidence': 1.0, 'Label': 'Meghan, Duchess...",[],2019-11-15 18:29:36,1573843000.0,why kate & meghan were on different balconies ...
42412,N30345,entertainment,entertainment-celebrity,See the stars at the 2019 Baby2Baby gala,Stars like Chrissy Teigen and Kate Hudson supp...,https://assets.msn.com/labs/mind/BBWyz7N.html,[],"[{'Confidence': 1.0, 'Label': 'Kate Hudson', '...",2019-11-15 09:16:10,1573809000.0,see the stars at the 2019 baby2baby gala star...
42413,N30135,news,newsgoodnews,Tennessee judge holds lawyer's baby as he swea...,Tennessee Court of Appeals Judge Richard Dinki...,https://assets.msn.com/labs/mind/BBWyzI8.html,"[{'Confidence': 0.994, 'Label': 'Tennessee', '...","[{'Confidence': 1.0, 'Label': 'Tennessee Court...",2019-11-15 01:00:46,1573780000.0,tennessee judge holds lawyer's baby as he swea...
42414,N44276,autos,autossports,Best Sports Car Deals for October,,https://assets.msn.com/labs/mind/BBy5rVe.html,"[{'Confidence': 1.0, 'Label': 'Peugeot RCZ', '...",[],NaT,,best sports car deals for october peugeot rcz
42415,N39563,sports,more_sports,Shall we dance: Sports stars shake their leg,,https://assets.msn.com/labs/mind/BBzMpnG.html,[],[],NaT,,shall we dance: sports stars shake their leg


In order to store the embeddings, we can either compute it prior to uploading the documents (through a separate process) or we can use one of the built-in chroma db [embedding functions](https://docs.trychroma.com/embeddings).

The built-in embedding functions are applied to the documents' textual data, so we need to define the sentences to be used. For the MIND dataset, we can concatenate the following attributes to form a sentence:

- Title
- Title Entities
- Abstract

In [139]:
def generate_item_sentence(item:pd.Series) -> str:
    sentence = ' '.join([
        item["title"],
        ' '.join([entity['Label'] for entity in item["title_entities"]]),
        item["abstract"] if str(item["abstract"]) != 'None' else '',
    ])

    return sentence.lower()

df_items.head().apply(lambda x: generate_item_sentence(x), axis=1)

0    the brands queen elizabeth, prince charles, an...
1    dispose of unwanted prescription drugs during ...
2    the cost of trump's aid freeze in the trenches...
3    i was an nba wife. here's how it affected my m...
4    how to get rid of skin tags, according to a de...
dtype: object

In [140]:
df_items['sentence'] = df_items.apply(lambda x: generate_item_sentence(x), axis=1)
df_items.tail()

Unnamed: 0,item_id,category,subcategory,title,abstract,url,title_entities,abstract_entities,publish_time,publish_timestamp,sentence
42411,N63550,lifestyle,lifestyleroyals,Why Kate & Meghan Were on Different Balconies ...,There's no scandal here. It's all about the or...,https://assets.msn.com/labs/mind/BBWyynu.html,"[{'Confidence': 1.0, 'Label': 'Meghan, Duchess...",[],2019-11-15 18:29:36,1573843000.0,why kate & meghan were on different balconies ...
42412,N30345,entertainment,entertainment-celebrity,See the stars at the 2019 Baby2Baby gala,Stars like Chrissy Teigen and Kate Hudson supp...,https://assets.msn.com/labs/mind/BBWyz7N.html,[],"[{'Confidence': 1.0, 'Label': 'Kate Hudson', '...",2019-11-15 09:16:10,1573809000.0,see the stars at the 2019 baby2baby gala star...
42413,N30135,news,newsgoodnews,Tennessee judge holds lawyer's baby as he swea...,Tennessee Court of Appeals Judge Richard Dinki...,https://assets.msn.com/labs/mind/BBWyzI8.html,"[{'Confidence': 0.994, 'Label': 'Tennessee', '...","[{'Confidence': 1.0, 'Label': 'Tennessee Court...",2019-11-15 01:00:46,1573780000.0,tennessee judge holds lawyer's baby as he swea...
42414,N44276,autos,autossports,Best Sports Car Deals for October,,https://assets.msn.com/labs/mind/BBy5rVe.html,"[{'Confidence': 1.0, 'Label': 'Peugeot RCZ', '...",[],NaT,,best sports car deals for october peugeot rcz
42415,N39563,sports,more_sports,Shall we dance: Sports stars shake their leg,,https://assets.msn.com/labs/mind/BBzMpnG.html,[],[],NaT,,shall we dance: sports stars shake their leg


Once the textual data is set, we can test the Embedding Function. For this notebook, we'll use the Sentence Transformers with the `all-MiniLM-L6-v2` model to create embeddings.

In [None]:
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
sentences = df_items["sentence"][9:11].values
print (sentences)
embeddings = np.array(embedding_function(sentences))
print (embeddings.shape)


## Setting Collections

Collections are similar to a cloud-based bucket. They store a specific group of documents defined by the user.

To create a collection (or get an existing one), we can define the following parameters:

- `name`: the collection's name
- `metadata`: a set of attributes defining how metadata is computed. For example, we can define the `hnsw:space` attribute, which set how embedding distances are computed. Currently options are:
    - `l2`: squared L2 norm (default)
    - `ip`: inner product
    - `cosine`: cosine similarity


In [142]:
collection = client.get_or_create_collection(
    name="mind-news",
    metadata={"hnsw:space": "cosine"}, # distance method. l2 is the default. Options: "l2", "ip", "cosine"
    embedding_function=embedding_function

)

In [100]:
# client.delete_collection("mind-news")

In [149]:
collection.count()

20

## Adding Data to Collection

In [144]:
query_fields = ["category", "subcategory", "publish_timestamp"]
df_items.apply(lambda x: {field: x[field] for field in query_fields}, axis=1).values

array([{'category': 'lifestyle', 'subcategory': 'lifestyleroyals', 'publish_timestamp': nan},
       {'category': 'health', 'subcategory': 'medical', 'publish_timestamp': nan},
       {'category': 'news', 'subcategory': 'newsworld', 'publish_timestamp': nan},
       ...,
       {'category': 'news', 'subcategory': 'newsgoodnews', 'publish_timestamp': 1573779646.0},
       {'category': 'autos', 'subcategory': 'autossports', 'publish_timestamp': nan},
       {'category': 'sports', 'subcategory': 'more_sports', 'publish_timestamp': nan}],
      dtype=object)

In [161]:
CHROMA_DB_MAX_CHUNK_SIZE = 41666
n_chunks = np.ceil(df_items.shape[0]/CHROMA_DB_MAX_CHUNK_SIZE)

2

In [162]:
%%time
for i, df_chunk in enumerate(np.array_split(df_items, n_chunks)):
    print (f"Processing chunk {i+1}/{int(n_chunks)}")
    collection.upsert(
        documents=df_chunk["sentence"].values.tolist(),
        metadatas=df_chunk.apply(lambda x: {field: x[field] for field in query_fields}, axis=1).values.tolist(),
        ids=df_chunk["item_id"].values.tolist()
    )

Processing chunk 0/1.0
Processing chunk 1/1.0
CPU times: user 23min 56s, sys: 2min 27s, total: 26min 23s
Wall time: 2h 1min 11s


In [163]:
collection.count()

42416

# Querying the Collection

In [164]:
# Get first N items from a collection
collection.peek(5)["ids"]

['N38146', 'N50966', 'N20106', 'N5860', 'N23965']

In [165]:
# Getting items based on ID and defining returning fields
collection.get(
    ids=df_sample["item_id"].sample(2).values.tolist(),
    include=["metadatas"]
)

{'ids': ['N22428', 'N15764'],
 'embeddings': None,
 'metadatas': [{'category': 'lifestyle', 'subcategory': 'lifestylefamily'},
  {'category': 'sports', 'subcategory': 'basketball_nba'}],
 'documents': None,
 'uris': None,
 'data': None}

In [167]:
# Querying with text input
collection.query(
    query_texts=["Washington D.C."],
    n_results=5,
    include=["distances", "documents"],
)

{'ids': [['N1364', 'N30199', 'N55433', 'N5241', 'N59911']],
 'distances': [[0.40464282035827637,
   0.5020017027854919,
   0.510242223739624,
   0.5167964696884155,
   0.5257139205932617]],
 'metadatas': None,
 'embeddings': None,
 'documents': [['union station in d.c.: the ultimate guide washington, d.c. where to eat, where to shop, and what to know about the history behind it all',
   "10 best museums in washington dc washington, d.c. if there's one thing washington knows how to do, it's a museum.",
   'what $3,100 rents in d.c. right now washington, d.c. these newer listings are in dupont circle, glover park, columbia heights, and elsewhere',
   'lonely planet names washington, d.c., the second-best city in the world to visit in 2020 lonely planet washington, d.c. lonely planet has compiled its annual ranking of 10 best countries, cities, regions and best-value destinations. what made the lists?',
   'where to hunt for deal on a d.c.-area apartment washington metropolitan area looki

In [168]:
# Defining querying filters
collection.query(
    query_texts=[""],
    n_results=5,
    where={"category": "autos"}
)

{'ids': [['N61182', 'N43436', 'N10344', 'N24897', 'N13363']],
 'distances': [[0.7420820593833923,
   0.7643195390701294,
   0.7644224166870117,
   0.7785135507583618,
   0.7803128957748413]],
 'metadatas': [[{'category': 'autos', 'subcategory': 'autosenthusiasts'},
   {'category': 'autos', 'subcategory': 'autosclassics'},
   {'category': 'autos',
    'publish_timestamp': 1573787140.0,
    'subcategory': 'autosnews'},
   {'category': 'autos', 'subcategory': 'autosnews'},
   {'category': 'autos', 'subcategory': 'autosenthusiasts'}]],
 'embeddings': None,
 'documents': [["2019 mustang week meet 'n greet  massive mustang week photo gallery.",
   'greatest ebay listing of 2019? or in all of human history? you decide!  this clever ebay ad made our day!',
   'bommarito kicks off `spirit of st. louis` campaign with $50,000 donation spirit of st. louis n the spirit of the holidays, fox2, kplr 11 and bommarito automotive group invites you to enter to win a car, truck or suva 2019 vw passat or a 