## References

https://docs.trychroma.com/guides#using-collections

https://docs.trychroma.com/reference/py-collection

https://cookbook.chromadb.dev/core/collections/

https://docs.trychroma.com/guides/embeddings

## Prepare Text Data

### Download Text Data

In [6]:
!wget https://raw.githubusercontent.com/johnnycode8/chromadb_quickstart/main/menu_items.csv

--2024-08-03 19:39:47--  https://raw.githubusercontent.com/johnnycode8/chromadb_quickstart/main/menu_items.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7171 (7,0K) [text/plain]
Saving to: ‘menu_items.csv’


2024-08-03 19:39:47 (9,90 MB/s) - ‘menu_items.csv’ saved [7171/7171]



In [7]:
# !pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.2 pytz-2024.1 tzdata-2024.1


### Load text data

In [1]:
import pandas as pd
text_data = pd.read_csv('menu_items.csv')

In [2]:
text_data.head()

Unnamed: 0,item_id,item_name
0,A1,Vegan Chicken Salad
1,A2,Spring Rolls (4 pieces)
2,A3,Pot Stickers (6 pieces)
3,A4,Fried Wonton (10 pieces)
4,A5,Fried Tofu with Soy Sauce


### Format text data for chromadb

In [3]:
ids_data = text_data['item_id'].to_list()
documents_data = text_data['item_name'].to_list()

In [4]:
ids_data[0:5]

['A1', 'A2', 'A3', 'A4', 'A5']

### Prepare metadata

In [5]:
len(ids_data)

201

In [6]:
versions = [
    {'ver': idx_ % 4} for idx_ in range(len(ids_data)) 
]

In [7]:
versions[0:5]

[{'ver': 0}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}]

## In-memory Vector Database

### Create Client 

In [8]:
import chromadb
client = chromadb.Client() 

### Create Collection

#### Embeding Function for Collection

Embedding functions can be linked to a collection and used whenever you call `add`, `update`, `upsert` or `query`. You can also use them directly which can be handy for debugging.

By default, Chroma uses the Sentence Transformers `all-MiniLM-L6-v2 model` to create embeddings


In [9]:
from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()


https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py#L106

```python
def create_collection(
        self,
        name: str,
        configuration: Optional[CollectionConfiguration] = None,
        metadata: Optional[CollectionMetadata] = None,
        embedding_function: Optional[
            EmbeddingFunction[Embeddable]
        ] = ef.DefaultEmbeddingFunction(),  # type: ignore
        data_loader: Optional[DataLoader[Loadable]] = None,
        get_or_create: bool = False,
    ) -> Collection:
        model = self._server.create_collection(
            name=name,
            metadata=metadata,
            tenant=self.tenant,
            database=self.database,
            get_or_create=get_or_create,
            configuration=configuration,
        )
        return Collection(
            client=self._server,
            model=model,
            embedding_function=embedding_function,
            data_loader=data_loader,
        )
```

Create a new collection with the given name and metadata.

**Arguments:**

`name` - The name of the collection to create.

`metadata` - Optional metadata to associate with the collection.

`embedding_function` - Optional function to use to embed documents. Uses the default embedding function if not provided.

`get_or_create` - If True, return the existing collection if it exists.

**Returns:**

`Collection` - The newly created collection.

**Raises:**

`ValueError` - If the collection already exists and get_or_create is False.

`ValueError` - If the collection name is invalid.

In [10]:
text_collection = client.create_collection(name='text_collection', 
                                           embedding_function=default_ef,
                                           metadata={"hnsw:space": "cosine"}) # l2 is the default

In [11]:
methods_list = [method for method in dir(text_collection) if callable(getattr(text_collection, method)) and not method.startswith('_')]

print("Methods in text_collection:", methods_list)

Methods in text_collection: ['add', 'count', 'delete', 'get', 'get_model', 'modify', 'peek', 'query', 'update', 'upsert']


In [12]:
text_collection.get_model()

Collection(id=UUID('11d0a1ad-6e2f-4e09-8a18-eac4b76f6b92'), name='text_collection', configuration_json={'hnsw_configuration': {'space': 'l2', 'ef_construction': 100, 'ef_search': 10, 'num_threads': 24, 'M': 16, 'resize_factor': 1.2, 'batch_size': 100, 'sync_threshold': 1000, '_type': 'HNSWConfigurationInternal'}, '_type': 'CollectionConfigurationInternal'}, metadata={'hnsw:space': 'cosine'}, dimension=None, tenant='default_tenant', database='default_database', version=0)

### Add text data to text collection 

In [13]:
import inspect

args = inspect.getfullargspec(text_collection.add).args
print(args)

['self', 'ids', 'embeddings', 'metadatas', 'documents', 'images', 'uris']


In [14]:
text_collection.add(
    ids=ids_data,
    documents=documents_data,
    metadatas=versions
)

#### Count in Text Collection

In [15]:
text_collection.count()

201

#### Delete in the text collection

```python
def delete(ids: Optional[IDs] = None,
           where: Optional[Where] = None,
           where_document: Optional[WhereDocument] = None) -> None

```

Delete the embeddings based on ids and/or a where filter

**Arguments:**

`ids` - The ids of the embeddings to delete

`where` - A Where type dict used to filter the delection by. E.g. {"color" : "red", "price": 4.20}. Optional.

`where_document` - A WhereDocument type dict used to filter the deletion by the document content. E.g. {$contains: {"text": "hello"}}. Optional.

**Returns:**

None

Delete the embeddings based on ids and/or a where filter



In [18]:
ids_data[-1]

'SP32'

In [19]:
text_collection.delete(ids=['SP32'])

In [20]:
text_collection.count()

200

#### Get in Text Collection

```python
def get(ids: Optional[OneOrMany[ID]] = None,
        where: Optional[Where] = None,
        limit: Optional[int] = None,
        offset: Optional[int] = None,
        where_document: Optional[WhereDocument] = None,
        include: Include = ["metadatas", "documents"]) -> GetResult
```

**Arguments:**

`ids` - The ids of the embeddings to get. Optional.

`where` - A Where type dict used to filter results by. E.g. {"color" : "red", "price": 4.20}. Optional.

`limit` - The number of documents to return. Optional.

`offset` - The offset to start returning results from. Useful for paging results with limit. Optional.

`where_document` - A WhereDocument type dict used to filter by the documents. E.g. {$contains: {"text": "hello"}}. Optional.

`include` - A list of what to include in the results. Can contain "embeddings", "metadatas", "documents". Ids are always included. Defaults to ["metadatas", "documents"]. Optional.

**Returns:**

`GetResult` - A GetResult object containing the results

Get embeddings and their associate data from the data store. If no ids or where filter is provided returns all embeddings up to limit starting at offset.



In [22]:
ids_data[0:5]

['A1', 'A2', 'A3', 'A4', 'A5']

In [23]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'])

{'ids': ['A1', 'A2', 'A3', 'A4', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': 0}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}],
 'documents': ['Vegan Chicken Salad',
  'Spring Rolls (4 pieces)',
  'Pot Stickers (6 pieces)',
  'Fried Wonton (10 pieces)',
  'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [24]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'], 
                    where={"ver": 0})

{'ids': ['A1', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': 0}, {'ver': 0}],
 'documents': ['Vegan Chicken Salad', 'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [30]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'], 
                    where_document={"$contains":"search_string"})

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [32]:
documents_data[0:5]

['Vegan Chicken Salad',
 'Spring Rolls (4 pieces)',
 'Pot Stickers (6 pieces)',
 'Fried Wonton (10 pieces)',
 'Fried Tofu with Soy Sauce']

In [35]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'], 
                    where_document={"$contains":"Chicken"})

{'ids': ['A1'],
 'embeddings': None,
 'metadatas': [{'ver': 0}],
 'documents': ['Vegan Chicken Salad'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [36]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'], 
                    where_document={"$contains":"chicken"})

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

#### get_model in Text Collection

In [41]:
import inspect

args = inspect.getfullargspec(text_collection.get_model).args
print(args)

['self']


In [40]:
text_collection.get_model()

Collection(id=UUID('11d0a1ad-6e2f-4e09-8a18-eac4b76f6b92'), name='text_collection', configuration_json={'hnsw_configuration': {'space': 'l2', 'ef_construction': 100, 'ef_search': 10, 'num_threads': 24, 'M': 16, 'resize_factor': 1.2, 'batch_size': 100, 'sync_threshold': 1000, '_type': 'HNSWConfigurationInternal'}, '_type': 'CollectionConfigurationInternal'}, metadata={'hnsw:space': 'cosine'}, dimension=None, tenant='default_tenant', database='default_database', version=0)

#### Modify in Text Collection

```python
def modify(name: Optional[str] = None,
           metadata: Optional[CollectionMetadata] = None) -> None
```

**Arguments:**

`name` - The updated name for the collection. Optional.

`metadata` - The updated metadata for the collection. Optional.

**Returns:**

None

Modify the collection name or metadata



In [42]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'])

{'ids': ['A1', 'A2', 'A3', 'A4', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': 0}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}],
 'documents': ['Vegan Chicken Salad',
  'Spring Rolls (4 pieces)',
  'Pot Stickers (6 pieces)',
  'Fried Wonton (10 pieces)',
  'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

#### Peek in Text Collection

```python
def peek(limit: int = 10) -> GetResult
```


**Arguments:**

`limit`` - The number of results to return.

**Returns:**

`GetResult` - A GetResult object containing the results.


Get the first few results in the database up to limit


In [None]:
text_collection.peek(limit=5) # include embedding

#### Update in Text Collection

```python
def update(ids: OneOrMany[ID],
           embeddings: Optional[OneOrMany[Embedding]] = None,
           metadatas: Optional[OneOrMany[Metadata]] = None,
           documents: Optional[OneOrMany[Document]] = None) -> None

```


**Arguments:**

`ids` - The ids of the embeddings to update

`embeddings` - The embeddings to add. If None, embeddings will be computed based on the documents using the embedding_function set for 
the Collection. Optional.

`metadatas` - The metadata to associate with the embeddings. When querying, you can filter on this metadata. Optional.

`documents` - The documents to associate with the embeddings. Optional.

**Returns:**

None

Update the embeddings, metadatas or documents for provided ids.


In [45]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'])

{'ids': ['A1', 'A2', 'A3', 'A4', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': 0}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}],
 'documents': ['Vegan Chicken Salad',
  'Spring Rolls (4 pieces)',
  'Pot Stickers (6 pieces)',
  'Fried Wonton (10 pieces)',
  'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [46]:
text_collection.update(ids=['A1'], 
                       documents=['Chicken Burger'], 
                       metadatas=[{'ver': -1}])

In [47]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'])

{'ids': ['A1', 'A2', 'A3', 'A4', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': -1}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}],
 'documents': ['Chicken Burger',
  'Spring Rolls (4 pieces)',
  'Pot Stickers (6 pieces)',
  'Fried Wonton (10 pieces)',
  'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [48]:
text_collection.update(ids=['A1BLABLA'], 
                       documents=['Fish Burger'], 
                       metadatas=[{'ver': -1}])

Update of nonexisting embedding ID: A1BLABLA
Update of nonexisting embedding ID: A1BLABLA


In [49]:
text_collection.get(ids=['A1BLABLA'])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

#### Upsert in Text Collection

```python
def upsert(ids: OneOrMany[ID],
           embeddings: Optional[OneOrMany[Embedding]] = None,
           metadatas: Optional[OneOrMany[Metadata]] = None,
           documents: Optional[OneOrMany[Document]] = None) -> None

```


**Arguments:**

`ids` - The ids of the embeddings to update

`embeddings` - The embeddings to add. If None, embeddings will be computed based on the documents using the embedding_function set for the Collection. Optional.

`metadatas` - The metadata to associate with the embeddings. When querying, you can filter on this metadata. Optional.

`documents` - The documents to associate with the embeddings. Optional.

**Returns:**

None

Update the embeddings, metadatas or documents for provided ids, or create them if they don't exist.



Methods in text_collection: ['add', 'count', 'delete', 'get', 'get_model', 'modify', 'peek', 'query', 'update', 'upsert']


In [57]:
text_collection.upsert(ids=['A1BLABLA'], 
                       documents=['Fish Burger'], 
                       metadatas=[{'ver': -1}])

In [58]:
text_collection.get(ids=['A1BLABLA'])

{'ids': ['A1BLABLA'],
 'embeddings': None,
 'metadatas': [{'ver': -1}],
 'documents': ['Fish Burger'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

### Query Text

https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py#L141
```python
    def query(
        self,
        query_embeddings: Optional[  # type: ignore[type-arg]
            Union[
                OneOrMany[Embedding],
                OneOrMany[np.ndarray],
            ]
        ] = None,
        query_texts: Optional[OneOrMany[Document]] = None,
        query_images: Optional[OneOrMany[Image]] = None,
        query_uris: Optional[OneOrMany[URI]] = None,
        n_results: int = 10,
        where: Optional[Where] = None,
        where_document: Optional[WhereDocument] = None,
        include: Include = ["metadatas", "documents", "distances"],
    ) -> QueryResult:
        """Get the n_results nearest neighbor embeddings for provided query_embeddings or query_texts.

        Args:
            query_embeddings: The embeddings to get the closes neighbors of. Optional.
            query_texts: The document texts to get the closes neighbors of. Optional.
            query_images: The images to get the closes neighbors of. Optional.
            n_results: The number of neighbors to return for each query_embedding or query_texts. Optional.
            where: A Where type dict used to filter results by. E.g. `{"$and": [{"color" : "red"}, {"price": {"$gte": 4.20}}]}`. Optional.
            where_document: A WhereDocument type dict used to filter by the documents. E.g. `{$contains: {"text": "hello"}}`. Optional.
            include: A list of what to include in the results. Can contain `"embeddings"`, `"metadatas"`, `"documents"`, `"distances"`. Ids are always included. Defaults to `["metadatas", "documents", "distances"]`. Optional.

        Returns:
            QueryResult: A QueryResult object containing the results.

        Raises:
            ValueError: If you don't provide either query_embeddings, query_texts, or query_images
            ValueError: If you provide both query_embeddings and query_texts
            ValueError: If you provide both query_embeddings and query_images
            ValueError: If you provide both query_texts and query_images

        """
```

In [59]:
text_collection.get(ids=['A1', 'A2', 'A3', 'A4', 'A5'])

{'ids': ['A1', 'A2', 'A3', 'A4', 'A5'],
 'embeddings': None,
 'metadatas': [{'ver': -1}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 0}],
 'documents': ['Chicken Burger',
  'Spring Rolls (4 pieces)',
  'Pot Stickers (6 pieces)',
  'Fried Wonton (10 pieces)',
  'Fried Tofu with Soy Sauce'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [60]:
text_query = "Meet Burger"
text_collection.query(
    query_texts=text_query,
    n_results=5
)

{'ids': [['A1', 'A1BLABLA', 'A18', 'E37', 'E81']],
 'distances': [[0.34587299823760986,
   0.3687518835067749,
   0.6099321842193604,
   0.6772676706314087,
   0.685279369354248]],
 'metadatas': [[{'ver': -1}, {'ver': -1}, {'ver': 1}, {'ver': 3}, {'ver': 2}]],
 'embeddings': None,
 'documents': [['Chicken Burger',
   'Fish Burger',
   'French Fries',
   'Sizzling Vegan Beef and Mushroom',
   'Vegan Beef with Broccoli']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}