# Vector Databases for Embeddings Systems

## Limitations of the Current Approach

* Loading all the embeddings into memory (1536 floats ~ 13kB/embedding)
* Recalculated embeddings for each new query
* Calculating cosine distances for every embedding and sorting us **slow and scales linearly**

![](images/current_approach_vector_db.png)

## Vector Database

* Embedded documents are *stored* and *queried* from the **vector database**

![](images/current_approach_vector_db_1.png)



## SQL vs NoSQL DBs

### NoSQL Database

* More flexible structure that allows for faster querying

### SQL/Relational DBs

* Structured data into tables, rows, and columns

## Components to Store

* Embeddings
* Source texts
    * There are some vector databases that don't support storing text, for these cases we need t SQL DB and keep reference by id
* Metadata
    * IDs and references
    * Additional data useful for filtering results

❌ Don't store the source text as metadata

## Options

![](images/vector_databases_options.png)

### Which solution is best?

* **Database Management**
    * Managed -> more expensive but lowers workload
    * Self-Managed -> cheaper but requires time and expertise
* **Open source or commercial**
    * Open source -> flexible and cost effective
    * Commercial -> better support, more advanced features, and compliance
* **Data models** 
    * does the type of data lend itself to a particular database type?
* **Specific features**
    * does your use case depend on specific functionality, such as multi-modal storage?





# ChromaDB

## Installing ChromaDB

* *ChromaDB* is a simple yet powerful vector database
* Two flavors
    * **Local**: great for development and prototyping
    * **Client/Server**: made for production

![](images/chroma_installation_types.png)


## Connecting to the Database

* Data will be persisted to disk

```{python}
import chromadb

client = chromadb.PersistentClient(path="/path/to/save/to")

```

## Creating a Collection

* Collections are analogous to tables
* Collections are able to create embeddings automatically

```{python}
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = client.create_collection(
    name="my_collection",
    embedding_function=OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key="..."
    )
)
```

## Inspecting Collections

```{python}
client.list_collections()
```
```bash
[Collection(name=my_collection)]
```

## Inserting embeddings

**Single document**

```{python}
collection.add(ids=["my-doc"], documents=["This is the source text"])
```

* IDs must be provided

**Multiple documents**

```{python}
collection.add(
    ids=["my-doc-1","my-doc-2"],
    documents=["This is document 1","This is document 2"]
)
```

## Inspecting a Collection

**Counting documents in a collection**

```{python}
collection.count
```
```bash
3
```

**Peeking at the first 10 items**

```{python}
collection.peek()
```
```bash
{
    'ids':['my-doc','my-doc-1','my-doc-2'],
    'embeddings': [...],
    'documents':["This is the source text","This is document 1","This is document 2"],
    'metadatas':[None, None, None]
)
```


## Retrieving items

```{python}
collection.get(ids=["s59])
```
```bash
{'ids': ['s59'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['Title: Naruto Shippūden the Movie: The Will of Fire (Movie)\nDescription: When ...'],
 'uris': None,
 'data': None}
```

## Netflix Dataset

```{python}
Title: Kota Factory (TV Show)
Description: In a city of coaching centers known to train India's finest...
Categories: International TV Shows, Romantic TV Shows, TV Comedies

Title: The Last Letter From Your Lover (Movie)
Description: After finding a trove of love letters from 1965, a reporter sets...
Categories: Dramas, Romantic Movies

```

### Estimating Embedding Cost

* Embedding model (text-embedding-3-small) costs $0.00002/1k tokens

```{python}
cost = 0.00002 * len(tokens/1000)
```

* Count tokens with the `tiktoken` library
  * `pip install tiktoken`
  
```{python}
import tiktoken

enc = tiktoken.encoding_for_token("text-embedding-3-small")
total_tokens = sum(len(enc.encode(text)) for text in documents)

cost_per_1k_tokens = 0.00002

print('Total tokens:',total_tokens)
print('Cost:', cost_per_1k_tokens * total_tokens/1000)
```


## Querying the Database

* We want to find similar items in the collection
  
**Original approach**

![](images/querying_database.png)

**ChromaDB approach**

![](images/querying_database_with_chromadb.png)

### Retrieve the Collection

* Must be specified the same embedding function used when adding data to the collection

```{python}
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = client.get_collection(
    name="netflix_titles",
    embedding_function=OpenAIEmbeddingFunction(api_key="...")
)
```

```{python}
result = collection.query(
    query_texts=["movies where people sing a lot"],
    n_results=3
)

print(result)
```
----
```bash
{
    'ids': [['s4068', 's293', 's2213']],
    'embeddings': None,
    'documents': [[
        'Title: Quién te cantará (Movie)\nDescription: When a near-...',
        'Title: Quartet (Movie)\nDescription: To save their posh retirement home, ...',
        'Title: Sing On! Spain (TV Show)\nDescription: In this fast-paced, high-...'
    ]],
    'metadatas': [[None, None, None]],
    'distances': [[0.350419282913208, 0.36049118638038635, 0.37080681324005127]]
}
```

`query()` returns a dict with multiple keys:

* **ids** : The ids of the returned items
* **embeddings** : The embeddings of the returned items, 
  * it is not returned by default. so this is `None`
* **documents** : The source texts of the returned items
* **metadatas** : The metadatas of the returned items
* **distances** : The distances of the returned items from the query text

![](images/chroma_query_result.png)


## Updating a Collection

* Include only the fields to update, other fields will be unchanged
* **Collection will automatically create embeddings**

```{python}
collection.update(
    ids=["id-1","id-2"],
    documents=["New document 1","New document 2"]
)
```

## Upserting a Collection

* If one is not sure the collection already exists
* If IDs are missing -> add them
* If IDs are present -> update them

```{python}
collection.upsert(
    ids=["id-1","id-2"],
    documents=["New document 1","New document 2"]
)
```

## Deleting from a Collection

**Delete items from a collection**

```{python}
collection.delete(ids=["id-1","id-2"])
```

### Delete all collections

```{python}
client.resert()
```

⚠️ **Warning**: this will delete **everything** in the database!


## Multiple Queries and Filtering

### Movie recommendations based on multiple datapoints

* Terrifier (id:'s8170')
* Strawberry Shortcake: Berry Bitty Adventures (id:'s8103')

```{python}
refence_ids = ['s8170','s8103']

reference_texts = collection.get(ids=reference_ids)["documents"]

result = collection.query(
    query_texts= reference_texts,
    n_results=3
)
```
```bash
{
  'ids': [['s8170', 's6939', 's7000'], ['s8103', 's2968', 's3085']],
  'embeddings': None,
  'documents': [
    [
      'Title: Terrifier (Movie)...',
      'Title: Haunters: The Art of the Scare (Movie)...',
      'Title: Horror Story (Movie)...'
    ],
    [
      'Title: Strawberry Shortcake: Berry Bitty Adventures (TV Show)...',
      'Title: Shopkins (TV Show)...',
      'Title: Rainbow Ruby (TV Show)...'
    ]
  ],
  'metadatas': [[None, None, None], [None, None, None]],
  'distances': [[0.00, 0.25, 0.26], [0.00, 0.25, 0.28]]
}

```

* We got two results
  * one for each id in the query


## Adding Metadata to Filter

* Create a list of dicts for the metadatas
* Create a list of IDs to add them to the existing items


```{python}
import csv

ids = []
metadatas = []

with open('netflix_titles.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for i, row in enumerate(reader):
        ids.append(row['show_id'])
        metadatas.append({
            "type": row['type'],
            "release_year": int(row['release_year'])
        })

```

```{python}
collection.update(ids=ids, metadatas= metadatas)
```

**Querying with metadata**

```{python}
result = collection.query(
    query_texts = reference_texts,
    n_results=3,
    where ={
        "type":"Movie"
    }
)
```

### Where Operators

This syntax and the below one are the same

```{python}
where={
    "type":"Movie"
}
```

```{python}
where={
    "type":{
        "$eq":"Movie"
    }
}
```

**List of Operators**

* `$eq` - equal to (string, int, float)
* `$ne` - not equal to (string, int, float)
* `$gt` - greater than (int, float)
* `$gte` - greater than or equal (int, float)
* `$lt` - lower than (int, float)
* `$lte` - lower than or equal (int, float)


### Multiple where filters

* It's valid to use `$and` & `$or`

```{python}
where={
    "$and":[
        {"type":
            {"$eq": "Movie"}    
        },
        {"release_year":
            {"$gt": 2020}
        }
    ]
}
```
