# Introduction to Embeddings with the OpenAI API - Part 3

## Vector Databases

To enable embedding applications in production, you'll need an efficient vector storage and querying solution: enter vector databases! You'll learn how vector databases can help scale embedding applications and begin creating and adding to your very own vector databases using Chroma.

In [2]:
import os

# Set your OpenAI API key
openai_api_key = os.environ['OPENAI_API_KEY']

### Getting started with ChromaDB
In the following exercises, you'll use a vector database to embed and query 1000 films and TV shows from the Netflix dataset introduced in the video. The goal will be to use this data to generate recommendations based on a search query. To get started, you'll create the database and collection to store the data.

chromadb is available for you to use, and the OpenAIEmbeddingFunction() has been imported from chromadb.utils.embedding_functions. As with the first two chapters, you don't need to provide an OpenAI API key in this chapter.

In [3]:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Create a persistant client
client = chromadb.PersistentClient()

# Create a netflix_title collection using the OpenAI Embedding function
collection = client.create_collection(name="netflix_titles",
                                      embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", 
                                                                                 api_key=openai_api_key)
)

# List the collections
print(client.list_collections())

[Collection(name=netflix_titles)]


### Estimating embedding costs with tiktoken
Now that we've created a database and collection to store the Netflix films and TV shows, we can begin embedding data.

Before embedding a large dataset, it's important to do a cost estimate to ensure you don't go over any budget restraints. Because OpenAI models are priced by number of tokens inputted, we'll use OpenAI's tiktoken library to count the number of tokens and convert them into a dollar cost.

You've been provided with documents, which is a list containing all of the data to embed. You'll iterate over the list, encode each document, and count the total number of tokens. Finally, you'll use the model's pricing to convert this into a cost.

In [5]:
import csv
import tiktoken


ids = []
documents = []

with open('resources/netflix_titles.csv') as csvfile:
  reader = csv.DictReader(csvfile)
  for i, row in enumerate(reader):
    ids.append(row['show_id'])
    text = f"Title: {row['title']} ({row['type']})\nDescription: {row['description']}\nCategories: {row['listed_in']}"
    documents.append(text)

# Load the encoder for the OpenAI text-embedding-3-small model
enc = tiktoken.encoding_for_model("text-embedding-3-small")

# Encode each text in documents and calculate the total tokens
total_tokens = sum(len(enc.encode(document)) for document in documents)

cost_per_1k_tokens = 0.00002

# Display number of tokens and cost
print('Total tokens:', total_tokens)
print('Cost:', cost_per_1k_tokens * total_tokens / 1000)

Total tokens: 444463
Cost: 0.00888926


### Adding data to the collection
Time to add those Netflix films and TV shows to your collection! You've been provided with a list of document IDs and texts, stored in ids and documents, respectively, which have been extracted from netflix_titles.csv using the following code:
```
ids = []
documents = []

with open('netflix_titles.csv') as csvfile:
  reader = csv.DictReader(csvfile)
  for i, row in enumerate(reader):
    ids.append(row['show_id'])
    text = f"Title: {row['title']} ({row['type']})\nDescription: {row['description']}\nCategories: {row['listed_in']}"
    documents.append(text)
```
As an example of what information will be embedded, here's the first document from documents:
```
Title: Dick Johnson Is Dead (Movie)
Description: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
Categories: Documentaries
```
All of the necessary functions and packages have been imported, and a persistent client has been created and assigned to client.

In [None]:
# Add the documents and IDs to the collection
openai_api_array_limit = 2048
for i in range(0, len(documents), openai_api_array_limit):
    collection.add(ids=ids[i : i+openai_api_array_limit], 
                   documents=documents[i : i+openai_api_array_limit])

# Print the collection size and first ten items
print(f"No. of documents: {collection.count()}")
print(f"First ten documents: {collection.peek()}")

No. of documents: 8807
First ten documents: {'ids': ['s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10'], 'embeddings': array([[ 0.02259043,  0.05178833, -0.02430926, ...,  0.02265739,
        -0.00401527, -0.02413068],
       [-0.00256908,  0.09567138, -0.04806154, ...,  0.01961534,
         0.03606874, -0.04447048],
       [-0.015072  ,  0.05057291, -0.04685031, ..., -0.00316648,
         0.00111224, -0.04591966],
       ...,
       [-0.02682706,  0.05365412, -0.02775045, ...,  0.03700871,
        -0.02228298, -0.02446997],
       [ 0.01401989,  0.0206609 , -0.0120415 , ...,  0.01189178,
         0.01038392, -0.04914984],
       [ 0.00980071,  0.07244343, -0.03348716, ...,  0.01767378,
         0.02326618, -0.0031016 ]]), 'documents': ['Title: Dick Johnson Is Dead (Movie)\nDescription: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.\nCategories: Documentaries', 'Title: Bl

### Querying the Netflix collection
Now that you've created and populated the netflix_titles collection, it's time to query it!

As a first trial, you'll use it to provide recommendations for films and TV shows about dogs to one of your colleagues who loves dogs!

The netflix_titles collection is still available to use, and OpenAIEmbeddingFunction() has been imported.

In [21]:
# Retrieve the netflix_titles collection
collection = client.get_collection(name="netflix_titles",
                                   embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", 
                                                                              api_key=openai_api_key)
)

# Query the collection for "films about dogs"
result = collection.query(query_texts=["films about dogs"], 
                          n_results=3)

print(result)

{'ids': [['s95', 's2057', 's830']], 'embeddings': None, 'documents': [['Title: Show Dogs (Movie)\nDescription: A rough and tough police dog must go undercover with an FBI agent as a prim and proper pet at a dog show to save a baby panda from an illegal sale.\nCategories: Children & Family Movies, Comedies', "Title: Hotel for Dogs (Movie)\nDescription: Placed in a foster home that doesn't allow pets, 16-year-old Andi and her younger brother Bruce turn an abandoned hotel into a home for their dog.\nCategories: Children & Family Movies, Comedies", 'Title: Dog Gone Trouble (Movie)\nDescription: The privileged life of a pampered dog named Trouble is turned upside-down when he gets lost and must learn to survive on the big-city streets.\nCategories: Children & Family Movies, Comedies']], 'uris': None, 'data': None, 'metadatas': [[None, None, None]], 'distances': [[0.8597354888916016, 0.8886237144470215, 0.896007776260376]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.docu

### Updating and deleting items from a collection
Just because the documents have been stored away in a vector database, that doesn't mean that you can't make changes to add to the collection or update existing items.

In this exercise, you've been provided with two new Netflix titles stored in new_data:
```
[{"id": "s1001", "document": "Title: Cats & Dogs (Movie)\nDescription: A look at the top-secret, high-tech espionage war going on between cats and dogs, of which their human owners are blissfully unaware."},
 {"id": "s6884", "document": 'Title: Goosebumps 2: Haunted Halloween (Movie)\nDescription: Three teens spend their Halloween trying to stop a magical book, which brings characters from the "Goosebumps" novels to life.\nCategories: Children & Family Movies, Comedies'}]
 ```
You'll either add or update these IDs in the database depending on whether they're already present in the collection.

In [22]:
new_data = [
  {
    'id': 's1001',
    'document': 'Title: Cats & Dogs (Movie)\nDescription: A look at the top-secret, high-tech espionage war going on between cats and dogs, of which their human owners are blissfully unaware.'
  },
  {
    'id': 's6884',
    'document': 'Title: Goosebumps 2: Haunted Halloween (Movie)\nDescription: Three teens spend their Halloween trying to stop a magical book, which brings characters from the "Goosebumps" novels to life.\nCategories: Children & Family Movies, Comedies'
  }
]

collection = client.get_collection(name="netflix_titles",
                                   embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", 
                                                                              api_key=openai_api_key)
)

# Update or add the new documents
collection.upsert(ids=[d["id"] for d in new_data],
                  documents=[d["document"] for d in new_data])

# Delete the item with ID "s95"
collection.delete(ids=["s95"])

result = collection.query(query_texts=["films about dogs"],
                          n_results=3)
print(result)

{'ids': [['s2057', 's830', 's6117']], 'embeddings': None, 'documents': [["Title: Hotel for Dogs (Movie)\nDescription: Placed in a foster home that doesn't allow pets, 16-year-old Andi and her younger brother Bruce turn an abandoned hotel into a home for their dog.\nCategories: Children & Family Movies, Comedies", 'Title: Dog Gone Trouble (Movie)\nDescription: The privileged life of a pampered dog named Trouble is turned upside-down when he gets lost and must learn to survive on the big-city streets.\nCategories: Children & Family Movies, Comedies', 'Title: All Dogs Go to Heaven (Movie)\nDescription: When a canine con artist becomes an angel, he sneaks back to Earth and crosses paths with an orphan girl who can speak to animals.\nCategories: Children & Family Movies']], 'uris': None, 'data': None, 'metadatas': [[None, None, None]], 'distances': [[0.8886237144470215, 0.896007776260376, 0.9038515090942383]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'docum

### Querying with multiple texts
In many cases, you'll want to query the vector database using multiple query texts. Recall that these query texts are embedded using the same embedding function as when the documents were added.

In this exercise, you'll use the documents from two IDs in the netflix_titles collection to query the rest of the collection, returning the most similar results as recommendations.

The netflix_titles collection is still available to use, and OpenAIEmbeddingFunction() has been imported.

In [23]:
collection = client.get_collection(name="netflix_titles",
                                   embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", 
                                                                              api_key=openai_api_key)
)

reference_ids = ['s999', 's1000']

# Retrieve the documents for the reference_ids
reference_texts = collection.get(ids=reference_ids)["documents"]

# Query using reference_texts
result = collection.query(query_texts=reference_texts,
                          n_results=3)

print(result['documents'])

[['Title: Shuddhi (Movie)\nDescription: An American woman on a revenge mission travels to India and befriends two journalists seeking justice for violent crimes against women.\nCategories: Dramas, International Movies, Thrillers', "Title: Shakti: The Power (Movie)\nDescription: A young mother must break free of the clutches of her husband's feudal family and bring herself and her young child to safety.\nCategories: Dramas, International Movies, Thrillers", "Title: Saavat (Movie)\nDescription: In rural India, a detective's investigation of seven confounding suicides uncovers revelations about the myths and mindsets of the rest of the village.\nCategories: Dramas, Independent Movies, International Movies"], ['Title: Stowaway (Movie)\nDescription: A three-person crew on a mission to Mars faces an impossible choice when an unplanned passenger jeopardizes the lives of everyone on board.\nCategories: Dramas, International Movies, Thrillers', "Title: Kidnapping Stella (Movie)\nDescription: Sn

### Filtering using metadata
Having metadata available to use in the database can unlock the ability to more easily filter results based on additional conditions. Imagine that the film recommendations you've be creating could access the user's set preferences and use those to further filter the results.

In this exercise, you'll be using additional metadata to filter your Netflix film recommendations. The netflix_titles collection has been updated to add metadatas to each title, including the 'rating', the age rating given to the title, and 'release_year', the year the title was initially released.

Here's a preview of an updated item:
```
{'ids': ['s999'],
 'embeddings': None,
 'metadatas': [{'rating': 'TV-14', 'release_year': 2021}],
 'documents': ['Title: Searching For Sheela (Movie)\nDescription: Journalists and fans await Ma Anand Sheela as the infamous former Rajneesh commune’s spokesperson returns to India after decades for an interview tour.\nCategories: Documentaries, International Movies'],
 'uris': None,
 'data': None}
 ```

In [34]:
ids = []
metadatas = []

with open('resources/netflix_titles.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for i, row in enumerate(reader):
        ids.append(row['show_id'])
        metadatas.append({
            "rating":row['rating'],
            "release_year": int(row['release_year'])
        })

collection = client.get_collection(name="netflix_titles",
                                   embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", 
                                                                              api_key=openai_api_key)
)

collection.update(ids=ids, metadatas=metadatas)

Update of nonexisting embedding ID: s95


In [35]:
reference_texts = ["children's story about a car", "lions"]

# Query two results using reference_texts
result = collection.query(query_texts=reference_texts,
                          n_results=2,
                          # Filter for titles with a G rating released before 2019
                          where={
                            "$and": [
                                {"rating":  
        	                        {"$eq": "G"}
                                },
                                {"release_year": 
         	                        {"$lt": 2019}
                                }
                            ]
                          }
)

print(result['documents'])

[['Title: Chitty Chitty Bang Bang (Movie)\nDescription: Quirky inventor Caractacus Potts and his family travel in their magical flying car to Vulgaria, a kingdom strangely devoid of children.\nCategories: Children & Family Movies, Classic Movies, Comedies', "Title: Charlotte's Web (Movie)\nDescription: Follow the adventures of Wilbur the pig, Templeton the rat and Charlotte the spider in this animated musical version of E.B. White's timeless story.\nCategories: Children & Family Movies, Classic Movies"], ['Title: Ghost of the Mountains (Movie)\nDescription: An international group of filmmakers sets out on a mission to get up close and personal with a family of elusive snow leopards.\nCategories: Children & Family Movies, Documentaries', 'Title: Balto 2: Wolf Quest (Movie)\nDescription: Half-dog, half-wolf Balto (voiced by Maurice LaMarche) and his wife proudly put their pups up for adoption to humans, but when nobody takes daughter Aleu because she looks too much like a wolf, she runs 