# 1. What are Embeddings?
- Texts (words, phrases, entire documents) are represented in **numerical form**.
- Embedding models **map text into a multi-dimensional vector space**. And the numbers outputted by the model are the text's location in that space.
    - similar words are **mapped** closer together in that space, and dissimilar words are further away.
- Embeddings allow **semantic meaning** to be captured from the text.
    - Semantic meaning means: **context and intent behind text**
- Most powerful use cases:
    - **Search engines**
        - traditional search engines use **keyword pattern recognition**. They might miss the true intent behind the searcher's query and might miss word variations (e.g., "soft" instead of "comfortable" or "sneakers" instead of "shoes").
        - Semantic search engines use embeddings to better understand the intent and context behind search queries to return more relevant results. The search query would be passed to an embedding model to generate the numbers that are mapped onto a vector space, and the embedded results closest to it would be returned.
    - **Recommendation systems**
        - embeddings enable more sophisticated recommender systems. e.g., for a job post recommendation, using embeddings allows recommending jobs based on viewed job descriptions, and it allows semantically similar jobs to be recommended.
    - **Classification**
        - cluster observations, classify sentiment, and perform categorization based on the semantic similarity between texts.

## How to use an OpenAI Embedding endpoint

In [6]:
# !pip install openai
from openai import OpenAI


# instantiate an OPENAI client and pass it out API key
client = OpenAI(api_key="????")

# to create a request, we call the create method on client.embeddings()
# input can be a list of strings
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text."
)

# we call .model_dump() on the response
response_dict = response.model_dump()
print(response_dict)

# The response from the API is long. The embedding model outputs 1536 numbers to represent this input string. 
# Having it in a dictionary makes it easier to dig into.
# this will print all the embeddings
print(response_dictp['data'][0]['embedding'])

print(response_dict['usage']['total_tokens'])

## Investigating the vector space

In [14]:
articles = [
    {"headline": "The Blues get promoted on the final day of the season!", "topic": "Sport"},
    {"headline": "1.5 Billion Tune-in to the World Cup Final", "topic": "Sport"},
    {"headline": "New Particle Discovered in CERN", "topic": "Science"},
    {"headline": "Scientists make breakthrough discovery in renewable energy", "topic": "Sport"}
]

# The goal
# We will embed each headline's text and add it back into the headlines dictionary, stored under the embedding key.
# therefore each article has {"headline": "???", "topic": "...", "embedding" = [?, ?, ?, ...]}

headline_text = [article['headline'] for article in articles]

# We can pass a list of lists to the input
# create embedding for each article
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=headline_text
)

# There will be one dictionary for every input instead of 1 dictionary
response_dict = response.model_dump()

# extract the embedding from response_dict and store it in articles
for i, article in enumerate(articles):
    articles['embedding'] = response_dict['data'][i]['embedding']

# print the first two articles
print(articles[:2])

# print the first items and its description, embedding, etc.
print(articles[0].items())

# no matter the length of the input, OPENAI always returns 1536 numbers representing the semantic meaning of the headline, its position, or vector in the vector space
len(articles[0]['embedding'])
len(articles[5]['embedding'])


## Dimensionality Reduction and t-SNE

- We'll be using t-SNE (t-distributed Stochastic Neighbor Embedding) technique for dimensionality reduction


In [None]:
from sklearn.manifold import TSNE
import numpy as np

# extract the embeddings
embeddings = [article['embedding'] for article in articles]

# create an instance of tsne
    # n_components: the number of output dimensions 
    # perplexity: must be less than the number of data. The default is 30 for big data.
tsne = TSNE(n_components=2, perplexity=3)

# this will return the transformed embeddings in a Numpy array with n_components dimensions, which we can now visualize.
embedding_2d = tsne.fit_transform(np.array(embeddings))

# This will result in less information, so use with caution.

## Visualizing the embeddings

In [None]:
import matplotlib.pyplot as plt

# first and second columns of the embeddings_2d array.
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])

topics = [article['topic'] for article in articles]
for i, topic in enumerate(topics):
    plt.annotate(topic, (embedding_2d[i, 0], embeddings_2d[i, 1]))

plt.show()
# similar articles with similar topics are closer together.
# The model captured the semantic meaning of the headlines and mapped them based on it.

## Text Similarity
- Calculate the similarity between texts using embeddings.
- Recall that semantically similar texts are embedded more closely in the vector space.
- This means that we can measure how semantically similar two pieces of text are by computing the **distance between vectors in the vector space**
- Cosine Distance (1 - cos(a)) where a could be between [-1, 1] or [0, 1]
    - so the distance could be between 0 and 2 (1 - (-1))
    - Smaller numbers show higher similarity.

In [19]:
from scipy.spatial import distance

distance.cosine([0, 1], [1, 0])

1.0

In [None]:
def create_embeddings(texts):
    response = client.embeddings.create(
        model = "text-embedding-3-small",
        input=texts
    )
    response_dict = response.model_dump()
    return [data['embedding'] for data in response_dict['data']]

print(create_embeddings(["Python is the best", "R is the best"]))

# for a single list, make sure to zero-index the result.
print(create_embeddings("This is awesome")[0])

In [None]:
from scipy.spatial import distance
import numpy as np

search_text = "computer"
search_embedding = create_embeddings(search_text)[0] # make sure to sero-index it if it is one input!!!!!

# calculate the cosine between the embeddings of the headlines and embeddings of the query.
distances = []
for article in articles:
    dist = distance.cosine(search_embeddings, article['embedding'])
    distances.append(dist)

# NumPy's argmin function returns the index of the smallest value in the distances list
min_dist_ind = np.argmin(distances)

# return its headline
print(articles[min_dist_ind]['headline'])

# 2. Embeddings for AI Applications

## Semantic Search and enriched embeddings
Semantic search engines use embeddings to return the **most semantically similar** results to a search query. \
There are three steps to semantic search:
1. Embed the search query and other texts to compare against
2. Compute the cosine distances between the embedded search query and other embedded texts
3. Extract the text with the smallest cosine distance

In [None]:
articles = [
    {
        "headline": "AI Breakthrough Enables Real-Time Language Translation Without Internet",
        "topic": "Artificial Intelligence",
        "keywords": ["AI", "machine learning", "language translation", "offline AI", "technology innovation"]
    },
    {
        "headline": "Olympic Sprinter Sets New World Record in 100m Final",
        "topic": "Sports",
        "keywords": ["Olympics", "sprinting", "track and field", "world record", "athletics"]
    },
    {
    "headline": "Climate Scientists Warn of Accelerating Ice Melt in Antarctica",
    "topic": "Environment",
    "keywords": ["climate change", "Antarctica", "ice melt", "global warming", "scientific research"]
    }
]


# Combine the information of each article into a single string that reflects the information stored in each dictionary.
# We use F-strings or formatted-string to return the desired string structure.
# F-strings allow us to insert variables into strings without having to convert them into strings and concatenate them.
# Pay attention that we use """ around the string.
# join all the keywords with comma and space.

def create_article_text(article):
    return f"""
            Headline: {article['headline']}
            Topic: {article['topic']}
            Keywords: {', '.join(article['keywords'])}
            """

# calling the function on the final headline
print(create_article_text(article[-1]))

# Creating enriched embeddings for each article
article_texts = [create_article_text(article) for article in articles]

# create embeddings
article_embeddings = create_embeddings(article_texts)
print(article embeddings)


In [None]:
# compute cosine distances
from scipy.spatial import distance

def find_n_closest(query_vector, embeddings, n=3):
    distances = []
    for index, embedding in enumerate(embeddings):
        dist = distance.cosine(query_vector, embedding)
        distances.append({'distance':dist, "index":index})
    
    distances_sorted = sorted(distances, key=lambda x: x['distance']) # it accesses the 'distance' key for each dictionary

    # return the closest n results
    return distances_sorted[0:n]

In [None]:
query_text = "AI"
query_vector = create_embedding(query_text)[0] # extract the embeddings by 0 indexing

# find the 3 closest articles to this query
hits = find_n_closest(query_vector, article_embeddings)

for hit in hits:
    # get the article based on the index
    article = articles[hit['index']]
    # get the headline now that we know the index
    print(article['headline'])


## Recommendation Systems
1. embed the potential recommendations and data points
2. Calculate the cosine distance
3. Recommend the closest items

In [None]:
# assume that we have the previous articles dictionary lis
articles = [
    {
        "headline": "AI Breakthrough Enables Real-Time Language Translation Without Internet",
        "topic": "Artificial Intelligence",
        "keywords": ["AI", "machine learning", "language translation", "offline AI", "technology innovation"]
    },
    {
        "headline": "Olympic Sprinter Sets New World Record in 100m Final",
        "topic": "Sports",
        "keywords": ["Olympics", "sprinting", "track and field", "world record", "athletics"]
    },
    {
    "headline": "Climate Scientists Warn of Accelerating Ice Melt in Antarctica",
    "topic": "Environment",
    "keywords": ["climate change", "Antarctica", "ice melt", "global warming", "scientific research"]
    }
]

# and a current article
current_article = {
    "headline": "Cutting-Edge Hardware Accelerates the Next Wave of AI Innovation",
    "topic": "Artificial Intelligence",
    "keywords": ["AI", "computer hardware", "GPUs", "TPUs", "AI acceleration", "machine learning infrastructure"]
                }

# combine the text together similar to the previous code. use create_article_text to
article_texts = [create_article_text(article) for article in articles]
current_article_text =  create_article_text(current_article)
print(current_article_text)

# create embeddings 
article_embeddings = create_embeddings(article_texts)
current_article_embeddings = create_embeddings(current_article_text)[0] # zero indexing

# find the nearest distances between the current article embedding and all the article embeddings, using the find_n_closest
hits = find_n_closest(current_article_embeddings, article_embeddings)

for hit in hits:
    article = articles[hit['index']]
    print(article['headline'])


## Recommendations on two data points
let's consider that a user has visited 2 articles. To find the most similar vector to two vector spaces, 
- We'll combine the two vectors into 1 by **taking the mean**.
- compute cosine distance
- recommend the closest vector
- recommend the nearest **unseen** article

In [None]:
history_texts = [create_article_text(article) for article in user_history]
history_embeddings = create_embeddings(history_texts)
mean_history_embeddings = np.mean(history_embeddings, axis=0) # we can give it a list of texts

# filter the articles to make sure the user hasn't seen them
articles_filtered = [article for article in articles if article not in user_history]
article_texts = [create_article_text(article) for article in articles_filtered]
articles_embeddings = create_embeddings(article_texts)

hits = find_n_closest(mean_history_embeddings, article_embeddings)

for hit in hits:
    article = articles_filtered[hit['index']]
    print(article['headline'])

## Embeddings for classification

Assigning labels to items
- categorization
    - example: headline into topics (e.g. sports, tech, business, science, etc.)
- sentiment analysis
    - example: classifying reviews as positive or negative.

We will use **zero-shot** classification here. The classification won't be based on labeled examples.

Process:
1. embed class descriptions (embed words such as science, tech, etc.)
    - **limitation** if the class description lacks detail, the embeddings might make mistakes or misclassify. 
3. embed the item to classify.
4. Compute cosine distances.
5. Assign the article the most similar label.

In [38]:
topics = [
    {"label": "tech"},
    {"label": "science"},
    {"label": "sport"},
    {"label": "business"}
]

class_descriptions = [topic['label'] for topic in topics]
class_embeddings = create_embeddings(class_description)

# here is the article we want to classify
article = {
    "headline": "Cutting-Edge Hardware Accelerates the Next Wave of AI Innovation",
    "topic": "Artificial Intelligence",
    "keywords": ["AI", "computer hardware", "GPUs", "TPUs", "AI acceleration", "machine learning infrastructure"]
                }

article_text = create_article_text(article)
article_embedding = create_embedding(article_text)

# if we want to find 1 label, we modify the previous article
def find_closest(query_vector, embeddings):
    distances = []
    for index, embedding in enumerate(embeddings):
        dist = distance.cosine(query_vector, embedding)
        distances.append({'distance':dist, 'index':index})
    
    # find the min distance
    return min(distances, key=lambda x:x['distance'])


# find the label
for index, article in enumerate(articles):
    # Find the closest distance and its index using find_closest()
    closest = find_closest(article_embeddings[index], class_embeddings)
    
    # Subset labels/topics using the index from closest
    label = topics[closest['index']]['label']
    print(f'"{review}" was classified as {label}')



- We add more descriptions to the class lables here to overcome the limitations of the previous method. Instead of label embeddings, we use description embeddings for each label this time.

In [45]:
topics = [
    {
        "label": "tech",
        "description": "Covers the latest developments in technology, including software, hardware, and innovations in AI."
    },
    {
        "label": "science",
        "description": "Focuses on discoveries, research, and advancements in fields such as biology, physics, and environmental science."
    },
    {
        "label": "sport",
        "description": "Includes news and updates about sporting events, athletes, competitions, and records."
    },
    {
        "label": "business",
        "description": "Reports on the economy, markets, companies, startups, and financial trends."
    }
]

class_descriptions = [topic['description'] for topic in topics]
class_embeddings = create_embeddings(class_description)


# the rest is the same as before
# here is the article we want to classify
article = {
    "headline": "Cutting-Edge Hardware Accelerates the Next Wave of AI Innovation",
    "topic": "Artificial Intelligence",
    "keywords": ["AI", "computer hardware", "GPUs", "TPUs", "AI acceleration", "machine learning infrastructure"]
                }

article_text = create_article_text(article)
article_embeddings = create_embedding(article_text)

# if we want to find 1 label, we modify the previous article
def find_closest(query_vector, embeddings):
    distances = []
    for index, embedding in enumerate(embeddings):
        dist = distance.cosine(embedding, query_vector)
        distances.append({'distance':dist, 'index':index})
    
    # find the min distance
    return min(distances, key=lambda x:x['distance'])


# find the label
for index, article in enumerate(reviews):
  closest = find_closest(article_embeddings[index], class_embeddings)
  label = topics[closest['index']]['label']
  print(f'"{article}" was classified as {label}')


# 3. Vector Databases for embedding systems

**limitation** of the current approach:
- Loading all the embeddings **in the memory** can be impractical due to its size (1536 floats, around 13kB/embedding).
- Recalculating the embeddings for each new query instead of storing them for later use is not efficient.
- Calculating cosine distance for every embedding and sorting is **slow and scales linearly**.

**Vector Databases** are better solutions that enable embeddings with larger datasets in production. 
- In this solution, the embedded documents are stored and queried from vector databases.
- A query is sent from the application interface, embedded and used to query the embeddings in the database. This could be a semantic search query or data.
- The results are returned to the users via the application interface.
- The embedded documents are stored in the vector database, they don't have to be created with each query or stored in memory.
- Due to the architecture of the database, the similarity is computed much more efficiently.
- The majority of the vector databases are called **NoSQL** databases. NoSQL databases don't use tables like conventional SQL (relational databases). Their data can be structured in several ways to enable faster querying. Three NoSQL architectures are shown:
    - key, value
    - documents
    - graph databases
- Vector databases aren't only used for **storing the embeddings**; the **source texts** are also commonly stored. For vector databases that don't allow storing the source texts, source texts must be stored in a separate database and referenced with an ID. **Metadata** is also stored in the database, including IDs and external references, and additional data that could be useful for filtering the query results.
    - **Avoid** storing the source texts as metadata. Metadata must be small to be practically useful. So, adding a large amount of text data will greatly degrade performance.

## Creating vector databases with ChromaDB
- ChromaDB is a simple yet powerful vector database.
- Two flavors:
    - Local: great for development and prototyping. Everything happens inside Python.
    - Client/Server mode: made for production. A ChromDB server is running in a separate process.

### connecting to the database

In [61]:
# !pip install chromadb

In [65]:
import chromadb

# data will be persisted to disk
client = chromadb.PersistentClient(path="./")

### Add embeddings to the database
- first create a collection.
    - A collection is analogous to tables.
- We need to pass the name of the collection and a function to create the embeddings.
    - In Chroma or other vector databases, a default embedding function is used automatically, if one isn't specified.

In [79]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [None]:
collection = client.create_collection(
    name="my_collection",
    embedding_function=OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key="Your_API_KEY"
    )
)

### Listing all the collections in the database

In [72]:
client.list_collections()

[]

### Inserting Embeddings to the collections

#### Single Document
- Chroma doesn't automatically generate **IDs** for these documents, so they **must** be specified.
- Embeddings will be created automatically by the collection, since it's aware of the embedding function.

In [83]:
collection.add(ids=["my-doc"], documents=["This is the source text"])

#### Multiple Document
- pass multiple ids and documents for multiple insertion.

In [89]:
# the embeddings will be created automatically when the texts are inserted.
collection.add(ids=["my-doc-1", "my-doc-2"], documents=["This is the document 1", "This is document 2"])

### Inspecting a collection

- **collection.count()** will return the total number of documents in the collection.
- **collection.peek()** will return the first 10 items in the collection.

### Retrieving items
- **collection.get(id["s59"])** retreiving particular items.

### Calculating Embedding Cost with Tiktoken
- We can find out about the cost. As an example, the embedding model (text-embedding-3-small) costs $0.00002/1k tokens.
- We can count the tokens with the OpenAI **tiktoken** library (pip install tiktoken). tiktoken turns any text into tokens.

In [None]:
# We can find out about the cost using the following formula.
# cost = 0.00002 * len(tokens)/1000

# We can use tiktoken to count the tokens
import tiktoken

# get a token encoder for the embedding model we're using.
enc = tiktoken.encoding_for_model("text-embedding-3-small")

# for each document, encode it using the encoder, count the tokens by taking the length of the encoded docs, and sum the results.
total_tokens = sum(len(enc.encode(text)) for text in documents)

cost_per_1k_tokens = 0.00002

print("Total tokens: ", total_tokens)
print("Cost: ", cost_per_1k_tokens * total_tokens/1000)

## Querying and updating the database

### Retrieve our collection
- We must **specify the same embedding function** that was used when adding data to the collection. In this way Chroma will use the *same function* to create the **query vector**

In [None]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

collection = client.get_collection(
    name = "netflix_titles",
    embedding_function=OpenAIEmbeddingFunction(api_key="...")
)

### Querying the collection

In [None]:
result = colection.query(
    # this parameter is plural, so even if it's one item, we need to pass a list.
    query_text=["movies where poeple sing a lot"],
    # number of items to retrieve
    n_results=3
)

print(result)

### result

- Query() returns s dictionary with multiple keys:
    - ids: the id of the returned items
        - A list of lists (an id for each document) e.g. 'id':[['s4068','s2213']] 
    - embeddings: the embeddings of the returned items
    - documents: the source text of the returned items
    - metadata: the metadatas of the the returned items
    - distances: the distances of the returned items with the query text 

### Update a collection
- only include the fields that need to be updated. the rest of the fields will be unchanged.
- collection will automatically create embeddings. 

In [None]:
collection.update(
    ids=["id-1", "id-2"],
    documents=["New document 1", "New document 2"]
)

- if we are not sure if the ids are present in the collection, use **collection.upsert()** function.
- upsert will add the IDs if they are not present, and if they are present, it will update them.
    - a combination of **add and update** methods. 

In [None]:
collection.upsert(
    ids=["id-1", "id-2"],
    documents=["New document 1", "New document 2"]
)

### Deleting a collection

In [None]:
collection.delete(
    ids=["id-1", "id-2"]
)

### Deleting the whole database!! (careful)

In [None]:
collection.reset()

## Multiple queries and filtering

In [None]:
reference_ids = ['s8170', 's8103']

# retrieve the texts of our reference ids
reference_texts = collection.get(ids=reference_ids)["documents"]

# we pass our reference texts as two queries
result = collection.query(
    query_texts = reference_texts,
    # it means that return 3 results per query, so here we get two lists of three items
    n_results=3
)

### Adding metadata

In [158]:
# we can use metadata to filter our query
collection.update(ids=ids, metadatas=metadatas)

result = collection.query(
    query_texts = reference_texts,
    n_results = 3,
    # adding condition here
    # we want to retrieve items whose metadata type is movie
    where={
        'type':'Movie'
    }
)

# Where filters (single or multiple)

- use either of these cases for a where condition:

In [None]:
where = {
    'type':'Movie'
}

In [None]:
where = {
    "type" : {
        "$eq": "Movie"
    }
}

- There are different operator used to support different comparisions, including:
    - \$eq : equal to (string, int, float) 
    - \$ne : not equal to (string, int, float) 
    - \$gt : greater than (int, float) 
    - \$gte : greater than or equal to (int, float) 
    - \$lt : less than (int, float) 
    - \$lte : less than or equal to (int, float)
      
- Where filters can be combined with **logical operators**
    - \$and
    - \$or: filter based on at least one condition 

In [None]:
where = {
    "$and" : [
        {"type":
            {"$eq":"Movie"}
        },
        {"release_year":
            {"$gt": 2020}
        }
    ]
}