# Embeddings via API

In this notebook, we demonstrate how to obtain embeddings using OpenAI's API (offers an embedding service).

In [1]:
# Load secrets
%load_ext dotenv 
%dotenv ../../05_src/.secrets

In [3]:
# Phrases to be embedded (from highly concentrated to less concentrated on the topic)
# Want to obtain a representation that reflects the semantic similarity between these phrases; the similarity search should reflect meaning
documents = [
    "The machine learning model predicts customer behavior based on historical data.",
    "The machine learning model predicts user behavior using historical data.",
    "A machine learning model predicts customer behavior from past data.",
    "The predictive model uses historical customer data to forecast behavior.",
    "Customer behavior is predicted by a data-driven machine learning system.",
    "Historical data is analyzed to understand how customers behave.",
    "A data science model analyzes past information to make predictions.",
    "Business analysts study customer trends to support decision making.",
    "Statistical techniques are used to interpret large datasets.",
    "The weather forecast was inaccurate due to missing satellite data.",
    "A novel explores human relationships in a small coastal town."
]

OpenAI's text embeddings are available through the embeddings API. A key reference is the [Embeddings API documentation](https://platform.openai.com/docs/guides/embeddings).

There are three models that we can choose from, depending on [the size of the hidden representation, latency, and cost](https://platform.openai.com/docs/guides/embeddings#embedding-models):

# Depending on context, some models more appropriate
+ `text-embedding-3-small` # original model; shorter context (e.g., emails)
+ `text-embedding-3-large` # large documents
+ `text-embedding-ada-002`

A simple implementation would call the embeddings API for each phrase.

In [None]:

from openai import OpenAI
import os
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

# Function to call the API; getting rid of newlines/line breaks
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=text, model=model).data[0].embedding

embeddings = [get_embedding(doc) for doc in documents]

In [None]:
# Python list of embeddings; anything that goes into square parantheses is a list (will contain output of the get embedding function for each document in the document list)
# If printed out you would get a list of vectors (each vector is a list of numbers representing the embedding for each document; contain floats that represent the position of the document in the embedding space)

# Can check length of embeddings and length of documents to confirm they match (each document should have a corresponding embedding)
len(embeddings), len(documents)

# Once you have floats + arrays we can start thinking about similarity metrics
import numpy as np

embeddings_array = np.array(embeddings)
embeddings_array

## A Note on Similarity

One important characteristic of embeddings is that they can be used to measure the relatedness of text strings. To see this, we can plot a reduced forms of the embeddings using Principal Components Analysis (PCA).

Similarity between two texts can be understood in two ways:

+ Lexical similarity refers to similarity of the choice of words. For example, "cats are fun" and "cats are furry" are similar in that they have two words in common.
+ Semantical similarity refers to similarity in the words meaning. For example, "the bottle is empty" and "there is nothing in the bottle" are similar in meaning, but the phrases do not have many words in common.

Using count or tf-idf tokenization, we can calculate lexical similarity; using embeddings, we can compute (model-dependent) lexical similarity.

In [None]:
# Now reduce dimensionality of the embeddings to 2 dimensions for visualization purposes (via PCA)
# Way of extracting linear correlation structure from a dataset
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
reduced_embeddings = pca.fit_transform(embeddings_array)

In [None]:
# Plot results; visualize the reduced embeddings in a 2D space; each point represents a document and the distance between points reflects their semantic similarity
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from adjustText import adjust_text

# Sample data
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"]).assign(label = documents)

# Create the scatter plot
fig, ax = plt.subplots()
sns.scatterplot(x='x', y='y', data=df, ax=ax)

# Add labels
texts = []
for i, row in df.iterrows():
    texts.append(ax.text(row['x'], row['y'], row['label'], fontsize=6))

# Adjust text positions to avoid overlap
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='black', lw=0.5))

plt.show()

# If we take embeddings, we can get a representation that will keep similar documents close together in the embedding space; we can use this for similarity search, clustering, or visualization; the PCA step is just for visualization purposes to reduce the high-dimensional embedding space to 2D while trying to preserve the structure of the data as much as possible.
# Can use this mechanism for a search engine; if you have a query you can embed the query and then find the closest embeddings in your document collection to return relevant results; this is a common technique in information retrieval and natural language processing applications.

# Additional Note:
- No guarantees on linear structure (standard is to use tSNE): https://www.jmlr.org/papers/v9/vandermaaten08a.html