In this notebook, we understand the intuition behind text embeddings, what use cases are they good for, and how we can customize them via finetuning.

Read the accompanying [blog post here](https://txt.cohere.ai/text-embeddings/).

In [None]:
! pip install cohere altair > /dev/null

In [None]:
import cohere
import pandas as pd
import numpy as np
import altair as alt

api_key = '{apikey}' # Paste your API key here. Remember to not share it publicly 
co = cohere.Client(api_key)

In [None]:
# Load the dataset to a dataframe
df_orig = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/atis_intents_train.csv',names=['intent','query'])

# Take a small sample for illustration purposes
sample_classes = ['atis_airfare', 'atis_airline', 'atis_ground_service']
df = df_orig.sample(frac=0.12, random_state=30)
df = df[df.intent.isin(sample_classes)]
df.reset_index(drop=True,inplace=True)

# Remove unnecessary column 
intents = df['intent'] #save for a later need
df.drop(columns=['intent'], inplace=True)
df.head()

Unnamed: 0,query
0,which airlines fly from boston to washington ...
1,show me the airlines that fly between toronto...
2,show me round trip first class tickets from n...
3,i'd like the lowest fare from denver to pitts...
4,show me a list of ground transportation at bo...


# 1. Intuition

When you hear about large language models (LLM), probably the first thing that comes to mind is the text generation capability, such as writing an essay or creating a marketing copy.

But another thing you can get is text representation: a set of numbers that represent what the text means, and somehow capture the semantics of the text. These numbers are called text embeddings.

![Comparing text generation and text representation](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/vis-embeds/1-text-gen-rep.png)

## 1.1 - Turn text into embeddings

In [None]:
# Get text embeddings
def get_embeddings(text,model='medium'):
  output = co.embed(
                model=model,
                texts=[text])
  return output.embeddings[0]

In [None]:
# Embed the dataset
df['query_embeds'] = df['query'].apply(get_embeddings)
df.head()

Unnamed: 0,query,query_embeds
0,which airlines fly from boston to washington ...,"[-0.5157496, -1.7158719, -1.9576564, -0.000320..."
1,show me the airlines that fly between toronto...,"[-0.56443447, 0.24712197, -2.4817867, -0.96786..."
2,show me round trip first class tickets from n...,"[-1.8028516, 0.16407005, -1.5529815, -2.070842..."
3,i'd like the lowest fare from denver to pitts...,"[-1.9541955, -0.2440019, -2.1427286, -0.895589..."
4,show me a list of ground transportation at bo...,"[-0.25850168, -0.24117625, -0.7334503, -0.0590..."


## 1.2 - Visualize embeddings on a heatmap

In [None]:
# Reduce dimensionality using PCA
from sklearn.decomposition import PCA

# Function to return the principal components
def get_pc(arr,n):
  pca = PCA(n_components=n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

In [None]:
# Reduce embeddings to 10 principal components to aid visualization
embeds = np.array(df['query_embeds'].tolist())
embeds_pc = get_pc(embeds,10)

In [None]:
# Set sample size to visualize
sample = 9

# Reshape the data for visualization purposes
source = pd.DataFrame(embeds_pc)[:sample]
source = pd.concat([source,df['query']], axis=1)
source = source.melt(id_vars=['query'])

# Configure the plot
chart = alt.Chart(source).mark_rect().encode(
    x=alt.X('variable:N', title="Embedding"),
    y=alt.Y('query:N', title='',axis=alt.Axis(labelLimit=500)),
    color=alt.Color('value:Q', title="Value", scale=alt.Scale(
                range=["#917EF3", "#000000"]))
)

result = chart.configure(background='#ffffff'
        ).properties(
        width=700,
        height=400,
        title='Embeddings with 10 dimensions'
       ).configure_axis(
      labelFontSize=15,
      titleFontSize=12)

# Show the plot
result

Notice the 3 inquiries about ground transportation in Boston - their embeddings patterns are very similar, and at the same time are distinctive from the rest.

## 1.3 - Visualize embeddings on a 2D plot

In [None]:
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,lbl='on',color='basic',title=''):
  chart = alt.Chart(df).mark_circle(size=500).encode(
    x=
    alt.X(xcol,
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),

    y=
    alt.Y(ycol,
        scale=alt.Scale(zero=False),
        axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=['query']
    )

  if lbl == 'on':
    text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='query', color= alt.value('black'))
  else:
    text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

  result = (chart + text).configure(background="#FDF7F0"
        ).properties(
        width=800,
        height=500,
        title=title
       ).configure_legend(
  orient='bottom', titleFontSize=18,labelFontSize=18)
        
  return result

In [None]:
# Reduce embeddings to 2 principal components to aid visualization
embeds_pc2 = get_pc(embeds,2)

# Add the principal components to dataframe
df_pc2 = pd.concat([df, pd.DataFrame(embeds_pc2)], axis=1)

# Plot the 2D embeddings on a chart
df_pc2.columns = df_pc2.columns.astype(str)
generate_chart(df_pc2.iloc[:sample],'0','1',title='2D Embeddings')

Here texts of similar meaning are located close together. We see inquiries about tickets on the left, inquiries about airlines somewhere around the middle, and inquiries about ground transportation on the top right.

# 2. Use Cases

## 2.1 - Semantic Search




Semantic, or similarity search, that can surface results based on the context or semantic meaning of a query instead of purely keyword-matching.

In [None]:
# Calculate cosine similarity between the search query and existing queries

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target,candidates):
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()
  sort_index = np.argsort(sim)[::-1]
  sort_score = [sim[i] for i in sort_index]
  similarity_scores = zip(sort_index,sort_score)

  # Return similarity scores
  return similarity_scores


In [None]:
# Add new query
new_query = "show business fares"

# Get embeddings of the new query
new_query_embeds = get_embeddings(new_query)

In [None]:
# Get the similarity between the search query and existing queries
similarity = get_similarity(new_query_embeds,embeds[:sample])

# View the top 5 articles
print('Query:')
print(new_query,'\n')

print('Similar queries:')
for idx,sim in similarity:
  print(f'Similarity: {sim:.2f};',df.iloc[idx]['query'])

Query:
show business fares 

Similar queries:
Similarity: 0.52;  show me round trip first class tickets from new york to miami
Similarity: 0.43;  i'd like the lowest fare from denver to pittsburgh
Similarity: 0.39;  show me a list of ground transportation at boston airport
Similarity: 0.38;  show me boston ground transportation
Similarity: 0.36;  show me the airlines that fly between toronto and denver
Similarity: 0.34;  which airlines fly from boston to washington dc via other cities
Similarity: 0.32;  what ground transportation is available in boston
Similarity: 0.31;  i would like your rates between atlanta and boston on september third
Similarity: 0.25;  of all airlines which airline has the most arrivals in atlanta


The top-ranked FAQ we get is an inquiry about first-class tickets, which is very relevant considering the other options. Notice that it doesn’t contain the keyword “business” and nor does the search query contain the keyword “class”. But their meanings turn out to be the most similar compared to the rest and are captured in their embeddings.

### Plot the new query and existing queries on a chart

In [None]:
# Create new dataframe and append new query
df_sem = df.copy()
df_sem.loc[len(df_sem.index)] = [new_query, new_query_embeds]

# Reduce embeddings dimension to 2
embeds_sem = np.array(df_sem['query_embeds'].tolist())
embeds_sem_pc2 = get_pc(embeds_sem,2)

# Add the principal components to dataframe
df_sem_pc2 = pd.concat([df_sem, pd.DataFrame(embeds_sem_pc2)], axis=1)

In [None]:
# Create column for representing chart legend
df_sem_pc2['Source'] = 'Existing'
df_sem_pc2.at[len(df_sem_pc2)-1, 'Source'] = "New"

# Plot on a chart
df_sem_pc2.columns = df_sem_pc2.columns.astype(str)
selection = list(range(sample)) + [-1]
generate_chart(df_sem_pc2.iloc[selection],'0','1',color='Source',title='Semantic Search')

On a plot, we see that the new query is located closest to the FAQ about first-class tickets.

# 2.2 - Clustering


Clustering is a process of grouping similar documents into clusters. It is used to organize a large number of documents into a smaller number of groups and lets us discover emerging patterns in the documents.

In [None]:
from sklearn.cluster import KMeans

# Pick the number of clusters
df_clust = df_pc2.copy()
n_clusters=2

# Cluster the embeddings
kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

# Plot on a chart
df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:sample],'0','1',lbl='on',color='cluster',title='Clustering with 2 Clusters')

When specified with 2 clusters to group the documents by, the algorithm looks to be spot on, where it generates one cluster related to airline information and one cluster related to ground service information.

## 2.3 - Classification 

While clustering is an unsupervised learning algorithm where we don’t know the number of classes and what they are, classification is a supervised learning algorithm where we do know them.

In [None]:
# Bring back the 'intent' column so we can build the classifier
df_class = df_pc2.copy()
df_class['intent'] = intents

# Use the remaining dataset as training data
df_test = df_class[:sample]
df_train = df_class[sample:]

In [None]:
# Train the classifier with Support Vector Machine (SVM) algorithm

# import SVM classifier code
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


# Initialize the classifier
svm_classifier = make_pipeline(StandardScaler(), SVC())

# Prepare the training features and label
features = df_train['query_embeds'].tolist()
label = df_train['intent']

# Fit the support vector machine
svm_classifier.fit(features, label)

Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])

In [None]:
# Predict with test data

# Prepare the test inputs
df_test = df_test.copy()
inputs = df_test['query_embeds'].tolist()

# Predict the labels
df_test['intent_pred'] = svm_classifier.predict(inputs)

# Compute the score
score = svm_classifier.score(inputs, df_test['intent'])
print(f"Prediction accuracy is {100*score}%")

Prediction accuracy is 100.0%


In [None]:
# Plot the predicted classes
df_test.columns = df_test.columns.astype(str)
generate_chart(df_test,'0','1',lbl='off',color='intent_pred',title='Classification - Prediction')

In [None]:
# Plot the actual classes
generate_chart(df_test,'0','1',lbl='off',color='intent',title='Classification - Actual')

The two plots above show that all predictions (each class is represented by one color) match the actual classes.

# 3. Finetuning

In practical applications, you will likely need to customize the model to your task, and in particular, the kind of data you are dealing with.

This is where finetuning comes in. A baseline model already comes pre-trained with a huge amount of text data. But finetuning can further build on that by taking in and adapting to your own data. 

The result is a custom model that produces outputs that are more attuned to the task you have at hand.

In [None]:
# The finetuned model ID
atis_ft_v1 = "79937a96-8bc6-494a-9432-b6c0cd1f6e51-ft"

In [None]:
# Embed the dataset - use the finetuned model this time
df_ft = df.copy()
df_ft['intent'] = intents

df_ft['query_embeds'] = df_ft['query'].apply(get_embeddings,model=atis_ft_v1)

# Reduce embeddings to 2 dimensions
embeds_ft = np.array(df_ft['query_embeds'].tolist())
embeds_ft_pc2 = get_pc(embeds_ft,2)

# Plot the 2D embeddings from a finetuned model
df_ft_pc2 = pd.concat([df_ft, pd.DataFrame(embeds_ft_pc2)], axis=1)
df_ft_pc2.columns = df_ft_pc2.columns.astype(str)
generate_chart(df_ft_pc2.iloc[:sample],'0','1',lbl='off',color='intent', title='Finetuned model')

In [None]:
# Plot the 2D embeddings from a non-finetuned model
generate_chart(df_test,'0','1',lbl='off',color='intent',title='Non-finetuned model')

Referring to the two plots above:
- With a baseline (non-finetuned) model, which is what we’ve been using before (first plot), we can already get a good separation between classes, which shows that it can perform well in this task.
- But with a finetuned model (second plot), the separation becomes even more apparent. Similar data points are now pushed even closer together and further apart from the rest. This indicates that the model has adapted to the additional data it receives during finetuning, hence is more likely to perform even better in this task.