<a href="https://colab.research.google.com/github/norachams/LLM-University/blob/main/Module2_Chapter1_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: Introduction to Text Embeddings

**Setup:**
First we will install and import the needed dependencies.

In [None]:
! pip install cohere altair -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/303.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.0/303.0 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.5/3.5 MB[0m [31m165.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m89.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

In [None]:
import cohere
co = cohere.ClientV2("eMgCzMsYnCCL1pBQTMqR0o87yUxOyYgEItCkt3tA")

## Step 1: Prepare the Dataset
We are using a subset of the [ATIS](https://www.kaggle.com/datasets/hassanamin/atis-airlinetravelinformationsystem?select=atis_intents_train.csv&ref=cohere-ai.ghost.io&_gl=1*1naj1c5*_gcl_au*MTU4MzMwNTUyNy4xNzU1ODA1MjU4LjExOTc5ODI2OC4xNzYwNTYxNTQyLjE3NjA1NjE1NTQ.) classification data. And we are going to load the data set into a Pandas dataframe.

In [None]:
# Load the dataset to a dataframe
df_orig = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/atis_intents_train.csv', names=['intent','query'])

# Take a small sample for illustration purposes
sample_classes = ['atis_airfare', 'atis_airline', 'atis_ground_service']
df = df_orig.sample(frac=0.1, random_state=30)
df = df[df.intent.isin(sample_classes)]
df_orig = df_orig.drop(df.index)
df.reset_index(drop=True,inplace=True)

# Remove unnecessary column
intents = df['intent'] #save for a later need
df.drop(columns=['intent'], inplace=True)
df.head()

Unnamed: 0,query
0,which airlines fly from boston to washington ...
1,show me the airlines that fly between toronto...
2,show me round trip first class tickets from n...
3,i'd like the lowest fare from denver to pitts...
4,show me a list of ground transportation at bo...


In [None]:
for i in df.head(10)["query"]:
    print(i)

 which airlines fly from boston to washington dc via other cities
 show me the airlines that fly between toronto and denver
 show me round trip first class tickets from new york to miami
 i'd like the lowest fare from denver to pittsburgh
 show me a list of ground transportation at boston airport
 show me boston ground transportation
 of all airlines which airline has the most arrivals in atlanta
 what ground transportation is available in boston
 i would like your rates between atlanta and boston on september third
 which airlines fly between boston and pittsburgh


## Step 2: Turn Text into Embeddings
Next we want to use Cohere's Embed endpoint, it takes text as input and returns embeddings.

In [None]:
def get_embeddings(texts, model="embed-v4.0", input_type="search_document"):
    output = co.embed(
        texts=texts,
        model=model,
        input_type=input_type,
        embedding_types=["float"]
    )
    return output.embeddings.float

In [None]:
# Embed the dataset
df['query_embeds'] = get_embeddings(df['query'].tolist())
df.head()

Unnamed: 0,query,query_embeds
0,which airlines fly from boston to washington ...,"[0.054129835, -0.022328556, -0.0012158069, -0...."
1,show me the airlines that fly between toronto...,"[0.021492628, 0.012399592, -0.0142182, -0.0172..."
2,show me round trip first class tickets from n...,"[-0.053697467, 0.028789606, -0.0049937028, 0.0..."
3,i'd like the lowest fare from denver to pitts...,"[0.048570547, 0.017809201, -0.019994875, -0.01..."
4,show me a list of ground transportation at bo...,"[0.046666104, -0.0040052338, 0.009201768, -0.0..."


Every text that is passed to the embed endpoint a sequence of 1024 numbers are generated. Each one of these numbers represents a piece of information about the meaning of the text passed.

## Step 3: Visualize Embeddings with a Heatmap
Now we are going to plot the numbers on a heatmap, the function below does this by using a technique called principle component analysis. Which reduces the number of dimesnions in an mendeggins while keeping as much information as possible. We'll set it to 10 dimensions.

In [None]:
# Function to return the principal components
def get_pc(arr, n):
    pca = PCA(n_components=n)
    embeds_transform = pca.fit_transform(arr)
    return embeds_transform

# Reduce embeddings to 10 principal components to aid visualization
embeds = np.array(df['query_embeds'].tolist())
embeds_pc = get_pc(embeds, 10)

In [None]:
# Set sample size to visualize
sample = 9

# Reshape the data for visualization purposes
source = pd.DataFrame(embeds_pc)[:sample]
source = pd.concat([source,df['query']], axis=1)
source = source.melt(id_vars=['query'])

# Configure the plot
chart = alt.Chart(source).mark_rect().encode(
    x=alt.X('variable:N', title="Embedding"),
    y=alt.Y('query:N', title='',axis=alt.Axis(labelLimit=500)),
    color=alt.Color('value:Q', title="Value", scale=alt.Scale(
                range=["#917EF3", "#000000"]))
)

result = chart.configure(background='#ffffff'
        ).properties(
        width=700,
        height=400,
        title='Embeddings with 10 dimensions'
       ).configure_axis(
      labelFontSize=15,
      titleFontSize=12)

# Show the plot
result

You can notice that text that are similar have similar emddings pattern that is also similar to each other. Like the text about transportation in Boston.

## Step 4: Visualize Embeddings on a 2D Plot


In [None]:
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,lbl='on',color='basic',title=''):
    chart = alt.Chart(df).mark_circle(size=500).encode(
        x=
        alt.X(xcol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        y=
        alt.Y(ycol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        color= alt.value('#333293') if color == 'basic' else color,
        tooltip=['query']
    )

    if lbl == 'on':
        text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='query', color= alt.value('black'))
    else:
        text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

    result = (chart + text).configure(background="#FDF7F0").properties(
        width=800,
        height=500,
        title=title
    ).configure_legend(orient='bottom', titleFontSize=18,labelFontSize=18)

    return result

In [None]:
# Reduce embeddings to 2 principal components to aid visualization
embeds_pc2 = get_pc(embeds, 2)

# Add the principal components to dataframe
df_pc2 = pd.concat([df, pd.DataFrame(embeds_pc2)], axis=1)

# Plot the 2D embeddings on a chart
df_pc2.columns = df_pc2.columns.astype(str)
generate_chart(df_pc2.iloc[:sample],'0','1',title='2D Embeddings')

Here we can see that text with similar meaning are close to each other and cluster together.

# Chapter 2: Semantic Search

We are going to use text embeddings to make a search capability that bring relevant information based on similarties and not just key word matching. This is one of the cool abilities of text embeddings even with 0 keyword match they are able to get all the relevant and similar answers to question.

### Step 1: Embed the Search Query


In [None]:
# Define new query
new_query = "How can I find a taxi or a bus when the plane lands?"

We use the same get_embeddings() function as before but we change the "input_type" to search query

In [None]:
# Get embeddings of the new query
new_query_embeds = get_embeddings([new_query], input_type="search_query")[0]

### Step 2: Compare to Embedded Documents


In [None]:
# Calculate cosine similarity between the search query and existing queries
def get_similarity(target, candidates):
    # Turn list into array
    candidates = np.array(candidates)
    target = np.expand_dims(np.array(target),axis=0)

    # Calculate cosine similarity
    sim = cosine_similarity(target, candidates)
    sim = np.squeeze(sim).tolist()
    sort_index = np.argsort(sim)[::-1]
    sort_score = [sim[i] for i in sort_index]
    similarity_scores = zip(sort_index,sort_score)

    # Return similarity scores
    return similarity_scores

# Get the similarity between the search query and existing queries
similarity = get_similarity(new_query_embeds, embeds[:sample])

We'll then view the documents in decreasing order of similarity.



In [None]:
# View the top 5 articles
print('Query:')
print(new_query,'\n')

print('Most Similar Documents:')
for idx, sim in similarity:
    print(f'Similarity: {sim:.2f};', df.iloc[idx]['query'])

Query:
How can I find a taxi or a bus when the plane lands? 

Most Similar Documents:
Similarity: 0.34;  what ground transportation is available in boston
Similarity: 0.32;  show me boston ground transportation
Similarity: 0.31;  show me a list of ground transportation at boston airport
Similarity: 0.25;  i would like your rates between atlanta and boston on september third
Similarity: 0.21;  of all airlines which airline has the most arrivals in atlanta
Similarity: 0.15;  show me round trip first class tickets from new york to miami
Similarity: 0.15;  which airlines fly from boston to washington dc via other cities
Similarity: 0.13;  i'd like the lowest fare from denver to pittsburgh
Similarity: 0.11;  show me the airlines that fly between toronto and denver


You can see the top three most similar, also talk about ground transporatation. Even though there are 0 common keywords (they dont mention taxi or bus) between the two but they are semantically simialr hence the higher similarity score thanks to text embeddings.

### Step 3: Visualize the Results in a 2D Plot


In [None]:
# Create new dataframe and append new query
df_sem = df.copy()
df_sem.loc[len(df_sem.index)] = [new_query, new_query_embeds]

# Reduce embeddings dimension to 2
embeds_sem = np.array(df_sem['query_embeds'].tolist())
embeds_sem_pc2 = get_pc(embeds_sem, 2)

# Add the principal components to dataframe
df_sem_pc2 = pd.concat([df_sem, pd.DataFrame(embeds_sem_pc2)], axis=1)

In [None]:
# Create column for representing chart legend
df_sem_pc2['Source'] = 'Existing'
df_sem_pc2.at[len(df_sem_pc2)-1, 'Source'] = "New"

# Plot on a chart
df_sem_pc2.columns = df_sem_pc2.columns.astype(str)
selection = list(range(sample)) + [-1]
generate_chart(df_sem_pc2.iloc[selection],'0','1',color='Source',title='Semantic Search')

We can see that our new query is located closest to the sentences about ground transporation.

# Chapter 3: Text Clustering

#### Step 1: Embed the Text for Clustering


In [None]:
# Embed the text for clustering
df['clustering_embeds'] = get_embeddings(df['query'].tolist(), input_type="clustering")
embeds = np.array(df['clustering_embeds'].tolist())

We will use the k-means cluster algorithm, since our data is small we will do 2 clusters.

#### Step 2: Cluster the Embeddings
since we have a small dataset we are going to use two clusters. For bigger datasets there are usaully way more.


In [None]:
# Pick the number of clusters
n_clusters = 2

# Cluster the embeddings
kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds).tolist()

# Store the cluster assignments
df_clust = df_pc2.copy()
df_clust['cluster'] = (list(map(str,classes)))

# Preview the cluster assignments
df_clust.head()

Unnamed: 0,query,query_embeds,0,1,cluster
0,which airlines fly from boston to washington ...,"[0.054129835, -0.022328556, -0.0012158069, -0....",-0.083819,0.430707,0
1,show me the airlines that fly between toronto...,"[0.021492628, 0.012399592, -0.0142182, -0.0172...",-0.084121,0.523354,0
2,show me round trip first class tickets from n...,"[-0.053697467, 0.028789606, -0.0049937028, 0.0...",-0.232563,-0.035637,0
3,i'd like the lowest fare from denver to pitts...,"[0.048570547, 0.017809201, -0.019994875, -0.01...",-0.288963,-0.101116,0
4,show me a list of ground transportation at bo...,"[0.046666104, -0.0040052338, 0.009201768, -0.0...",0.508615,0.131637,1


#### Step 3: Visualize the Results in a 2D Plot


In [None]:

# Plot on a chart
df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:sample],'0','1',lbl='on',color='cluster',title='Clustering with 2 Clusters')

As you can see in the plot, the algorithm produces 2 clusters. One is related to air transportation and the other is related to ground transportation.