<a href="https://colab.research.google.com/github/norachams/LLM-University/blob/main/Module2_Chapter1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: Introduction to Text Embeddings

**Setup:**
First we will install and import the needed dependencies.

In [3]:
! pip install cohere altair -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.0/303.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

In [4]:
import cohere
co = cohere.ClientV2("eMgCzMsYnCCL1pBQTMqR0o87yUxOyYgEItCkt3tA")

## Step 1: Prepare the Dataset
We are using a subset of the [ATIS](https://www.kaggle.com/datasets/hassanamin/atis-airlinetravelinformationsystem?select=atis_intents_train.csv&ref=cohere-ai.ghost.io&_gl=1*1naj1c5*_gcl_au*MTU4MzMwNTUyNy4xNzU1ODA1MjU4LjExOTc5ODI2OC4xNzYwNTYxNTQyLjE3NjA1NjE1NTQ.) classification data. And we are going to load the data set into a Pandas dataframe.

In [8]:
# Load the dataset to a dataframe
df_orig = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/atis_intents_train.csv', names=['intent','query'])

# Take a small sample for illustration purposes
sample_classes = ['atis_airfare', 'atis_airline', 'atis_ground_service']
df = df_orig.sample(frac=0.1, random_state=30)
df = df[df.intent.isin(sample_classes)]
df_orig = df_orig.drop(df.index)
df.reset_index(drop=True,inplace=True)

# Remove unnecessary column
intents = df['intent'] #save for a later need
df.drop(columns=['intent'], inplace=True)
df.head()

Unnamed: 0,query
0,which airlines fly from boston to washington ...
1,show me the airlines that fly between toronto...
2,show me round trip first class tickets from n...
3,i'd like the lowest fare from denver to pitts...
4,show me a list of ground transportation at bo...


In [9]:
for i in df.head(10)["query"]:
    print(i)

 which airlines fly from boston to washington dc via other cities
 show me the airlines that fly between toronto and denver
 show me round trip first class tickets from new york to miami
 i'd like the lowest fare from denver to pittsburgh
 show me a list of ground transportation at boston airport
 show me boston ground transportation
 of all airlines which airline has the most arrivals in atlanta
 what ground transportation is available in boston
 i would like your rates between atlanta and boston on september third
 which airlines fly between boston and pittsburgh


## Step 2: Turn Text into Embeddings
Next we want to use Cohere's Embed endpoint, it takes text as input and returns embeddings.

In [10]:
def get_embeddings(texts, model="embed-v4.0", input_type="search_document"):
    output = co.embed(
        texts=texts,
        model=model,
        input_type=input_type,
        embedding_types=["float"]
    )
    return output.embeddings.float

In [11]:
# Embed the dataset
df['query_embeds'] = get_embeddings(df['query'].tolist())
df.head()

Unnamed: 0,query,query_embeds
0,which airlines fly from boston to washington ...,"[0.054129835, -0.022328556, -0.0012158069, -0...."
1,show me the airlines that fly between toronto...,"[0.021492628, 0.012399592, -0.0142182, -0.0172..."
2,show me round trip first class tickets from n...,"[-0.053697467, 0.028789606, -0.0049937028, 0.0..."
3,i'd like the lowest fare from denver to pitts...,"[0.048570547, 0.017809201, -0.019994875, -0.01..."
4,show me a list of ground transportation at bo...,"[0.046666104, -0.0040052338, 0.009201768, -0.0..."


Every text that is passed to the embed endpoint a sequence of 1024 numbers are generated. Each one of these numbers represents a piece of information about the meaning of the text passed.

## Step 3: Visualize Embeddings with a Heatmap
Now we are going to plot the numbers on a heatmap, the function below does this by using a technique called principle component analysis. Which reduces the number of dimesnions in an mendeggins while keeping as much information as possible. We'll set it to 10 dimensions.

In [7]:
# Function to return the principal components
def get_pc(arr, n):
    pca = PCA(n_components=n)
    embeds_transform = pca.fit_transform(arr)
    return embeds_transform

# Reduce embeddings to 10 principal components to aid visualization
embeds = np.array(df['query_embeds'].tolist())
embeds_pc = get_pc(embeds, 10)

In [12]:
# Set sample size to visualize
sample = 9

# Reshape the data for visualization purposes
source = pd.DataFrame(embeds_pc)[:sample]
source = pd.concat([source,df['query']], axis=1)
source = source.melt(id_vars=['query'])

# Configure the plot
chart = alt.Chart(source).mark_rect().encode(
    x=alt.X('variable:N', title="Embedding"),
    y=alt.Y('query:N', title='',axis=alt.Axis(labelLimit=500)),
    color=alt.Color('value:Q', title="Value", scale=alt.Scale(
                range=["#917EF3", "#000000"]))
)

result = chart.configure(background='#ffffff'
        ).properties(
        width=700,
        height=400,
        title='Embeddings with 10 dimensions'
       ).configure_axis(
      labelFontSize=15,
      titleFontSize=12)

# Show the plot
result

You can notice that text that are similar have similar emddings pattern that is also similar to each other. Like the text about transportation in Boston.

## Step 4: Visualize Embeddings on a 2D Plot


In [13]:
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,lbl='on',color='basic',title=''):
    chart = alt.Chart(df).mark_circle(size=500).encode(
        x=
        alt.X(xcol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        y=
        alt.Y(ycol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        color= alt.value('#333293') if color == 'basic' else color,
        tooltip=['query']
    )

    if lbl == 'on':
        text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='query', color= alt.value('black'))
    else:
        text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

    result = (chart + text).configure(background="#FDF7F0").properties(
        width=800,
        height=500,
        title=title
    ).configure_legend(orient='bottom', titleFontSize=18,labelFontSize=18)

    return result

In [14]:
# Reduce embeddings to 2 principal components to aid visualization
embeds_pc2 = get_pc(embeds, 2)

# Add the principal components to dataframe
df_pc2 = pd.concat([df, pd.DataFrame(embeds_pc2)], axis=1)

# Plot the 2D embeddings on a chart
df_pc2.columns = df_pc2.columns.astype(str)
generate_chart(df_pc2.iloc[:sample],'0','1',title='2D Embeddings')

Here we can see that text with similar meaning are close to each other and cluster together.