# Hands on session: embeddings

In this session, we’ll dive right into some hands-on data work. We’ve prepared a (messy) dataset for you, consisting of a transcription from a yet-to-be-revealed Tweede Kamer debat. The transcription was generated using Azure's AI Speech-to-Text service, which is highly accurate and capable of distinguishing between different speakers, but still makes mistakes.

The goal of this session is to gain practical experience with text embeddings, dimensionality reduction, and clustering, while exploring how to combine different techniques. By applying these methods to our dataset, we’ll aim to uncover the main topics discussed in the debate.



In [None]:
!pip install -r https://raw.githubusercontent.com/kcambrek/masterclass/refs/heads/main/data/requirements.txt

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
import plotly.express as px
import numpy as np
from typing import Optional, Callable


## Load data

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/kcambrek/masterclass/refs/heads/main/data/transcriptie.csv", sep = ";").fillna(" ")
df.head()


Unnamed: 0,speaker,text
0,Guest-1,Als eerste het woord aan de heer Krul.
1,Guest-1,Van de Fractie van het CDA die een klemmende v...
2,Guest-1,Aan minister Madlener.
3,Guest-2,Die heeft nog eens maken het vragenuurtje in d...
4,Guest-1,Te mogen open.


In [5]:
# How many rows do we have?
print(f"Number of rows : {len(df)}")

# How many speakers do we have?
print(df["speaker"].value_counts())

# Calculate the number of words in each row of the specified column
df['word_count'] = df['text'].apply(lambda x: len(x.split()))

# Create a histogram using Plotly Express
fig = px.histogram(df, x="word_count", nbins=50, title="Distribution of Number of Words")
fig.show()

Number of rows : 295
speaker
Guest-1     62
Guest-3     45
Guest-4     36
Guest-7     35
Guest-13    35
Unknown     25
Guest-5     14
Guest-16    11
Guest-14     7
Guest-6      5
Guest-8      5
Guest-11     4
Guest-12     4
Guest-9      3
Guest-2      2
Guest-10     1
Guest-15     1
Name: count, dtype: int64


## Create embedding

We are using a transformer model to generate embeddings of our texts.
It allows us to map texts to high dimensional vectors. These vectors should reflect the semantics of the input text. Similar texts should be embedded close to eachother.  ![image2.png](https://www.wavelabs.ai/wp-content/uploads/2021/03/Word-Embeddings-Image-2-1024x358.png)

In [None]:
# Choose embedding model. Check out https://huggingface.co/sentence-transformers for more models
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
# Generate embeddings
embeddings = model.encode(df["text"], show_progress_bar= True)

## Reduce and plot embeddings

The embeddings have a high dimension, 384 to be exact. We cannot visualize such an embedding space. Subsequently, clustering techniques can have difficulties with clustering high dimensional vectors. Dimensionality reduction can help in this regard by removing noise and combat the curse of dimensionality. However, reducing dimensions can also result in information loss.

In [None]:
# Applying t-sne to reduce dimensionality to 2. https://scikit-learn.org/dev/modules/generated/sklearn.manifold.TSNE.html
tsne = TSNE(n_components=2, random_state=42)
reduced_embeddings = tsne.fit_transform(embeddings)

print(f"Shape of embeddings : {embeddings.shape}")
print(f"Shape of reduced embeddings : {reduced_embeddings.shape}")

In [None]:
def plot_reduced_embeddings(reduced_embeddings: np.ndarray, labels : Optional[list] = None) -> None:
  '''Creates plotly scatter plot of 2-dimensional embeddings.'''
  temp_df = pd.DataFrame(reduced_embeddings, columns=['x', 'y'])
  if labels:
      temp_df["label"] = labels
      fig = px.scatter(temp_df, x='x', y='y', color=temp_df['label'],
                title="Reduced Embeddings", labels={'color': 'label'})
  else:
      fig = px.scatter(temp_df, x = "x", y='y', title="Reduced Embeddings")

  fig.show()


In [None]:
# Plot reduced embeddings
plot_reduced_embeddings(reduced_embeddings)

In [None]:
# Be carefull with t-sne! Results are not deterministic
tsne_1 = TSNE(n_components=2, random_state=43)
reduced_embeddings_1 = tsne_1.fit_transform(embeddings)

plot_reduced_embeddings(reduced_embeddings_1)

In [None]:
# We do have extra information on the speakers. Are the texts from the same speaker clustered close together? / Do they speak about more topics?
plot_reduced_embeddings(reduced_embeddings, labels=list(df["speaker"]))

## Splitting texts

The dataset is messy. The texts outputted by Azure Speech to Text are in big chunks of one or more sentences. Embedding models are trained to generate an embedding for individual tokens, thus generating multiple embeddings per sentence or paragraph. To get one embedding for an arbitrary long piece of text, we pool the individual tokens. Often we just take the mean of all the embeddings.

![image.png](https://huggingface.co/blog/assets/32_1b_sentence_embeddings/model.png)

Since every token has its own embedding reflecting its semantics, we water down the semantics by pooling.

Splitting the big chunks in smaller bits will allow for more embeddings that should capture the semantics of the text better.

In [8]:
#Example of a multi-sentence chunk of text
df.loc[263]["text"]

'Ja voorzitter dank en Ik ben blij komend uit een waterrijke provincie dat de staatssecretaris hard achteraan gaat. Maar Er zijn Natuurlijk al dingen die nu wel al bekend zijn. Daar hoef je geen analyse op los te Laten. Daar is een jeugdsportfonds. Er zijn gewoon andere regelingen. En ja, ik vraag me af, kan daar dan nog niet begonnen worden met te kijken van om die weer een beetje in het voetlicht te brengen, zodat je daar al vooruitloopt? Op een analyse van dingen die er al zijn? Ik denk altijd laaghangend fruit, pak dat eerst.'

In [None]:
def split_df(df: pd.DataFrame, splitter: Callable):
  """Splits the text column of a DataFrame into smaller chunks using a provided splitter function.

  Args:
      df (pd.DataFrame): The input DataFrame containing a 'text' column to be split.
      splitter (Callable): A function that takes a string (text) as input and returns a list of strings (split text chunks).

  Returns:
      pd.DataFrame: A new DataFrame with the text column split into smaller chunks.
                    The new DataFrame will have the same columns as the input DataFrame,
                    but with potentially more rows due to the splitting of text.
                    The 'text' column in the new DataFrame will contain the individual text chunks.
  """
  data = []
  print(f"Length of old data : {len(df)}")
  for _, row in df.iterrows():
      splitted_text = splitter(row["text"])

      for text in splitted_text:
          data.append({"speaker" : row["speaker"], "text" : text})
  print(f"Length of new data : {len(data)}")
  return pd.DataFrame(data)

def simple_splitter(text: str) -> list[str]:
  """Splits text in chunks of max 20 words"""
  text_splitted = text.split()

  return [" ".join(text_splitted[i:i+20]) for i in range(0, len(text_splitted), 20)]


df_splitted = split_df(df, simple_splitter)
df_splitted.head()


### Plotting texts again

In [None]:
# With more refined texts we might get better results.
embeddings_splitted = model.encode(df_splitted["text"], show_progress_bar= True)

tsne_splitted = TSNE(n_components=2, random_state=42)
reduced_embeddings_splitted = tsne.fit_transform(embeddings_splitted)

plot_reduced_embeddings(reduced_embeddings_splitted, labels=list(df_splitted["speaker"]))

# Challenge

In this part you are going to get your hands dirty! A this point we know not much about the content of the text we analysed. How can you use methods such as embedding, dimensionality reduction and clustering to make sense of large quantities of messy texts?

Below are some problems/open questions to work on:

- The simple_splitter does not pay attention to the beginning and end of sentences. Would a more sentence aware splitter give better results? Take a look at [NLTK sentence tokenizer](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html)
- Can we use the (reduced) embeddings to create clusters?
  - About what topics is the text? The first person who identifies the three "kamervragen" and shows this through clustering wins something...
  - Explore different clustering methods (such as PCA and uMAP)
  . A good start is [Scikit Learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
  - Explore different dimensionality reduction methods and settings.
  - How do we get a feel about the semantics of a cluster? Can we get a set of words per cluster that reflect the semantics? Take a look at [Scikit Learn Feature Extraction from Text](https://scikit-learn.org/1.5/api/sklearn.feature_extraction.html#module-sklearn.feature_extraction.text)
  - What happens if you use the speakers as clusters?

- How to score cluster results?
  - Qualitative vs quantative?
  - Take a look at different metrics at [Sklearn clustering metrics](https://scikit-learn.org/1.5/api/sklearn.metrics.html#module-sklearn.metrics.cluster).
  
- How do different embedding models influence results?
  - How do you even choose an embedding model? Popularity vs generic benchmarks vs domain specific benchmarks? For available models that work with Sentence Transformers [Huggingface Dutch Sentence similarity models](https://huggingface.co/models?pipeline_tag=sentence-similarity&language=nl&sort=trending)
  - Can we train our own embedding model from scratch? Yes we can! This can sometimes lead to better results on very domain specific texts, even with very simple models. A great starting place is [Gensim doc2vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#sphx-glr-auto-examples-tutorials-run-doc2vec-lee-py)
