[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeljov/NAP2025/blob/main/BBC_News_Clustering_SBERT.ipynb)

# Sentence Embeddings for Document (News) Clustering

This notebook provides an example of text clustering using:
* [SentenceTransformers](https://www.sbert.net/) python library,
* a pre-trained sentence embeddings model, available from the [HuggingFace](https://huggingface.co/models) repository, and
* a hierarchical clustering algorithm.

In addition, the notebook exemplifies the use of the [t-SNE](https://lvdmaaten.github.io/tsne/) dimensionality reduction technique to plot and visually examine documents after transforming them into vectors.

Finally, it shows how a keywords extraction library, in this case [KeyBERT](https://github.com/MaartenGr/KeyBERT), can be used to extract keywords for each cluster, so that we can better understand the documents in individual clusters.<br>

The data used in the example originate from Kaggle' [BBC News](https://www.kaggle.com/datasets/gpreda/bbc-news) dataset.

## Install and import the required libraries

If running the notebook in Colab, uncomment the next two cells; if running it locally (eg., in DataSpell), first install packages `sentence-transformers`, `keybert`, and `pypalettes`

In [None]:
# !pip -q install sentence-transformers keybert

In [None]:
# a collection of 2500+ palettes curated by experts; https://github.com/y-sunflower/pypalettes

# !pip -q install pypalettes

In [None]:
import pandas as pd
import numpy as np

# module for dimensionality reduction
from sklearn.manifold import TSNE

# modules for clustering
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as shc

# modules for plotting
import matplotlib.pyplot as plt
import seaborn as sb
from pypalettes import load_cmap

import warnings

In [None]:
warnings.filterwarnings('ignore', category=DeprecationWarning)

# for building a sentence embeddings model
from sentence_transformers import SentenceTransformer

In [None]:
# seed value for random processes
RAND_STATE = 1

# A pretrained sentence embeddings model to use
PRETRAINED_LM = "all-MiniLM-L12-v2"

For an overview of pretrained language models offered by HuggingFace, see: [https://www.sbert.net/docs/pretrained_models.html](https://www.sbert.net/docs/pretrained_models.html)

## Load the required resources

Start by loading the data from the 'bbc_news.csv' file:

In [None]:
# from google.colab import files
#
# data_file = files.upload()
# file_name = list(data_file.keys())[0]

In [None]:
# in case of running the notebook locally (e.g., in DataSpell)

from pathlib import Path

file_name = Path.cwd() / 'data' / 'bbc_news.csv'

In [None]:
data = pd.read_csv(file_name)
data.head()

In [None]:
data.info()

Considering that the dataset is quite large and thus may slow down the data processing steps, we will use in the analysis only a subset of news originating from the same year.

So, we need to extract year from the <pubDate> column and use it for data filtering:

In [None]:
try:
  data['pub_date'] = pd.to_datetime(data.pubDate, format="%a, %d %b %Y %H:%M:%S GMT", errors='raise')
except ValueError as err:
  print(err)

In [None]:
data['pub_year'] = data.pub_date.dt.year
data.pub_year.value_counts()

Let's focus on a subset of the most recent news - those from 2024. Since this is still fairly large (for efficient processing), we will take a random sample of N=5000 entries

In [None]:
data2024 = data.loc[data.pub_year == 2024,].sample(n=5000, random_state=RAND_STATE).copy()

# drop columns that are no longer needed
data2024.drop(columns=['pub_year', 'pubDate'], inplace=True)

data2024.reset_index(drop=True, inplace=True)

We will merge the `title` and `description` fields to obtain an overall textual content of each news, which we will then use in news processing and clustering.

In [None]:
data2024['content'] = data2024.apply(lambda row: f"{row['title']}. {row['description']}", axis=1)

In [None]:
data2024.head(10)

Check the lenght of the newly created textual column

In [None]:
data2024['content_len'] = data2024.content.apply(lambda c: len(c))
data2024.content_len.describe()

In [None]:
plt.figure(figsize=(9,5))
sb.kdeplot(data=data2024, x='content_len')
plt.title("Distribution of news content length")
plt.xlabel('Content length (in number of characters)')
plt.show()

This indicates that we have rather short texts, which are suitable for [SBERT's original pretrained models](https://www.sbert.net/docs/pretrained_models.html).

#### Getting access to HuggingFace models

Next, we instantiate a sentence embedding model using a pre-trained embeddings model from HuggingFace.

Note: To be able to use models from the HuggingFace repo, one needs a HuggingFace access token. To obtain one, you would, first, need to set up an account at [HuggingFace.co](https://huggingface.co/). After logging in, click on the profile in the top-right corner, then follow these steps: click *Settings* > click *Access Tokens* > click *New Token* > set *Role* to *write* > *Generate*.

If running this notebook in Google colab, the generated token can be stored in **Colab Secrets**, which is a recommended way of securely storing access tokens and API keys. To learn how to do that and how then to access API tokens / keys stored as Secrets, see, for example, [this short article](https://labs.thinktecture.com/secrets-in-google-colab-the-new-way-to-protect-api-keys/).

If running the notebook localy, you may want to store the token in the `.env` file (HF_TOKEN=token_value) and load it as shown below:

In [None]:
# from dotenv import load_dotenv

# load_dotenv()

In [None]:
warnings.filterwarnings('ignore')

model = SentenceTransformer(PRETRAINED_LM)

According to it's [model card](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2), the pretrained model that we've loaded
maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. It is trained on a large and diverse dataset of over 1 billion training pairs.

The embeddings are normalised, meaning that values of vector elements are in the 0-1 range.

## Clustering of news items

To be able to cluster documents (news), we need a way of estimating how similar or close they are to one another. To that end, we will:
1) Transform each document (news item) into its vector representation (embedding)
2) Use cosine similarity to compute the similarity of news vectors

### Step 1. Create news embeddings

In [None]:
warnings.filterwarnings('ignore')

news_embeddings = model.encode(data2024['content'])

In [None]:
news_embeddings.shape

In [None]:
news_embeddings[:5,]

#### Step 2. Compute similarity of news based on their embeddings

Similarity of documents expressed as vectors is typically estimated using [cosine similarity measure](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

In [None]:
news_similarities = cosine_similarity(news_embeddings, news_embeddings)

In [None]:
news_similarities.shape

Take one news item and explore those that are the most similar to it

In [None]:
sample_news_item = data2024.content.sample(random_state=RAND_STATE)

sample_index = sample_news_item.index.tolist()[0]
sample = sample_news_item.iloc[0]

print(sample)
print(sample_index)

In [None]:
n_news = news_similarities.shape[0]
pairs_scores = [{'pair':i, 'score':news_similarities[sample_index,i]} for i in range(n_news) if i != sample_index]

sorted_pair_scores = sorted(pairs_scores, key=lambda pair: pair['score'], reverse=True)

print("Five most similar news:")
for pair in sorted_pair_scores[:5]:
  paired_news = data2024.content[pair['pair']]
  print(f"NEWS: {paired_news}\nSimilarity score: {pair['score']:.4f}\n")

### Hierarchical agglomerative clustering of news items

We will now use agglomerative hierarchical clustering to group (cluster) news based on their (cosine) similarity,

Agglomerative clustering works in a “bottom-up” manner:
* each instance is initially considered as a single-element cluster (leaf);
* then, at each step of the algorithm, two clusters that are the most similar are combined into a new bigger cluster (node). This procedure is repeated until all items are members of just one single big cluster (root).

The similarity of any two clusters is computed based on similarity of the instances that form the clusters and this computation can be done in different ways, that is, using different methods, which are called *linkage* methods. Here, we will use the most often used linkage method - the Ward's algorithm, which aims at minimizing the total within-cluster variance.

[This blog post](https://dataaspirant.com/hierarchical-clustering-algorithm/) provides a nice visual introduction to agglomerative hierarchical clustering.

Since we have already computed similarities of news items, with a minor additional transformation, these similarities can be used for hierarchical agglomerative clustering. In particular, since linkage methods work with distances, not similarites, we need to compute cosine distances, which is simple:<br>
`cosine_distance = 1 - cosine_similarity`

In [None]:
news_dist = 1 - news_similarities
ward_linkage = shc.linkage(news_dist, method='ward')

plt.figure(figsize=(10, 7))
shc.dendrogram(ward_linkage)
plt.title('Dendrogram from clustering of news items')
plt.show()

Denodrogram suggests any number between 3 and 6 clusters as a potential solution, so we need additional indicators to make the decision

### Use silhouette to compare alternative clustering options

There is a variety of measures used for estimating the quality of a clustering solution and comparing alternative solutions. A frequently used one is *silhouette score*, which indicates how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from −1 to +1, with a high value indicating that an instance is well matched to its own cluster and poorly matched to neighboring clusters; negative values suggest the opposite.

To determine the optimal number of clusters, we will identify cluster assignments for different number of clusters (3 - 6) and for each one compute average silhouette score across all observations; the one with the highest score will be chosen.

In [None]:
silhouette_scores = []

for k in range(3, 7):
  # fcluster identifies a specific clustering from the linkage matrix and the given number of clusters
  agg_clust = shc.fcluster(ward_linkage, t=k, criterion='maxclust')
  # Ensure no distance value is less than 0.0;
  # added to fix an issue with very tiny negative values that prevented further use of the distance matrix
  news_dist_clipped = np.clip(news_dist, a_min=0.0, a_max=None)
  sil_score = silhouette_score(news_dist_clipped, agg_clust, metric='precomputed')
  silhouette_scores.append(sil_score)

In [None]:
# Plotting a bar graph to compare the results
plt.bar(range(3, 7), silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 15)
plt.ylabel('S(i)', fontsize = 15)
plt.show()

Silhouette scores suggest 6 clusters as the best solution. Let's examine how large and balanced the clusters are:

In [None]:
agg_6_clust = shc.fcluster(ward_linkage, t=6, criterion='maxclust')
pd.Series(agg_6_clust).value_counts()

This is not perfect - there is one very small cluster, but from the dendrogram, it would be present regardless of the (meaningful) number of clusters we choose.

### Optional: Use t-SNE to reduce the dimensionality of news representation and visualise cluster assignments

We will use a dimensionality reduction method to reduce the dimensionality of the news representation, so that we can visually explore distinct clustering solutions.

t-SNE is a dimensionality reduction technique that is often used for exploration of high-dimensional data. The main advantage of t-SNE is its ability to preserve local structure in the data, meaning that points that are close to one another in a high-dimensional dataset will still be close to one another in the (dimensionaly) reduced dataset and if the new dataset is 2-dimensional and plotted, they will be close to one another in the plot.

We will use t-SNE here to reduce the *news_embeddings* data from 384 dimensions to 2 dimensions and plot news items across distinct cluster assignments

In [None]:
tsne = TSNE(n_components=2,
            perplexity=20, # key argument for fine-tuning, see below
            max_iter=1500,
            metric='cosine',
            verbose=1,
            random_state=RAND_STATE)

z = tsne.fit_transform(news_embeddings)

Note: *perplexity* is one of the key paramters for fine-tuning dimensionality reduction done by t-SNE since it determines if we care more about the local structure of the data or a 'big-picture': a low perplexity means we care about local scale and focus on the closest points; high perplexity takes more of a "big picture" approach. Recommended values are in the 5-50 range, and default is 30. We will use here the default value and you may want to experiment with higher / lower values to explore how the visualisations and patterns in the data change.

In [None]:
z.shape

Now that we have the news content as 2-dimensional vectors, we can add cluster assignments and plot them

In [None]:
reduced_embeddings_df = pd.DataFrame()
reduced_embeddings_df["comp-1"] = z[:,0]
reduced_embeddings_df["comp-2"] = z[:,1]
reduced_embeddings_df['clust'] = agg_6_clust

In [None]:
plt.figure(figsize=(10,10))
sb.scatterplot(x='comp-1', y='comp-2', data=reduced_embeddings_df, hue='clust', palette='Dark2')
plt.title("News data presented with T-SNE")
plt.show()

If interested in learning how to fine tune t-SNE to get an optimal 2D representation of data, [this blog post](https://danielmuellerkomorowska.com/2021/01/05/introduction-to-t-sne-in-python-with-scikit-learn/) might be a good point to start from.

We will try to better understand the clusters by associating them with keywords.

### Characterise clusters by their keywords

To get an idea what each cluster is about, we will extract keywords from each cluster and use them to characterise the clusters.

In particular, we will use [KeyBERT](https://github.com/MaartenGr/KeyBERT) for keywords extraction. For an overview of other options, see, for example, [this article](https://www.analyticsvidhya.com/blog/2022/01/four-of-the-easiest-and-most-effective-methods-of-keyword-extraction-from-a-single-text-using-python/)

In brief, KeyBERT works as follows (see it illustrated on [this page](https://maartengr.github.io/KeyBERT/guides/quickstart.html)):
* First, documents are split into tokens and tokens are filtered to keep those that are solid candidates for keywords; the filtering typically consists of excluding stop-words and keeping words with some minimal TF and/or DF values  
* Then, document embeddings are created using a pretrained Sentente Embeddings model (typically, a BERT-based model from the HuggingFace repo), to get a document-level vectors; likewise, the same embedding model is used to create embeddings of the candidate keywords.
* Finally, cosine similarity is used to find the words/phrases that are the most similar to each document. The most similar words/phrases are identified as those that best describe the entire document.



In [None]:
from keybert import KeyBERT

kw_model = KeyBERT(model=PRETRAINED_LM)

In [None]:
df = pd.DataFrame()
df = pd.concat([df, data2024['content'], reduced_embeddings_df['clust']], axis=1)
df.head()

Merge documents within each cluster and then determine keywords for each cluster

In [None]:
n_clust = df.clust.nunique()
clusters_text = []
for i in range(1, n_clust+1):
  clust_txt = " ".join(df.loc[df.clust == i,'content'].tolist())
  clusters_text.append(clust_txt)

clusters_keywords = []
for i in range(n_clust):
  clust_keywords = kw_model.extract_keywords(clusters_text[i],
                                             keyphrase_ngram_range=(1, 2),
                                             stop_words='english',
                                             top_n=10,
                                             use_mmr=True, diversity=0.5) #use Maximal Margin Relevance (MMR) to diversify the results
  for clust_kw in clust_keywords:
    kw, score = clust_kw
    clusters_keywords.append({'cluster':i+1, 'keyword':kw, 'score':score})

In [None]:
cl_keywords_df = pd.DataFrame(clusters_keywords)
cl_keywords_df.head()

In [None]:
cl_keywords_df.sort_values(by=['cluster','score'], inplace=True)
# cl_keywords_df.head(10)

Present the keywords and their scores visually

In [None]:
fig, ax_grid = plt.subplots(nrows=2, ncols=3, figsize=(16, 10), constrained_layout=True)
axes = ax_grid.flatten()

cmap = load_cmap("excel_Median")

for i in range(n_clust):

    df = cl_keywords_df.loc[cl_keywords_df.cluster == (i+1),]
    scores = df.score.tolist()
    keywords = df.keyword.tolist()

    axes[i].barh(keywords, scores, color=cmap(i))
    axes[i].set_title(f"Cluster {i+1}", color=cmap(i))
    axes[i].grid(visible=True, axis='y', color='gray', alpha=0.35)

fig.suptitle("Keywords across the news clusters")

plt.show()
