# Text Clustering with Sentence-BERT

In [None]:
!pip3 install sentence-transformers

In [None]:
!pip install datasets

In [None]:
import pandas as pd, numpy as np
import torch, os
from datasets import load_dataset

In [None]:
dataset = load_dataset("amazon_polarity", split="train")

In [None]:
dataset

In [None]:
corpus = dataset.shuffle(seed=42)[:10000]["content"]

In [None]:
pd.Series([len(e.split()) for e in corpus]).hist()

## Model Selection
(source link: https://www.sbert.net/docs/pretrained_models.html)
The best available models for STS are:

* stsb-mpnet-base-v2
* stsb-roberta-base-v2
* stsb-distilroberta-base-v2
* nli-mpnet-base-v2
* nli-roberta-base-v2 
* nli-distilroberta-base-v2

Paraphrase Identification Models
* paraphrase-distilroberta-base-v1 - Trained on large scale paraphrase data.
* paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)

In [None]:
from sentence_transformers import SentenceTransformer

model_path = "paraphrase-distilroberta-base-v1"
# paraphrase-distilroberta-base-v1 - Trained on large scale paraphrase data.
model = SentenceTransformer(model_path)

In [None]:
corpus_embeddings = model.encode(corpus)
corpus_embeddings.shape

In [None]:
from sklearn.cluster import KMeans

K = 5
kmeans = KMeans(n_clusters=5, random_state=0).fit(corpus_embeddings)

In [None]:
import pandas as pd

cls_dist = pd.Series(kmeans.labels_).value_counts()
cls_dist

In [None]:
import scipy

distances = scipy.spatial.distance.cdist(kmeans.cluster_centers_, corpus_embeddings)

In [None]:
centers = {}
print("Cluster", "Size", "Center-idx", "Center-Example", sep="\t\t")
for i, d in enumerate(distances):
    ind = np.argsort(d, axis=0)[0]
    centers[i] = ind
    print(i, cls_dist[i], ind, corpus[ind], sep="\t\t")

## Visualization of the cluster points

In [None]:
!pip install umap-learn

In [None]:
import matplotlib.pyplot as plt
import umap

X = umap.UMAP(n_components=2, min_dist=0.0).fit_transform(corpus_embeddings)
labels = kmeans.labels_

fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=1, cmap="Paired")
for c in centers:
    plt.text(X[centers[c], 0], X[centers[c], 1], "CLS-" + str(c), fontsize=24)
plt.colorbar()

## Topic Modeling with BERT

BERTopic Official NOTE: BERTopic is stocastich which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [None]:
!pip install bertopic

Official Note: Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

In [None]:
len(corpus)

In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("paraphrase-distilroberta-base-v1")
topic_model = BERTopic(embedding_model=sentence_model)
topics, _ = topic_model.fit_transform(corpus)

In [None]:
topic_model.get_topic_info()[:6]

In [None]:
topic_model.get_topic(2)