<a href="https://colab.research.google.com/github/lorenzobalzani/sensitive-text-clustering/blob/master/sensitive_text_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sensitive Text Clustering
Author: Lorenzo Balzani

`balzanilo[at]icloud[dot]com`

This code was originally intended to be the outcome of a challenge for an NLP-focused internship.

## Problem statement
Given a dataset where each sentence (by human judgment) provides some sensitive personal information about a person, write an ML algorithm that clusters data into different types of sensitive personal information and automatically assigns a label to each cluster.

Use Jupyter notebook and Python 3.X.

Hints:
* focus on SOTA approaches, try to avoid ancient technology if there is a better one.
* write clean code, it will be evaluated.
* write a conclusion to your findings.

## Notes
### GPU usage
I have chosen to use Google Colab because a GPU is freely available. Hence, it will be assumed that a GPU is available. If not, some packages might either not work (CuML) or be slower (BERTopic). You can fix this by using the *standard* package versions.

### Visualizations
All the visualizations (both interactive and static) cannot be appropriately rendered in the notebook. Therefore, you might want to re-run the notebook to get them.

### Fine-tuned model
If you are not keen to re-train the model, download the `sensitive-text-clustering` model from the [HuggingFace Hub](https://huggingface.co/balzanilo/sensitive-text-clustering).

# Setup

In [3]:
%%capture
%pip install sentence_transformers umap-learn hdbscan bertopic keybert nltk torch

Let's download the needed packages to execute CuML. [1]

[1] https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html#cuml-hdbscan

In [None]:
%pip install -q cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
%pip install -q cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
%pip install -q cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
%pip uninstall cupy-cuda115 -y
%pip uninstall cupy-cuda11x -y
%pip install cupy-cuda11x -f https://pip.cupy.dev/aarch64

[0mFound existing installation: cupy-cuda11x 11.4.0
Uninstalling cupy-cuda11x-11.4.0:
  Successfully uninstalled cupy-cuda11x-11.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://pip.cupy.dev/aarch64
Collecting cupy-cuda11x
  Using cached cupy_cuda11x-11.4.0-cp38-cp38-manylinux1_x86_64.whl (93.7 MB)
Installing collected packages: cupy-cuda11x
Successfully installed cupy-cuda11x-11.4.0


## Imports

In [4]:
# General purpose
import json, re, gdown
from tqdm import tqdm

# ML
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cpu":
  from umap import UMAP
  from hdbscan import HDBSCAN
else:
  # Cu(da)ML
  from cuml.manifold import UMAP
  from cuml.cluster import HDBSCAN

# NLP
from bertopic import BERTopic
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
from nltk import download
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

## Utilities

In [None]:
download("stopwords")
download("punkt")
random_state: int = 42
hf_model_name: str = "balzanilo/sensitive-text-clustering"
words_per_label: int = 2

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Data importing
In NLP, it is important to pre-process text (e.g., removing stopwords or lemmatization). Nevertheless, these techniques are less useful - if not harmful - when used with transformer-based approaches, which were designed to leverage every part of the text to get the context (this can be avoided in particular cases, e.g., HTML tags [1]).

Furthermore, I removed all characters before `|` (i.e., vertical bar), because they likely do not contain any valuable information.

[1] https://maartengr.github.io/BERTopic/faq.html#should-i-preprocess-the-data

In [None]:
with open("personal_data.json") as json_file:
    sentences = json.load(json_file)

sentences = [re.sub(r'^.*\|', '', sentence) for sentence in sentences]
#sentences = [' '.join([token for token in word_tokenize(sentence) if token.isalpha() and not token.lower() in set(stopwords.words('english'))]) for sentence in tqdm(sentences)]

# Topic modeling
There are lots of ways to tackle (short) text clustering. Since the problem involves automatically assigning a label to each cluster, topic modeling could be used as a proxy to achieve that goal. Indeed, **Topic modeling** is commonly viewed as a pipeline in which (i) clusters are generated and (ii) afterward labels are inferred to each of them.
## BERTopic
The following approach is based on *BERTopic* [1], a recent technique based on a model released earlier this year. I have chosen it since (i) I was seeking a usable out-of-the-box model and (ii) it's based on SOTA technologies, such as transformer, and not ancient ones like text-based approaches.
The selected embedding model used to create latent representations of the sentences is `all-MiniLM-L6-v2` in a 384-dimensional space. For better results, I used `all-mpnet-base-v2`, as the official documentation states [2]. It creates vectors in a 768-dimensional space. I should investigate whether the trade-off is worth it or whether the advantage given by a more granular model is not enough. The architecture [3] is composed of 6 components:
1. Sentence embeddings
2. Dimensionality reduction
3. Sentence clustering
4. Bag-of-words
5. Topic representation
6. (Optional) Maximal Marginal Relevance

### Embeddings pre-computations
For performance purposes, I have pre-computed the sentence embeddings, so I do not need to calculate them each time I want to tune the hyper-parameters.

[1] https://arxiv.org/abs/2203.05794

[2] https://maartengr.github.io/BERTopic/index.html

[3] https://maartengr.github.io/BERTopic/algorithm/algorithm.html

In [None]:
%%time
embedding_model = SentenceTransformer("all-mpnet-base-v2", device = device)
embeddings = embedding_model.encode(sentences, show_progress_bar = True)

Batches:   0%|          | 0/157 [00:00<?, ?it/s]

CPU times: user 12.7 s, sys: 1.72 s, total: 14.5 s
Wall time: 17.4 s


## KeyBERT
From the BERTopic documentation [1]:
> Although BERTopic focuses on topic extraction methods that does not assume specific structures for the generated clusters, it is possible to do this on a more local level. More specifically, we can use KeyBERT to generate a number of keywords for each document and then build a vocabulary on top of that as the input for BERTopic. This way, we can select words that we know have meaning to a topic, without focusing on the centroid of that cluster. This also allows more frequent words to pop-up regardless of the structure and density of a cluster.
To do this, we first need to run KeyBERT on our data and create our vocabulary.

[1] https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#keybert-bertopic

In [None]:
# Extract keywords
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(sentences)

# Create our vocabulary
vocabulary = [k[0] for keyword in keywords for k in keyword]
vocabulary = list(set(vocabulary))

## Text processing
As described above, we should not use pre-processing techniques before applying transformer techniques. In the following code, these techniques are used **after** the embeddings have been computed for the Topic Representation step(s).

## UMAP
The higher the number of neighbors that are taken into account, the bigger the cluster will be (i.e., micro-clusters will be avoided). Thus, I have increased that parameter to 250.

### Non-determinism
From the cuML UMAP documentation:
> Note: Unfortunately, achieving a high amount of parallelism during
the optimization stage often comes at the expense of determinism,
since many floating-point additions are being made in parallel
without deterministic ordering. This causes slightly different
results across training sessions, even when the same seed is used
for random number generation. Setting a random_state will enable
consistency of trained embeddings, allowing for reproducible results
to 3 digits of precision, but will do so at the expense of potentially
slower training and increased memory usage.

Therefore, different iterations can lead to slightly different results. Empirically, I have found that topics converge easly to the first-top word, whereas the second one is more likely to change.


## HDBSCAN
It's a clustering algorithm that enables seeking different cluster shapes. Furthermore, it has the feature of identifying outliers when possible (i.e., not forcing items in clusters if they might not belong).

## Topic diversity
Topic diversity is the amount of unique words in the top-n words of all topics [1] and it falls in the range [0,1], where 0 means redundant topics, while 1 indicates varied topics.

[1] https://arxiv.org/abs/1907.04907

In [None]:
umap_model = UMAP(n_neighbors = 250, n_components = 15, min_dist = 0.0, metric = "cosine", random_state = random_state)

## ⚠️ Warning ⚠️
If you want to reproduce my results if a fidelity manner (due to non-determinism of `fit_transform()` method), load the saved model. Otherwise, run the following cell, but please take into account that the saved model will be overwritten.

In [None]:
%%time

# it is possible to convert this to a lambda, but in this way, the model cannot be pickled anymore.
def text_processor(text):
    return re.sub(r"\d+", "", text.lower())

hdbscan_model = HDBSCAN(min_samples = 20, gen_min_span_tree = True)
topics_model = BERTopic(embedding_model = embedding_model,
                        nr_topics = 5,
                        calculate_probabilities = True,
                        n_gram_range = (1,2),
                        top_n_words = 15,
                        umap_model = umap_model,
                        hdbscan_model = hdbscan_model,
                        vectorizer_model = CountVectorizer(stop_words = "english",
                                                           preprocessor = text_processor,
                                                           vocabulary = vocabulary),
                        diversity = 0.5,
                        verbose = True)
topics, probs = topics_model.fit_transform(sentences, embeddings)

# Push to HuggingFace Hub
topics_model.push_to_hf_hub(
    repo_id=model_name_hf,
    save_ctfidf=True
)

## Model loading
Besides loading the model, I set up labels for each topic with the first two words that more probably describe that topic.

In [None]:
topics_model = BERTopic.load(model_name_hf)
topic_labels = topics_model.generate_topic_labels(nr_words = words_per_label, separator = ", ")
topics_model.set_topic_labels(topic_labels)

# Conclusions
After trying different combinations of hyperparameters, I achieved this final model. The most notable element is that I converged on five topics since setting `nr_topics = 'auto'` (i.e., automatic generation) led to much more noise, and the number five seemed to be a good trade-off. The topics are related to the following macro areas:
1. Politics
2. Crime
3. Religion
4. Education
5. Health

The following graphic shows the five-most probable words for each topic (y-axis) with their probabilities (x-axis).

In [None]:
topics_model.visualize_barchart(custom_labels = True)

## Visualizations

I plot the topics distribution. Once again, the results can vary due to the non-determinism present in UMAP [1].

[1] https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-topics

In [None]:
topics_model.visualize_topics()

Finally, I visualize the clusters along with their labels. Here, the `transform()` method reduces the dimensional space from`768` to `15`. But a non-determinist dimensionality reduction has to be performed to achieve a 2D space. As can be seen, HBDSCAN does not prevent outliers from being present here.

In [None]:
topics_model.visualize_documents(sentences, reduced_embeddings = umap_model.transform(embeddings), custom_labels = True)

# What's next?
What would I do if I got more time and resources? In increasing order of (time) complexity:

1. Try different experimental setups, such as seek for HDBSCAN alternatives. A recent work [1] has shown that HDBSCAN:
  > Classifies a majority of the documents as outliers. This crucial, yet overseen problem excludes too many documents from further analysis. When we replace HDBSCAN with K-Means, we achieve similar performance, but without outliers.
2. Use TSNE 2D projections rather than UMAP 2D projections, which might be better for a qualitative evaluation.
3. Once some topics are sure to be present in the corpus - thus set up as custom labels - I can apply zero-shot models to further fine-tune the labels classification. From the documentation:
>The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released, we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could use the `bart-large-mnli` model to find which user-defined labels best represent the BERTopic-generated labels:
4. Semi-supervised topic modeling, in which I have a partially labeled dataset [2].
5. Full supervised topic modeling, in which the dataset is fully labeled: [3]. From the documentation:
>The topic model will be much more attuned to the categories that were defined previously. However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics in those you have already defined. This allows you to discover previously unknown topics!
6. The main problem of BERTopic - as discussed in this GitHub issue [4] - is that it mainly relies on the selected clustering scheme, but the latter is not optimized to select the best hyperparameters (e.g. cluster number). A possibile solution is TopicTuner [5].
7. Try TopClus [6]. Main difference from BERTopic?
>A straightforward way is to first apply a dimensionality reduction technique to the original embedding space $H$ to obtain the aforementioned latent space $Z$, and subsequently apply clustering algorithms to $Z$ for obtaining the latent space clusters representing topics. However, such a naive approach cannot guarantee that the reduced-dimension embeddings will be naturally suited for clustering, given that no clustering promoting objective is incorporated in the dimensionality reduction step. Therefore, we propose to jointly learn the latent space projection and cluster in the latent space instead of conducting them one after another, so that the latent representation learning is guided by the clustering objective, and the cluster quality benefits from the well-separated structure of the latent space, achieving a mutually-enhanced effect. Such joint learning is realized by training a generative model that connects the latent topic structure with the original space representations.

[1] https://arxiv.org/abs/2212.08459

[2] https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html#semi-supervised-topic-modeling

[3] https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html#supervised-topic-modeling

[4] https://github.com/MaartenGr/BERTopic/discussions/788

[5] https://github.com/drob-xx/TopicTuner

[6] https://arxiv.org/pdf/2202.04582.pdf