# Topic Modelling via Clustering

This Python notebook translates Tweets into vectors using vaccineBERT. Just follow the instructions below!

## Installing packages

Make sure you've installed the following packages:

* pip install pandas
* pip install numpy
* pip install umap-learn
* pip install hdbscan
* pip install sentence-transformers

## Import BERT Model

First we import the vaccineBERT model. This takes a while, so no need to run again once you've done it once.

In [None]:
from sentence_transformers import models
word_embedding_model = models.Transformer('../model/vaccineBert_SA/', max_seq_length=128)
from sentence_transformers import SentenceTransformer
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=False,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=True)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

## Get Text Data

Now, we get the actual set of Tweets that we'll be clustering. Make sure though to change the name `"TWEETSETNAME.csv"` to the name of your CSV, and change `"TWEETSCOLUMN"` to the name of the column in your dataset that has Tweets.

In [None]:
import pandas as pd
from TweetNormalizer import normalizeTweet

dataset = pd.read_csv("../TWEETSETNAME.csv") # Change TWEETSETNAME.csv to the name of your CSV, but keep the ../ before it
tweets_column = "TWEETSCOLUMN" # Change "TWEETSCOLUMN" to the name of the column in your dataset that has tweets
text_data = dataset[tweets_column].apply(normalizeTweet)

## Embedding Texts 

Now, we embed these tweets into vectors that we can then cluster. Since this code takes a while to run, this file will check if you've done so already and use the previously saved result if so. If you want to calculate new embeddings for a new dataset, delete the "embeddings" folder in this directory.

In [None]:
import numpy as np
import os

if os.path.isfile("embeddings/embeddings.npy"):
    embeddings = np.load('embeddings/embeddings.npy')
else:
    embeddings = model.encode(text_data, show_progress_bar=True)
    os.mkdir("embeddings")
    with open('embeddings/embeddings.npy', 'w+'): pass
    np.save('embeddings/embeddings.npy', embeddings)

## UMAP algorithm

From here, we use the UMAP algorithm to project our embeddings into a lower dimensional space. Read the following resources to understand the UMAP algorithm more.

* [Understanding UMAP](https://pair-code.github.io/understanding-umap/)
* [UMAP Doc](https://umap-learn.readthedocs.io/en/latest/)

Also, feel free to play with the parameters!

In [None]:
import umap
umap_embeddings = umap.UMAP(n_neighbors=5, 
                            n_components=20, 
                            metric='cosine', random_state = 123, min_dist= 0.1,
                           ).fit_transform(embeddings)

## HDBSCAN algorithm

Now, we use the HDBSCAN algorithm to cluster our Tweets. The following link provides some information about the algorithm.

* [HDBSCAN Doc](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html)

Again, free to play with parameters!

In [None]:
import hdbscan
cluster = hdbscan.HDBSCAN(min_cluster_size=10, min_samples= 2,
                          metric='euclidean',                      
                          cluster_selection_method='eom')

cluster = cluster.fit(umap_embeddings)

The following output will show the labels of the different clusters. -1 is the label of Tweets that were not assigned to a cluster by HDBSCAN. We also save the original dataset with labels for the clusters output by HDBSCAN as the file `clustered_data.csv`. With this CSV, you can examine which Tweets belong to which clusters and try to extract topics.

In [None]:
import numpy as np
np.unique(cluster.labels_)
dataset["cluster"] = cluster.labels_
dataset.to_csv("clustered_data.csv", index = False)

## 2D Plot - Clustered from HDSBCAN

We now project our embeddings into two dimensional space so that they can be plotted. We color the points based on the assigned clusters above.

In [None]:
import matplotlib.pyplot as plt

# Reduce dimension
umap_data = umap.UMAP(n_neighbors=10, n_components=2, min_dist=0.1, random_state = 123, metric='cosine').fit_transform(embeddings)

# Prepare data
result = pd.DataFrame(umap_data[cluster.labels_!= -1], columns=['x', 'y'])
result['labels'] = cluster.labels_[cluster.labels_!= -1]

# Visualize clusters
fig, ax = plt.subplots(figsize=(10, 10))
plt.scatter(result.x, result.y, c=result.labels, s=5, cmap='plasma')
plt.colorbar()