<h1>Table of Contents<span class="tocSkip"></span></h1>


# Introduction
<hr style = "border:2px solid black" ></hr>


**What?** Topic modelling with a pre-trained BERT model



# Imports
<hr style = "border:2px solid black" ></hr>

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [1]:
from transformers import pipeline
from transformers import AutoTokenizer

In [2]:
import transformers
transformers.__version__

'4.10.0'

# Import dataset
<hr style = "border:2px solid black" ></hr>

- The famous 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics.

In [5]:
data = fetch_20newsgroups(subset='all')['data']

In [6]:
len(data)

18846

In [7]:
# Smaller size
data = data[:2000]

# Create embeddings with BERT
<hr style = "border:2px solid black" ></hr>

- The very first step we have to do is converting the documents to numerical data.
- We use BERT for this purpose as it extracts different embeddings based on the context of the word.
- Moreover, there are many pre-trained models available ready to be used.
- We'll use **Distilbert** as it gives a nice balance between speed and performance.

In [15]:
# Import the tokeniser
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Create a huggingFace transformer pipeline
model = pipeline("sentiment-analysis",
               model="distilbert-base-uncased-finetuned-sst-2-english",
               tokenizer=tokenizer)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
# direct encoding of the sample sentence 
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")
print(encoded_seq)

[0, 118, 524, 3645, 2]


In [25]:
data[0]

"From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>\nSubject: Pens fans reactions\nOrganization: Post Office, Carnegie Mellon, Pittsburgh, PA\nLines: 12\nNNTP-Posting-Host: po4.andrew.cmu.edu\n\n\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [27]:
embeddings = []
for temp in data:
    embeddings.append(tokenizer.encode(temp))

Token indices sequence length is longer than the specified maximum sequence length for this model (1220 > 512). Running this sequence through the model will result in indexing errors


In [30]:
a = np.array(embeddings)
a.shape

  a = np.array(embeddings)


(2000,)

In [38]:
len(a[1])

326

# Clustering
<hr style = "border:2px solid black" ></hr>

- We want to make sure that documents with similar topics are clustered together such that we can find the topics within these clusters.
- Before doing so, we first need to lower the dimensionality of the embeddings as many clustering algorithms handle high dimensionality poorly.

In [39]:
import umap
umap_embeddings = umap.UMAP(n_neighbors=15,
                            n_components=5,
                            metric='cosine').fit_transform(embeddings)

ImportError: Numba needs NumPy 1.21 or less

- Now that we have reduced the dimension we can proceed to cluster the results.

In [None]:
import hdbscan
cluster = hdbscan.HDBSCAN(min_cluster_size=15,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

In [None]:
import matplotlib.pyplot as plt

# Prepare data
umap_data = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()

# Reference
<hr style = "border:2px solid black" ></hr>



- https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6
- https://huggingface.co/models
- https://stackoverflow.com/questions/64685243/getting-sentence-embedding-from-huggingface-feature-extraction-pipeline
    
