# BERTopic Short Demo

If you are working on Colab, 
- The following cell installs all the packages you will need. 
- You may want to make use of the (free) GPU resources: click on the down arrow in the upper-right of the page next to the RAM and Disk usage graphic.  Then "Change runtime type" and select "T4 GPU".  This will dramatically speed up your runtime for this code.
- Please be sure to save your file on your own account. (If you clicked on the link on our GitHub repo, your changes are not saved automatically).

If you are working locally on your computer, please see the [README.md](https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md) file on our GitHub repo for a command to create a conda environment that has the necessary packages.

In [None]:
try:
    import google.colab
    print("You are working in Google Colab.  We will install necessary packages...")
    !pip install scikit-learn sentence-transformers umap-learn hdbscan bertopic pandas matplotlib datashader bokeh holoviews scikit-image colorcet keybert
except:
    print("You are not working in Google Colab.")
    print("Please be sure that the necessary packages are installed and available, ideally within a conda env (e.g., see here: https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md).")


You are not working in Google Colab.
Please be sure that the necessary packages are installed and available, ideally within a conda env (e.g., see here: https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md).


### Read and Preprocess Data

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# this is a function from sklearn that fetches the 20 newsgroups text dataset
# it is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups
# this returns a bunch object, which is very similar to a dictionary
bunch = fetch_20newsgroups(
    categories=["comp.graphics", "rec.autos", "rec.motorcycles", 
                "rec.sport.baseball", "rec.sport.hockey", 
                "sci.electronics", "sci.med", "sci.space"], # only extract select topics
    remove=("headers","footers","quotes")) # don't extract unnecessary metadata

# get the text data and labels
docs = bunch["data"]
doc_labels = bunch["target"]

print("Documents: ")
print(docs[:5])

# create a data frame with the text and labels
df = pd.DataFrame({
    "text": docs,
    "labels": doc_labels
})

# create a label with text info
df["labels_text"] = df["labels"].astype("category").cat.rename_categories({i:j for i,j in enumerate(bunch["target_names"])})

print()
print("Data Frame: ")
print(df.head())

Before applying topic modeling to the text, we should do a basic preprocessing, mainly stripping of newlines and removing empty texts.

In [None]:
# strip blank characters
df["text_processed"] = df["text"].str.strip()

# remove empty text from data frame
empty_text_bool =  df["text_processed"].str.len() == 0

print(f"Number of empty texts: {empty_text_bool.sum()}")

# remove empty text from df
df = df[~empty_text_bool]

print("Final Data Frame:")
print(f"Dimension: {df.shape[0]}, {df.shape[1]}")
df.head()

In [None]:
# store the texts into docs variable
docs = df["text_processed"].values.tolist()

In [None]:
print(docs[:5])

### Simplest case

BERTopic can be run out of the box without any tuning. However, this doesn't guarantee the best number of topics and representation for each topic.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic() # initialize the model
topic_model.fit(docs) # fit the model to the data

topic_model.get_topic_info() # get the topic information

### Embeddings

This steps uses a language model to convert the text into vectors.

In [None]:
from sentence_transformers import SentenceTransformer

# initialize model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is name of pretrained model
embeddings = embedding_model.encode(docs) # encode the texts into embeddings

In [None]:
print("Dimension of embeddings: ")
print(embeddings.shape)
print()
print(embeddings)

### Dimension Reduction

This step uses the [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html) library to reduce the dimensions to 2 (to make it easier to cluster the data in the next step).

In [16]:
from umap import UMAP
import umap.plot

# set random seed for reproducibility
seed = 54382
# initialize UMAP model
umap_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)
# fit the UMAP model to find the best 2D representation of the embeddings
umap_model.fit(embeddings)

0,1,2
,n_neighbors,15
,n_components,2
,metric,'cosine'
,metric_kwds,
,output_metric,'euclidean'
,output_metric_kwds,
,n_epochs,
,learning_rate,1.0
,init,'spectral'
,min_dist,0.1


In [None]:
print("Dimension of UMAP output: ")
print(umap_model.embedding_.shape)

In [None]:
# Plot the UMAP representation
umap.plot.points(umap_model)

### Unsupervised Clustering

Here we use the [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/#) library to identify clusters in the data.

In [None]:
from hdbscan import HDBSCAN
import matplotlib.pyplot as plt

# initialize HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=15, min_samples=1, cluster_selection_epsilon=0.165)

# identify clusters on the 2-d representation of embeddings generated by UMAP
hdbscan_model.fit(umap_model.embedding_)
umap.plot.points(umap_model, labels=hdbscan_model.labels_, theme="blue")

### Labeling

Here we label each cluster using another language model with [KeyBERT](https://maartengr.github.io/KeyBERT/api/keybert.html).  Note that this is similar to, though not identical, to what BERTopic uses (e.g., see the BERTopic documentation [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired)).

In [None]:
from keybert import KeyBERT
import numpy as np

# initialize the model; can use the same LM as we used for embeddings
rep_model = KeyBERT(model='all-MiniLM-L6-v2')

# loop through the clusters and get the labels (as BERTopic would do)
for label in np.unique(hdbscan_model.labels_):
    # Get docs in this cluster
    cluster_docs = [doc for doc, c in zip(docs, hdbscan_model.labels_) if c == label]
    # Combine documents into a single string
    combined_text = ' '.join(cluster_docs)
    # Extract keywords
    keywords = rep_model.extract_keywords(combined_text, top_n=5)
    # print the results 
    # Note: KeyBERT returns a tuple with the (word, numer), where the number is:
    #   the relevance score, i.e., the cosine similarity between the embedding of the keyword and the original doc
    print(label, [kw[0] for kw in keywords])

## Combine All Steps with BERTopic

In [None]:
from bertopic.representation import KeyBERTInspired

# set random seed for reproducibility
seed = 54382

# embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is name of pretrained model

# umap model
umap_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)

# initialize HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=15, min_samples=1, cluster_selection_epsilon=0.165)

# representation model
representation_model = KeyBERTInspired()

# define the BERTopic model using the models above
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    verbose=True
)

# fit the model to the data
topic_model.fit(docs) 

# get the topic information
topic_model.get_topic_info() 