# BERTopic Short Demo

If you are working on Colab, 
- The following cell installs all the packages you will need. 
- You may want to make use of the (free) GPU resources: click on the down arrow in the upper-right of the page next to the RAM and Disk usage graphic.  Then "Change runtime type" and select "T4 GPU".  This will dramatically speed up your runtime for this code.
- Please be sure to save your file on your own account. (If you clicked on the link on our GitHub repo, your changes are not saved automatically).

If you are working locally on your computer, please see the [README.md](https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md) file on our GitHub repo for a command to create a conda environment that has the necessary packages.

In [None]:
try:
    import google.colab
    print("You are working in Google Colab.  We will install necessary packages...")
    !pip install scikit-learn sentence-transformers umap-learn hdbscan bertopic pandas matplotlib datashader bokeh holoviews scikit-image colorcet keybert
except:
    print("You are not working in Google Colab.")
    print("Please be sure that the necessary packages are installed and available, ideally within a conda env (e.g., see here: https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md).")


### Read in the Preprocessed Data

For this demo, we will use the [`20newsgroups` dataset from scikit-learn](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset).  We will fetch the data and then clean reformat it so that it is easier for BERTopic to work with.  I wrote code to do this in the `exercises/supplementary_code.ipynb` notebook; that code also saves the output to a .csv file.  We can simply read in the resulting cleaned data below.  

In [None]:
import pandas as pd
# The line below will read the file directly from GitHub and will work on Colab.  
df = pd.read_csv('https://raw.githubusercontent.com/nuitrcs/AI_Week_Topic_Modeling/refs/heads/main/exercises/data/sklearn_20newsgroups_cleaned.csv')
# If you're working on your local computer and prefer to simply read the file from your disk, you can instead use the line below.
# df = pd.read_csv('exercises/data/sklearn_20newsgroups_cleaned.csv')
df.head()

In [None]:
# store the texts into docs variable as a list for use in BERTopic
docs = df["cleaned_text"].values.tolist()
print(docs[:5])

### Simplest case

BERTopic can be run out of the box without any tuning. However, this doesn't guarantee the best number of topics and representation for each topic.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic() # initialize the model
topic_model.fit(docs) # fit the model to the data

topic_model.get_topic_info() # get the topic information

## Now let's break this down into the componet steps we discussed in the presentation:

![graphical representation of topic modeling pipeline](exercises/images/topic_modeling_pipeline.png)

1. Embeddings
2. Dimension reduction
3. Clustering
4. Labeling

Each of these steps have parameters we can tune.  This way we will have more fine-grained control so that we can improve the topics that are returned.

### 1. Embeddings

![graphical representation of embedding step](exercises/images/embeddings.png)


This step uses a language model to convert the text from the documents into vectors.

In [None]:
from sentence_transformers import SentenceTransformer

# initialize model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is the name of a pretrained model
embeddings = embedding_model.encode(docs) # encode the texts into embeddings

In [None]:
print("Dimension of embeddings: ")
print(embeddings.shape)
print()
print(embeddings)

### 2. Dimension reduction

![graphical representation of dimension reduction step](exercises/images/dimension_reduction.png)


This step uses the [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html) library to reduce the dimensions to 2 (to make it easier to cluster the data in the next step).

In [None]:
from umap import UMAP
import umap.plot

# set random seed for reproducibility
seed = 42
# initialize UMAP model
umap_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)
# fit the UMAP model to find the best 2D representation of the embeddings
umap_model.fit(embeddings)

In [None]:
print("Dimension of UMAP output: ")
print(umap_model.embedding_.shape)

In [None]:
# Plot the UMAP representation
umap.plot.points(umap_model)

### 3. Clustering

![graphical representation of clustering step](exercises/images/clustering.png)


Here we use the [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/#) library, a suite of tools that uses unsupervised machine learning, to identify clusters in the data.  

Note that you may see different clusters than other participants working on different computers because of the way each computer handles randomization.  But (hopefully!) your notebook will be internally consistent if you rerun it with the same random seed.

In [None]:
from hdbscan import HDBSCAN
import matplotlib.pyplot as plt

# initialize HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=12, min_samples=5, cluster_selection_epsilon=0.2)

# identify clusters on the 2-d representation of embeddings generated by UMAP
hdbscan_model.fit(umap_model.embedding_)
umap.plot.points(umap_model, labels=hdbscan_model.labels_, theme="blue")

### 4. Labeling

![graphical representation of labeling step](exercises/images/labeling.png)

Here we label each cluster using another language model with [KeyBERT](https://maartengr.github.io/KeyBERT/api/keybert.html).  Note that this is similar to, though not identical, to what BERTopic uses (e.g., see the BERTopic documentation [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired)). 

In [None]:
from keybert import KeyBERT
import numpy as np

# initialize the model; can use the same LM as we used for embeddings
rep_model = KeyBERT(model='all-MiniLM-L6-v2')

hlabels = []
counts = []
words = []
# loop through the clusters and get the labels (as BERTopic would do)
for label in np.unique(hdbscan_model.labels_):
    # Get docs in this cluster
    cluster_docs = [doc for doc, c in zip(docs, hdbscan_model.labels_) if c == label]
    # Combine documents into a single string
    combined_text = ' '.join(cluster_docs)
    # Extract keywords
    keywords = rep_model.extract_keywords(combined_text, top_n=5)
    # save the results 
    # Note: KeyBERT returns a tuple with the (word, numer), where the number is:
    #   the relevance score, i.e., the cosine similarity between the embedding of the keyword and the original doc
    hlabels.append(label)
    counts.append(len(cluster_docs))
    words.append([kw[0] for kw in keywords])

# save this in a dataframe so it is prettier to look at (and easier to sort)
output_df = pd.DataFrame({'hdbscan_label':hlabels, 'count':counts, 'keywords':words})
# sort by counts (easier to compare to BERTopic output)
output_df.sort_values(by="count", ascending=False)

## Combine All Steps with BERTopic

In [None]:
from bertopic.representation import KeyBERTInspired

# set random seed for reproducibility
seed = 42

# embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is name of pretrained model

# umap model
umap_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)

# initialize HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=12, min_samples=5, cluster_selection_epsilon=0.2)

# representation model
representation_model = KeyBERTInspired()

# define the BERTopic model using the models above
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=representation_model,
    verbose=True
)

# fit the model to the data
topic_model.fit(docs) 

# get the topic information
topic_model.get_topic_info() 

### Additional visualization options from BERTopic

BERTopic provides many different ways to visualization the results.  Here are some links to get your started.

- [Visualize topics](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html)
- [Visualize documents](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html)
- [Additional info on best practices](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html)

We provide some examples below. Try running these for yourself and see what you get! You can use these methods to understand how your choices of model & parameters can affect the output.

In [None]:
topic_model.visualize_topics()

In [None]:
# note that this uses the embeddings we calculated above
topic_model.visualize_documents(docs, embeddings=embeddings)