# BERTopic Exercise Part 1

If you are working on Colab, 
- The following cell installs all the packages you will need. 
- You may want to make use of the (free) GPU resources: click on the down arrow in the upper-right of the page next to the RAM and Disk usage graphic.  Then "Change runtime type" and select "T4 GPU".  This will dramatically speed up your runtime for this code.
- Please be sure to save your file on your own account. (If you clicked on the link on our GitHub repo, your changes are not saved automatically).

If you are working locally on your computer, please see the [README.md](https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md) file on our GitHub repo for a command to create a conda environment that has the necessary packages.

In [None]:
try:
    import google.colab
    print("You are working in Google Colab.  We will install necessary packages...")
    !pip install scikit-learn sentence-transformers umap-learn hdbscan bertopic pandas matplotlib datashader bokeh holoviews scikit-image colorcet keybert
except:
    print("You are not working in Google Colab.")
    print("Please be sure that the necessary packages are installed and available, ideally within a conda env (e.g., see here: https://github.com/nuitrcs/AI_Week_Topic_Model/blob/main/README.md).")


In this exercise, you will join one of four groups: 
- Group 1: Embedding
- Group 2: Dimension Reduction
- Group 3: Clustering
- Group 4: Labeling
 
Each group will focus on the parameters in their own section (e.g., Group 1 will focus on changing parameters for the Embedding step), but will also run the entire topic modeling pipeline from start to finish, leaving parameters for the other steps on their default values. Each time you change a parameter in your section, please run the entire pipeline to see the resulting topics that are identified and labeled. Save the results of each run (e.g. save the result dataframe to your computer as a csv, or take a screen grab) so that you can remember the differences and share those with us in your presentation.

### Read and Preprocess Data

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# this is a function from sklearn that fetches the 20 newsgroups text dataset
# it is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups
# this returns a bunch object, which is very similar to a dictionary
bunch = fetch_20newsgroups(
    categories=["comp.graphics", "rec.autos", "rec.motorcycles", 
                "rec.sport.baseball", "rec.sport.hockey", 
                "sci.electronics", "sci.med", "sci.space"], # only extract select topics
    remove=("headers","footers","quotes")) # don't extract unnecessary metadata

# get the text data and labels (in case you want to compare back to the original labels after you run through BERTopic)
docs = bunch["data"]
doc_labels = bunch["target"]
# create a data frame with the text and labels
df = pd.DataFrame({
    "text": docs,
    "labels": doc_labels
})

# create a label with text info
df["labels_text"] = df["labels"].astype("category").cat.rename_categories({i:j for i,j in enumerate(bunch["target_names"])})

# also remove documents that are empty
df["text_processed"] = df["text"].str.strip()
df = df[df["text_processed"] != ""]

print()
print("Data Frame: ")
print(df.head())

In [None]:
# store the processed texts into docs variable
docs = df["text_processed"].values.tolist()
print(docs[:5])

## BERTopic Exploration

### Default Model Definition

The cell below defines default parameters and packages for each step of topic modeling. You can refer back to this as a reference when you start modifying your specific step below.  Everyone should run this cell.

In [None]:
from bertopic import BERTopic

from sentence_transformers import SentenceTransformer
from umap import UMAP
import umap.plot
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired

# define default models and parameters
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
dimension_reduction_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=54382)
clustering_model = HDBSCAN(min_cluster_size=15, min_samples=1, cluster_selection_epsilon=0.165)
representation_model = KeyBERTInspired()

# we will also run these models here so that you can have everything ready to go below
# these steps aren't strictly necessary if you want to run everything directly through BERTopic but will help with this exercise
embeddings = embedding_model.encode(docs) # encode the texts into embeddings
dimension_reduction_model.fit(embeddings) # fit the UMAP model on the embeddings
clustering_model.fit(dimension_reduction_model.embedding_) # fit the clustering model on the UMAP output

Once you customize the step in the pipeline that corresponds to your group, you should run the full topic model through the `BERTopic()` function shown below. You can copy this cell to your section below and use it to check how changes to your parameters affect the resulting topics. 

In [None]:
# define topic model pipeline with BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=dimension_reduction_model,
    hdbscan_model=clustering_model,
    representation_model=representation_model
)

# begin topic modeling using the processed documents
topic_model.fit(docs)

### Visualize and Observe the Output of BERTopic

Before you dive in to your specific group's section, please familiarize yourself with these methods to visualize and explore your results.  BERTopic model comes with a set of methods that you can use to observe the output of your pipeline.
For more options, you can explore different chapters in the BERTopic documentation.

- [Visualize topics](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html)
- [Visualize documents](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html)
- [Additional info on best practices](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html)

We provide some examples below. Try running these for yourself and see what you get! You can use these methods to understand how your choices of model & parameters can affect the output.

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.visualize_topics()

In [None]:
# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

## Group 1: Embeddings

[Sentence transformer (also called SBERT)](https://sbert.net/index.html) is a widely used python package for accessing embedding models. You can find a selection of available models that can be used with SBERT [on this page of SBERT documentation](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models). These models have been evaluated for their ability to produce high quality sentence embeddings. Choose a pretrained model for yourself and see how it affects the output.

In [None]:
# initialize model
# all-MiniLM-L6-v2 is name of pretrained model
# after you try this model, you should try others from the documentaiton linked above
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") 
embeddings = embedding_model.encode(docs) # encode the texts into embeddings

print("Dimension of embeddings: ",embeddings.shape)
print(embeddings)

Once you select a pretrained embedding model, run the topic modeling pipeline using the embedding model of your choice by copying the `BERTopic()` command provided above into the cell below and running it.  And don't forget to explore and visualize your results (e.g., using the example code from above)!

In [None]:
# copy code from above to run BERTopic using your embedding model and then explore and visualize the results


## Group 2: Dimension Reduction

The output of the embedding model is a high dimensional array.  Attempting to cluster on those data would likely be difficult because the data points are sparse. We can perform dimension reduction to reduce the embeddings to 2 dimensions, which should preserve the useful information, and make the clustering step easier.

[UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html) is a popular method for dimensionality reduction 
often used in text analysis. 
It works by preserving both local and global structures of the data in a lower-dimensional space. 
Several parameters control UMAP’s behavior and output:

- `n_neighbors`: Controls the balance between **local** and **global** structure.
    - Lower values focus more on local structure, emphasizing relationships between nearby points, but may miss the global context.
    - Higher values capture broader structures and global patterns, but may smooth out local details.
- `n_components`: Specifies the target number of dimensions in the reduced space. This is typically set to 2 or 3.
- `metric`: Defines the distance metric used to compute similarity between points. For text embeddings, `"cosine"` distance is commonly used.
- `random_state`: Sets the seed for reproducibility. 
    Since UMAP involves stochastic processes (e.g., initialization and optimization), fixing the random_state ensures consistent results across runs.

The cell below is a code for running UMAP on the embeddings calculated from the previous step. (Recall that we already define the `embedding_model` above.) Try out different parameters and plot the result until you find a reasonable configuration that groups the embeddings in a reasonable manner.

For more detailed explanation of the parameters with examples, consult [this page](https://umap-learn.readthedocs.io/en/latest/parameters.html).

In [None]:
# initialize UMAP model
# try changing the parameters one at a time
dimension_reduction_model = UMAP(
    n_components=2, 
    n_neighbors=15, 
    metric="cosine", 
    random_state=54382
)

# fit the UMAP model to find the best 2D representation of the embeddings
dimension_reduction_model.fit(embeddings)

Once you fit the dimension reduction model, you can check the dimension reduction output with the following code. When you decide on the set of parameters, try running entire BERTopic model with your chosen dimension reduction model and see what you get!

In [None]:
# Plot the UMAP representation
umap.plot.points(dimension_reduction_model)

Once you select the parameters for your UMAP model, run the topic modeling pipeline using your UMAP model by copying the `BERTopic()` command provided above into the cell below and running it.  And don't forget to explore and visualize your results (e.g., using the example code from above)!

In [None]:
# copy code from above to run BERTopic using your embedding model and then explore and visualize the results


## Group 3:  Clustering

This group assumes that you have already selected embedding and dimension reduction methods and need to cluster our 2-D representation of the data. [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/index.html) is a popular method that used an unsupervised machine learning algorithm.  

HDBSCAN provides many parameters that you can use to control the clustering behavior. You can read about the parameters in-depth [here](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html). Important parameters in HDBSCAN that you will most commonly interact with are the following:

- `min_cluster_size`: Specifies the smallest size grouping to be considered as a cluster. Increasing this value results in a smaller number of clusters.
- `min_samples`: Provides a measure of how conservative you want your clustering to be. The larger the value, the more conservative the clustering, and therefore more points will be declared as noise, and clustering is restricted to more densely populated areas. 
- `cluster_selection_epsilon`: Helps with merging micro-clusters in regions where they are abundant, and ensures that clusters given below the threshold are not split further.

Remember that we already defined the embeddings and reduced dimension output of those embeddings above.

Your job is to modify the code below to cluster the way you think is the best. Try playing with different parameters. If you aren't able to get the output that you would like, try a different [clustering algorithm](https://scikit-learn.org/stable/modules/clustering.html). While HDBSCAN certain provides many advantages, you are not limited to only relying on that algorithm, and for a particular dataset other algorithm may perform better. 

In [None]:
# initialize HDBSCAN model
# try changing the parameters one at a time
clustering_model = HDBSCAN(
    min_cluster_size=15, 
    min_samples=1, 
    cluster_selection_epsilon=0.165
)

# fit the clustering model to the reduced embeddings calculated in the previous step
clustering_model.fit(dimension_reduction_model.embedding_)

Once you train the clustering model and label the datapoints, you can visually check the result using the `umap.plot` function.

In [None]:
umap.plot.points(dimension_reduction_model, labels=clustering_model.labels_, theme="blue")

Once you select the parameters for your HDBSCAN model, run the topic modeling pipeline using your HDBSCAN model by copying the `BERTopic()` command provided above into the cell below and running it.  And don't forget to explore and visualize your results (e.g., using the example code from above)!

In [None]:
# copy code from above to run BERTopic using your embedding model and then explore and visualize the results


### Group 4: Labeling

In this group, you will control how each cluster is labeled. The default BERTopic method uses a form of TF-IDF, but this doesn't always return optimal topic labels. We can improve by using a representation model implemented by BERTopic. The example below will demonstrate how to use a model inspired by KeyBERT, available in the BERTopic package, but you are free to try different methods as well.

Recall that we already defined the default embedding, dimension reduction and clustering steps above.

First, try running BERTopic without specifying a representation model. What kind of result do you get? Is this what we want? How can this be fixed? BERTopic explains how you can make better representations [here](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html#improving-default-representation) and [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html). Try different approaches for yourself! You are not limited to what's provided in the BERTopic package, however. As we saw in the demo, you can utilize models outside of BERTopic to try to extract topics and keywords.

In [None]:
# define topic model pipeline with BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=dimension_reduction_model,
    hdbscan_model=clustering_model,
)

# begin topic modeling using the processed documents
topic_model.fit(docs)

Explore and visualize your results (e.g., using the example code from above)!

In [None]:
# copy code from above to explore and visualize the results


Now let's compare with the `KeyBERTInspired` representation using the code below.  Afterwards, you can try a different representation model from the links provided above.

In [None]:
# choose a representation model, e.g., KeyBERTInspired()
# after to run with KeyBERTInspired, you can choose something else 
representation_model = KeyBERTInspired()

# define topic model pipeline with BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=dimension_reduction_model,
    hdbscan_model=clustering_model,
    representation_model= representation_model
)

# begin topic modeling using the processed documents
topic_model.fit(docs)

Don't forget to explore and visualize your results (e.g., using the example code from above)!

In [None]:
# copy code from above to explore and visualize the results
