# BERTopic Exercise Part 1

In this exercise, you will be assigned to one of three groups: 
- Embedding + Dimension Reduction
- Clustering
-  Representation
 
Based on your group, you will navigate to the corresponding section in this notebook to try various options and explore how different settings might affect the topic modeling pipeline.

### Read and Preprocess Data

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# this is a function from sklearn that fetches the 20 newsgroups text dataset
# it is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups
# this returns a bunch object, which is very similar to a dictionary
bunch = fetch_20newsgroups(
    categories=["comp.graphics", "rec.autos", "rec.motorcycles", 
                "rec.sport.baseball", "rec.sport.hockey", 
                "sci.electronics", "sci.med", "sci.space"], # only extract select topics
    remove=("headers","footers","quotes")) # don't extract unnecessary metadata

# get the text data and labels
docs = bunch["data"]
doc_labels = bunch["target"]
# create a data frame with the text and labels
df = pd.DataFrame({
    "text": docs,
    "labels": doc_labels
})

# create a label with text info
df["labels_text"] = df["labels"].astype("category").cat.rename_categories({i:j for i,j in enumerate(bunch["target_names"])})

# also remove documents that are empty
df["text_processed"] = df["text"].str.strip()
df = df[df["text_processed"] != ""]

print()
print("Data Frame: ")
print(df.head())

In [None]:
# store the processed texts into docs variable
docs = df["text_processed"].values.tolist()
print(docs[:5])

## BERTopic Exploration

### Default Model Definition

The cell below defines default models for each step of topic modeling. Everybody should run this cell.

In [None]:
from bertopic import BERTopic

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired

# define default models and parameters
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
dimension_reduction_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric="cosine")
clustering_model = HDBSCAN(min_cluster_size=15, min_samples=1, cluster_selection_epsilon=0.165)
representation_model = KeyBERTInspired()

Once you define the custom model corresponding to your group, you should run topic model with BERTopic through `BERTopic()` function shown below. You can copy this cell to your group and run iteratively to see how your choices affect the output.

In [None]:
# define topic model pipeline with BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=dimension_reduction_model,
    hdbscan_model=clustering_model,
    representation_model=representation_model
)

# begin topic modeling using the processed documents
topic_model.fit(docs)

### Visualize and Observe the Output of BERTopic

BERTopic model comes with a set of methods that you can use to observe the output of your pipeline.
For more options, you can explore different chapters in the BERTopic documentation.

- [Visualize topics](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html)
- [Visualize documents](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_documents.html)
- [Additional info on best practices](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html)

We have provided some examples below. Try running these for yourself and see what you get! You can use these methods to understand how your choices of model & parameters can affect the output.

In [None]:
topic_model.get_topic_info()

In [None]:
topic_model.visualize_topics()

In [None]:
# manually extract the embeddings using the model used for topic modeling
embeddings = embedding_model.encode(docs)
# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

## GROUP 1: Embeddings + Dimension Reduction

[Sentence transformer (also called SBERT)](https://sbert.net/index.html) is a widely used python package for accessing embedding models. You can find a selection of available models [on this page of SBERT documentation](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models). These models have been evaluated for their ability to produce high quality sentence embeddings. Choose a pretrained model for yourself and see how it affects the output.

In [None]:
from sentence_transformers import SentenceTransformer

# initialize model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is name of pretrained model
embeddings = embedding_model.encode(docs) # encode the texts into embeddings

print("Dimension of embeddings: ")
print(embeddings.shape)
print()
print(embeddings)

Once you select a pretrained embedding model, run the topic modeling pipeline with BERTopic with the embedding model of your choice using the code below.

## GROUP 2: Dimension Reduction

Since the output is a high dimensional array, it's not as easy to understand what differentiates output from one pretrained model to another. Furthermore, the high dimension would result in the data points being sparse, which could cause problem in the clustering step. To resolve these problems, we can perform dimension reduction to reduce the embeddings to 2 dimensions, which should preserve the useful information, and visualize the data.

[UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html) is a popular method for dimensionality reduction 
often used in text analysis to visualize high-dimensional embeddings. 
It works by preserving both local and global structures of the data in a lower-dimensional space. 
Several parameters control UMAP’s behavior and output:

- `n_neighbors`: Controls the balance between **local** and **global** structure.
    - Lower values focus more on local structure, emphasizing relationships between nearby points, but may miss the global context.
    - Higher values capture broader structures and global patterns, but may smooth out local details.
- `n_components`: Specifies the target number of dimensions in the reduced space. This is typically set to 2 or 3.
- `metric`: Defines the distance metric used to compute similarity between points. For text embeddings, `"cosine"` distance is commonly used.
- `random_state`: Sets the seed for reproducibility. 
    Since UMAP involves stochastic processes (e.g., initialization and optimization), fixing the random_state ensures consistent results across runs.

The cell below is a code for running UMAP on the embeddings calculated from previous step. Try out different parameters and plot the result until you find a reasonable configuration that groups the embeddings in a reasonable manner.

For more detailed explanation of the parameters with examples, consult [this page](https://umap-learn.readthedocs.io/en/latest/parameters.html).

In [None]:
from umap import UMAP
import umap.plot

# set random seed for reproducibility
seed = 54382
# initialize UMAP model
dimension_reduction_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)
# fit the UMAP model to find the best 2D representation of the embeddings
dimension_reduction_model.fit(embeddings)

Once you fit the dimension reduction model, you can check the dimension reduction output with the following code. When you decide on the set of parameters, try running entire BERTopic model with your chosen dimension reduction model and see what you get!

In [None]:
# Plot the UMAP representation
umap.plot.points(dimension_reduction_model)

## Group 3: Unsupervised Clustering

This group assumes that you have already selected embedding and dimension reduction methods and need to cluster our 2-D representation of the data. Usually you will be working on a large collection of texts, so labeling them by hand may be a tedious task. We can using unsupervised clustering algorithm to cluster similar points together. [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/index.html) is a popular method which has a handful of useful features. Main differentiating features of this method are that it automatically chooses number of clusters and identifies outliers for ambiguous points.

HDBSCAN provides many parameters that you can use to control the clustering behavior. You can check out the parameters in-depth [here](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html). Important parameters in HDBSCAN that you will most commonly interact with are the following:

- `min_cluster_size`: Species the smallest size grouping for it to be considered a cluster. Increasing this value results in a smaller number of clusters.
- `min_samples`: Provides a measure of how conservative you want your clustering to be. The larger the value, the more conservative the clustering, and therefore more points will be declared as noise and clustering is restricted to more densely populated areas. 
- `cluster_selection_epsilon`: Helps with merging micro-clusters in regions where there are abundance of micro-clusters. Ensures that clusters given below threshold are not split further.

First, let's obtain the embeddings and reduced dimension output of those embeddings to begin clustering.

In [None]:
from umap import UMAP
import umap.plot

# initialize model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # all-MiniLM-L6-v2 is name of pretrained model
embeddings = embedding_model.encode(docs) # encode the texts into embeddings

# set random seed for reproducibility
seed = 54382
# initialize UMAP model
dimension_reduction_model = UMAP(n_components=2, n_neighbors = 15, metric="cosine", random_state=seed)
# fit the UMAP model to find the best 2D representation of the embeddings
dimension_reduction_model.fit(embeddings)

# Plot the UMAP representation
umap.plot.points(dimension_reduction_model)

You job is to modify the code below to cluster the way you think is the best. Try playing with different parameters. If you aren't able to get the output that you would like, try a different [clustering algorithm](https://scikit-learn.org/stable/modules/clustering.html). While HDBSCAN certain provides many advantages, you are not limited to only relying on that algorithm, and for a particular dataset other algorithm may perform better. 

In [None]:
from hdbscan import HDBSCAN
import matplotlib.pyplot as plt

# initialize HDBSCAN model
clustering_model = HDBSCAN()

# fit the clustering model to the reduced embeddings calculated in the previous step
clustering_model.fit(dimension_reduction_model.embedding_)

Once you train the model and label the datapoints, you can visually check the result using the umap plot function.

In [None]:
umap.plot.points(dimension_reduction_model, labels=clustering_model.labels_, theme="blue")

Once you settle with the parameter that you like, try running the entire BERTopic pipeline and see what you get!

### GROUP 4: Labeling & Representation

In this group, you will control how each cluster is represented. As we already saw, the default representation approach didn't return optimal topics. We can improve by using representation models implemented by BERTopic. By default these models are not used. There are different kinds of models available, such as GPT-like models to methods that extract keywords like KeyBERT. The example below will demonstrate how to use a model inspired by KeyBERT, available in  the BERTopic package.

First, try running BERTopic without specifying a representation model. What kind of result do you get? Is this what we want? How can this be fixed? BERTopic explains how you can make better representations [here](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html#improving-default-representation) and [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html). Try different approaches for yourself! You are not limited to what's provided in the BERTopic package, however. As we saw in the demo, you can utilize models outside of BERTopic to try to extract topics and keywords.

In [None]:
from bertopic.representation import KeyBERTInspired

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
dimension_reduction_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric="cosine")
clustering_model = HDBSCAN(min_cluster_size=15, min_samples=1, cluster_selection_epsilon=0.165)
representation_model = # choose a representation model, e.g., KeyBERTInspired()

# define topic model pipeline with BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=dimension_reduction_model,
    hdbscan_model=clustering_model,
    # representation_model= representation_model
)

# begin topic modeling using the processed documents
topic_model.fit(docs)