# BERTopic years' trend

This script aims to generate latent topics in each year from input data sets.

Input file: csv files for the sentiment and emotion analysis.
Output file:

1.   topic infomation in each year including topic representative documents and keywords
2.   corresponding models
3.   the distribution probabilities of each topic for every text
4.   the distribution probabilities of each topic every year



**NOTE:**
The script is adapted from https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=Fo-Oig4Yib5K



# Enabling the GPU

First, enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic
!pip install sentence-transformers
!pip install umap-learn

import pandas as pd
import matplotlib.pyplot as plt

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Training

## **Data**



In [None]:
# Parameters
corpus = "red" # 🟡Only change this one
min_cluster_size = 300 # 🟡Only change this one
size = str(min_cluster_size)

# input data
from google.colab import drive
drive.mount("/content/drive")
input_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/{corpus}_sentiment_df.csv"
dataset = pd.read_csv(input_path)

# output paths
output_csv_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_yearly_trend_topic_info.csv"
output_model_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/"
output_full_csv_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_text_topic_label_prob.csv"
output_yearly_trend_per_topic_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_yearly_trend_per_topic.csv"


# Extract abstracts to train on and corresponding titles
abstracts = dataset["text"]
abstracts = abstracts.fillna("")
abstracts = abstracts.astype(str)

# # Load the model, if you have already got a trained model
# from sentence_transformers import SentenceTransformer
# from bertopic import BERTopic

# path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/model"
# # Define embedding model
# embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# # Load model and add embedding model
# topic_model = BERTopic.load(path, embedding_model=embedding_model)

In [None]:
abstracts[0]

'Discussing climate change with a skeptic on another site and they pull this link some kind of trump card. Reading in between the lines, I believe it doesn\'t mean what they think it means but I must confess I don\'t know what "forcings" or "polynomial cointegration" mean. Also, I am not familiar enough with Earth System Dynamics to know if they are a reputable organization. If anyone can shed light on this, I would be grateful.'

## **Pre-calculate Embeddings**
After having created our data, namely `abstracts`, we can dive into the very first best practice, **pre-calculating embeddings**.

BERTopic works by converting documents into numerical values, called embeddings. This process can be very costly, especially if we want to iterate over parameters. Instead, we can calculate those embeddings once and feed them to BERTopic to skip calculating embeddings each time.

In [None]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Batches:   0%|          | 0/4808 [00:00<?, ?it/s]

## **Preventing Stochastic Behavior**
In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) to a certain degree.

As a default, this is done with [UMAP](https://github.com/lmcinnes/umap) which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

As a result, we can now fully reproduce the results each time we run the model.

In [None]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric="cosine", random_state=42)

## **Controlling Number of Topics**
There is a parameter to control the number of topics, namely `nr_topics`. This parameter, however, merges topics **after** they have been created. It is a parameter that supports creating a fixed number of topics.

However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely `min_topic_size` that indirectly controls the number of topics that will be created.

A higher `min_topic_size` will generate fewer topics and a lower `min_topic_size` will generate more topics.

Here, we will go with `min_topic_size=40` to get around XXX topics.

In [None]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric="euclidean", cluster_selection_method="eom", prediction_data=True)

## **Improving Default Representation**
The default representation of topics is calculated through [c-TF-IDF](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation). However, c-TF-IDF is powered by the [CountVectorizer](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) which converts text into tokens. Using the CountVectorizer, we can do a number of things:

* Remove stopwords
* Ignore infrequent words
* Increase

In other words, we can preprocess the topic representations **after** documents are assigned to topics. This will not influence the clustering process in any way.

Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

stop_words = text.ENGLISH_STOP_WORDS.union(["ve", "ha","don","did","ll",
                                            "climate", "change", "just",
                                            "like","think","really","going",
                                            "thank","thanks","weclome",
                                            "lol", "ok", "okay","lmao",
                                            "sorry","sure","isn",'yes',
                                            'oh', 'yeah', 'shit', 'duh',
                                            'fuck', 'checks', 'boe', 'huh',
                                            'people'])
stop_words = list(stop_words)

vectorizer_model = CountVectorizer(stop_words=stop_words, min_df=0.01, ngram_range=(1, 2))

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

frozenset({'whereas', 'sixty', 'cry', 'again', 'their', 'very', 'those', 'neither', 'thru', 're', 'whither', 'whose', 'another', 'amount', 'we', 'anyone', 'into', 'un', 'being', 'ever', 'twelve', 'name', 'both', 'none', 'across', 'ours', 'well', 'before', 'front', 'detail', 'always', 'after', 'but', 'get', 'please', 'also', 'besides', 'yourself', 'latterly', 'everyone', 'due', 'me', 'no', 'whereby', 'give', 'formerly', 'either', 'would', 'hers', 'somehow', 'together', 'first', 'such', 'out', 'if', 'other', 'call', 'take', 'though', 'fill', 'themselves', 'made', 'thereby', 'wherein', 'us', 'own', 'yourselves', 'yours', 'there', 'top', 'someone', 'in', 'keep', 'de', 'until', 'however', 'because', 'afterwards', 'amoungst', 'anything', 'sincere', 'were', 'bill', 'towards', 'should', 'had', 'ten', 'up', 'of', 'when', 'once', 'wherever', 'never', 'thereafter', 'your', 'etc', 'nevertheless', 'myself', 'becoming', 'whence', 'from', 'less', 'mine', 'below', 'during', 'move', 'upon', 'anywhere',

## **Additional Representations**
Previously, we have tuned the default representation but there are quite a number of [other topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) in BERTopic that we can choose from. From [KeyBERTInspired](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired) and [PartOfSpeech](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#partofspeech), to [OpenAI"s ChatGPT](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#chatgpt) and [open-source](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#langchain) alternatives, many representations are possible.

In BERTopic, you can model many different topic representations simultanously to test them out and get different perspectives of topic descriptions. This is called [multi-aspect](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling.

Here, we will demonstrate a number of interesting and useful representations in BERTopic:

* KeyBERTInspired
  * A method that derives inspiration from how KeyBERT works
* PartOfSpeech
  * Using SpaCy"s POS tagging to extract words
* MaximalMarginalRelevance
  * Diversify the topic words
* OpenAI
  * Use ChatGPT to label our topics


In [None]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

## **Training**
Now that we have a set of best practices, we can use them in our training loop. Here, several different representations, keywords and labels for our topics will be created. If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.

In [None]:
from bertopic import BERTopic

# If the results do not make sense, then change this parameter and run it again.
# min_cluster_size = 200
# size = str(min_cluster_size)
# hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric="euclidean", cluster_selection_method="eom", prediction_data=True)
# output_csv_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_yearly_trend_topic_info.csv"
# output_model_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/"
# output_full_csv_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_text_topic_label_prob.csv"
# output_yearly_trend_per_topic_path = f"/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/5_results/{corpus}_{size}_yearly_trend_per_topic.csv"

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True,
  calculate_probabilities=True # show probs of all topics for each text
)

topics, probs = topic_model.fit_transform(abstracts, embeddings)

2024-02-08 16:42:48,994 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-08 16:46:01,319 - BERTopic - Dimensionality - Completed ✓
2024-02-08 16:46:01,324 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-08 16:47:11,411 - BERTopic - Cluster - Completed ✓
2024-02-08 16:47:11,445 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-08 16:48:22,107 - BERTopic - Representation - Completed ✓


In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,77534,-1_carbon_warming_dioxide_carbon dioxide,"[carbon, warming, dioxide, carbon dioxide, yea...","[global warming, warming, emissions, greenhous...","[carbon, warming, dioxide, carbon dioxide, yea...","[carbon, warming, dioxide, years, global, time...",[> \-Lol - it sounds like you believe I am jus...
1,0,16823,0_nuclear_energy_power_carbon,"[nuclear, energy, power, carbon, emissions, so...","[nuclear power, nuclear, renewable energy, rea...","[nuclear, energy, power, carbon, emissions, so...","[nuclear, energy, power, carbon, emissions, so...",[> I'm guessing that you could be the author o...
2,1,15655,1_science_scientific_read_post,"[science, scientific, read, post, comment, art...","[debate, scientific, argument, discussion, con...","[science, scientific, read, post, comment, art...","[science, scientific, post, comment, article, ...","[> Maybe you don't like holes being poked, but..."
3,2,4752,2_science_scientists_scientific_believe,"[science, scientists, scientific, believe, evi...","[global warming, warming, skeptics, scientists...","[science, scientists, scientific, believe, evi...","[science, scientists, scientific, evidence, wa...",[I knew you were not serious about a serious d...
4,3,3129,3_data_temperature_years_warming,"[data, temperature, years, warming, period, te...","[global temperature, temperature data, warming...","[data, temperature, years, warming, period, te...","[data, temperature, years, warming, period, te...","[You are wrong again. Let me explain, in detai..."
5,4,2786,4_extinction_species_humans_life,"[extinction, species, humans, life, planet, hu...","[mass extinction, extinction, extinctions, ext...","[extinction, species, humans, life, planet, hu...","[extinction, species, humans, life, planet, hu...",[It depends on how fast it happens. If it happ...
6,5,2427,5_food_meat_agriculture_crops,"[food, meat, agriculture, crops, crop, vegan, ...","[animal agriculture, livestock, veganism, meat...","[food, meat, agriculture, crops, crop, vegan, ...","[food, meat, agriculture, crops, crop, animal,...",[The problem is that when reports come out tha...
7,6,2249,6_trees_carbon_tree_forest,"[trees, carbon, tree, forest, forests, plantin...","[plant trees, trees, planting trees, tree plan...","[trees, carbon, tree, forest, forests, plantin...","[trees, carbon, tree, forest, forests, wood, d...","[PLANT TREES, Less cry, More trees, Planting t..."
8,7,2222,7_radiation_atmosphere_carbon dioxide_dioxide,"[radiation, atmosphere, carbon dioxide, dioxid...","[greenhouse gases, greenhouse effect, carbon d...","[radiation, atmosphere, carbon dioxide, dioxid...","[radiation, atmosphere, dioxide, carbon, heat,...",[Global warming and scaremongering !! Abstract...
9,8,1877,8_ice_sea_sea ice_arctic,"[ice, sea, sea ice, arctic, antarctic, antarct...","[antarctic ice, greenland ice, arctic ice, ice...","[ice, sea, sea ice, arctic, antarctic, antarct...","[ice, sea, arctic, antarctic, melting, sheet, ...",[Good question. The Arctic 'north pole' is all...


Save the results

In [None]:
topic_info = topic_model.get_topic_info()
topic_info_df = pd.DataFrame(topic_info)
topic_info_df.to_csv(output_csv_path, index=False)

To get all representations for a single topic, we simply run the following:

In [None]:
topic_model.get_topic(0, full=True)

{'Main': [('nuclear', 0.01680408279838093),
  ('energy', 0.01620995806113373),
  ('power', 0.013211238189541948),
  ('carbon', 0.013069083410763512),
  ('emissions', 0.011420672313628638),
  ('solar', 0.009852358041863105),
  ('fossil', 0.008992102891011256),
  ('wind', 0.008339558307905077),
  ('fuel', 0.0075833315450387915),
  ('coal', 0.0075702056171917205)],
 'KeyBERT': [('nuclear power', 0.6262368),
  ('nuclear', 0.54513156),
  ('renewable energy', 0.5355706),
  ('reactors', 0.5101001),
  ('renewables', 0.46571666),
  ('renewable', 0.45368916),
  ('fossil fuels', 0.42826384),
  ('fossil fuel', 0.41791654),
  ('fuels', 0.33647805),
  ('emissions', 0.33093527)],
 'MMR': [('nuclear', 0.01680408279838093),
  ('energy', 0.01620995806113373),
  ('power', 0.013211238189541948),
  ('carbon', 0.013069083410763512),
  ('emissions', 0.011420672313628638),
  ('solar', 0.009852358041863105),
  ('fossil', 0.008992102891011256),
  ('wind', 0.008339558307905077),
  ('fuel', 0.0075833315450387915)

**NOTE**: The labels generated by OpenAI"s **ChatGPT** are especially interesting to use throughout your model. Below, we will go into more detail how to set that as a custom label.

# Presenting

## **(Custom) Labels**
The default label of each topic are the top 3 words in each topic combined with an underscore between them.

This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use `.set_topic_labels` to manually label all or certain topics.

We can also use `.set_topic_labels` to use one of the other topic representations that we had before, like `KeyBERTInspired` or even `OpenAI`.

**However, I labled the topics manually.**

In [None]:
# label dictionary
topic_to_label = {

}

# Label the topics yourself
topic_model.set_topic_labels(topic_to_label)

# Update the dataset and map the labels to topics
dataset["topic"] = topics
dataset["probs"] = probs.tolist()
# dataset["label"] = dataset["topic"].map(topic_to_label)

# Check the results
print(dataset.head())

# Save the csv
dataset.to_csv(output_full_csv_path, index=False)

**🔥 Tip - Parameters 🔥**
***
If you would like to return the topic-document probability matrix, then it is advised to use `calculate_probabilities=True`. Do note that this can significantly slow down training. To speed it up, use [cuML"s HDBSCAN](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html#cuml-hdbscan) instead. You could also approximate the topic-document probability matrix with `.approximate_distribution` which will be discussed later.
***

## Topical Trend Anslysis

### Calculate yearly average probs per topic and define functions

In [None]:
# creat a hash to save the topic weight distribution of each text and the total num of texts
# if a text is assigned to the Topic -1 (Noise), I still count its disctibution (vector)
yearly_topic_probs_sum = {}
yearly_texts_count = {}

for index, row in dataset.iterrows():
    year = row["year"]
    topic_probs = row["probs"]

    # make sure that every year's topic weight disctribution is saved
    if year not in yearly_topic_probs_sum:
        yearly_topic_probs_sum[year] = [0] * len(topic_probs)
        yearly_texts_count[year] = 0

    # sum up the prob of each topic respectively
    yearly_topic_probs_sum[year] = [sum(x) for x in zip(yearly_topic_probs_sum[year], topic_probs)]
    # save the number of texts of each year
    yearly_texts_count[year] += 1

# calculate the avg prob of each topic in every year
yearly_avg_topic_probs = {year: [prob / yearly_texts_count[year] for prob in probs]
                          for year, probs in yearly_topic_probs_sum.items()}


# Save the yearly average topic probs
label_probs_df = pd.DataFrame(list(yearly_avg_topic_probs.items()), columns=["year", "probs"])
label_probs_df.to_csv(output_yearly_trend_per_topic_path, index=False)

## **Serialization**

When saving a BERTopic model, there are several ways in doing so. You can either save the entire model with `pickle`, `pytorch`, or `safetensors`.

Personally, I would advise going with `safetensors` whenever possible. The reason for this is that the format allows for a very small topic model to be saved and shared.

When saving a model with `safetensors`, it skips over saving the dimensionality reduction and clustering models. The `.transform` function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.

As a result, the `.transform` step might give different results but it is generally worth it considering the smaller and significantly faster model.

In [None]:
# Save the model
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(output_model_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)