<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2024_11_18_BERTopic_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **BERTopic** Notebook uses the `bertopic` package for transformer based topic modeling. The package has an excellent documentation: https://maartengr.github.io/BERTopic/index.html and a tutorial notebook: https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb

**2024 Updates:**
* Updated the sample data to the US 2024 Elections
* Replaced German with English stopword list
* Added a cell top combine Topics with Documents

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2024). michaelachmann/social-media-lab: 2024-11-18 (v0.0.14). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## Import the Data Frame
### CrowdTangle

In [1]:
#@markdown Read the `csv` file. Select the correct `text_column` below. We called this column `Text` in our **Text Master** table (see [Preprocessing Notebook](https://github.com/michaelachmann/social-media-lab/blob/main/notebooks/2024_11_11_Preprocessing.ipynb) and [OCR & Whisper Documentation](https://social-media-lab.net/processing/preprocessing.html#ocr-whisper)). When using raw data from i.e. **4CAT / Zeeschuimer** the Instagram caption column is called `body`, in **Meta Content Library Exports** it is `text`.

import pandas as pd

csv_file = "/content/drive/MyDrive/2024-11-18-US-Sample.csv" #@param {type:"string"}
text_column = "text" #@param {type:"string"}


df = pd.read_csv(csv_file)

*Note:* The data used in this tutorial has been retrieved through the Meta Content Library [1]. Snippets shown are for demonstration purposes only, while the datasets are part of my ongoing research.

[1] Meta Platforms, Inc. (n.d.). Meta Content Library API version v5.0. https://doi.org/10.48680/meta.metacontentlibraryapi.5.0


## BERTopic

In [2]:
!pip install -q bertopic

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m3.1/4.2 MB[0m [31m92.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/88.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In the following cells we download a stopword dictionary for the German language and applied it according to [the documentation](https://maartengr.github.io/BERTopic/faq.html#how-do-i-remove-stop-words)

In [3]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# Replace with "german" for German social media texts.
STOPWORDS = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=STOPWORDS)

Now we're ready to create our corpus in `docs`, a list of text documents to pass to `BERTopic`.

In [16]:
# We create our corpus

# We filter empty captions out
filtered_df = df[~pd.isna(df[text_column])]

# And create a list of documents
docs = filtered_df[text_column].tolist()

For this tutorial I chose the *simplest* version of BERTopic. [The Documentation](https://maartengr.github.io/BERTopic/index.html#installation) offers **a lot** of ideas to improve your topic models.

In [17]:
from bertopic import BERTopic

# When dealing with German texts choose 'multilingual'.
# When dealing with English texts exclusively, choose 'english'
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

2024-11-18 09:44:49,940 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/351 [00:00<?, ?it/s]

2024-11-18 09:45:09,624 - BERTopic - Embedding - Completed ✓
2024-11-18 09:45:09,625 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-18 09:45:18,238 - BERTopic - Dimensionality - Completed ✓
2024-11-18 09:45:18,241 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-18 09:46:00,385 - BERTopic - Cluster - Completed ✓
2024-11-18 09:46:00,393 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-18 09:46:02,473 - BERTopic - Representation - Completed ✓


**The following cells have been copied from the [BERTopic Tutorial](https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb).** Please check the linked notebook for more functions and the documentation for more background information.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [18]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3503,-1_trump_kamalaharris_election_vote,"[trump, kamalaharris, election, vote, kamala, ...",[GARBAGE GATE: The White House press office al...
1,0,183,0_redactedmention_redactedmention redactedment...,"[redactedmention, redactedmention redactedment...",[Follow <redacted_mention>.\n.\n.\n.\n.\n.\n.\...
2,1,135,1_redactedmention_redactedmention redactedment...,"[redactedmention, redactedmention redactedment...",[What’s The Reason Men Don’t Know How To Act A...
3,2,132,2_garden_rally_square_square garden,"[garden, rally, square, square garden, madison...",[Media across the spectrum framed headlines ar...
4,3,130,3_vote_voting_trump kamala_kamalaharris,"[vote, voting, trump kamala, kamalaharris, vot...",[Happy voting day America! Your 2 main candida...


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [19]:
len(freq)

247

We have a total of 52 topics

In [20]:
topic_model.get_topic(0)  # Select the most frequent topic

[('redactedmention', 0.034268711110545434),
 ('redactedmention redactedmention', 0.03101189735711171),
 ('redactedmention trump', 0.015278780092564016),
 ('trump redactedmention', 0.011812356869575508),
 ('trump2024', 0.008812790024905162),
 ('repost', 0.007661381022314542),
 ('maga', 0.007292623754943119),
 ('repost redactedmention', 0.0069858916403627794),
 ('follow', 0.0069641741565703765),
 ('trump', 0.006952301504586322)]

## Combine Topics with Documents

In [21]:
# Create a DataFrame from topics and probs with the filtered indices

post_ids = filtered_df['id'].tolist()

docs_w_topics_df = pd.DataFrame({
    'post_id': post_ids,
    'caption': docs,
    'topic': topics,
})

In [22]:
docs_w_topics_df.head()

Unnamed: 0,post_id,caption,topic
0,1246922449869249,🚨Did I hear that right!? \n\n𝔀𝔀𝔀.𝓢𝓮𝓪𝓞𝓯𝓜𝓾𝓭.𝓬𝓸𝓶\...,0
1,380723045031443,New ‘Kamalexa’ Amazon Echo Rambles And Never A...,6
2,1511705799536753,🩸🇺🇸🦅 BLOOD OF JESUS CLEAN THE STREETS OF OUR N...,5
3,488940517503443,#Trump #Republicans #MAGA #TrumpSupporters #GO...,4
4,3908844146026596,Make America $CHILL Again 🫡\n\n#LumiChill #Tru...,21


## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [23]:
topic_model.visualize_topics()

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [24]:
topic_model.visualize_barchart(top_n_topics=15)

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [25]:
topic_model.reduce_topics(docs, nr_topics=15)

2024-11-18 09:48:07,664 - BERTopic - Topic reduction - Reducing number of topics
2024-11-18 09:48:10,493 - BERTopic - Topic reduction - Reduced number of topics from 247 to 15


<bertopic._bertopic.BERTopic at 0x7bf124ff98d0>

## Visualize Terms Again


In [26]:
topic_model.visualize_barchart(top_n_topics=15)

In [27]:
topic_model.visualize_topics()

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [14]:
# Save model
topic_model.save("/content/drive/MyDrive/2024-11-18-US24-ContentLibrary-Posts-model")

