# Topic Modeling on Short Text with BERTopic and BERTweet

### Team members: Emily Altland, Terryl Dodson, Maheep Mahat, Daniel Manesh, Kiet Nguyen, Tianjiao Yu

Semester: Spring 2022

Instructor: Dr. Dawei Zhou

First, we import the necessary packages to run our code:

In [13]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import argparse
import torch

# BERTopic (our modification of the source code)
from bertopic._bertopic import BERTopic

# Dimension reduction
from umap import UMAP
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Embeddings
from flair.embeddings import TransformerDocumentEmbeddings
from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer
from sklearn.feature_extraction.text import CountVectorizer

# Evaluation
import gensim.corpora as corpora
from gensim.models import CoherenceModel

Here is our function to train our BERTopic model. We pass to it our data in the form of a Pandas Series, an embedding model (in our case, it is either Sentence Transformers or BERTweet), and a dimension reduction model (UMAP, PCA, or t-SNE).

In [14]:
def train_bertopic(data, embedding_model, dimension_reduction_model):
    vectorizer_model = CountVectorizer(ngram_range=(1, 1), min_df=1)
    if isinstance(embedding_model, TSNE):
        nr_topics = 5
    else:
        nr_topics = "auto"
    topic_model = BERTopic(
        embedding_model=embedding_model,
        nr_topics=nr_topics,
        top_n_words=20,
        min_topic_size=30,
        verbose=True,
        low_memory=True,
        vectorizer_model=vectorizer_model,
        umap_model=dimension_reduction_model,
    )
    topics, _ = topic_model.fit_transform(data.tolist())
    return topic_model, topics

In [15]:
def get_coherence_score(data, topic_model, topics, coherence):
    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    tokens = [tokenizer(doc) for doc in data]
    # tokens = [token for token in tokens if token!='']

    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    topic_words = [
        [words for words, _ in topic_model.get_topic(topic) if words != ""]
        for topic in range(len(set(topics)) - 1)
    ]

    # Evaluate
    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokens,
        corpus=corpus,
        dictionary=dictionary,
        coherence=coherence,
    )
    return coherence_model.get_coherence()

In [19]:
def main():
    data_path = os.path.join(
        "data",
        "preprocessed_tweets",
        "all_tweets.csv",
    )
    df = pd.read_csv(data_path, header=0)
    df = df.text.dropna()[:10000] # Train on sample of 20,000 tweets
    
    models_path = "models"
    try:
        os.mkdir(models_path)
    except FileExistsError:
        pass

    # Train BERTopic using BERTweet base vs. BERT base as our embedding model
    bertweet = TransformerDocumentEmbeddings("vinai/bertweet-base")
    sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

    embedding_models = [bertweet, sentence_model]
    embedding_model_names = ["BERTweet", "BERT-base"]
    
    # Dimension reduction models
    tsne = TSNE(n_components=5, init="pca", method="exact", random_state=2022)
    pca = PCA(n_components=5, random_state=2022)
    umap = UMAP(n_neighbors=35, n_components=5, min_dist=0.0, metric="euclidean", random_state=2022)

    dimension_reduction_models = [tsne, pca, umap]
    dimension_reduction_model_names = ["tSNE", "PCA", "UMAP"]

    for embedding_model, ename in zip(embedding_models, embedding_model_names):
        for dimension_reduction_model, dname in zip(dimension_reduction_models, dimension_reduction_model_names):
            print(f"{datetime.now().strftime('%H:%M:%S')}: Begin: {ename} with {dname}.")
            topic_model, topics = train_bertopic(df, embedding_model, dimension_reduction_model)
            topic_model.save(os.path.join(models_path, f"{ename}_{dname}"))
            
            score = get_coherence_score(df, topic_model, topics, "u_mass")
            print(
                f"{datetime.now().strftime('%H:%M:%S')}: {ename} with {dname} UMass score: {score}."
            )

In [20]:
main()

14:42:29: Begin: BERTweet with tSNE.


10000it [06:16, 26.58it/s]
2022-05-09 14:48:45,910 - BERTopic - Transformed documents to Embeddings
2022-05-09 15:49:42,609 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 15:50:12,822 - BERTopic - Reduced number of topics from 29 to 24


15:50:13: BERTweet with tSNE UMass score: -5.957196610296199.
15:50:13: Begin: BERTweet with PCA.


10000it [05:50, 28.57it/s]
2022-05-09 15:56:03,338 - BERTopic - Transformed documents to Embeddings
2022-05-09 15:56:03,465 - BERTopic - Reduced dimensionality with UMAP
2022-05-09 15:56:04,525 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 15:56:21,401 - BERTopic - Reduced number of topics from 14 to 14


15:56:22: BERTweet with PCA UMass score: -5.076505012801119.
15:56:22: Begin: BERTweet with UMAP.


10000it [06:04, 27.44it/s]
2022-05-09 16:02:26,668 - BERTopic - Transformed documents to Embeddings
2022-05-09 16:03:08,377 - BERTopic - Reduced dimensionality with UMAP
2022-05-09 16:03:08,835 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 16:03:50,159 - BERTopic - Reduced number of topics from 46 to 26
  self._set_arrayXarray(i, j, x)


16:04:02: BERTweet with UMAP UMass score: -6.762316077999512.
16:04:02: Begin: BERT-base with tSNE.


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2022-05-09 16:04:38,527 - BERTopic - Transformed documents to Embeddings
2022-05-09 17:13:01,652 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 17:13:02,531 - BERTopic - Reduced number of topics from 29 to 20


17:13:03: BERT-base with tSNE UMass score: -5.1543494975940005.
17:13:03: Begin: BERT-base with PCA.


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2022-05-09 17:13:37,762 - BERTopic - Transformed documents to Embeddings
2022-05-09 17:13:37,824 - BERTopic - Reduced dimensionality with UMAP
2022-05-09 17:13:38,519 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 17:13:39,094 - BERTopic - Reduced number of topics from 15 to 6


17:13:39: BERT-base with PCA UMass score: -4.1602324936516.
17:13:39: Begin: BERT-base with UMAP.


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2022-05-09 17:14:14,803 - BERTopic - Transformed documents to Embeddings
2022-05-09 17:14:41,909 - BERTopic - Reduced dimensionality with UMAP
2022-05-09 17:14:42,326 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-05-09 17:14:43,591 - BERTopic - Reduced number of topics from 52 to 30
  self._set_arrayXarray(i, j, x)


17:14:45: BERT-base with UMAP UMass score: -6.064310563206254.


 ## Coherence score: 
 
 BERTweet with tSNE UMass score: -5.957196610296199. <br>
 BERTweet with PCA UMass score: -5.076505012801119.<br>
 BERTweet with UMAP UMass score: -6.762316077999512. <br>
 BERT-base with tSNE UMass score: -5.1543494975940005.<br>
 BERT-base with PCA UMass score: -4.1602324936516. <br>
 BERT-base with UMAP UMass score: -6.064310563206254.<br>
 
 The best one is: BERT-base with PCA: -4.1602324936516. <br>
 

In [21]:
my_model = BERTopic.load("./models/BERT-base_PCA")

In [23]:
freq = my_model.get_topic_info()
freq.head()

Unnamed: 0,Topic,Count,Name
0,-1,7276,-1_transgender_sports_athletes_women
1,0,1690,0_women_sports_transgender_not
2,1,806,1_veto_transgender_utah_ban
3,2,116,2_cyclist_bridges_emily_event
4,3,81,3_tweeted_legit_stolen_widely


## Term Rank
Term rank tells us how each term weighted in each topic. <br>
As we can see in the picture bellow, top 10 terms are sufficient for representing all of the topics except topic 3

In [25]:
fig5= my_model.visualize_term_rank()
fig5.show()

In [26]:
fig = my_model.visualize_topics()
fig.show()

In [27]:
fig4=my_model.visualize_barchart(top_n_topics=20, n_words=10)
fig4.show()

In [30]:
my_model.get_representative_docs(0)

['men only care womens sports theres transgender woman involved',
 'good answer simple transgender normal thing must hundred thousands why dont compete against rather trying steal womans sports',
 'hard question transgender sports debating anatomically men women different gender transition more mental emotional transition not complete anatomical transformation male symmetry playing female sports fair',
 'good',
 'well nice three days',
 'good',
 "changing history too renee richards sports female transgender reassignment surgery forced get court order competed few yrs later martina navratilova's coach against fts competing bio women reason fts want erase history put lia instead",
 "changing history too renee richards sports female transgender reassignment surgery forced get court order competed few yrs later martina navratilova's coach against fts competing bio women reason fts want erase history put lia instead",
 "changing history too renee richards sports female transgender reassignm