# Dynamic Topic Modeling Baseline Exercise

Goal: Create a clustering algorithm that shows the anatomy of each topic and how it has evolved over time.

Library: BERTopic. BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

### Install new libraries

In [None]:
! pip install bertopic

In [None]:
! pip install bertopic[visualization]

### Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from bertopic import BERTopic
import nltk


  from .autonotebook import tqdm as notebook_tqdm


Taken from this tutorial: [**Dynamic Topic Modeling with BERTopic**](https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872)
We will drop business, sports and movies and start with technology and healthcare first. There are three main algorithm components:
1. Embed Documents: Extract document embeddings with *Sentence Transformers*. 
2. Cluster Documents: Create groups of similar documents with *UMAP* to reduce the dimensionaility of embeddings and HDBSCAN (to identify and cluster similar documents).
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF (class-based term frequency, inverse document frequency).

### Load data

In [2]:
# Load in the all-the-news dataset (25k news articles) into a pandas dataframe
df = pd.read_csv('./data/all-the-news-25k.csv')

In [3]:
df.head()

Unnamed: 0,date,year,month,day,title,article,section,publication
0,2018-05-02 17:09:00,2018,5.0,2,You Can Trick Your Brain Into Being More Focused,If only every day could be like this. You can’...,healthcare,Vice
1,2019-06-23 00:00:00,2019,6.0,23,Hudson's Bay's chairman's buyout bid pits reta...,(Reuters) - The success of Hudson’s Bay Co Exe...,business,Reuters
2,2018-12-28 00:00:00,2018,12.0,28,Wells Fargo to pay $575 million in settlement ...,NEW YORK (Reuters) - Wells Fargo & Co (WFC.N) ...,business,Reuters
3,2019-05-21 00:00:00,2019,5.0,21,Factbox: Investments by automakers in the U.S....,(Reuters) - Major automakers have announced a ...,business,Reuters
4,2019-02-05 00:00:00,2019,2.0,5,Exclusive: Britain's financial heartland unbow...,LONDON (Reuters) - Britain’s financial service...,business,Reuters


In [4]:
df.shape

(25000, 8)

In [7]:
# Drop business, movies, and sports sections
df = df.drop(df[df.section.isin(['business', 'movies', 'sports'])].index)
df.shape

(10000, 8)

In [10]:
# Reset the index
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,date,year,month,day,title,article,section,publication
0,2018-05-02 17:09:00,2018,5.0,2,You Can Trick Your Brain Into Being More Focused,If only every day could be like this. You can’...,healthcare,Vice
1,2019-06-20 00:00:00,2019,6.0,20,"Hungary has no evidence of Huawei threat, plan...",BUDAPEST (Reuters) - Hungary has no evidence t...,technology,Reuters
2,2019-06-20 00:00:00,2019,6.0,20,Philippines' Globe Telecoms launches 5G servic...,MANILA (Reuters) - Philippines’ Globe Telecom ...,technology,Reuters
3,2018-10-08 00:00:00,2018,10.0,8,Google welcomes UK court block on claim over d...,(Reuters) - Google welcomed a decision on Mond...,technology,Reuters
4,2019-06-28 00:00:00,2019,6.0,28,"Google announces new subsea cable 'Equiano', c...",(Reuters) - Alphabet Inc’s Google on Friday an...,technology,Reuters


### Preprocess data

In [11]:
# Prepare to remove stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/curtispond/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
# Remove stop words
stop = stopwords.words('english')

df['article'] = df['article'].map(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [14]:
# Remove whitespace, punctuation, numbers, and lowercase all titles
def clean_text(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub('\w*\d\w*', '', text) # remove numbers
    text = text.strip() # remove whitespace
    text = text.lower() # lowercase
    return text # return cleaned text

In [15]:
# Apply the cleaning function to the title column
df['article'] = df['article'].apply(lambda x: clean_text(x))

In [16]:
# Inspect one of the cleaned titles
df['article'][0]

'if every day could like this you can’t put finger why maybe right amount sleep maybe stars somehow aligned favor whatever reason you’re cooking gas hours fly like minutes you’re feeling great know it’s  pm todo list done this feeling ‘flow’ ‘in zone’ something us experienced point other—although often might like it’s mental state elite athletes seem beck call for us mere mortals though hardly ever shows need it since psychologist mihály csíkszentmihályi first described zone which called ‘flow’  neuroscientists trying figure make show demand yet close secrets zone another truth emerged what think zone actually one many mental states person in works particular kind thinking here’s master one them several the flow zoneto understand states might work better makes sense first consider know this original ‘zone one thing definitely know feels great csíkszentmihályi describes ‘optimal experience’ achieve true happiness one explanation happens—and feels good—is represents perfect match activit

### Prepare data

In [17]:
articles = df.article.to_list()
dates = df.date.to_list()

### Create the model

In [18]:
# Create a BERTopic model
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(articles)

Batches: 100%|██████████| 313/313 [02:37<00:00,  1.99it/s]
2022-12-23 16:17:56,235 - BERTopic - Transformed documents to Embeddings
2022-12-23 16:18:07,169 - BERTopic - Reduced dimensionality
2022-12-23 16:18:07,384 - BERTopic - Clustered reduced embeddings


#### Extract the largest 10 topics based on the number of topics assigned to each topic

In [19]:
freq = topic_model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,3289,-1_the_people_like_says
1,0,208,0_opioid_drug_addiction_overdose
2,1,173,1_headphones_earbuds_wireless_audio
3,2,166,2_drone_drones_dji_flight
4,3,161,3_sex_sexual_porn_men
5,4,150,4_selfdriving_cars_autonomous_vehicles
6,5,139,5_insurance_bill_healthcare_medicaid
7,6,130,6_vr_oculus_virtual_headset
8,7,127,7_co_ltd_further_text
9,8,116,8_bacteria_food_water_allergies


Topic -1 is a topic consisting of outlier documents that are typically ignored due to their prevalence aross the whole corpus and not any particular topic.

In [21]:
# Look at the terms that contribute to a topic
topic_nr = freq.iloc[5]["Topic"] # select a frequent topic
topic_model.get_topic(topic_nr)

[('selfdriving', 0.04456743412953301),
 ('cars', 0.033316605168688763),
 ('autonomous', 0.030755073824332405),
 ('vehicles', 0.029451596927189846),
 ('car', 0.022643058158491974),
 ('uber', 0.018829979624936162),
 ('vehicle', 0.018260921479120944),
 ('lyft', 0.0163402631485499),
 ('ford', 0.014150327853925972),
 ('technology', 0.01240024976085427)]

In [22]:
# Visualize the topics
topic_model.visualize_topics()

Each circle indicates a topic and its size is the frequency of the topic across all documents. It's interesting to play around with the slider. You can see where topics overlap. For example, Topic 6 contains VR, Oculus, Virtual, Headset and Vive. Topic 160, which is a red circle above Topic 6 when you move the slider to the right, contains spectacles, glasses, snapchat, lenses, and sunglasses.

In [23]:
# Visualize the topics in a bar chart
topic_model.visualize_barchart()

In [25]:
hierarchical_topics = topic_model.hierarchical_topics(articles)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

100%|██████████| 187/187 [00:00<00:00, 289.47it/s]


I think this visualization is a little challenging to digest, but it looks like it's building a relationship between all 180 topics.

In [27]:
# Visualize the topics over time
topics_over_time = topic_model.topics_over_time(articles, dates)

4938it [04:49, 17.04it/s]


In [28]:
topic_model.visualize_topics_over_time(topics_over_time, topics=[0, 1, 2, 3, 4, 5])

I can see value in the Topics over Time output, but only if we're working with recent data. For example, you can show GLG how certain topics are spiking to possibly indicate that GLG will get incoming client requests related to similar topics. This can help GLG to anticipate the need for adequate SME resources so they can fulfill client requests in a timely manner.

In [36]:
# Visualize topic similarity
topic_model.visualize_heatmap(n_clusters=10)

The Similarity Matrix indicates how similar certain topics are to each other. I set the number of clusters to 10 to make the heatmap a little more readable.