# Dynamic Topic Modeling Baseline Exercise

Goal: Create a clustering algorithm that shows the anatomy of each topic and how it has evolved over time.

Library: BERTopic. BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

### Install new libraries

In [None]:
! pip install bertopic

In [None]:
! pip install bertopic[visualization]

### Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from bertopic import BERTopic
import nltk


  from .autonotebook import tqdm as notebook_tqdm


Taken from this tutorial: [**Dynamic Topic Modeling with BERTopic**](https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872)
We will start with sentence titles first. There are three main algorithm components:
1. Embed Documents: Extract document embeddings with *Sentence Transformers*. 
2. Cluster Documents: Create groups of similar documents with *UMAP* to reduce the dimensionaility of embeddings and HDBSCAN (to identify and cluster similar documents).
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF (class-based term frequency, inverse document frequency).

### Load data

In [2]:
# Load in the all-the-news dataset (25k news articles) into a pandas dataframe
df = pd.read_csv('./data/all-the-news-25k.csv')

In [3]:
df.head()

Unnamed: 0,date,year,month,day,title,article,section,publication
0,2018-05-02 17:09:00,2018,5.0,2,You Can Trick Your Brain Into Being More Focused,If only every day could be like this. You can’...,healthcare,Vice
1,2019-06-23 00:00:00,2019,6.0,23,Hudson's Bay's chairman's buyout bid pits reta...,(Reuters) - The success of Hudson’s Bay Co Exe...,business,Reuters
2,2018-12-28 00:00:00,2018,12.0,28,Wells Fargo to pay $575 million in settlement ...,NEW YORK (Reuters) - Wells Fargo & Co (WFC.N) ...,business,Reuters
3,2019-05-21 00:00:00,2019,5.0,21,Factbox: Investments by automakers in the U.S....,(Reuters) - Major automakers have announced a ...,business,Reuters
4,2019-02-05 00:00:00,2019,2.0,5,Exclusive: Britain's financial heartland unbow...,LONDON (Reuters) - Britain’s financial service...,business,Reuters


In [4]:
df.shape

(25000, 8)

### Preprocess data

In [5]:
# Prepare to remove stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/curtispond/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Remove stop words
stop = stopwords.words('english')

df['titles'] = df['titles'].map(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [6]:
# Remove whitespace, punctuation, numbers, and lowercase all titles
def clean_text(text):
    text = stop_words(text) # remove stopwords
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub('\w*\d\w*', '', text) # remove numbers
    text = text.strip() # remove whitespace
    text = text.lower() # lowercase
    return text # return cleaned text

In [7]:
# Apply the cleaning function to the title column
df['title'] = df['title'].apply(lambda x: clean_text(x))

TypeError: 'list' object is not callable

In [None]:
# How many articles without a title?
df['title'].isnull().sum()

In [None]:
df.head(10)

### Prepare data

In [None]:
titles = df.title.to_list()

# Convert the date to an integer
dates = df.date.to_list()

### Create the model

In [None]:
# Create a BERTopic model
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(titles)

#### Extract the largest 10 topics based on the number of topics assigned to each topic

In [None]:
freq = topic_model.get_topic_info()
freq.head(10)

Topic -1 is a topic consisting of outlier documents that are typically ignored due to their prevalence aross the whole corpus and not any particular topic.

In [None]:
# Look at the terms that contribute to a topic
topic_nr = freq.iloc[6]["Topic"] # select a frequent topic
topic_model.get_topic(topic_nr)

In [None]:
# Visualize the topics
topic_model.visualize_topics()

In [None]:
# Visualize the topics in a bar chart
topic_model.visualize_barchart()

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(titles)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)


In [None]:
# Visualize the topics over time
topics_over_time = topic_model.topics_over_time(titles, dates)

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, topics=[0, 1, 2, 3, 4, 5])