# Topics model evolution over time

This notebook demonstrate evolution of topic models over time. The following steps are performed on the publication abstracts:

- Preprocessing the text, such as removing stop words, word stemming, lemmatize, and n-gram phrase detection.
- Use BERT to encode documents into embeddings.
- dimentionality reduction of embeddings using UMAP.
- Hierarchical DBSCAN to form clusters of embeddings.
- Assign scores to words with regard to the clusters.
- Extract most coheret topics (c-TD-IDF algorithm).
- Fine-tune clusters and topics at different publicationn dates to model topic evolution.

**c-TD-IDF**: $c\_td\_idf = {t_i \over w_i} \times log{m \over {\Sigma_j^n t_j}}$


## Setup

In [1]:
# REMOVE: MOVED TO "2 Preprocessing.iypnb"

%reload_ext autoreload
%autoreload 2

from pathlib import Path
import matplotlib.pyplot as plt

import pandas as pd

from bertopic import BERTopic

In [None]:
# parameters
INPUT_FILE = Path('data/pubmed_abstracts_preprocessed.csv.gz')

# load the dataset
df = pd.read_csv(INPUT_FILE, compression='gzip')

Now load PubMed abstracts ad preprocess them. This takes a few minutes to run and preprocessed abstracts will be stored in `data/pubmed/tests_preprocessed.csv`.

As a visual check, a word cloud can quickly visualize the whole preprocessed corpus.

In [None]:
from wordcloud import WordCloud
combined_text = df['abstract'].str.cat(sep=' ')

combined_text = combined_text.replace('result','').replace('find', '').replace('suggest','')

cloud = WordCloud(width=500,height=500,background_color ='white').generate(combined_text)

plt.figure(figsize=(10,10))
plt.imshow(cloud)
plt.axis('off')
plt.show()

## Overall topics (time-independent model)

In this section, we fit a topic model on all the PubMed cognitive task abstracts given which task the text belongs to.

In [10]:
def fit_topic_model(docs, corpus_name):
    """Fit a topic model to the docs and return the fitted model, and the topics.
    """

    print(f'fitting topics for {corpus_name}...', file=sys.stderr)

    try:
        topic_model = BERTopic(verbose=True)
        topics, _ = topic_model.fit_transform(docs)
        return topic_model, topics
    except:
        return None, None

stroop_topic_model = fit_topic_model(df.query('corpus_name == "Stroop Task"')['preprocessed_abstract'].to_list(), "Stroop Task")

# Fit all corpora
# models = df.groupby('corpus_name').apply(
#     lambda grp: fit_topic_model(grp['preprocessed_abstract'].to_list(), grp.name)
# )

# topic_model.get_topic_info()

# DEBUG
# topic_model.get_topics()
# topic_model.get_topic(0)

fitting topics for Stroop Task...
Batches: 100%|██████████| 248/248 [02:32<00:00,  1.62it/s]
2021-07-14 09:50:14,907 - BERTopic - Transformed documents to Embeddings
2021-07-14 09:50:39,723 - BERTopic - Reduced dimensionality with UMAP
2021-07-14 09:50:39,990 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [15]:
stroop_topic_model[0].get_topic_info()
stroop_topic_model[0].visualize_barchart()

## Topics evolution over the span of the years (time-dependent model)

In [17]:
docs = df.query('corpus_name == "Stroop Task"')['preprocessed_abstract'].to_list()
years = df.query('corpus_name == "Stroop Task"')['year'].to_list()
topics = stroop_topic_model[1]
topics_over_time = stroop_topic_model[0].topics_over_time(docs, topics, years)#, datetime_format='%b')
stroop_topic_model[0].visualize_topics_over_time(topics_over_time, top_n_topics=20)

60it [02:18,  2.31s/it]


In [None]:
#TODO store models and plot stroop topics

# # topic_model.get_topic_info()

# DEBUG
# topic_model.get_topics()
# topic_model.get_topic(0)