# Topics model evolution over time

This notebook demonstrate evolution of topic models over time. The following steps are performed on the publication abstracts:

- Preprocessing the text, such as removing stop words, word stemming, lemmatize, and n-gram phrase detection.
- Use BERT to encode documents into embeddings.
- dimentionality reduction of embeddings using UMAP.
- Hierarchical DBSCAN to form clusters of embeddings.
- Assign scores to words with regard to the clusters.
- Extract most coheret topics (c-TD-IDF algorithm).
- Fine-tune clusters and topics at different publicationn dates to model topic evolution.

**c-TD-IDF**: $c\_td\_idf = {t_i \over w_i} \times log{m \over {\Sigma_j^n t_j}}$


## Setup

In [1]:
from pathlib import Path

from tqdm import tqdm
import pandas as pd
import matplotlib.pyplot as plt

import spacy
import gensim
from gensim.models.phrases import ENGLISH_CONNECTOR_WORDS
from bertopic import BERTopic

In [5]:
# parameters

CUSTOM_STOP_WORDS = ['study', 'task', 'test']
PREPROCESSED_ABSTRACTS_FILE = Path('data/pubmed/tests_preprocessed.csv')

## Preprocessing

In [6]:

# Note: run this to download the SpaCy model: `python -m spacy download en_core_web_sm`
nlp = spacy.load('en_core_web_sm')


def preprocess(texts: list[str], corpus_name: str) -> list[str]:
  """Opinionated preprocessing pipeline.

  Args:
      texts (list[str]): list of texts, each item is one text document.
      corpus_name (str): Name of the corpus

  Returns:
      list[str]: preprocessed documents
  """
  # DEBUG standard preprocessing pipeline
  # docs = \
  #   texts['abstract'].progress_apply(lambda abstract: gensim.parsing.preprocess_string(abstract)).to_list()

  print(f'Preprocessing {corpus_name}...', file=sys.stderr)

  # additional stop words
  for stop_word in CUSTOM_STOP_WORDS:
    lexeme = nlp.vocab[stop_word]
    lexeme.is_stop = True

  # flake8: noqa: W503
  def _clean(doc):
    cleaned = []
    for token in doc:
      if (not token.is_punct
          and token.is_alpha
          and not token.is_stop
          and not token.like_num
          and not token.is_space):
        cleaned.append(token.lemma_.lower().strip())
    return cleaned

  docs = tqdm([_clean(txt) for txt in nlp.pipe(texts)], desc='Cleaning abstracts')

  # bigram
  ngram_phrases = gensim.models.Phrases(docs, connector_words=ENGLISH_CONNECTOR_WORDS) #, scoring='npmi')

  # there are cases that a test or construct contains 4 terms; a heuristic for n-grams is to count spaces in the corpus_name
  for _ in range(max(1, 2 + corpus_name.count(' '))):
    ngram_phrases = gensim.models.Phrases(ngram_phrases[docs], connector_words=ENGLISH_CONNECTOR_WORDS) #, scoring='npmi')

  ngram = gensim.models.phrases.Phraser(ngram_phrases)
  docs = [' '.join(doc) for doc in ngram[docs]]
  # FIXME filter ngram stop words: docs = [[w for w in doc if w not in my_stop_words] for doc in docs]

  return docs


Now load PubMed abstracts ad preprocess them. This takes a few minutes to run and preprocessed abstracts will be stored in `data/pubmed/tests_preprocessed.csv`.

In [62]:

if PREPROCESSED_ABSTRACTS_FILE.exists():
    # load from cached csv
    df = pd.read_csv(PREPROCESSED_ABSTRACTS_FILE)
else:
    # if preprocessed abstracts are not already available
    csv_files = Path('data/pubmed/tests').glob('*.csv')

    corpora = []

    for csv_file in tqdm(csv_files, desc='Reading CSV files', unit=' files'):
        df = pd.read_csv(csv_file)
        df['corpus_name'] = csv_file.stem
        corpora.append(df)

    df = pd.concat(corpora, axis=0)
    df['abstract'].fillna(df['title'], inplace=True)

    df['preprocessed_abstract'] = df.groupby('corpus_name')['abstract'].transform(
        lambda grp: preprocess(grp.to_list(), grp.name)
    )

    # store the preprocessed abstracts as a csv file.
    df.to_csv('data/pubmed/tests_preprocessed.csv')

print('Done! Feel free to move on to the next cell.')

As a visual check, a word cloud can quickly visualize the whole preprocessed corpus.

In [70]:
from wordcloud import WordCloud
combined_text = df['preprocessed_abstract'].str.cat(sep=' ')

cloud = WordCloud(width=500,height=500,background_color ='white').generate(combined_text)

plt.figure(figsize=(10,10))
plt.imshow(cloud)
plt.axis('off')
plt.show()

## Overall topics (time-independent model)

In this section, we fit a topic model on all the PubMed cognitive task abstracts given which task the text belongs to.

In [86]:
def fit_topic_model(docs, corpus_name):
    """Fit a topic model to the docs and return the fitted model, and the topics.
    """

    print(f'fitting topics for {corpus_name}...', file=sys.stderr)

    topic_model = BERTopic(verbose=True)
    topics, _ = topic_model.fit_transform(docs)
    return topic_model, topics

models = df.groupby('corpus_name').apply(
    lambda grp: fit_topic_model(grp['preprocessed_abstract'].to_list(), grp.name)
)

# topic_model.get_topic_info()

# DEBUG
# topic_model.get_topics()
# topic_model.get_topic(0)

## Topics evolution over the span of the years (time-dependent model)

In [7]:
topics_over_time = topic_model.topics_over_time(docs, topics, df['year'], datetime_format="%b")
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)