<a href="https://colab.research.google.com/github/jameswrbrookes/cc-nlp-tutorials/blob/main/topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we give some basic code for doing topic modeling using the [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) algorithm. This algorithm is used to surface $K$ topics amongst a collection of documents, where $K$ is a user-defined number.  

# Package Installation

First let's install packages that we will need and that are not natively installed in a Colab environment.

In [None]:
!pip install datasets pyLDAvis



# Data

As data, we'll use the [Reuters newswire classification dataset](https://huggingface.co/datasets/ucirvine/reuters21578), which we can grab from HuggingFace.

In [None]:
from datasets import load_dataset

reuters = load_dataset("ucirvine/reuters21578", 'ModHayes', split = 'train', trust_remote_code=True)

In [None]:
reuters

Dataset({
    features: ['text', 'text_type', 'topics', 'lewis_split', 'cgis_split', 'old_id', 'new_id', 'places', 'people', 'orgs', 'exchanges', 'date', 'title'],
    num_rows: 20856
})

In [None]:
# example observation
reuters[0]

{'text': 'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this coc

We only want the text, and, to make things a bit quicker, we'll take a sample of 2000 docs.

In [None]:
reuters_text = reuters.shuffle(seed=123).select(range(2000))['text']

In [None]:
del reuters

In [None]:
reuters_text[0]

'MP Co, a New York investment\npartnership, told the Securities and Exchange Commission it\nbought a 6.8 pct stake in IPCO Corp common stock.\n    The partnership said it acquired 346,600 IPCO shares,\npaying 4.9 mln dlrs, because it believed the securities to be\n"an attractive investment opportunity."\n    It said it planned to regularly review its investment and\nmay in the future recommend business strategies or an\nextraordinary corporate transaction such as a merger,\nreorganization, liquidation or asset sale.\n    The partnership is controlled by Marcus Schloss and Co Inc,\na New York brokerage firm, and Prime Medical Products Inc, a\nGreenwood, S.C., medical supplies firm.\n Reuter\n'

# Data Preparation

We'll now do some simple text preprocessing.  Specifically, we'll:
- casefold everything;
- remove stopwords (generally I don't advise this, but for this particular task it can be useful);
- remove tokens that are typically function (rather than content) words;
- remove tokens with a spurious part-of-speech tag;
- remove punctuation;
- replace anything that is a digit/number with a `[DIGIT]`  tag;
- remove tokens whose lemma occurs less than 4 times;
- use a token's lemma and (broad) part-of-speech to represent the word (useful for disambiguation)

Which words occur less than 4 times? We won't know until we've done the preprocessing, so let's do that first.  

In [None]:
# package for doing linguistic analysis
import spacy

# setup the pipeline with the language model
nlp = spacy.load('en_core_web_sm')


# function implementing the above tokenization design
def preprocess_text_and_tokenize(doc):

  preprocessed_tokens = []

  for token in doc:
    if token.is_stop or token.is_punct or token.pos_ in ['SPACE', 'X', 'AUX',
                                                         'ADP', 'DET', 'CCONJ',
                                                         'INTJ', 'SCONJ',
                                                         'PRON', 'PUNCT' ]:
      continue
    elif token.is_digit or token.pos_ == 'NUM':
      preprocessed_tokens.append('[DIGIT]')
    else:
      preprocessed_tokens.append(token.lemma_.lower() + '/' + token.pos_)

  return preprocessed_tokens

# collect the tokenized docs
tokenized_docs = []
all_lemmas = [] # need this for the counts

for doc in nlp.pipe(reuters_text, batch_size = 8):
  preprocessed_tokens = preprocess_text_and_tokenize(doc)
  tokenized_docs.append(preprocessed_tokens)
  all_lemmas.extend(preprocessed_tokens)

In [None]:
pos = set([p.split('/')[-1] for p in all_lemmas])
pos

{'ADJ', 'ADV', 'NOUN', 'PROPN', 'SYM', 'VERB', '[DIGIT]'}

In [None]:
# now get a count of all the words
from collections import Counter

all_lemmas_counts = Counter(all_lemmas)

In [None]:
# now get tokenized docs without rare (where rare means < 4) words
new_tokenized_docs = []

for doc in tokenized_docs:
  new_tokens = []
  for t in doc:
    if all_lemmas_counts[t] >= 4:
      new_tokens.append(t)
  new_tokenized_docs.append(new_tokens)


In [None]:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

In [None]:
reuters_dictionary = Dictionary(new_tokenized_docs)

> N.B. You could also filter out rare words using `reuters_dictionary.filter_extremes(no_below = 4)`

In [None]:
reuters_dictionary.most_common(100)

[('[DIGIT]', 15604),
 ('say/VERB', 4807),
 ('mln/NOUN', 2687),
 ('dlrs/NOUN', 2046),
 ('reuter/PROPN', 1774),
 ('pct/NOUN', 1622),
 ('year/NOUN', 1475),
 ('company/NOUN', 1071),
 ('ct/NOUN', 948),
 ('share/NOUN', 916),
 ('u.s./PROPN', 730),
 ('market/NOUN', 621),
 ('bank/NOUN', 606),
 ('loss/NOUN', 574),
 ('inc/PROPN', 533),
 ('corp/PROPN', 522),
 ('stock/NOUN', 518),
 ('price/NOUN', 506),
 ('sale/NOUN', 482),
 ('month/NOUN', 445),
 ('new/ADJ', 430),
 ('expect/VERB', 394),
 ('rate/NOUN', 386),
 ('profit/NOUN', 367),
 ('april/PROPN', 367),
 ('co/PROPN', 358),
 ('bank/PROPN', 350),
 ('net/ADJ', 332),
 ('government/NOUN', 332),
 ('debt/NOUN', 319),
 ('rise/VERB', 318),
 ('week/NOUN', 317),
 ('march/PROPN', 316),
 ('include/VERB', 306),
 ('trade/NOUN', 304),
 ('oil/NOUN', 298),
 ('agreement/NOUN', 294),
 ('dlr/NOUN', 292),
 ('february/PROPN', 284),
 ('tell/VERB', 282),
 ('quarter/NOUN', 280),
 ('japan/PROPN', 280),
 ('tonne/NOUN', 269),
 ('interest/NOUN', 265),
 ('shr/PROPN', 262),
 ('coun

Now we convert the tokens into a (sparse) bag-of-words representation, where each document with a list of $t$ tokens is converted into a list of $t$ tuples each containing `(word_type_index, count_of_token_in_doc)`   

In [None]:
reuters_corpus = [reuters_dictionary.doc2bow(text) for text in new_tokenized_docs ]

In [None]:
reuters_corpus[1000][:4] # (word_type_index, count_of_token_in_doc)

  and should_run_async(code)


[(0, 17), (6, 1), (18, 1), (25, 4)]

Running the following code allows us to confirm this:

In [None]:
for t in reuters_corpus[1000][:4]:
  print(t)
  print(reuters_dictionary.get(t[0]), Counter(new_tokenized_docs[1000])[reuters_dictionary.get(t[0])])
  print('--------------')

(0, 17)
[DIGIT] 17
--------------
(6, 1)
business/NOUN 1
--------------
(18, 1)
future/ADJ 1
--------------
(25, 4)
mln/NOUN 4
--------------


  and should_run_async(code)


# Training

In [None]:
from gensim.models import LdaModel

lda_model = LdaModel(reuters_corpus, num_topics=70, id2word = reuters_dictionary, passes = 10)

# Inspection/Visualization

What topics has the model learned?

In [None]:
lda_model.print_topics()

[(36,
  '0.175*"[DIGIT]" + 0.045*"pct/NOUN" + 0.044*"year/NOUN" + 0.044*"february/PROPN" + 0.031*"say/VERB" + 0.024*"sale/NOUN" + 0.020*"car/NOUN" + 0.017*"reuter/PROPN" + 0.016*"january/PROPN" + 0.015*"month/NOUN"'),
 (6,
  '0.040*"say/VERB" + 0.026*"trade/VERB" + 0.022*"[DIGIT]" + 0.020*"reuter/PROPN" + 0.018*"office/NOUN" + 0.017*"product/NOUN" + 0.013*"retaliation/NOUN" + 0.012*"exchange/PROPN" + 0.011*"philadelphia/PROPN" + 0.011*"trade/NOUN"'),
 (49,
  '0.184*"[DIGIT]" + 0.121*"pct/NOUN" + 0.035*"price/NOUN" + 0.025*"rise/VERB" + 0.023*"january/PROPN" + 0.017*"february/PROPN" + 0.017*"rise/NOUN" + 0.014*"say/VERB" + 0.014*"month/NOUN" + 0.012*"year/NOUN"'),
 (44,
  '0.047*"[DIGIT]" + 0.030*"say/VERB" + 0.018*"sugar/PROPN" + 0.016*"reuter/PROPN" + 0.014*"factory/NOUN" + 0.012*"mln/NOUN" + 0.011*"beet/NOUN" + 0.010*"sugar/NOUN" + 0.010*"salomon/PROPN" + 0.010*"dlrs/NOUN"'),
 (45,
  '0.051*"say/VERB" + 0.038*"[DIGIT]" + 0.025*"merrill/PROPN" + 0.025*"lynch/PROPN" + 0.020*"trader/NOU

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvisualize

In [None]:
reuters_lda_vis = gensimvisualize.prepare(lda_model, reuters_corpus, reuters_dictionary, mds='mmds')
pyLDAvis.display(reuters_lda_vis)

  and should_run_async(code)


You can use these visualizations to:

- add words to the stopword list that are not useful for topic disambiguation (for example, `[DIGIT]` seems to be in most topics, as do `reuter/PROPN` and `say/VERB`), and then re-run the algorithm;
- determine what the topics relate to / interpret the topics for actionable insight (this can be very tricky to do!)