# **Topic Modeling**

## **Exercise 1.1: Processing the raw text**

You will using the gensim python library to perform LDA on the Brown corpus

We will start with the file `brown.txt`, which contains 500 documents from the Brown corpus, one document per line, with all punctuation removed.

Read these documents into memory:


In [1]:
docs = [ line.strip() for line in open('brown.txt', 'r') ]
len(docs)

500

Split each document into its word tokens and lowercase them, to get a list of lists of words:

In [2]:
texts = [ [ w for w in d.lower().split() ] for d in docs ]

We could do some more preprocessing, like stoplisting or removing numbers, etc., but you will see that this doesn’t apply for this lab’s general purpose task.

## **Exercise 1.2a: Raw text to bag-of-words**

Many sophisticated models like LDA may rely on the bag-of-words representation (i.e. representing a document as unigram frequencies).

This is done by creating a unique index of all words in the corpus, then transforming each document into a list of those indexes along with a count of how many times they appear.

`gensim` provides a `Dictionary` class for this purpose:

In [3]:
from gensim.corpora import Dictionary
dictionary = Dictionary(texts)   #Index all words in the corpus
len(dictionary)                  #Number of word types

48017

In [4]:
dictionary[1729]                 #ID lookup

'rights'

In [5]:
dictionary.token2id['rights']    #Word type lookup

1729

In [6]:
dictionary.token2id['flake']

22909

In [7]:
allwords = [ dictionary[i] for i in dictionary ]
print(allwords[100:150])

['bar', 'barber', 'battle', 'be', 'became', 'become', 'been', 'before', 'begin', 'being', 'believes', 'bellwood', 'berry', 'best', 'birth', 'bit', 'blue', 'board', 'bodys', 'bond', 'bonds', 'both', 'bowden', 'brief', 'brought', 'burden', 'bush', 'but', 'by', 'byrd', 'byrds', 'caldwell', 'caldwells', 'callan', 'calls', 'calmest', 'campaign', 'can', 'candidate', 'candidates', 'career', 'carey', 'carry', 'cent', 'chairman', 'chambers', 'charge', 'charged', 'cheshire', 'child']


## **Exercise 1.2b: Filtering raw text to bag-of-words**

A common pre-processing task is to remove highly frequent words (e.g. *a, the, of*), as they appear in nearly every document, and thus are uninformative for topics.

For topic modeling, it can also be useful to eliminate words that appear in too few documents, as they are not likely associated with any particular topic.

Both of these tasks can be accomplished using the `.filter_extremes()` function.

Let’s **remove** a word if it appears in fewer than 5 documents (`no_below=5`) or in more than 50%
of the documents (`no_above=0.5`).

In [8]:
dictionary.filter_extremes(no_below = 5, no_above = 0.5)
len(dictionary) 

11137

In [9]:
filteredwords = [ dictionary[i] for i in dictionary ]
print(filteredwords[100:150])

['cent', 'chairman', 'chambers', 'charge', 'charged', 'child', 'church', 'citizens', 'city', 'clerical', 'combined', 'commented', 'commerce', 'commissioners', 'committee', 'compensation', 'conducted', 'congress', 'congressional', 'congressmen', 'considering', 'consistently', 'constitution', 'constitutional', 'construction', 'consulted', 'continue', 'contracts', 'controversy', 'cost', 'costs', 'council', 'counties', 'county', 'couple', 'courses', 'court', 'courts', 'criticisms', 'crowd', 'cruelty', 'd', 'date', 'daughter', 'davis', 'days', 'death', 'defeated', 'delegation', 'democratic']


Q1. Compare to the above. Which proportion of the vocabular was removed? What do you think was removed?

In [10]:
dictionary.token2id['rights']        #Indexes can change after filtering 

1327

In [11]:
#dictionary.token2id['flake']          #Some words have disappeared after filtering 

In [12]:
#dictionary.token2id['the']            #Some words have disappeared after filtering 

Q2. Check to see if you are right about the some of words that were removed. 

In [13]:
set(allwords) - set(filteredwords)

{'xylophones',
 'toppers',
 'wisconsins',
 'splotches',
 'fraternities',
 'moffett',
 'beadles',
 '1093',
 'bootlegging',
 'lateral',
 'measles',
 'individualcontributor',
 'voltaires',
 '12month',
 'metaphorical',
 'stavropoulos',
 'perennially',
 'seekers',
 'subdue',
 'mirrored',
 'coauthor',
 'yknow',
 'murkland',
 'hirsch',
 'yellowgreen',
 'beheading',
 'undercurrent',
 'holloway',
 'interviewee',
 'dialectics',
 'skinnin',
 'onestep',
 'tooke',
 'britishborn',
 'buckley',
 'defecated',
 'theorizing',
 'puissant',
 'cityscapes',
 'monies',
 'transcription',
 'reformatory',
 'courcy',
 'dandy',
 'bouffe',
 'amalgamation',
 'unattached',
 'ferber',
 'villainous',
 'redeclared',
 'resistors',
 'hatters',
 'borland',
 'admonition',
 'saamis',
 'filibuster',
 'equipping',
 'forma',
 'unchanging',
 'cholelithiasis',
 'ethyl',
 'anouilh',
 'hyphenated',
 'ballots',
 'quint',
 'activation',
 'tenements',
 'peanuts',
 'ps',
 'tug',
 'preflight',
 'throbbing',
 'recentlypassed',
 'little',

Q3. How would you only remove words that occur frequenly across the corpus? Why would you (not) want to do this? (Tip: Review cell 52.)

You can save a dictionary with `.save(filename)`

In [14]:
dictionary.save('LDALab.bow.dict')

## **Exercise 1.2c: Processing and serializing the corpus**

The function `.doc2bow()` will convert a document to a list of tuples of the unique word indexes in the document and how many times they appear. i.e. [ (word_index, count), (word_index, count), ...]

Do this for the entire collection of texts:

In [15]:
bow_corpus = [ dictionary.doc2bow(t) for t in texts ]
print (bow_corpus[0])

[(0, 4), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 3), (24, 1), (25, 3), (26, 2), (27, 1), (28, 2), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 2), (36, 1), (37, 2), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 3), (48, 2), (49, 1), (50, 1), (51, 1), (52, 2), (53, 1), (54, 1), (55, 1), (56, 2), (57, 2), (58, 1), (59, 1), (60, 1), (61, 1), (62, 5), (63, 1), (64, 2), (65, 1), (66, 2), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 3), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 2), (86, 2), (87, 3), (88, 6), (89, 1), (90, 1), (91, 1), (92, 3), (93, 1), (94, 3), (95, 5), (96, 4), (97, 2), (98, 1), (99, 1), (100, 1), (101, 3), (102, 1), (103, 2), (104, 2), (105, 1), (106, 1), (107, 2), (108, 9), (109, 1), (110, 1),

This can also be done on unseen documents (unseen words will be ignored):

In [16]:
dictionary.doc2bow('This document has unseen words like Trafalmadorian'.split())

[(4022, 1), (7509, 1), (10999, 1)]

In [17]:
dictionary[4022], dictionary[7509], dictionary[10999]

('words', 'document', 'unseen')

Q4. Which words are not indexed? Why?

Serializing such a processed corpus allows us to store it and read it back in another time

In [18]:
from gensim.corpora import MmCorpus 
MmCorpus.serialize('LDALab.bow.mm', bow_corpus)

## **Exercise 1.3: Bag-of-words to LDA**

Let’s train a **30 topic LDA model** on the first 80% of the documents and evaluate it on the remaining 20%.

Split the bag-of-words corpus into 80% for training and 20% for testing.

In [19]:
size = int(len(bow_corpus) * 4 / 5)
training = bow_corpus[:size]
testing = bow_corpus[size:]

Now we can train an `LdaModel` object with the following parameters:
  * the processed corpus of interest 
  * `id2word` = the dictionary so that we can print the words instead of the indexes
  * `num_topics` = the number of topics (an important parameter to experiment with)
  * `passes` = the number of iterations for parameter-learning (we only have 500 documents; larger corpora may involve different number of passes for reasonable results)
  
**This might take up to 2 minutes or so to run. Please be patient. You can look ahead at instructions below while waiting.**

In [20]:
from gensim.models import LdaModel
lda30 = LdaModel(training, id2word=dictionary, num_topics=30, passes=10)

Q5. How do you know if you have chosen a good number of topics?

## **Exercise 1.4a: Evaluating LDA - interpreting**

The `LdaModel` object has a built-in function to print topics/return raw data `.show_topics()`. Its output is hard to read, so a nicer function `pretty_topics()` has been provided for you.

Import the `pretty_topics()` function from `demo` (in the lab directory). It will print the top 10 words and their probabilities per topic. Use this to examine the topics. (Remember: The word distribution is over the full vocabulary, not just frequent words. For a given topic, it sums to 1.)

In [21]:
from demo import pretty_topics
pretty_topics(lda30)

Topic 1:
	0.01414 states                
	0.01292 wage                  
	0.01051 state                 
	0.01030 industry              
	0.01025 price                 
	0.00932 increase              
	0.00809 rate                  
	0.00685 above                 
	0.00655 basic                 
	0.00646 increases             
Topic 2:
	0.00628 she                   
	0.00376 house                 
	0.00293 mrs                   
	0.00289 family                
	0.00278 school                
	0.00272 trial                 
	0.00244 american              
	0.00226 mr                    
	0.00225 music                 
	0.00217 home                  
Topic 3:
	0.02389 af                    
	0.00885 1                     
	0.00649 2                     
	0.00553 used                  
	0.00513 temperature           
	0.00506 data                  
	0.00440 system                
	0.00385 radiation             
	0.00365 fiscal                
	0.00350 3                     
Topic 4:
	0.0

To see more topics than 2, update the variable num_topics in `pretty_topics()` in demo.py by going to the file,  clicking it to open a text editor, revising the variable value, and then saving the file. Then, restart this notebook with `Kernel` > `Restart & Run All`. Be patient and wait a bit.

Topics will not be exactly the same because of randomness in the model. 

Q6. How difficult/straightforward is it to interpret the topics? Do some make more sense than others? 

Q7. How would you name the topics that do make sense? 


## **Exercise 1.4b: Evaluating LDA - perplexity**

Now we will compute the **perplexity** of the model on the test data.

Remember that LDA is a **generative model**. After training the parameters of the model, let’s test
the likelihood of this model randomly generating the held-out test set.

The `.log_perplexity()` function of the `LdaModel` will give the log perplexity.

In [22]:
lg_perp = lda30.log_perplexity(testing)
lg_perp

-11.618870438686745

In [23]:
perp = 2 ** -(lg_perp) # Calculate perplexity 
perp

3145.056977639745

Your results may not be exactly the same, but should be reasonably close.

For comparison, using a version of this trained model on an external dataset (not given in lab) yielded a perplexity of 42,224 -- much much worse than on our test set from the same corpus.

Q8. For LDA, do you think perplexity is linked to model interpretability?