# An Introduction to Topic Modeling using TED Transcripts

Topic modelling is an unsupervised machine learning technique in which
documents–spans of text of arbitrary length–are grouped together based on
the co-occurrence patterns of words. Initially developed by computer scientists and statisticians to further information organization and retrieval goals, it has since been
co-opted by digital humanists for historical and literary research. In spite
of the name applied to these clusters, it has been noted that the semantic
coherence between the terms that constitute a 'topic', when it occurs, is a coincidental function of use.
That is, humans often write thematically and, there happens to be a statistical pattern capable of explaining their word choices. In reality, Latent Dirichlet Allocation (LDA)–the topic modeling algorithm favored by humanists–is not capable of grouping documents based on semantic
similarities beyond detecting this non-essential, though common, attribute
of writing. Blei et al. acknowledge as much in a footnote in their seminal
paper [1]. A more accurate alternative descriptormight be 'discourse,' since topic modelling algorithms pick up
on all types of patterns in word use, not just semantic patterns [2].

LDA is very complicated in abstract form but fairly easy to grasp intuitively. The following example, using transcripts of TED talks, will help us gain an understanding of the concept. First lets load the libraries that we will need for this example.

In [1]:
from __future__ import division

import gensim
import pyLDAvis.gensim
import spacy
import pandas as pd


from gensim import corpora, models, similarities

nlp = spacy.load('en', disable=['tagger', 'ner'])
pyLDAvis.enable_notebook()

Next, lets read the transcripts into a pandas dataframe and print out the first five transcripts.

In [2]:
df = pd.read_csv('datasets/transcripts.csv')
df['transcript'][0:5]

0    Good morning. How are you?(Laughter)It's been ...
1    Thank you so much, Chris. And it's truly a gre...
2    (Music: "The Sound of Silence," Simon & Garfun...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

## 1. Preprocessing

LDA assumes that the documents–in this case the TED transcripts–were generated by an arbitrary number of 'topics' that existed before hand; arbitary because you choose the number randomly, or at least intuitively. In reality, there is no such thing as a topic; there are documents, words, and relationships (document to document, word to document, and word to word). LDA exploits these relationships to assign documents to the pre-existent topics. It can detect patterns of co-occurence and assign words that commonly occur together to the same topic. These words might occur together because human writers associate them with the same topic of for other reasons. The LDA process can be summarized as follows: choose k number of topics; randomly assign each word in each document to one of k topics–this will result in a very poor representation of topics; improve the topics by going through each word in each document and for each topic t compute the probability of the topic given document d (i.e. the proportion of words in document d that are currently assigned to topic t), and the probability of word w given topic t (i.e. the proportion of assignments to topic t over all documents that come from this word w) [3]. The method assumes all topic assignments except the current word are correct and reassigns that word based on the others. To put it simply, after the initial random assignments, the algorithm reassigns words to topics based on the following questions: How often does the topic occur in the document? How often is the word assigned to the topic? If the word in question is rarely assigned to the topic it was randomly associated with, it is likely to be reassigned; if the topic it was initially assigned to has a small influence on the document, it is likely to be reassigned. Repeating this process a large number of times produces fairly stable "topics". 

In order to get LDA to produce something that resembles topics we have to think through some pre-processing steps, things like removing stop words and reducing the documents to a list of nouns. We will use spaCy for preprocessing the transcripts. Let's create a new column in our dataframe and model the transcritps usings spaCy's English language model.

In [3]:
df['parsed_transcript'] = df['transcript'].apply(nlp)

Spacy has a built in stopword list that does well enough on modern English text, but there are often corpus specific words that have the potential to convey misleading results. Because the TED corpus is transcribed, there are narrative tokens that are not spoken words, but audience actions–tokens such as "(laughter)" and "(applause)." If these tokens are not removed, one might conclude that TED talks are often about laughter. There are also words that seem to appear in most topics; these words are so general to the corpus that they should probably be filtered out. Lets add these terms to the stop words list.

Now lets remove the stop words and reduce the transcripts to a list of nouns, since nouns convey the subject of the sentence. We will write a function to remove the stop words and everything but the nouns, apply that function to each transcript in the dataframe and store the results in a new dataframe "df['spacy_transcript']." We can then print out the first five transcritps to see what they have become.

In [4]:
my_stop_words = ["thing", "people", "way", "year", "time", "lot", "day"]

In [5]:
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [6]:
def preprocess_texts(texts_as_csv_column):
    #Takes a column from a pandas datafram and converts it into a list of nouns.
    
    lemmas = []
    for doc in texts_as_csv_column: 
        # Append the lemmas of all nouns that are not stop words
        lemma = ([token.lemma_ for token in doc if token.pos_ == 'NOUN' and not token.is_stop])
        lemmas.append(lemma)
        
    return lemmas

In [7]:
#preprocess_texts(df['parsed_transcript'])
df['list_of_words'] = preprocess_texts(df['parsed_transcript'])
df['list_of_words'][0:5]

0    [morning, fact, theme, conference, evidence, c...
1    [honor, opportunity, stage, conference, commen...
2    [music, voice, mail, friend.(laughter)i've, te...
3    [today, development, sustainability, policy, a...
4    [year, task, development, student, year, insti...
Name: list_of_words, dtype: object

## 2. LDA with Gensim

We will use Gensim for topic modeling. Gensim uses dictionaries and a function called "doc2bow"–a function that represents the document as a bag of words, retaining word count at the expense of word order–to map words to integer ids and store each id with the number of times the word occurs in the document. Each entry in the dictionary might look like (536, 3) but, in reality, it is just an efficient way of storing information like" “How many times does the word *system* appear in the document? Three times.” Let's build a dictionary, save it for later, and print it out.

In [8]:
dictionary = gensim.corpora.Dictionary(df['list_of_words'])
dictionary.save('/tmp/TED.dict')
print(dictionary)

Dictionary(28050 unique tokens: ['pup', 'physique', 'forbearance', 'antler', 'stopwatch']...)


Now we'll use the "doc2bow" transformation, save it for later, and print it out.

In [13]:
corpus = [dictionary.doc2bow(transcript) for transcript in df['list_of_words']]
corpora.MmCorpus.serialize('/tmp/TED.mm', corpus)
print(corpus[0:2])

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 4), (12, 3), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 3), (23, 1), (24, 1), (25, 5), (26, 3), (27, 1), (28, 1), (29, 6), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 9), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 2), (47, 1), (48, 2), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 5), (56, 1), (57, 4), (58, 2), (59, 1), (60, 1), (61, 1), (62, 3), (63, 1), (64, 1), (65, 2), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 3), (74, 1), (75, 2), (76, 22), (77, 1), (78, 1), (79, 2), (80, 1), (81, 2), (82, 1), (83, 1), (84, 2), (85, 2), (86, 1), (87, 1), (88, 1), (89, 3), (90, 2), (91, 2), (92, 2), (93, 4), (94, 1), (95, 1), (96, 6), (97, 1), (98, 3), (99, 5), (100, 1), (101, 2), (102, 1), (103, 1), (104, 1), (105, 4), (106, 1), (107, 3), (108, 2), (109, 1), (110, 1

Now lets run the LDA model and see what "topics" it generates. We'll use 20 topics, iterate through 60 times, and update the model every time. 

In [11]:
corpora.MmCorpus.serialize('/tmp/TED.mm', corpus)

In [12]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, update_every=1, chunksize=10000, passes=60)

Lets print it out and save it for later.

In [14]:
lda.print_topics(20)

[(0,
  '0.025*"building" + 0.023*"design" + 0.016*"space" + 0.014*"idea" + 0.013*"place" + 0.013*"project" + 0.012*"city" + 0.012*"kind" + 0.011*"world" + 0.010*"art"'),
 (1,
  '0.029*"virus" + 0.022*"vaccine" + 0.022*"story" + 0.020*"disease" + 0.013*"flu" + 0.011*"epidemic" + 0.010*"case" + 0.009*"worker" + 0.009*"bat" + 0.009*"trust"'),
 (2,
  '0.079*"cell" + 0.029*"cancer" + 0.019*"body" + 0.015*"molecule" + 0.013*"dna" + 0.013*"tissue" + 0.012*"gene" + 0.011*"♫♫" + 0.011*"protein" + 0.011*"technology"'),
 (3,
  '0.023*"energy" + 0.015*"power" + 0.011*"year" + 0.009*"thing" + 0.009*"car" + 0.008*"fuel" + 0.008*"oil" + 0.007*"electricity" + 0.007*"bit" + 0.007*"wind"'),
 (4,
  '0.018*"technology" + 0.015*"world" + 0.013*"dollar" + 0.012*"problem" + 0.012*"thing" + 0.012*"money" + 0.011*"life" + 0.009*"year" + 0.007*"cost" + 0.006*"self"'),
 (5,
  '0.046*"food" + 0.020*"plant" + 0.013*"animal" + 0.011*"farmer" + 0.011*"bee" + 0.009*"world" + 0.006*"farm" + 0.006*"crop" + 0.006*"natur

In [15]:
lda.save('lda_20-60.model')

Now let's explore our topics with a visualization tool called pyLDAvis. 

In [20]:
dictionary = corpora.Dictionary.load('/tmp/TED.dict')
corpus = corpora.MmCorpus('/tmp/TED.mm')
lda = gensim.models.ldamodel.LdaModel.load('lda_20-60.model')
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


Interpreting topic models has been compared to reading tea leaves; it is more art than science [4]. Chang et al. desgined the "word intrusion" and "topic intrusion" tests to measure the quality of topics. The former tests the ability of humans to identify a random word added to a topic. If the topic was sensible, the reasoned, it would be easy to identify a word that did not belong. The latter approach test human ability to recognize an intruder topic with respect to a document. Given four topics, can the human identify which one is the intruder. High performance on this test indicates that the three acuarately assigned topics are a relatively good representation of the contents of the document in question. While these tests are useful, many applications will not warrant such an output of human labor. In cases like these, we need a more intuitive method of evaluation. 

In a talk on training and evaluating topic models, David Mimno mentions several qualitative aspects of topics that might be useful for us to keep in mind: topics with very few words are likely to be bad; general topics tend to constitute a low proportion of many documents, whereas specific topics might have a stronger association with fewer documents; topics with a high similarity to the whole corpus are likely to be uninteresting; identifying words that share a topic but never cooccur (co-document coherence) can be useful for recognizing chimera topics–two topics that have been merged together by the algorithm [5].

From an intuitive look at the topics above, it is clear that topics 10, 16, and 18 share an affinity. 12 seems to be about biology and 11 might relate to astronomy and physics. Broadly speaking, they relate to the natural environment. What will happen if we reduce the number of topics to 10?

In [20]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize=10000, passes=60)

In [21]:
lda.save('lda_10-60.model')

In [22]:
lda.print_topics(10)

[(0,
  '0.014*"thing" + 0.012*"design" + 0.011*"world" + 0.011*"building" + 0.010*"idea" + 0.009*"technology" + 0.009*"year" + 0.008*"kind" + 0.008*"project" + 0.007*"space"'),
 (1,
  '0.026*"brain" + 0.014*"cell" + 0.013*"patient" + 0.013*"cancer" + 0.013*"disease" + 0.010*"body" + 0.009*"health" + 0.009*"drug" + 0.008*"year" + 0.007*"thing"'),
 (2,
  '0.023*"woman" + 0.019*"life" + 0.018*"child" + 0.015*"school" + 0.015*"kid" + 0.014*"year" + 0.014*"man" + 0.011*"family" + 0.011*"world" + 0.009*"girl"'),
 (3,
  '0.043*"water" + 0.010*"word" + 0.008*"material" + 0.008*"plastic" + 0.007*"waste" + 0.006*"silk" + 0.006*"choice" + 0.005*"refugee" + 0.005*"body" + 0.005*"river"'),
 (4,
  '0.023*"city" + 0.017*"percent" + 0.016*"world" + 0.015*"country" + 0.013*"car" + 0.013*"year" + 0.010*"dollar" + 0.009*"problem" + 0.009*"market" + 0.008*"economy"'),
 (5,
  '0.019*"life" + 0.016*"story" + 0.012*"music" + 0.010*"love" + 0.008*"experience" + 0.008*"word" + 0.007*"mind" + 0.007*"moment" + 0

In [23]:
dictionary = corpora.Dictionary.load('/tmp/TED.dict')
corpus = corpora.MmCorpus('/tmp/TED.mm')
lda = gensim.models.ldamodel.LdaModel.load('lda_10-60.model')
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


Topic 9 now seems to be a mixture of biology, astronomy, and the natural enviornment. Now lets see what happens if we change the number of topics to 30. Before running the model think about what you would expect to happen.

In [24]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=30, update_every=1, chunksize=10000, passes=60)

In [25]:
lda.save('lda_30-60.model')

In [26]:
lda.print_topics(30)

[(0,
  '0.027*"student" + 0.014*"problem" + 0.013*"question" + 0.012*"science" + 0.011*"guy" + 0.009*"math" + 0.009*"course" + 0.008*"word" + 0.008*"fact" + 0.008*"thing"'),
 (1,
  '0.021*"world" + 0.015*"government" + 0.015*"country" + 0.014*"power" + 0.010*"group" + 0.010*"society" + 0.008*"right" + 0.008*"democracy" + 0.008*"question" + 0.007*"problem"'),
 (2,
  '0.076*"child" + 0.034*"family" + 0.022*"sound" + 0.018*"refugee" + 0.013*"voice" + 0.013*"life" + 0.011*"parent" + 0.011*"kid" + 0.010*"mother" + 0.010*"world"'),
 (3,
  '0.016*"cloud" + 0.010*"color" + 0.010*"light" + 0.010*"eye" + 0.009*"image" + 0.009*"camera" + 0.008*"thing" + 0.008*"hand" + 0.008*"air" + 0.008*"material"'),
 (4,
  '0.037*"baby" + 0.017*"light" + 0.016*"monkey" + 0.015*"image" + 0.014*"mother" + 0.010*"male" + 0.010*"design" + 0.009*"puzzle" + 0.009*"thing" + 0.008*"female"'),
 (5,
  '0.021*"thing" + 0.017*"idea" + 0.015*"book" + 0.012*"art" + 0.011*"world" + 0.010*"work" + 0.010*"kind" + 0.010*"life" +

In [27]:
dictionary = corpora.Dictionary.load('/tmp/TED.dict')
corpus = corpora.MmCorpus('/tmp/TED.mm')
lda = gensim.models.ldamodel.LdaModel.load('lda_16-60.model')
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


There is no ideal number of topics. The number of topics will change with the size of the dataset and the ganularity of description desired. It will likely prove useful to try different numbers until things start to make sense. 

## References:
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, no. Jan, pp. 993–1022, 2003.

[2] T. Underwood, “What can topic models of PMLA teach us about the history of literary scholarship?,” The Stone and the Shell, 14-Dec-2012. 

[3] D. M. Blei, “Probabilistic Topic Models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, 2012.

[4] J. Chang, S. Gerrish, C. Wang, J. L. Boyd-graber, and D. M. Blei, “Reading Tea Leaves: How Humans Interpret Topic Models,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc., 2009, pp. 288–296.

[5] D. Mimno, “The Details: Training and Validating Big Models on Big Data,” Journal of Digital Humanities, vol. 2, no. 1, Winter 2012.