# An Introduction to Topic Modeling using TED Transcripts

Topic modelling is an unsupervised machine learning technique in which
documents–spans of text of arbitrary length–are grouped together based on
the co-occurrence patterns of words. Initially developed by computer scientists and statisticians to further information organization and retrieval goals, it has since been
co-opted by digital humanists for historical and literary research. In spite
of the name applied to these clusters, it has been noted that the semantic
coherence between the terms that constitute a 'topic', when it occurs, is a coincidental function of use.
That is, humans often write thematically and, there happens to be a statistical pattern capable of explaining their word choices. In reality, Latent Dirichlet Allocation (LDA)–the topic modeling algorithm favored by humanists–is not capable of grouping documents based on semantic
similarities beyond detecting this non-essential, though common, attribute
of writing. Blei et al. acknowledge as much in a footnote in their seminal
paper [1]. A more accurate alternative descriptormight be 'discourse,' since topic modelling algorithms pick up
on all types of patterns in word use, not just semantic patterns [2].

LDA is very complicated in abstract form but fairly easy to grasp intuitively. The following example, using transcripts of TED talks, will help us gain an understanding of the concept. First lets load the libraries that we will need for this example.

In [2]:
from __future__ import division

import gensim
import pyLDAvis.gensim
import spacy
import pandas as pd


from gensim import corpora, models, similarities

nlp = spacy.load('en', disable=['tagger', 'ner'])
pyLDAvis.enable_notebook()

Next, lets read the transcripts into a pandas dataframe and print out the first five transcripts.

In [2]:
df = pd.read_csv('datasets/transcripts.csv')
df['transcript'][0:5]

0    Good morning. How are you?(Laughter)It's been ...
1    Thank you so much, Chris. And it's truly a gre...
2    (Music: "The Sound of Silence," Simon & Garfun...
3    If you're here today — and I'm very happy that...
4    About 10 years ago, I took on the task to teac...
Name: transcript, dtype: object

## 1. Preprocessing

LDA assumes that the documents–in this case the TED transcripts–were generated by an arbitrary number of 'topics' that existed before hand; arbitary because you choose the number randomly, or at least intuitively. In reality, there is no such thing as a topic; there are documents, words, and relationships (document to document, word to document, and word to word). LDA exploits these relationships to assign documents to the pre-existent topics. It can detect patterns of co-occurence and assign words that commonly occur together to the same topic. These words might occur together because human writers associate them with the same topic of for other reasons. The LDA process can be summarized as follows: choose k number of topics; randomly assign each word in each document to one of k topics–this will result in a very poor representation of topics; improve the topics by going through each word in each document and for each topic t compute the probability of the topic given document d (i.e. the proportion of words in document d that are currently assigned to topic t), and the probability of word w given topic t (i.e. the proportion of assignments to topic t over all documents that come from this word w) [3]. The method assumes all topic assignments except the current word are correct and reassigns that word based on the others. To put it simply, after the initial random assignments, the algorithm reassigns words to topics based on the following questions: How often does the topic occur in the document? How often is the word assigned to the topic? If the word in question is rarely assigned to the topic it was randomly associated with, it is likely to be reassigned; if the topic it was initially assigned to has a small influence on the document, it is likely to be reassigned. Repeating this process a large number of times produces fairly stable "topics". 

In order to get LDA to produce something that resembles topics we have to think through some pre-processing steps, things like removing stop words and reducing the documents to a list of nouns. We will use spaCy for preprocessing the transcripts. Let's create a new column in our dataframe and model the transcritps usings spaCy's English language model.

In [3]:
df['parsed_transcript'] = df['transcript'].apply(nlp)

Spacy has a built in stopword list that does well enough on modern English text, but there are often corpus specific words that have the potential to convey misleading results. Because the TED corpus is transcribed, there are narrative tokens that are not spoken words, but audience actions–tokens such as "(laughter)" and "(applause)." If these tokens are not removed, one might conclude that TED talks are often about laughter. There are also words that seem to appear in most topics; these words are so general to the corpus that they should probably be filtered out. Lets add these terms to the stop words list.

In [8]:
add_to_stop = ["(laughter)", "(applause)", "thing", "people", "way", "year", "♫♫", "♫", "time", "lot", "day"]
for token in add_to_stop:
    nlp.vocab[token].is_stop = True
# nlp.vocab["(laughter)"].is_stop = True
# nlp.vocab["(applause)"].is_stop = True
# nlp.vocab["thing"].is_stop = True
# nlp.vocab["people"].is_stop = True
# nlp.vocab["way"].is_stop = True
# nlp.vocab["year"].is_stop = True
# nlp.vocab["♫♫"].is_stop = True
# nlp.vocab["♫"].is_stop = True
# nlp.vocab["time"].is_stop = True

Now lets remove the stop words and reduce the transcripts to a list of nouns, since nouns convey the subject of the sentence. We will write a function to remove the stop words and everything but the nouns, apply that function to each transcript in the dataframe and store the results in a new dataframe "df['spacy_transcript']." We can then print out the first five transcritps to see what they have become.

In [5]:
def preprocess_texts(texts_as_csv_column):
    #Takes a column from a pandas datafram and converts it into a list of nouns.
    
    lemmas = []
    for doc in texts_as_csv_column: 
        # Append the lemmas of all nouns that are not stop words
        lemma = ([token.lemma_ for token in doc if token.pos_ == 'NOUN' and not token.is_stop])
        lemmas.append(lemma)
        
    return lemmas

In [9]:
#preprocess_texts(df['parsed_transcript'])
df['spacy_transcript'] = preprocess_texts(df['parsed_transcript'])
df['spacy_transcript'][0:5]

In [10]:
df['spacy_transcript'][0:5]

0    [morning, theme, conference, evidence, creativ...
1    [honor, opportunity, stage, conference, commen...
2    [music, voice, mail, friend.(laughter)i've, te...
3    [today, development, sustainability, policy, a...
4    [year, task, development, student, year, insti...
Name: spacy_transcript, dtype: object

## 2. LDA with Gensim

We will use Gensim for topic modeling. Gensim uses dictionaries and a function called "doc2bow"–a function that represents the document as a bag of words, retaining word count at the expense of word order–to map words to integer ids and store each id with the number of times the word occurs in the document. Each entry in the dictionary might look like (536, 3) but, in reality, it is just an efficient way of storing information like" “How many times does the word *system* appear in the document? Three times.” Let's build a dictionary, save it for later, and print it out.

In [4]:
dictionary = gensim.corpora.Dictionary(df['spacy_transcript'])
dictionary.save('/tmp/TED.dict')
print(dictionary)

NameError: name 'df' is not defined

Now we'll use the "doc2bow" transformation, save it for later, and print it out.

In [11]:
# corpus = [dictionary.doc2bow(transcript) for transcript in df['spacy_transcript']]
# corpora.MmCorpus.serialize('/tmp/TED.mm', corpus)
# print(corpus[0:5])
class MyCorpus(object):
    def iter(self):
        for transcript in df['spacy_transcript']:
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(transcript)

Now lets run the LDA model and see what "topics" it generates. We'll use 20 topics, iterate through 60 times, and update the model every time. 

In [13]:
corpus = MyCorpus()
corpora.MmCorpus.serialize('/tmp/TED.mm', corpus)
SerializedCorpus = corpora.MmCorpus('/tmp/TED.mm')
vis_data = gensimvis.prepare(lda, SerializedCorpus, corpus.dictionary)
pyLDAvis.save_html(vis_data,outpth+'LDA_Visualization.html')

TypeError: 'MyCorpus' object is not iterable

In [33]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, update_every=1, chunksize=10000, passes=60)

Lets print it out and save it for later.

In [36]:
lda.print_topics(20)

[(0,
  '0.030*"sound" + 0.014*"sleep" + 0.010*"noise" + 0.007*"dolphin" + 0.007*"stress" + 0.006*"area" + 0.006*"body" + 0.006*"brain" + 0.006*"kind" + 0.005*"language"'),
 (1,
  '0.031*"animal" + 0.019*"specie" + 0.019*"ocean" + 0.018*"fish" + 0.013*"water" + 0.010*"sea" + 0.010*"plant" + 0.009*"world" + 0.008*"lot" + 0.008*"shark"'),
 (2,
  '0.026*"book" + 0.023*"story" + 0.014*"love" + 0.013*"word" + 0.013*"thing" + 0.012*"life" + 0.009*"man" + 0.009*"world" + 0.008*"compassion" + 0.008*"idea"'),
 (3,
  '0.018*"car" + 0.013*"energy" + 0.012*"world" + 0.011*"water" + 0.011*"city" + 0.010*"design" + 0.010*"problem" + 0.009*"lot" + 0.009*"technology" + 0.009*"year"'),
 (4,
  '0.021*"building" + 0.017*"art" + 0.013*"space" + 0.012*"work" + 0.012*"project" + 0.012*"kind" + 0.012*"place" + 0.011*"thing" + 0.011*"design" + 0.011*"image"'),
 (5,
  '0.046*"child" + 0.040*"kid" + 0.039*"school" + 0.024*"student" + 0.018*"teacher" + 0.017*"education" + 0.012*"parent" + 0.012*"family" + 0.011*"

In [37]:
lda.save('lda.model')

Now let's explore our topics with a visualization tool called pyLDAvis. 

In [10]:
dictionary = corpora.Dictionary.load('/tmp/TED.dict')
corpus = corpora.MmCorpus('/tmp/TED.mm')
lda = gensim.models.ldamodel.LdaModel.load('lda.model')
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


## References:
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, no. Jan, pp. 993–1022, 2003.

[2] Underwood, Ted, “What can topic models of PMLA teach us about the history of literary scholarship?,” The Stone and the Shell, 14-Dec-2012. 

[3] D. M. Blei, “Probabilistic Topic Models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, 2012.