<h1 align='center'>It Starts with a Research Question...</h1>
<img src='Nelson, Mining the Dispatch, NYTimes Opinionator.jpg' width="66%" height="66%">

# Topic Modeling in Python
<ul><li>Review/Preview</li>
<ul><li>String Methods</li>
<li>List Comprehensions</li>
<li>Concatenation</li>
<li>Group DataFrame by Rows</li>
</ul>
<li>Pre-Process</li>
<ul><li>Import Corpus</li>
<li>Tokenize</li>
<li>Feature Selection</li>
</ul>
<li>Topic Model</li>
<li>Interpreting the Model</li>
<ul><li>Visualization</li></ul>
<li>Revising Model Inputs</li>
</ul>

### Corpus Description
English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on disk, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files.

The image at the top of this notebook comes from Robert K. Nelson's project <i>Mining the Distpatch</i>. The study uses topic modeling to expore the <i>Richmond Dispatch</i>, the Confederacy's paper of record during the American Civil War. It demonstrates the interpretive power of laying out topic distributions over a chronological axis.

### Metadata Columns
<ol><li>Filename: Name of file on disk</li>
<li>ID: Unique ID in Piper corpus</li>
<li>Language: Language of novel</li>
<li>Date: Initial publication date</li>
<li>Title: Title of novel</li>
<li>Gender: Authorial gender</li>
<li>Person: Textual perspective</li>
<li>Length: Number of tokens in novel</li></ol>

# 0. Review/Preview

In [None]:
%pylab inline
style.use('ggplot')

import pandas
import nltk

modules = ["words", "stopwords"]
nltk.download(modules)

### String Methods

In [None]:
# Assign a string to a variable

a_token = 'Spam'

In [None]:
# Make it lower case

a_token.lower()

In [None]:
# Test whether the original word was lower case

a_token.islower()

In [None]:
# Test whether the original word is title case

a_token.istitle()

In [None]:
# Is it alphabetical?

a_token.isalpha()

In [None]:
# New token with punctuation

excited_token = 'Spam!'

In [None]:
# Still counts as title case?

excited_token.istitle()

In [None]:
# Still counts as alphabetical?

excited_token.isalpha()

In [None]:
# How long is it?

len(excited_token)

In [None]:
# Longer than that first token, right?

len(excited_token) > len(a_token)

### List Comprehensions

In [None]:
# Let's use a longer string

script = "Man: Well, what've you got? Waitress: Well, there's egg and bacon; egg sausage and bacon; \
egg and spam; egg bacon and spam; egg bacon sausage and spam; spam bacon sausage and spam; \
spam egg spam spam bacon and spam; spam sausage spam spam bacon spam tomato and spam; \
spam spam spam egg and spam; spam spam spam spam spam spam baked beans spam spam spam; \
...or Lobster Thermidor au Crevette with a Mornay sauce served in a Provencale manner with shallots \
and aubergines garnished with truffle pate, brandy and with a fried egg on top and spam."

In [None]:
# Split it into tokens

script.split()

In [None]:
# Assign to a variable

script_tokens = script.split()

In [None]:
# Get the title case tokens

[token for token in script_tokens if token.istitle()]

In [None]:
# Get the NOT title case tokens

[token for token in script_tokens if not token.istitle()]

In [None]:
# Multiple conditions!

[token for token in script_tokens if token.isalpha() and not token.istitle()]

In [None]:
# List in which each entry has two elements

paired_entries = [['yes','no'],['yes','no'],['yes','no']]

In [None]:
# Inspect

paired_entries[0]

In [None]:
# Basic list comprehension format

[pair for pair in paired_entries]

In [None]:
# Call up entry items individually

[first for first, second in paired_entries]

In [None]:
[second for first, second in paired_entries]

In [None]:
## EX. For the list below, return the first number in each pair.

## EX. For the list below, return the first number in each pair if it is greater than the second number.

In [None]:
new_pairs = [ [1,12], [15,6], [18,17], [3,9], [21,16] ]

### Concatenation

In [None]:
# Two lists in which each element is also a list

list_1 = [['bacon','eggs','sausage'],['tomato','beans','lobster']]
list_2 = [['spam','spam','spam'],['spam','spam','shallots']]

In [None]:
# A list of lists

list_1

In [None]:
# Concatenate the lists

list_1 + list_2

In [None]:
# Make a dataframe with the concatenated lists
# Number of rows matches the length of the concatenated list

pandas.DataFrame(list_1 + list_2)

In [None]:
# Convert each list to a dataframe individually

df_1 = pandas.DataFrame(list_1)
df_2 = pandas.DataFrame(list_2)

In [None]:
# It's a dataframe!

df_1

In [None]:
# Concatenate the lists as columns rather than rows

pandas.concat([df_1,df_2], axis=1)

### Group DataFrame Rows

In [None]:
# Make a new dataframe

columns = ['Menu Item','With Spam?', 'Price']
values = [['Lobster','Yes',12],
          ['Eggs','Yes',6],
          ['Beans','No',5],
          ['Bacon','No',2],
          ['Bacon','Yes',3]]
menu_df = pandas.DataFrame(values, columns = columns)

In [None]:
# It's a menu!

menu_df

In [None]:
# Get basic statistics for numeric columns

menu_df.describe()

In [None]:
# But what if I want to divvy up those stats based on whether there is spam in the dish?

menu_df.groupby('With Spam?').describe()

In [None]:
# What if I just want one of the stats
# Count is handy to see how many items are in each category

menu_df.groupby('With Spam?').count()

In [None]:
# Just one column of interest

menu_df.groupby('With Spam?').count()['Menu Item']

In [None]:
# Of course, the average is very handy

menu_df.groupby('With Spam?').mean()

In [None]:
# And why not graph it

menu_df.groupby('With Spam?').mean().plot(kind='bar')

In [None]:
## EX. Import the corpus metadata in the cell below. Get a list of the number of books
##     published in each year.

## EX. Find the average text length by year. Graph this.

In [None]:
metadata_df = pandas.read_csv('txtlab_Novel150_English.csv')

# 1. Pre-Process

Typically, this is the process of importing a corpus and then converting it into a Document-Term Matrix. However, gensim (our Topic Modeling package) prefers to receive texts as token lists. It then converts the vocabulary of the corpus into a <i>dictionary</i> (not to be confused with the Python datatype) that maps words to unique ID's.

We perform feature selection by subtracting words from that dictionary. Topic Modeling is especially sensitive to stopwords, proper names, and errors introduced by digitization, so we will make a point of removing those tokens.

### Import Corpus

In [None]:
# Read metadata

metadata_df = pandas.read_csv('txtlab_Novel150_English.csv')

In [None]:
# Inspect

metadata_df

In [None]:
# Set location of corpus folder

fiction_path = 'txtalb_Novel150_English/'

In [None]:
# Import Corpus

novel_list = [open(fiction_path+file_name).read() for file_name in metadata_df['filename']]

In [None]:
# Inspect

novel_list[0]

### Tokenize

In [None]:
# Split each novel into a list of tokens

novel_tokens_list = [novel.lower().split() for novel in novel_list]

In [None]:
# Inspect tokens from first novel

novel_tokens_list[0]

### Feature Selection (Gensim Dictionary)

In [None]:
# Import Topic Model package

import gensim

In [None]:
# Create dictionary based on corpus tokens
# Each token is mapped to its own unique ID

dictionary = gensim.corpora.dictionary.Dictionary(novel_tokens_list)

In [None]:
# Map lists of tokens to the dictionary IDs

dictionary.doc2bow(['pride','prejudice', 'pride'])

In [None]:
# Remove stopwords & (some!) proper names from dictionary

from nltk.corpus import stopwords, words

In [None]:
# Our trusty list of stop words

stopwords.words('english')

In [None]:
# List of common English-language words, typically used for autocorrect

words.words()

In [None]:
# Proper name test

'Ishmael' in words.words()

In [None]:
# Find proper names by looking for title-case words, then make lower case

proper_names = [word.lower() for word in words.words() if word.istitle()]

In [None]:
# The list of all words in the dictionary

list(dictionary.values())

In [None]:
noise_tokens = [word for word in dictionary.values() if word.isalpha()==False or len(word)<=2]

In [None]:
# Collect stop words and proper names together

bad_words = stopwords.words('english') + proper_names + noise_tokens

In [None]:
# Rather than passing a list of stopwords to gensim, we pass in their dictionary ids

dictionary.doc2bow(bad_words)

In [None]:
# Map stopwords, proper names to dictionary IDs

stop_ids = [_id for _id, count in dictionary.doc2bow(bad_words)]

In [None]:
# Inspect

stop_ids

In [None]:
# Remove stopwords from dictionary mappings

dictionary.filter_tokens(bad_ids = stop_ids)

In [None]:
# Remove terms by document frequency -- in this case about a quarter of all documents

dictionary.filter_extremes(no_below = 40)

## Bag-of-Words

In [None]:
# Create list of dictionary mappings by novel
# This is gensim's version of a document-term matrix

corpus = [dictionary.doc2bow(text) for text in novel_tokens_list]

In [None]:
# Inspect first text's representation

corpus[0]

# 2. Topic Model

### Latent Dirichlet Allocation (LDA) Models
LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."

### Topic Model Features
<ul><li>Corpus: Pre-processed textual corpus</li>
<li>Number of Topics: Choosing this is the art of Topic Modeling </li>
<li>Alpha (Hyperparameter): Prior, reflecting expected distribution of topics over documents</li>
<li>Iterations: TM initially uses random distribution, iteratively tweaks model</li>
<li>Passes: Bootstrap method for evaluating model during training; primarily seen in Gensim implementation</li></ul>

### Training

In [None]:
# Train Topic Model
lda_model = gensim.models.LdaModel(corpus, num_topics=25, alpha='auto', \
                                   id2word=dictionary, iterations=2500, passes = 4)

In [None]:
# If you have more than two cores at your disposal, then perhaps try:

#lda_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics=25, \
#                                                    id2word=dictionary, iterations=2500, passes = 4)

### Topics

In [None]:
# Quick look at n topics among those inferred

lda_model.show_topics(10)

In [None]:
# Deeper look at particular topic

lda_model.show_topic(8, topn=10)

In [None]:
## EX. Return a list of the top 20 words for topic 0.
##     Return a list of all words for topic 0.

## EX. Using the 'show_topics' method, try to find a topic in your model that is similar to 
##     one in that of person sitting next to you. How closely related do the topics seem?

## CHALLENGE: Create a table that contains all topic-term distributions.
##            Make each row a certain topic and label each column by the word it represents.

### Documents

In [None]:
# Most prominent topics in a given document

lda_model.get_document_topics(corpus[0])

In [None]:
# Distribution of all topics over a document

lda_model.get_document_topics(corpus[0], minimum_probability=0)

In [None]:
## EX. Return a list of the most prominent topics in document 10.
##     What terms are most prominent in those topics?

## EX. Compare your answers to the previous exercise with a classmate.
##     Do similar topics come up? Different ones?

### Corpus

In [None]:
# Measure of model's "fit" to corpus data
# Related to the probability of seeing texts like the ones in our corpus given inferred model

lda_model.log_perplexity(corpus)

In [None]:
# Most present topics in corpus

lda_model.top_topics(corpus)

# 3. Interpeting the Model

### Metadata
There are many strategies that can be used to interpret the output of a topic model. In this case, we will visualize topics over time in order to look for patterns.

In [None]:
# Create list of all document-topic distributions

list_of_doctopics = [lda_model.get_document_topics(text, minimum_probability=0) for text in corpus]

In [None]:
# Inspect

list_of_doctopics[0]

In [None]:
# In the list above, each topic got represented as a tuple containing
# the label of the topic and its probability within the given document

# Create list containing only the probabilities (remains ordered by topic label)
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]

In [None]:
# Labels removed!

list_of_probabilities[0]

In [None]:
# Reformat as a DataFrame
# Each row is a given text; each column is the probability distribution of a topic

pandas.DataFrame(list_of_probabilities)

In [None]:
# Assign to variable

proba_distro_df = pandas.DataFrame(list_of_probabilities)

In [None]:
# Concatenate our dataframe of metadata with the new one of document-topic distributions

pandas.concat([metadata_df, proba_distro_df], axis=1)

In [None]:
# Reassign concanated dataframe to 'metadata_df'

metadata_df = pandas.concat([metadata_df, pandas.DataFrame(list_of_probabilities)], axis=1)

In [None]:
# Group the rows of our dataframe by the date of each book's publication
# Get the average of each numberical value listed for that year

metadata_df.groupby('date').mean()

In [None]:
# Assign this to a new variable so we can play around with it easily

annual_means_df = metadata_df.groupby('date').mean()

In [None]:
# Inspect mean topic distribution by year

annual_means_df[8]

In [None]:
# Plot mean topic distribution by year

annual_means_df[8].plot(kind='bar', figsize=(8,8))

In [None]:
# And let's glance back at the most prominent terms in that topic

lda_model.show_topic(8)

# 4. Alternate Model Inputs

In [None]:
## Q.  Some proper names and titles still came through our filter.
##     How might you remove names in a more targeted way?

## EX. For Matt Jockers's study of literary theme in 'Macroanalysis',
##     he included only nouns for topic modeling. Use a POS tagger
##     to remove all words from the corpus that are not common nouns.

## EX. Jockers also found it useful to split texts into 1000-noun chunks
##     after the POS filter. Run the topic model over these smaller chunks.
##     Do the topics appear different?