# The Joy of Topic Modeling

## Matt Burton (@mcburton)

- Visiting Assistant Professor 
- SIS & ULS
- PhD from Michigan
- Studied Digital Humanities Blogs

# The Plan

- What you do before you start topic modeling
- An intro to generative topic models
- What comes out the other end
- Then we'll do a thing!

# Data Centric Research Workflow
![Idealized Workflow](Data-science-workflow.png)

# Reality
![Real workflow](real-workflow.png)

# Data Work

![Data work](data-work.png)


# What **is** topic modeling?

### A method for finding *latent* patterns of co-occurance within large amounts of data (most often, but not always, text)

### "Distant Reading" - Franco Moretti

### "Macroanalysis" - Matthew Jockers

### "non-consumptive reading" - some lawyer

# Topic modeling is *not* magic

# Text Pre-processing

# What do I mean when I say "words?"

# Words are transformed into numbers

## Documents and sentences are broken down and chopped up into little units called tokens, or unigrams(or ngrams).

##  “digital humanities” could be a bigram or two individual unigrams, “digital” and “humanities.”

# Stopwords are removed

## "and", "but", or "or"

# "to be or not to be"

# Stemming

## Are “model” and “models”  separate tokens or do we want to treat them as one token?

# We typically don't use stemming when topic modeling

# What Do I mean when I say "document?"

# A document is a bag of words

In [None]:
doc_one = "John likes to watch movies. Mary likes movies too."
doc_two = "John also likes to watch football games."

In [None]:
# the corpus dictionary
{
    "John": 1,
    "likes": 2,
    "to": 3,
    "watch": 4,
    "movies": 5,
    "also": 6,
    "football": 7,
    "games": 8,
    "Mary": 9,
    "too": 10
}

In [None]:
# the vector representation
doc_one_as_vec = [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
doc_two_as_vec = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

# The vector representation is what the computer chews on when topic modeling

# In Topic Modeling, documents are composed from a *mixture* of "topics"

# OK, so what is a "topic?"

# A "topic" is a probability distribution over words

![A "topic" is a distribution over words](files/word-distribution2.png)

# Basically a topic is a bag that contains all of the words, but more of some words than others

# OK, no we can talk about models!

# The technical term for topic modeling is Latent Dirichlet Allocation or LDA

#  LDA models a *generative* process

-  Select a set of K topics from a probability distribution 
-  For each document
  -  Determine a topic mix
  -  For each word
    -  Select a topic from the mix
    -  Select an observed word from the topic


# Basically you "author" a document by repeatedly selected a word token from a series of different bags-of-words

# What I've just described is *not* what actually happens when you *do* topic modeling (it is merely the process being modeled)

# Computering, or more technically "estimating" these numbers with MATH is what happens when you *do* topic modeling

# The parameters (the various bags-of-words) must be *estimated*
![Estimating the model](files/estimating.png)

# The "parameters" of the model are the *N* topic distributions and each document's topic mixture

# There are multiple ways to estimate the parameters using *fancy math.* 
## "Variational Inference" & "Gibbs Sampling", are the most common. 
## MALLET uses Gibbs Sampling. [This is a great video of David Mimno explaining how it works](http://journalofdigitalhumanities.org/2-1/the-details-by-david-mimno/)


# The big pile of numbers you get after you run MALLET, those are the estimated parameters, the distributions the model would use to generate the corpus you have. 

# Analysis

# Reading these parameters are what scholars use to "distantly read" a corpus.

# Many scholars focus on the list of top words for each of the k topic distributions

In [None]:
0	2.5	science network networks scientific analysis research history social data study publication juggling statistics published information natural time ideas sorts 
1	2.5	students learning education student experience teaching model college class building courses ve free thinking general teach graduate approach business 
2	2.5	words word figure collection results literary common language capitalism good interesting corpus lot appears love case period similar women 
3	2.5	archives day archival viagra archivists society cancer conference information studies heart web american archive treatment prostate october mid blood 
4	2.5	tr report link hcil html shneiderman cs version published umiacs acm plaisant information computer human conference car video proc 
5	2.5	history project music american research virginia war center university workshop civil part america asian freedom events archive summer september 
6	2.5	digital humanities work dh scholarship scholars projects scholarly review tools ways project critical criticism technology field process discourse peer 
7	2.5	topic topics documents modeling models model http april crowdsourcing analysis mallet words accessed document text antiquities lda network set 
8	2.5	history book human historians historical technology big century 

# There is a tension interpreting these clusters of top words. Are they meaningful clusters or artifacts of the model? of bad data?

In [None]:
3	2.5	archives day archival viagra archivists society cancer conference information studies heart web american archive treatment prostate october mid blood 

# This is why it is really important to go back to the original documents and, if possible, check your validity against some external source 

## Ryan Heuser and Long Le-Khac did a good job of this in the Stanford Literary Lab Pamphlet #4 - *"A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method"*
## Also see Alan Liu's discussion of this article in his PMLA article *"The Meaning of the Digital Humanities"* 

![My network of topics visualization](dh-blog-map-small.jpeg)

## Nice Topic Modeling Links
- Andrew Goldstone & Ted Underwood's [dfr-browser](http://agoldst.github.io/dfr-browser/demo/#)

- Michael Nelson's [Mining the Dispatch](http://dsl.richmond.edu/dispatch/)

- Lisa Rhody's [Revising Ekphrasis](http://www.lisarhody.com/revising-ekphrasis/)

- Cameron Blevin's [Topic Modeling Martha Ballard's Diary](http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/)

## Great Tutorials

- [The Joy of Topic Modeling](http://mcburton.net/blog/joy-of-tm/) - *my post that covers this presentation & more*
- [Getting Started with Topic Modeling and MALLET from the programing historian](http://programminghistorian.org/lessons/topic-modeling-and-mallet) - *a great set of tutorials for historians*
- [Journal of Digital Humanities issue on Topic Modeling](http://journalofdigitalhumanities.org/2-1/) - *This issue is dedicated to discussing topic modeling*
- [The Historian's Macroscope - a DH methods book](http://www.themacroscope.org/) - *Another great set of tutorials for historians*
- [Topic Modeling a Guided Tour](http://www.scottbot.net/HIAL/?p=19113) - *yet another blog post about topic modeling...Scott has written a lot about topic modeling*

In [None]:
# ignore me, command to launch slides
!ipython nbconvert intro-to-topic-modeling.ipynb --to slides --post serve

[NbConvertApp] Converting notebook intro-to-topic-modeling.ipynb to slides
[NbConvertApp] Writing 237469 bytes to intro-to-topic-modeling.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdn.jsdelivr.net/reveal.js/2.6.2
Serving your slides at http://127.0.0.1:8000/intro-to-topic-modeling.slides.html
Use Control-C to stop this server
