In [None]:
from datascience import *
import numpy as np

# Topic Modeling in Python

In Lisa Rhody's article, "Topic Modeling and Figurative Language", she uses LDA topic modeling to look at ekphrasis poetry. She argues that ekphrasis poetry is particulary well-suited to an LDA analysis because of the assumption of a previously existing set of topics. She's able to extract a number of topics, each constituted of a set of words and probabilities. While we don't have Rhody's corpus, we can use this technique on any large text corpus. We'll use a corpus of novels curated by Andrew Piper.

## Corpus Description
We'll look at an English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on our volume, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files by the [txtLAB](https://txtlab.org/) at McGill.

The metadata provided describes the corpus that exists as `.txt` files. So let's first read in the metadata:

In [None]:
metadata_tb = Table.read_table('txtlab_Novel150_English.csv')
metadata_tb.show(5)

We can see the column variables we have with the `.labels` attribute:

In [None]:
metadata_tb.labels

To clarify:
<ol><li>Filename: Name of file on disk</li>
<li>ID: Unique ID in Piper corpus</li>
<li>Language: Language of novel</li>
<li>Date: Initial publication date</li>
<li>Title: Title of novel</li>
<li>Gender: Authorial gender</li>
<li>Person: Textual perspective</li>
<li>Length: Number of tokens in novel</li></ol>

We see a list of `filename`s in the table, these map into a folder we have called `txtlab_Novel150_English`:

In [None]:
!ls txtlab_Novel150_English/

We can then read in the full text for each novel by iterating through the column, reading each file and appending the string to our `novel_list`:

In [None]:
# create empty list, entries will be list of tokens from each novel
novel_list = []

# iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # read in novel text as single string, make lowercase
    with open('txtlab_Novel150_English/'+filename, 'r') as f:
        novel = f.read()
    
    # clean up for TM analysis
    toks = novel.split()
    toks = [t for t in toks if not t.istitle() and not t.isupper()]
    novel = ' '.join(toks)
    
    # add list of tokens to master list
    novel_list.append(novel)

Let's double check they all came through:

In [None]:
len(novel_list)

And look at the first 200 characters of the fourth novel:

In [None]:
novel_list[3][:200]

---

## Document Term Matrix

Now we need to make a document term matrix, just as we have in the past two classes. We can the pull in our `CountVectorizer` from `sklearn` again to create our dtm: 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

While you may not have seen the importance of `max_features`, `max_df` and `min_df` before, today you'll see just how much this can affect your results.

Let's start out with this:

- `max_features` = 5000  (i.e. only include 5000 tokens in our dtm)
- `max_df` = .8  (i.e. don't keep any tokens that appear in > 80% of the documents)
- `min_df` = 5  (i.e. only keep the token if it appears in > 5 documents)

We'll add in a `stop_words='english'` too:

In [None]:
cv = CountVectorizer(max_features=num_features, stop_words='english', max_df=0.80, min_df=5)

Now we can use our `cv` to `fit_transform` our list of novels (strings!):

In [None]:
dtm = cv.fit_transform(novel_list)

To get our words back out we'll `.get_feature_names()`

In [None]:
dtm_feature_names = cv.get_feature_names()

We can double check that our feature limit was enforced by calling `len` on the `dtm_feature_names`:

In [None]:
len(dtm_feature_names)

We can throw this into a `Table` like we have before too:

In [None]:
dtm_tb = Table(dtm_feature_names).with_rows(dtm.toarray())
dtm_tb.show(5)

---

## Topic Modeling

### [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) Models
LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."

Instead of writing out the complicated math, `sklearn` has the `LatentDirichletAllocation` function:

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

Let's check the doc string:

In [None]:
LatentDirichletAllocation?

Importantly, we'll note:
<li>`n_components`: This is the number of topics. Choosing this is the art of Topic Modeling </li>
<li>`max_iter`: TM initially uses random distribution, iteratively tweaks model </li>
    
### Training

That's all the preprocessing out of the way. Here is where we'll see something new: the `LatentDirichletAllocation` method. This is where the algorithm described in the video is implemented. Because it's a probabilistic algorithm, there's some randomness to the exact results we'll get each time we use it. To make sure you and I get the exact same results, we'll also have to set the random seed again. We'll look for 20 topics across these emails, but you can change this to whatever you want. We tell `sklearn` to only give 20 topics when we create the `lda` variable. There are a few other optional arguments we've included to fine-tune the model, but they're not important for now.

In [None]:
lda = LatentDirichletAllocation(n_components=20, max_iter=50)

Before we `fit` the model, we need to remember that with a lot of these probabilistic models random number generators are used to star the algorithm. If we want our results to be reproducible, we need to set the random seed of the math library we use, in this case `numpy`:

In [None]:
np.random.seed(0) # sets the random seed to ensure reproducible results

Now we just `fit` the model, as we've done with all `sklearn` models! This may take a while, a lot is going on:

In [None]:
lda_model = lda.fit(dtm)

### Topics

To print the topics, we'll need to write a function. That function will print the most probable words to show up in each topic.

In [None]:
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(topic_idx, " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

Now let's print the top 10 words of the 20 topics for the model we trained, using our `display_topics` function. Have a look through the output and see what topics you can spot:

In [None]:
display_topics(lda, tf_feature_names, 10)

We can `print` which topic each novel is closest too by indexing the topic probabilities and using the `argmax` function:

In [None]:
doc_topic = lda.transform(dtm)

for n in range(doc_topic.shape[0]):
    topic_most_pr = doc_topic[n].argmax()
    print(metadata_tb['author'][n], metadata_tb['title'][n])
    print("doc: {} topic: {}\n".format(n,topic_most_pr))

## Challenge

Add these topic assignments back to our `Table` `metadata_tb`

### Evaluation

One measure of the model's fit is perplexity:

In [None]:
lda_model.perplexity(dtm)

We can also look at the log likelihood:

In [None]:
lda_model.score(dtm)

# Homework import new corpus run TM

# 4. Interpreting the Model

### Metadata
There are many strategies that can be used to interpret the output of a topic model. In this case, we will look for any correlations between the topic distributions and metadata.

In [None]:
# Create list of all document-topic distributions
list_of_doctopics = [lda_model.get_document_topics(corpus[i], minimum_probability=0) for i in range(len(corpus))]

In [None]:
list_of_doctopics[0]

In [None]:
# In the list above, each topic got represented as a tuple containing
# the label of the topic and its probability within the given document

# Create list containing only the probabilities (remains ordered by topic label)
list_of_probabilities = [[probability for label,probability in distribution] for distribution in list_of_doctopics]

In [None]:
list_of_probabilities[0]

In [None]:
# We'll put these into a labeled column format so that we can add
# document-topic distributions to our original metadata table

# Note that this means a cumbersome switch from lists that represent rows
# to lists that represent columns

labeled_columns = [['Topic '+str(i),[document[i] for document in list_of_probabilities]] for i in range(50)]

In [None]:
labeled_columns[0]

In [None]:
# Add these as new columns to the metadata table
metatopic_tb = metadata_tb.with_columns(labeled_columns)

In [None]:
# Quick and dirty correlation function

def correlator(tb, col_1, col_2):
    import numpy as np
    col_1_in_su = [(x-np.mean(tb[col_1]))/np.std(tb[col_1]) for x in tb[col_1]]
    col_2_in_su = [(x-np.mean(tb[col_2]))/np.std(tb[col_2]) for x in tb[col_2]]
    col_mult = [col_1_in_su[i]*col_2_in_su[i] for i in range(len(col_1_in_su))]
    r = np.mean(col_mult)
    return r

In [None]:
correlator(metatopic_tb, 'date', 'Topic 0')

In [None]:
## EX. Find any topics that have an r^2 value greater than 0.1.
##     Return the top terms for those topics. Are the correlations
##     positive or negative?

## EX. Try running the topic model without removing any words from
##     the dictionary. How do the topics change?
##                     Try changing the minimum document frequency.

# 5. Revising Model Inputs

In [None]:
## EX. Some proper names and titles still came through our filter.
##     Use nltk's NER function to remove names in a more targeted way.

## EX. In Matt Jockers's study of literary theme, he included only
##     nouns for topic modeling. Use nltk's POS tagger to remove all
##     words from the corpus that are not common nouns.

## EX. Jockers also found it useful to split texts into 1000-noun chunks
##     after the POS filter. Run the topic model over these smaller chunks.