In [None]:
%%capture
!rm -rf data/*
!unzip data.zip -d data/
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

# Topic Modeling in Python

In Lisa Rhody's article, "Topic Modeling and Figurative Language", she uses LDA topic modeling to look at ekphrasis poetry. She argues that ekphrasis poetry is particulary well-suited to an LDA analysis because of the assumption of a previously existing set of topics. She's able to extract a number of topics, each constituted of a set of words and probabilities. While we don't have Rhody's corpus, we can use this technique on any large text corpus. We'll use a corpus of novels curated by Andrew Piper.

## Corpus Description
We'll look at an English-language subset of Andrew Piper's novel corpus, totaling 150 novels by British and American authors spanning the years 1771-1930. These texts reside on our volume, each in a separate plaintext file. Metadata is contained in a spreadsheet distributed with the novel files by the [txtLAB](https://txtlab.org/) at McGill.

The metadata provided describes the corpus that exists as `.txt` files. So let's first read in the metadata:

In [None]:
metadata_tb = Table.read_table('data/txtlab_Novel150_English.csv')
metadata_tb.show(5)

Before we go anywhere, let's randomly shuffle the rows so that we don't have them ordered by dates or anything else:

In [None]:
np.random.seed(0)
metadata_tb = Table.from_df(metadata_tb.to_df().sample(frac=1))
metadata_tb.show(5)

We can see the column variables we have with the `.labels` attribute:

In [None]:
metadata_tb.labels

To clarify:
<ol><li>Filename: Name of file on disk</li>
<li>ID: Unique ID in Piper corpus</li>
<li>Language: Language of novel</li>
<li>Date: Initial publication date</li>
<li>Title: Title of novel</li>
<li>Gender: Authorial gender</li>
<li>Person: Textual perspective</li>
<li>Length: Number of tokens in novel</li></ol>

We see a list of `filename`s in the table, these map into a folder we have called `txtlab_Novel150_English`:

In [None]:
!ls data/txtlab_Novel150_English/

We can then read in the full text for each novel by iterating through the column, reading each file and appending the string to our `novel_list`:

In [None]:
# create empty list, entries will be list of tokens from each novel
novel_list = []

# iterate through filenames in metadata table
for filename in metadata_tb['filename']:
    
    # read in novel text as single string, make lowercase
    with open('data/txtlab_Novel150_English/'+filename, 'r') as f:
        novel = f.read()
    
    # clean up for TM analysis
    toks = novel.split()
    toks = [t for t in toks if not t.istitle() and not t.isupper()]  # quick & dirty no titles/proper nouns
    novel = ' '.join(toks)
    
    # add list of tokens to master list
    novel_list.append(novel)

Let's double check they all came through:

In [None]:
len(novel_list)

And look at the first 200 characters of the fourth novel:

In [None]:
metadata_tb['author'][3], metadata_tb['title'][3], novel_list[3][:200]

---

## Document Term Matrix

Now we need to make a document term matrix, just as we have in the past two classes. We can the pull in our `CountVectorizer` from `sklearn` again to create our dtm: 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

While you may not have seen the importance of `max_features`, `max_df` and `min_df` before, today you'll see just how much this can affect your results.

Let's start out with this:

- `max_features` = 5000  (i.e. only include 5000 tokens in our dtm)
- `max_df` = .8  (i.e. don't keep any tokens that appear in > 80% of the documents)
- `min_df` = 5  (i.e. only keep the token if it appears in > 5 documents)

We'll add in a `stop_words='english'` too:

In [None]:
cv = CountVectorizer(max_features=5000, stop_words='english', max_df=0.80, min_df=5)

As with most machine learning approaches, to validate your model you need training and testing partitions. Since we don't have any labels, we just need to do this for the novel strings:

In [None]:
train = novel_list[:120]
test = novel_list[120:]

Now we can use our `cv` to `fit_transform` our training list of novels (strings!):

In [None]:
dtm = cv.fit_transform(train)

To get our words back out we'll `.get_feature_names()`

In [None]:
dtm_feature_names = cv.get_feature_names()

We can double check that our feature limit was enforced by calling `len` on the `dtm_feature_names`:

In [None]:
len(dtm_feature_names)

We can throw this into a `Table` like we have before too:

In [None]:
dtm_tb = Table(dtm_feature_names).with_rows(dtm.toarray())
dtm_tb.show(5)

---

## Topic Modeling

### [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) Models
LDA reflects an intuition that words in a text are not merely chosen at random but are drawn from underlying concepts (the so-called "latent variables"). The goal of LDA is to look across many texts in order to reverse engineer these concepts by finding words that tend to cluster with one another. For this reason, LDA has been referred to as "the mother of all word collocation techniques."

Instead of writing out the complicated math, `sklearn` has the `LatentDirichletAllocation` function:

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

Let's check the doc string:

In [None]:
LatentDirichletAllocation?

Importantly, we'll note:
<li>`n_components`: This is the number of topics. Choosing this is the art of Topic Modeling </li>
<li>`max_iter`: TM initially uses random distribution, iteratively tweaks model </li>
    
### Training

That's all the preprocessing out of the way. Here is where we'll see something new: the `LatentDirichletAllocation` method. This is where the algorithm described in the video is implemented. Because it's a probabilistic algorithm, there's some randomness to the exact results we'll get each time we use it. To make sure you and I get the exact same results, we'll also have to set the random seed again. We'll look for 20 topics across these emails, but you can change this to whatever you want. We tell `sklearn` to only give 20 topics when we create the `lda` variable. There are a few other optional arguments we've included to fine-tune the model, but they're not important for now.

In [None]:
lda = LatentDirichletAllocation(n_components=30, max_iter=10)

Before we `fit` the model, we need to remember that with a lot of these probabilistic models random number generators are used to star the algorithm. If we want our results to be reproducible, we need to set the random seed of the math library we use, in this case `numpy`:

In [None]:
np.random.seed(0) # sets the random seed to ensure reproducible results

Now we just `fit` the model, as we've done with all `sklearn` models! This may take a while, a lot is going on:

In [None]:
lda_model = lda.fit(dtm)

### Evaluation

One measure of the model's fit is [perplexity](https://en.wikipedia.org/wiki/Perplexity#Perplexity_of_a_probability_model) where we can judge how well the model fits the data.:

In [None]:
lda_model.perplexity(cv.transform(test))

The lower the perplexity, the better the fit of the model. We can also look at the [log-likelihood](https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood), which is related, the higher the log-likelihood, the better the model:

In [None]:
lda_model.score(dtm)

### Choosing the best model

Given our perplexity and likelihood scores, we'd like to choose a number of topics that minimizes perplexity. Unfortunately, the best way to do this is build a model for a range of *k* topics. This is extremeley computationally intensive, and I've runn these commands on a remote server for you:

```python
from joblib import Parallel, delayed
import multiprocessing

try_topic_n = list(range(5,200,5))


def try_topic_number(i):
    lda = LatentDirichletAllocation(n_components=i, max_iter=1000)
    lda_model = lda.fit(dtm)
    test_dtm = cv.transform(test)
    p = lda_model.perplexity(test_dtm)
    ll = lda_model.score(test_dtm)
    return p, ll

if __name__ == '__main__':

    num_cores = multiprocessing.cpu_count()

    results = Parallel(n_jobs=num_cores)(delayed(try_topic_number)(i)
                                         for i in try_topic_n)
    
    results_p = [x[0] for x in results]
    lda = LatentDirichletAllocation(n_components=try_topic_n[np.argmin(results_p)], max_iter=1000)
    lda_model = lda.fit(dtm)
    
    
    pickle.dump(lda, open('model.pkl', 'wb'))
    pickle.dump(results, open('scores.pkl', 'wb'))
```

You can see above I've dumped the scores into a binary `pickle` file, as well as the model. We can load these in too:

In [None]:
import pickle

scores = pickle.load(open('scores.pkl', 'rb'))
lda = pickle.load(open('model.pkl', 'rb'))

### Topics

To `print` the topics, we can write a function. `display_topics` will print the most probable words to show up in each topic.

In [None]:
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(topic_idx, " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

Now let's print the top 10 words of the 20 topics for the model we trained, using our `display_topics` function. Have a look through the output and see what topics you can spot:

In [None]:
display_topics(lda, dtm_feature_names, 10)

To get the probabilities for each topic for a given book we can print the whole matrix:

In [None]:
metadata_tb['author'][25], metadata_tb['title'][25], doc_topic[25]

We can `print` which topic each novel is closest to by indexing the topic probabilities and using the `argmax` function:

In [None]:
doc_topic = lda.transform(dtm)

for n in range(doc_topic.shape[0]):
    topic_most_pr = doc_topic[n].argmax()
    print(metadata_tb['author'][n], metadata_tb['title'][n])
    print("doc: {} topic: {}\n".format(n,topic_most_pr))

### Challenge

Add these topic assignments back to our `Table` `metadata_tb`

In [None]:
# YOUR CODE HERE

### Interpreting the Model

There are many strategies that can be used to interpret the output of a topic model. In this case, we will look for any correlations between the topic distributions and metadata.

We'll first grab all the topic distributions similar to what we did above. Remember, the order is still the same!

In [None]:
list_of_doctopics = [doc_topic[n] for n in range(len(doc_topic))]
list_of_doctopics[0]

We'll make a `DataFrame`, which is similar to a `Table`, with the probabilities for the topics (columns) and documents (rows):

In [None]:
df = pd.DataFrame(list_of_doctopics)
df.head()

We can add these columns to our `metadata_tb` `Table`:

In [None]:
meta = metadata_tb.to_df()
meta[df.columns] = df
meta.head()

The `corr()` method will give us a correlation matrix:

In [None]:
meta.corr()

We see some strong correlations of topics with `date`, recall:

In [None]:
display_topics(lda, dtm_feature_names, 10)

In [None]:
meta.plot.scatter(x='date', y=13)

In [None]:
meta.plot.scatter(x='date', y=1)

Why do you think we see this?

# Homework

We're going to download the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/), a widely used corpus for demos of general texts:

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

First we'll import the data from `sklearn`:

In [None]:
from sklearn.datasets import fetch_20newsgroups

We'll ask for the training data:

In [None]:
train_subset = fetch_20newsgroups(subset="train")

Here are th predetermined catgories:

In [None]:
train_subset.target_names

Since we're topic modeling, we don't care about what they've been labeled, but it'll be interesting to see how our topics line up with these!

How many documents are there?

In [None]:
len(train_subset.data)

Let's get a list of documents as strings just like we did with the novels, and then we'll randomly shuffle them in case they're ordered by category already:

In [None]:
documents_train = train_subset.data
np.random.shuffle(documents_train)

In [None]:
print(documents_train[0])

Now we'll do the same for the test set:

In [None]:
test_subset = fetch_20newsgroups(subset="test")
documents_test = test_subset.data
np.random.shuffle(documents_test)
print(documents_test[0])

## TASK:

You now have two arrays of strings: `documents_train` and `documents_test`. Create a `dtm` and then a topic model for `k` number of topics. Just choose one number of `k` and a low `iter` value for the training so it doesn't take too long. See how the topics match up to the annotated categories, and play with different ways of preprocessing the data. What did you have to do to get decent results?