# Latent Dirichlet Allocation (LDA)
## A Litterary Introduction: *Jane Austen V. Charlotte Bronte*
Despite being born nearly forty years apart, modern fans often pit Jane Austen & Charlotte Bronte against one another in a battle for litterary  supremacy. The battle centers around the topics of education for women, courting, and marriage. The authors' similiar backgrounds naturally draw comparisons, but the modern fascination is probably due to novelility of British women publishing novels during the early 19th century. 

Can we help close a litterary battle for supremacy and simply acknowledge that the authors addressed different topics and deserve to be acknowledged as excellent authors each in their own right?

We're going to apply Latent Dirichlet Allocation a machine learning alogrithm for topic modeling to each of the author's novels to compare the distribution of topics in their novels.

In [40]:
import numpy as np
import gensim
import os
import re

from gensim.utils import smart_open, simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

import pandas as pd

## Novel Data
I grabbed the novel data pre-split into a bunch of smaller chuncks

In [14]:
path = './data/austen-brontë-split'

## Text Preprocessing

In [17]:
titles = [t[:-4] for t in os.listdir(path)]

In [5]:
STOPWORDS = set(STOPWORDS).union(set(['said', 'mr', 'mrs']))

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [6]:
tokenize("Hello World! This a test of the tokenization method")

['hello', 'world', 'test', 'tokenization', 'method']

## Python Genators
Here we use a new pythonic thingy: the `yield` statement in our fucntion. This allows us to iterate over a bunch of documents without actually reading them into memory. You can see how we use this fucntion later on. 

In [7]:
def doc_stream(path):
    for f in os.listdir(path):
        with open(os.path.join(path,f)) as t:
            text = t.read().strip('\n')
            tokens = tokenize(str(text))
            yield tokens

## Gensim LDA Topic Modeling

In [18]:
# A Dictionary Representation of all the words in our corpus
id2word = corpora.Dictionary(doc_stream(path))

In [20]:
# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=10, no_above=0.75)

In [41]:
# a bag of words(bow) representation of our corpus
# Note: we haven't actually read any text into memory here
corpus = [id2word.doc2bow(text) for text in doc_stream(path)]

In [23]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                     id2word=id2word,
                                     random_state=723812,
                                     num_topics = 15,
                                     passes=10)

In [47]:
words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]

In [58]:
topics = [' '.join(t[2:5]) for t in words]

In [59]:
topics

['years rochester adele',
 'wife living sir',
 'think know thing',
 'think colonel wickham',
 'bingley miss bennet',
 'madame thought know',
 'mother john elinor',
 'like ingram rochester',
 'harriet weston knightley',
 'know think sister',
 'bennet know lydia',
 'heard eliza day',
 'thought long room',
 'elinor shall come',
 'house like day']

## Comparison

In [62]:
distro = [lda[d] for d in corpus]

In [63]:
distro[0]

[(2, 0.22452806), (4, 0.24244052), (8, 0.4533502), (9, 0.077866085)]

In [64]:
def update(doc):
        d_dist = {k:0 for k in range(0,15)}
        for t in doc:
            d_dist[t[0]] = t[1]
        return d_dist
    
new_distro = [update(d) for d in distro]

In [65]:
new_distro[0]

{0: 0,
 1: 0,
 2: 0.22452806,
 3: 0,
 4: 0.24244052,
 5: 0,
 6: 0,
 7: 0,
 8: 0.4533502,
 9: 0.077866085,
 10: 0,
 11: 0,
 12: 0,
 13: 0,
 14: 0}

In [72]:
df = pd.DataFrame.from_records(new_distro, index=titles)
df.columns = topics

In [73]:
df.head()

Unnamed: 0,years rochester adele,wife living sir,think know thing,think colonel wickham,bingley miss bennet,madame thought know,mother john elinor,like ingram rochester,harriet weston knightley,know think sister,bennet know lydia,heard eliza day,thought long room,elinor shall come,house like day
Austen_Emma0000,0.0,0.0,0.224528,0.0,0.242441,0.0,0.0,0.0,0.45335,0.077866,0.0,0.0,0.0,0.0,0.0
Austen_Emma0001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.997247,0.0,0.0,0.0,0.0,0.0,0.0
Austen_Emma0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.997422,0.0,0.0,0.0,0.0,0.0,0.0
Austen_Emma0003,0.0,0.0,0.177979,0.0,0.0,0.0,0.0,0.0,0.477138,0.342727,0.0,0.0,0.0,0.0,0.0
Austen_Emma0004,0.0,0.0,0.0,0.0,0.125666,0.0,0.0,0.0,0.872183,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
df['author'] = df.reset_index()['index'].apply(lambda x: x.split('_')[0]).tolist()
df['book'] = df.reset_index()['index'].apply(lambda x: x.split('_')[1][:-4]).tolist()
df['section'] = df.reset_index()['index'].apply(lambda x: x[-4:]).tolist()
df.head()

Unnamed: 0,years rochester adele,wife living sir,think know thing,think colonel wickham,bingley miss bennet,madame thought know,mother john elinor,like ingram rochester,harriet weston knightley,know think sister,bennet know lydia,heard eliza day,thought long room,elinor shall come,house like day,author,book,section
Austen_Emma0000,0.0,0.0,0.224528,0.0,0.242441,0.0,0.0,0.0,0.45335,0.077866,0.0,0.0,0.0,0.0,0.0,Austen,Emma,0
Austen_Emma0001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.997247,0.0,0.0,0.0,0.0,0.0,0.0,Austen,Emma,1
Austen_Emma0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.997422,0.0,0.0,0.0,0.0,0.0,0.0,Austen,Emma,2
Austen_Emma0003,0.0,0.0,0.177979,0.0,0.0,0.0,0.0,0.0,0.477138,0.342727,0.0,0.0,0.0,0.0,0.0,Austen,Emma,3
Austen_Emma0004,0.0,0.0,0.0,0.0,0.125666,0.0,0.0,0.0,0.872183,0.0,0.0,0.0,0.0,0.0,0.0,Austen,Emma,4


In [75]:
df.describe()

Unnamed: 0,years rochester adele,wife living sir,think know thing,think colonel wickham,bingley miss bennet,madame thought know,mother john elinor,like ingram rochester,harriet weston knightley,know think sister,bennet know lydia,heard eliza day,thought long room,elinor shall come,house like day
count,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0,813.0
mean,0.003044,0.001283,0.014421,0.01087,0.065071,0.155098,0.005907,0.037986,0.145708,0.188003,0.070472,0.007806,0.27825,0.00781,0.005507
std,0.052194,0.035014,0.096115,0.091922,0.208182,0.281066,0.07076,0.145822,0.324443,0.319548,0.214361,0.079167,0.355732,0.077163,0.071168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01248,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.19496,0.0,0.0,0.0,0.217176,0.0,0.0,0.583615,0.0,0.0
max,0.997498,0.997378,0.997471,0.997295,0.997531,0.997667,0.997655,0.997436,0.997672,0.997762,0.997531,0.997386,0.997767,0.997625,0.997279


In [76]:
author_mean = df.groupby(by=['author']).mean()

In [77]:
author_mean

Unnamed: 0_level_0,years rochester adele,wife living sir,think know thing,think colonel wickham,bingley miss bennet,madame thought know,mother john elinor,like ingram rochester,harriet weston knightley,know think sister,bennet know lydia,heard eliza day,thought long room,elinor shall come,house like day
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Austen,0.0,0.0,0.029317,0.023416,0.137172,0.001817,0.007148,0.002057,0.313899,0.316756,0.143855,0.0,0.010486,0.010214,0.001125
CBronte,0.005611,0.002365,0.001856,0.000287,0.004251,0.284396,0.00486,0.068294,0.003832,0.079395,0.008571,0.01439,0.504119,0.005783,0.009204


In [35]:
#CBronte's big topic
lda.show_topic(12)

[('like', 0.0072644744),
 ('night', 0.0049664956),
 ('thought', 0.0048304982),
 ('long', 0.0046967305),
 ('room', 0.004392841),
 ('come', 0.004066149),
 ('day', 0.0038081058),
 ('door', 0.0037465198),
 ('eyes', 0.003669961),
 ('saw', 0.003612934)]

## Making a Prediction on an unseen document

In [36]:
tokens = "like night though long ed the world room come day door eyes saw".split()
bow = id2word.doc2bow(tokens)

In [39]:
lda[bow]

[(12, 0.91515136)]

## Resources

* [Gensim](https://radimrehurek.com/gensim/): Python package for topic modeling, nlp, word vectorization, and few other things. Well maintained and well documented.
* [Topic Modeling with Gensim](http://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#11createthedictionaryandcorpusneededfortopicmodeling): A kind of cookbook for LDA with gensim. Excellent overview, but the you need to be aware of missing import statements and assumed prior knowledge.
* [Chinese Restuarant Process](https://en.wikipedia.org/wiki/Chinese_restaurant_process): That really obscure stats thing I mentioned... 
* [PyLDAvis](https://github.com/bmabey/pyLDAvis): Library for visualizing the topic model and performing some exploratory work. Works well. Has a direct parrell implementation in R as well. 
* [Rare Technologies](https://rare-technologies.com/): The people that made & maintain gensim and a few other libraries.
* [Jane Austen v. Charlotte Bronte](https://www.literaryladiesguide.com/literary-musings/jane-austen-charlotte-bronte-different-alike/)