# Modelling Genre

Author-Topic Modelling -> Genre-Topic Modelling

https://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb

## Latent Dirichlet Allocation

In [71]:
### Training a topic model on the Books corpus

In [72]:
import glob
from nltk.tokenize import word_tokenize

def chunker(l, n):
    """Yield successive n-sized chunks from l."""
    l = ''.join([c for c in l if c.isalpha() or c.isspace()])
    l = word_tokenize(l)
    l = [t.lower() for t in l]
    for i in range(0, len(l), n):
        yield l[i:i + n]        

class ParagraphIterator(object):
    def __init__(self, path, max_per_book=None,
                 chunk_size=300, max_books=None):
        self.max_books = max_books
        self.max_per_book = max_per_book
        self.chunk_size = chunk_size
        
        self.filenames = list(glob.glob(path))
        if self.max_books:
            self.filenames = self.filenames[:self.max_books]

    def __iter__(self):
        for filename in self.filenames:
            comps = filename.split('/')
            #genre, idx = comps[-2:]
            #idx = idx.replace('.txt', '')
            with open(filename, 'r') as f:
                try:
                    if self.max_per_book:
                        text = f.read(self.max_per_book)
                    else:
                        text = f.read()
                except:
                    continue
            for ch in chunker(text, self.chunk_size):
                yield ch

In [73]:
path = '/Users/mike/GitRepos/potter/data/other/books_txt_full/*/*.txt'
n_features = 3000
n_topics = 50
n_top_words = 60

paragraphs = ParagraphIterator(path, max_books=100)

In [74]:
max_freq = 0.5
min_wordcount = 20

dictionary = corpora.Dictionary(paragraphs)
dictionary.filter_extremes(no_below=min_wordcount,
                           no_above=max_freq,
                           keep_n=n_features)
dictionary.filter_n_most_frequent(500)

bow = [dictionary.doc2bow(doc) for doc in paragraphs]

2018-02-01 08:58:29,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-02-01 08:58:44,672 : INFO : adding document #10000 to Dictionary(51244 unique tokens: ['the', 'halfling', 'book', 'one', 'in']...)


KeyboardInterrupt: 

In [67]:
len(bow)

23580

In [68]:
import gensim
from gensim import corpora
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [70]:
lda = gensim.models.ldamodel.LdaModel(corpus=bow,
                                      id2word=dictionary,
                                      num_topics=n_topics,
                                      update_every=1,
                                      chunksize=1000,
                                      passes=1)

2018-02-01 08:56:38,086 : INFO : using symmetric alpha at 0.02
2018-02-01 08:56:38,086 : INFO : using symmetric eta at 0.0004
2018-02-01 08:56:38,087 : INFO : using serial LDA version on this node
2018-02-01 08:56:38,830 : INFO : running online (single-pass) LDA training, 50 topics, 1 passes over the supplied corpus of 23580 documents, updating model once every 1000 documents, evaluating perplexity every 10000 documents, iterating 50x with a convergence threshold of 0.001000
2018-02-01 08:56:38,831 : INFO : PROGRESS: pass 0, at document #1000/23580
2018-02-01 08:56:41,669 : INFO : merging changes from 1000 documents into a model of 23580 documents
2018-02-01 08:56:41,719 : INFO : topic #47 (0.020): 0.010*"guy" + 0.010*"victor" + 0.009*"college" + 0.008*"el" + 0.008*"job" + 0.007*"eventually" + 0.006*"guards" + 0.006*"free" + 0.005*"jason" + 0.005*"problem"
2018-02-01 08:56:41,720 : INFO : topic #4 (0.020): 0.008*"tommy" + 0.007*"sword" + 0.007*"guy" + 0.005*"stupid" + 0.005*"prince" + 

2018-02-01 08:56:51,918 : INFO : topic diff=0.497766, rho=0.408248
2018-02-01 08:56:51,919 : INFO : PROGRESS: pass 0, at document #7000/23580
2018-02-01 08:56:53,790 : INFO : merging changes from 1000 documents into a model of 23580 documents
2018-02-01 08:56:53,879 : INFO : topic #10 (0.020): 0.025*"observed" + 0.017*"beach" + 0.015*"crowd" + 0.013*"moon" + 0.012*"grasp" + 0.010*"wandered" + 0.009*"questioned" + 0.009*"determined" + 0.009*"laughter" + 0.008*"shore"
2018-02-01 08:56:53,880 : INFO : topic #2 (0.020): 0.022*"prisoner" + 0.019*"corridor" + 0.018*"falls" + 0.017*"tears" + 0.015*"bathroom" + 0.013*"dad" + 0.011*"branch" + 0.011*"cat" + 0.010*"cheeks" + 0.010*"mom"
2018-02-01 08:56:53,881 : INFO : topic #21 (0.020): 0.022*"france" + 0.017*"flag" + 0.011*"chief" + 0.010*"plane" + 0.009*"cargo" + 0.009*"scott" + 0.008*"hunt" + 0.007*"gon" + 0.007*"na" + 0.007*"money"
2018-02-01 08:56:53,881 : INFO : topic #1 (0.020): 0.034*"sea" + 0.030*"boat" + 0.024*"land" + 0.019*"ship" + 0

2018-02-01 08:57:09,858 : INFO : topic #24 (0.020): 0.030*"tree" + 0.021*"doorway" + 0.014*"tent" + 0.013*"beside" + 0.013*"steps" + 0.011*"trail" + 0.010*"forest" + 0.010*"path" + 0.009*"trees" + 0.008*"opposite"
2018-02-01 08:57:09,859 : INFO : topic #23 (0.020): 0.060*"jungle" + 0.031*"horse" + 0.022*"wizard" + 0.018*"magic" + 0.013*"horses" + 0.013*"crystal" + 0.013*"rain" + 0.011*"tarzyn" + 0.011*"star" + 0.010*"spell"
2018-02-01 08:57:09,860 : INFO : topic #49 (0.020): 0.015*"french" + 0.014*"shield" + 0.014*"dad" + 0.013*"clothes" + 0.013*"shirt" + 0.011*"rick" + 0.010*"kitchen" + 0.010*"shoes" + 0.010*"eat" + 0.010*"pair"
2018-02-01 08:57:09,860 : INFO : topic #12 (0.020): 0.054*"allison" + 0.024*"bill" + 0.013*"theyll" + 0.012*"file" + 0.012*"mirror" + 0.009*"knowing" + 0.009*"certain" + 0.009*"material" + 0.009*"whoever" + 0.009*"mile"
2018-02-01 08:57:09,861 : INFO : topic #1 (0.020): 0.032*"sea" + 0.029*"boat" + 0.025*"east" + 0.022*"cabin" + 0.020*"wind" + 0.019*"ships" + 

2018-02-01 08:57:21,397 : INFO : topic #19 (0.020): 0.057*"jacob" + 0.031*"general" + 0.022*"message" + 0.018*"figured" + 0.017*"boss" + 0.014*"hunting" + 0.014*"alert" + 0.014*"tomorrow" + 0.013*"planet" + 0.012*"today"
2018-02-01 08:57:21,397 : INFO : topic #12 (0.020): 0.018*"theyll" + 0.016*"mirror" + 0.014*"enemy" + 0.013*"risk" + 0.013*"allison" + 0.013*"file" + 0.012*"feared" + 0.011*"consider" + 0.011*"plate" + 0.010*"certain"
2018-02-01 08:57:21,398 : INFO : topic #30 (0.020): 0.043*"chamber" + 0.035*"company" + 0.025*"drawing" + 0.020*"oclock" + 0.020*"grace" + 0.016*"cook" + 0.015*"afternoon" + 0.014*"host" + 0.014*"dartagnyn" + 0.013*"nation"
2018-02-01 08:57:21,398 : INFO : topic diff=0.371920, rho=0.229416
2018-02-01 08:57:25,021 : INFO : -9.276 per-word bound, 619.8 perplexity estimate based on a held-out corpus of 1000 documents with 50948 words
2018-02-01 08:57:25,022 : INFO : PROGRESS: pass 0, at document #20000/23580
2018-02-01 08:57:27,011 : INFO : merging changes f

In [75]:
print(lda.print_topics(num_topics=10, num_words=20))

2018-02-01 08:58:49,774 : INFO : topic #6 (0.020): 0.095*"president" + 0.058*"billy" + 0.030*"tommy" + 0.030*"aircraft" + 0.029*"anderson" + 0.027*"rifle" + 0.026*"ball" + 0.024*"mark" + 0.016*"enemy" + 0.014*"carriage" + 0.014*"yard" + 0.013*"fired" + 0.013*"highway" + 0.010*"six" + 0.009*"boxes" + 0.008*"jumped" + 0.008*"device" + 0.008*"radio" + 0.008*"loaded" + 0.008*"heres"
2018-02-01 08:58:49,775 : INFO : topic #1 (0.020): 0.039*"sea" + 0.036*"boat" + 0.035*"ships" + 0.026*"wind" + 0.022*"east" + 0.020*"north" + 0.017*"land" + 0.016*"ship" + 0.016*"pilot" + 0.014*"port" + 0.013*"bridge" + 0.013*"speed" + 0.012*"waves" + 0.012*"miles" + 0.011*"bird" + 0.011*"sail" + 0.011*"vessel" + 0.010*"dock" + 0.010*"bay" + 0.009*"surface"
2018-02-01 08:58:49,776 : INFO : topic #19 (0.020): 0.036*"message" + 0.034*"fbi" + 0.026*"general" + 0.025*"uh" + 0.020*"boss" + 0.018*"alert" + 0.018*"tomorrow" + 0.016*"thanks" + 0.015*"jacob" + 0.015*"jump" + 0.014*"today" + 0.014*"planet" + 0.013*"figur

[(6, '0.095*"president" + 0.058*"billy" + 0.030*"tommy" + 0.030*"aircraft" + 0.029*"anderson" + 0.027*"rifle" + 0.026*"ball" + 0.024*"mark" + 0.016*"enemy" + 0.014*"carriage" + 0.014*"yard" + 0.013*"fired" + 0.013*"highway" + 0.010*"six" + 0.009*"boxes" + 0.008*"jumped" + 0.008*"device" + 0.008*"radio" + 0.008*"loaded" + 0.008*"heres"'), (1, '0.039*"sea" + 0.036*"boat" + 0.035*"ships" + 0.026*"wind" + 0.022*"east" + 0.020*"north" + 0.017*"land" + 0.016*"ship" + 0.016*"pilot" + 0.014*"port" + 0.013*"bridge" + 0.013*"speed" + 0.012*"waves" + 0.012*"miles" + 0.011*"bird" + 0.011*"sail" + 0.011*"vessel" + 0.010*"dock" + 0.010*"bay" + 0.009*"surface"'), (19, '0.036*"message" + 0.034*"fbi" + 0.026*"general" + 0.025*"uh" + 0.020*"boss" + 0.018*"alert" + 0.018*"tomorrow" + 0.016*"thanks" + 0.015*"jacob" + 0.015*"jump" + 0.014*"today" + 0.014*"planet" + 0.013*"figured" + 0.013*"plan" + 0.013*"hood" + 0.012*"sounding" + 0.012*"indicating" + 0.010*"guys" + 0.009*"meeting" + 0.009*"responded"'), (

### Reading tea leaves: add your own label to the topics

In [76]:
!pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-2.1.1.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 294kB/s ta 0:00:01
Collecting joblib>=0.8.4 (from pyldavis)
  Downloading joblib-0.11-py2.py3-none-any.whl (176kB)
[K    100% |████████████████████████████████| 184kB 2.2MB/s ta 0:00:01
Collecting numexpr (from pyldavis)
  Downloading numexpr-2.6.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (171kB)
[K    100% |████████████████████████████████| 174kB 2.1MB/s ta 0:00:01
[?25hCollecting pytest (from pyldavis)
  Downloading pytest-3.4.0-py2.py3-none-any.whl (188kB)
[K    100% |████████████████████████████████| 194kB 2.5MB/s ta 0:00:01
Collecting funcy (from pyldavis)
  Downloading funcy-1.10.tar.gz
Collecting pluggy<0.7,>=0.5 (from pytest->pyldavis)
  Downloading pluggy-0.6.0.tar.gz
Collecting py>=1.5.0 (from pytest->pyldavis)
  Downloading py-1.5.2-py2.py3-none-any.whl (88kB)
[K    100% |████████████████

In [80]:
import pyLDAvis
import pyLDAvis.gensim

v = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(v)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


- student assigned: provide short interpretative labels for each topic

In [None]:
### Infer topic on HP + diachronic plot

## A Genre-Topic Model of the Books corpus

## word2vec: modelling the muggles and other non-words