## Getting Started

In this workbook we loosely follow the example from "Toward Data Science" on
[Topic Modeling with spaCy and gensim](https://towardsdatascience.com/building-a-topic-modeling-pipeline-with-spacy-and-gensim-c5dc03ffc619). First, we need to install gensim, so open up a command window (and I had to do it in "administrator"
mode) and run this command: `pip install gensim`. We're also going to do some data viz, so run `pip install pyLDAvis`. 


In [1]:
from nltk.corpus import brown
from nltk.corpus import stopwords

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

from pprint import pprint
from collections import Counter, defaultdict

nlp = spacy.load('en_core_web_sm')

In [2]:
# Some functions we'll use later in the topic modeling.

def lemmatizer(doc):
    # This takes in a doc of tokens from the NER and lemmatizes them. 
    # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
    doc = u' '.join(doc)
    return nlp.make_doc(doc)
    
def remove_stopwords(doc):
    # This will remove stopwords and punctuation.
    # Use token.text to return strings, which we'll need for Gensim.
    doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
    return doc

## Getting to Know the Brown Corpus

Let's spend a bit of time getting to know what's in the Brown corpus, our NLTK example of an "overlapping" corpus.

In [3]:
# categories of articles in Brown corpus
print(brown.categories())

for category in brown.categories() :
    print(f"For {category} we have {len(brown.fileids(categories=category))} articles.")


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
For adventure we have 29 articles.
For belles_lettres we have 75 articles.
For editorial we have 27 articles.
For fiction we have 29 articles.
For government we have 30 articles.
For hobbies we have 36 articles.
For humor we have 9 articles.
For learned we have 80 articles.
For lore we have 48 articles.
For mystery we have 24 articles.
For news we have 44 articles.
For religion we have 17 articles.
For reviews we have 17 articles.
For romance we have 29 articles.
For science_fiction we have 6 articles.


Let's create a list of the articles in of editorial, government, news, and romance.

In [4]:
for_modeling = []

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :
        text = brown.words(fileids=file_id)
        for_modeling.append(" ".join(text))
        
print(f"We have {len(for_modeling)} documents.")

We have 130 documents.


In [5]:
# Updates spaCy's default stop words list with my additional words. 
stop_list = ['`',"Mr.","Mrs.","Ms."]
nlp.Defaults.stop_words.update(stop_list)

# Iterates over the words in the stop words list and resets the "is_stop" flag.
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

In [6]:
# The add_pipe function appends our functions to the default pipeline.
nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

In [7]:
doc_list = []

# Iterates through each article in the corpus.
for doc in for_modeling :
    # Passes that article through the pipeline and adds to a new list.
    pr = nlp(doc)
    doc_list.append([t.lower() for t in pr if t.isalpha()])

In [8]:
# Create a mapping of word IDs to words.
words = corpora.Dictionary(doc_list)

# Turns each document into a bag of words.
corpus = [words.doc2bow(doc) for doc in doc_list]

The actual fitting of our model.

In [9]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=4, 
                                           random_state=2,
                                           update_every=1,
                                           passes=15,
                                           alpha='auto',
                                           per_word_topics=True)

In [10]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.006*"like" + 0.005*"come" + 0.005*"know" + 0.005*"look" + 0.005*"think" + '
  '0.005*"man" + 0.004*"time" + 0.004*"day" + 0.003*"old" + 0.003*"little" + '
  '0.003*"feel" + 0.003*"tell" + 0.003*"good" + 0.003*"home" + 0.003*"way"'),
 (1,
  '0.006*"year" + 0.005*"state" + 0.004*"man" + 0.003*"new" + 0.003*"time" + '
  '0.003*"board" + 0.003*"president" + 0.003*"country" + 0.003*"people" + '
  '0.002*"program" + 0.002*"city" + 0.002*"need" + 0.002*"good" + '
  '0.002*"american" + 0.002*"government"'),
 (2,
  '0.007*"year" + 0.006*"new" + 0.004*"state" + 0.003*"john" + '
  '0.003*"president" + 0.003*"member" + 0.003*"student" + 0.003*"time" + '
  '0.003*"high" + 0.003*"program" + 0.002*"city" + 0.002*"day" + '
  '0.002*"increase" + 0.002*"university" + 0.002*"man"'),
 (3,
  '0.008*"state" + 0.007*"year" + 0.006*"government" + 0.006*"united" + '
  '0.005*"states" + 0.005*"new" + 0.005*"tax" + 0.004*"service" + 0.004*"time" '
  '+ 0.003*"shall" + 0.003*"use" + 0.003*"general" + 0.

In [11]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, words)

Let's take a look at our topic classifications by document and see how good a job LDA is doing recovering our original topics. We'll take each document one at a time, parse it (as a joined string), and do basically the same processing as we did before. 

You can pass the processed document into the LDA model using square brackets (this is a bit odd) and recieve a tuple back. The first element of the tuple contains the topics and associated probabilities. The max probability will be the assigned topic.

In [12]:
topic_assignments = []

for file_id in brown.fileids(categories="romance") :
    doc = brown.words(fileids=file_id)
    pr = nlp(" ".join(doc))
    doc = [t.lower() for t in pr if t.isalpha()]
    doc_new = words.doc2bow(doc)
    
    topic_probs = lda_model[doc_new][0]
    topic = max(topic_probs,key=lambda x: x[1])
    topic_assignments.append(topic[0])
    

Now let's look at those topic assignments:

In [13]:
Counter(topic_assignments)

Counter({0: 26, 2: 1, 1: 2})

Looks like topic zero is overwhelmingly romance. Let's do this for every category we worked with.

In [14]:
topic_assignments = defaultdict(list)

for category in ['editorial','government','news','romance'] :
    for file_id in brown.fileids(categories=category) :

        doc = brown.words(fileids=file_id)
        pr = nlp(" ".join(doc))
        doc = [t.lower() for t in pr if t.isalpha()]
        doc_new = words.doc2bow(doc)

        topic_probs = lda_model[doc_new][0]
        topic = max(topic_probs,key=lambda x: x[1])
        topic_assignments[category].append(topic[0])


In [15]:
for cat, topic_list in topic_assignments.items() :
    print(f"In {cat} we had the following:")
    topic_count = Counter(topic_list).most_common()
    
    for topic, count in topic_count : 
        print(f"    {count} articles were classified as topic {topic}.")
    
    

In editorial we had the following:
    12 articles were classified as topic 1.
    7 articles were classified as topic 3.
    6 articles were classified as topic 2.
    2 articles were classified as topic 0.
In government we had the following:
    13 articles were classified as topic 3.
    8 articles were classified as topic 2.
    6 articles were classified as topic 1.
    3 articles were classified as topic 0.
In news we had the following:
    17 articles were classified as topic 2.
    11 articles were classified as topic 1.
    8 articles were classified as topic 3.
    8 articles were classified as topic 0.
In romance we had the following:
    26 articles were classified as topic 0.
    2 articles were classified as topic 1.
    1 articles were classified as topic 2.


As we can see, this assignment is pretty imperfect, though the categories overlap pretty heavily, particularly the first three. Romance seems to be safely identified on its own. 