# Latent Direchlet Allocation Practice

To practice topic modeling, I will be using some different genres from the Brown Corpus. I will run different LDA models in order to identify an ideal combination of topics and passes in order to find the most related words.

In [1]:
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

from pprint import pprint
from collections import Counter, defaultdict

nlp = spacy.load('en_core_web_sm')
sw = stopwords.words("english")

In [2]:
# Some functions we'll use later in the topic modeling.

def lemmatizer(doc):
    # This takes in a doc of tokens from the NER and lemmatizes them. 
    # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
    doc = u' '.join(doc)
    return nlp.make_doc(doc)
    
def remove_stopwords(doc):
    # This will remove stopwords and punctuation.
    # Use token.text to return strings, which we'll need for Gensim.
    doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
    return doc

I am choosing categories of articles to make up my corpus based on the Brown corpus. I am interested in looking at 'science_fiction', 'humor', 'mystery', and 'lore'. I might expect four distinct grouping based on each genre, or words from 'mystery' and 'lore' could end up together based on similar topics.

In [3]:
# All categories of articles in Brown corpus. I will use humor, science fiction, mystery, and lore
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [4]:
##Pulling just the documents I want and storing them in a new list

for_modeling = []

for category in ['science_fiction','mystery','lore','humor'] :
    for file_id in brown.fileids(categories=category) :
        text = brown.words(fileids=file_id)
        for_modeling.append(" ".join(text))
        
print(f"We have {len(for_modeling)} documents.")

We have 87 documents.


In [5]:
# Iterates over the words in the stop words list and resets the "is_stop" flag.
for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True

In [6]:
# The add_pipe function appends our functions to the default pipeline.
nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

In [7]:
doc_list = []

# Iterates through each article in the corpus.
for doc in for_modeling :
    # Passes that article through the pipeline and adds to a new list.
    pr = nlp(doc)
    doc_list.append([t.lower() for t in pr if t.isalpha()])

In [8]:
# Create a mapping of word IDs to words.
words = corpora.Dictionary(doc_list)

# Turns each document into a bag of words.
corpus = [words.doc2bow(doc) for doc in doc_list]

Now I am going to start fitting the model with different numbers of topics and passes to see which combination results in groupings of words that seem like they have things in common.

# Attempt 1: 4 topics, 15 passes

In [9]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=4, #First I want to see if LDA can group by genre 
                                           random_state=2,
                                           update_every=1,
                                           passes=15, #Start out seeing if 15 passes will get me to where I want
                                           alpha='auto',
                                           per_word_topics=True)

In [10]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.005*"use" + 0.004*"time" + 0.004*"seed" + 0.004*"wife" + 0.003*"husband" '
  '+ 0.003*"school" + 0.003*"oil" + 0.003*"college" + 0.003*"roberts" + '
  '0.003*"come" + 0.003*"work" + 0.002*"helva" + 0.002*"jewish" + 0.002*"long" '
  '+ 0.002*"good"'),
 (1,
  '0.006*"man" + 0.006*"know" + 0.005*"come" + 0.005*"time" + 0.005*"like" + '
  '0.004*"year" + 0.004*"tell" + 0.004*"look" + 0.003*"think" + 0.003*"school" '
  '+ 0.003*"car" + 0.003*"work" + 0.003*"day" + 0.003*"old" + 0.003*"want"'),
 (2,
  '0.004*"good" + 0.004*"man" + 0.004*"time" + 0.004*"new" + 0.004*"wine" + '
  '0.003*"know" + 0.003*"world" + 0.003*"like" + 0.002*"church" + 0.002*"year" '
  '+ 0.002*"way" + 0.002*"trade" + 0.002*"red" + 0.002*"people" + '
  '0.002*"come"'),
 (3,
  '0.005*"time" + 0.005*"know" + 0.004*"man" + 0.004*"like" + 0.004*"people" + '
  '0.003*"come" + 0.003*"think" + 0.003*"day" + 0.003*"tell" + 0.003*"find" + '
  '0.003*"use" + 0.002*"long" + 0.002*"way" + 0.002*"year" + 0.002*"little"')]


The first time trying the LDA model with number of topics as 4 and passes as 15, the word lists for each category seem pretty similar. Words like 'time' and 'man' appear in more than one cateogory. Therefore, I am re-running the model with a much larger number of passes to try and increase the relatedness of words in each category

# Attempt 2: 4 topics, 50 passes

In [11]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=4, #First I want to see if LDA can group by genre 
                                           random_state=2,
                                           update_every=1,
                                           passes=50, #Start out seeing if 50 passes will get me to where I want
                                           alpha='auto',
                                           per_word_topics=True)

In [12]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.005*"use" + 0.004*"seed" + 0.004*"wife" + 0.004*"time" + 0.003*"husband" '
  '+ 0.003*"school" + 0.003*"oil" + 0.003*"college" + 0.003*"roberts" + '
  '0.003*"helva" + 0.003*"jewish" + 0.002*"sexual" + 0.002*"girl" + '
  '0.002*"work" + 0.002*"good"'),
 (1,
  '0.007*"man" + 0.006*"know" + 0.006*"come" + 0.005*"time" + 0.005*"like" + '
  '0.004*"look" + 0.004*"tell" + 0.004*"year" + 0.004*"think" + 0.003*"work" + '
  '0.003*"want" + 0.003*"day" + 0.003*"find" + 0.003*"car" + 0.003*"way"'),
 (2,
  '0.004*"good" + 0.004*"man" + 0.004*"wine" + 0.004*"new" + 0.004*"time" + '
  '0.003*"church" + 0.003*"world" + 0.003*"know" + 0.002*"year" + 0.002*"way" '
  '+ 0.002*"people" + 0.002*"trade" + 0.002*"red" + 0.002*"company" + '
  '0.002*"state"'),
 (3,
  '0.005*"time" + 0.005*"know" + 0.004*"man" + 0.004*"like" + 0.004*"people" + '
  '0.003*"come" + 0.003*"day" + 0.003*"use" + 0.003*"find" + 0.003*"tell" + '
  '0.003*"think" + 0.002*"long" + 0.002*"year" + 0.002*"great" + 0.002*"new"'

It's still hard to identify what kinds of categories the four topics might be based on. Some words still appear multiple times, and therefore it's hard to make distinctions between categories. Therefore, I will keep the amount of passes I have and try adjusting the amount of topics.

# Attempt 3: 3 topics, 50 passes

In [13]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=3, #Decreasing the amount of topics to 3
                                           random_state=2,
                                           update_every=1,
                                           passes=50, #Start out seeing if 50 passes will get me to where I want
                                           alpha='auto',
                                           per_word_topics=True)

In [14]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.004*"use" + 0.004*"wine" + 0.003*"come" + 0.003*"seed" + 0.003*"man" + '
  '0.003*"good" + 0.003*"know" + 0.003*"time" + 0.003*"look" + 0.002*"head" + '
  '0.002*"red" + 0.002*"like" + 0.002*"turn" + 0.002*"oil" + 0.002*"old"'),
 (1,
  '0.005*"time" + 0.005*"man" + 0.005*"know" + 0.005*"come" + 0.004*"school" + '
  '0.004*"like" + 0.004*"year" + 0.003*"work" + 0.003*"people" + 0.003*"find" '
  '+ 0.003*"day" + 0.003*"good" + 0.003*"new" + 0.003*"use" + 0.003*"way"'),
 (2,
  '0.005*"time" + 0.005*"man" + 0.005*"know" + 0.004*"like" + 0.003*"new" + '
  '0.003*"think" + 0.003*"year" + 0.003*"come" + 0.003*"good" + 0.003*"tell" + '
  '0.002*"find" + 0.002*"way" + 0.002*"long" + 0.002*"day" + 0.002*"people"')]


# Attempt 4: 6 topics, 50 passes

In [15]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=6, #Setting topics to 6
                                           random_state=2,
                                           update_every=1,
                                           passes=50, #Start out seeing if 50 passes will get me to where I want
                                           alpha='auto',
                                           per_word_topics=True)

In [16]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.008*"wine" + 0.006*"seed" + 0.005*"wife" + 0.005*"use" + 0.004*"red" + '
  '0.004*"oil" + 0.004*"husband" + 0.004*"good" + 0.003*"time" + 0.003*"man" + '
  '0.003*"fort" + 0.003*"machine" + 0.002*"sexual" + 0.002*"madden" + '
  '0.002*"marriage"'),
 (1,
  '0.006*"man" + 0.006*"come" + 0.006*"know" + 0.005*"time" + 0.005*"year" + '
  '0.004*"like" + 0.004*"school" + 0.004*"car" + 0.004*"tell" + 0.003*"think" '
  '+ 0.003*"work" + 0.003*"old" + 0.003*"tooth" + 0.003*"day" + 0.003*"new"'),
 (2,
  '0.005*"know" + 0.004*"time" + 0.004*"man" + 0.004*"like" + 0.004*"think" + '
  '0.003*"new" + 0.003*"way" + 0.003*"little" + 0.003*"look" + 0.003*"thing" + '
  '0.003*"good" + 0.003*"year" + 0.002*"come" + 0.002*"long" + 0.002*"find"'),
 (3,
  '0.005*"know" + 0.004*"man" + 0.004*"like" + 0.004*"people" + 0.004*"church" '
  '+ 0.004*"time" + 0.003*"come" + 0.003*"cattle" + 0.003*"use" + 0.003*"find" '
  '+ 0.003*"write" + 0.003*"mean" + 0.003*"think" + 0.002*"long" + '
  '0.002*"poet"')

Adjusting the number of topics and keeping the passes the same hasn't resulted in too many word groupings that seem similar. I am going to double the amount of passes and see if that changes anything.

# Attempt 5: 6 topics, 100 passes

In [17]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=6, 
                                           random_state=2,
                                           update_every=1,
                                           passes=100, 
                                           alpha='auto',
                                           per_word_topics=True)

In [18]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.008*"wine" + 0.006*"seed" + 0.005*"wife" + 0.005*"use" + 0.004*"red" + '
  '0.004*"oil" + 0.004*"husband" + 0.004*"good" + 0.003*"time" + 0.003*"man" + '
  '0.003*"fort" + 0.003*"machine" + 0.002*"sexual" + 0.002*"madden" + '
  '0.002*"marriage"'),
 (1,
  '0.007*"man" + 0.006*"come" + 0.006*"know" + 0.005*"time" + 0.005*"year" + '
  '0.004*"like" + 0.004*"school" + 0.004*"car" + 0.004*"tell" + 0.004*"think" '
  '+ 0.003*"work" + 0.003*"old" + 0.003*"tooth" + 0.003*"day" + 0.003*"look"'),
 (2,
  '0.005*"know" + 0.004*"time" + 0.004*"man" + 0.004*"like" + 0.004*"think" + '
  '0.003*"new" + 0.003*"way" + 0.003*"little" + 0.003*"look" + 0.003*"thing" + '
  '0.003*"good" + 0.002*"year" + 0.002*"find" + 0.002*"long" + 0.002*"people"'),
 (3,
  '0.005*"know" + 0.004*"man" + 0.004*"like" + 0.004*"people" + 0.004*"church" '
  '+ 0.004*"time" + 0.003*"cattle" + 0.003*"come" + 0.003*"use" + 0.003*"find" '
  '+ 0.003*"write" + 0.003*"mean" + 0.003*"think" + 0.002*"long" + '
  '0.002*"poet

The topics seem to be getting better, with two topics looking pretty distinct: Topic 0 and 5. Topic 0 includes words like 'wine', 'wife', 'husband', 'sexual', and 'marriage'. Topic 5 includes words like 'wave', 'water', 'tsunami', and 'trader'.

# Attempt 6: 3 topics, 100 passes

In [19]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=3, 
                                           random_state=2,
                                           update_every=1,
                                           passes=100, 
                                           alpha='auto',
                                           per_word_topics=True)

In [20]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.004*"use" + 0.004*"wine" + 0.003*"come" + 0.003*"seed" + 0.003*"man" + '
  '0.003*"good" + 0.003*"know" + 0.003*"time" + 0.003*"look" + 0.002*"head" + '
  '0.002*"red" + 0.002*"like" + 0.002*"turn" + 0.002*"oil" + 0.002*"tell"'),
 (1,
  '0.005*"time" + 0.005*"man" + 0.005*"know" + 0.005*"come" + 0.004*"school" + '
  '0.004*"like" + 0.004*"year" + 0.003*"work" + 0.003*"people" + 0.003*"find" '
  '+ 0.003*"day" + 0.003*"good" + 0.003*"new" + 0.003*"use" + 0.003*"way"'),
 (2,
  '0.005*"time" + 0.005*"man" + 0.005*"know" + 0.004*"like" + 0.003*"new" + '
  '0.003*"think" + 0.003*"year" + 0.003*"come" + 0.003*"good" + 0.002*"tell" + '
  '0.002*"find" + 0.002*"way" + 0.002*"long" + 0.002*"day" + 0.002*"people"')]


In Attempt #5, I used 6 topics and 100 passes. I saw that two of the topics had words that did seem pretty related, so I hoped dropping the topics down to three and keeping 100 passes would keep the words making up the categories the same, but they ended up getting changed.

# Attempt 7: 5 topics, 200 passes

In [21]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=5, 
                                           random_state=2,
                                           update_every=1,
                                           passes=200, 
                                           alpha='auto',
                                           per_word_topics=True)

In [22]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.009*"wine" + 0.006*"wife" + 0.005*"helva" + 0.005*"husband" + '
  '0.004*"good" + 0.003*"time" + 0.003*"machine" + 0.003*"sexual" + '
  '0.003*"marriage" + 0.003*"man" + 0.003*"jones" + 0.003*"old" + '
  '0.002*"people" + 0.002*"shell" + 0.002*"quack"'),
 (1,
  '0.005*"year" + 0.005*"school" + 0.004*"time" + 0.004*"tooth" + 0.004*"man" '
  '+ 0.004*"work" + 0.003*"come" + 0.003*"know" + 0.003*"new" + 0.003*"like" + '
  '0.003*"day" + 0.003*"tell" + 0.002*"think" + 0.002*"way" + 0.002*"high"'),
 (2,
  '0.005*"time" + 0.004*"man" + 0.004*"new" + 0.004*"good" + 0.003*"year" + '
  '0.002*"company" + 0.002*"film" + 0.002*"trade" + 0.002*"know" + '
  '0.002*"general" + 0.002*"anti" + 0.002*"use" + 0.002*"state" + 0.002*"way" '
  '+ 0.002*"river"'),
 (3,
  '0.007*"know" + 0.006*"man" + 0.005*"like" + 0.005*"come" + 0.005*"time" + '
  '0.004*"car" + 0.004*"think" + 0.004*"look" + 0.003*"tell" + 0.003*"want" + '
  '0.003*"people" + 0.003*"turn" + 0.003*"leave" + 0.003*"right" + '
  '0

# Attempt 8: 6 topics, 200 passes

I decided to return to the use of 6 topics because it had given me the most related-seeming words in the past. I increased the passes to 200 to make sure the words in the results were as related as possible.

In [23]:
##Fitting the model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=words,
                                           num_topics=6, 
                                           random_state=2,
                                           update_every=1,
                                           passes=200, 
                                           alpha='auto',
                                           per_word_topics=True)

In [24]:
pprint(lda_model.print_topics(num_words=15))

[(0,
  '0.008*"wine" + 0.006*"seed" + 0.005*"wife" + 0.005*"use" + 0.004*"red" + '
  '0.004*"oil" + 0.004*"husband" + 0.004*"good" + 0.003*"time" + 0.003*"man" + '
  '0.003*"fort" + 0.003*"machine" + 0.002*"sexual" + 0.002*"madden" + '
  '0.002*"marriage"'),
 (1,
  '0.007*"man" + 0.006*"come" + 0.006*"know" + 0.005*"time" + 0.005*"year" + '
  '0.004*"like" + 0.004*"school" + 0.004*"car" + 0.004*"tell" + 0.004*"think" '
  '+ 0.003*"work" + 0.003*"old" + 0.003*"day" + 0.003*"look" + 0.003*"tooth"'),
 (2,
  '0.005*"know" + 0.004*"time" + 0.004*"man" + 0.004*"like" + 0.004*"think" + '
  '0.003*"new" + 0.003*"way" + 0.003*"little" + 0.003*"thing" + 0.003*"look" + '
  '0.003*"good" + 0.002*"year" + 0.002*"find" + 0.002*"long" + 0.002*"people"'),
 (3,
  '0.005*"know" + 0.004*"man" + 0.004*"like" + 0.004*"people" + 0.004*"church" '
  '+ 0.004*"time" + 0.003*"cattle" + 0.003*"come" + 0.003*"use" + 0.003*"find" '
  '+ 0.003*"write" + 0.003*"mean" + 0.003*"think" + 0.002*"long" + '
  '0.002*"poet

__Visualization of Attempt 8__

In [25]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, words)

You can see that there are very distinct topics for topic 5 and 6. Topics 1-4 are all pretty overlapping, but I know from Attempt 6 that reducing my topic numbers to 3 topics does not actually improve the LDA model results. Overall, the most distinct word groups have resulted from using 6 topics, so I will base the rest of my analysis on the use of six topics. 

Topic 0 (5 in the visualization) uses words like 'wine', 'husband', 'wife', 'sexual', and 'marriage'. It could be centered on marriage or relationships in general.

Topic 1-4 are very similar. Common words include 'time', 'man', 'like', 'know', and 'year'.

Topic 5 (6 in the visualization) uses words like 'water', 'waves', 'tsunami', and 'earthquakes', 'mickey', and 'wart'. This topic could be focused on nature, natural disasters, or partially on water.



__Looking at the probability of each article belonging to each of the orgiginal categories__

In [26]:
topic_assignments = defaultdict(list)

for category in ['science_fiction','humor','lore','mystery'] :
    for file_id in brown.fileids(categories=category) :

        doc = brown.words(fileids=file_id)
        pr = nlp(" ".join(doc))
        doc = [t.lower() for t in pr if t.isalpha()]
        doc_new = words.doc2bow(doc)

        topic_probs = lda_model[doc_new][0]
        topic = max(topic_probs,key=lambda x: x[1])
        topic_assignments[category].append(topic[0])

In [27]:
for cat, topic_list in topic_assignments.items() :
    print(f"In {cat} we had the following:")
    topic_count = Counter(topic_list).most_common()
    
    for topic, count in topic_count : 
        print(f"    {count} articles were classified as topic {topic}.")
    
    

In science_fiction we had the following:
    2 articles were classified as topic 1.
    2 articles were classified as topic 2.
    1 articles were classified as topic 4.
    1 articles were classified as topic 5.
In humor we had the following:
    4 articles were classified as topic 4.
    4 articles were classified as topic 2.
    1 articles were classified as topic 1.
In lore we had the following:
    11 articles were classified as topic 4.
    10 articles were classified as topic 1.
    10 articles were classified as topic 3.
    8 articles were classified as topic 2.
    6 articles were classified as topic 0.
    3 articles were classified as topic 5.
In mystery we had the following:
    11 articles were classified as topic 1.
    5 articles were classified as topic 3.
    4 articles were classified as topic 4.
    2 articles were classified as topic 2.
    1 articles were classified as topic 5.
    1 articles were classified as topic 0.


Even though it seems like most distinct word relationships came out of indicating six different topics, there still aren't too many genres that have distinct topics. 

11 articles were classified into topic 1 under mystery, which is the the highest number of articles for the category. Topic 1 includes words like 'man', 'time', 'school', 'car', and 'tooth'. The words seem hard to connect, but they still represent mystery better than the other topics.

Lore seems to encompass topics 1, 3, and 4. As far as how many articles fit under these topics, each topic was nearly tied.

Humor seems to encompass topics 2 and 4. Topic 4 contains msot of the instances of the words 'cattle' and 'poet'. Topic 2 contains most of the instances of 'film', 'hudson', and 'farm'.

Science Fiction seems to encompass all topics but topics 3 and 0, which I previously thought to be centered on marriage and/or relationships.


Overall, selecting six topics for the LDA model led to two topics that appeared pretty distinct from all the other topics. While the other four topics were pretty overlapping, reducing the amount of topics didn't actually improve the relatedness of words in the new topics. 

Mystery seems to be well-represented by topic 1. Topic 1 also helps represent lore, but topic 3 and 4 also represent lore pretty equally. I can also rule out topics 0, 2, and 5 because of the few number of articles that fit those topics. Humor is most represented by topics 2 and 4, but is not well represented by any other topics. Science fiction is covered by a lot of topics equally, although it does not involve topic 3 or 0.

Since topic 3 is only representing mystery and lore, topic 3 could be representative of these two fairly related topics. Topic 0 is only omitted from humor and science fiction, and topic 0 is one of the fairly distinct topics (presumed to include relationships/marriage). The omission of these topics from humor and science fiction was surprising.