Source: This tutorial was created by using the code found on Machine Leanring Plus, created by Selva Prabhakaran. While much of the written word is from the myself, the majority of what is presented, the headers and Python code and technique is accredited to Selva. His work can be found at https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ and his GitHub where the data is queried from at https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json

<img src="img/lda_pic.PNG" width="1000" height="1000" align="center"/>

# Introduction

Today we will be learning about an incredibly useful and hot topic in NLP: Topic Modelling. The premise is that there are topics inside documents and it is our job to extract those topics, albeit, algorithmically. While it is easy to extract random words and place in a sentence, it gives us little semantic meaning. The challenge is going to come from creating diverse and distinct groupings of words that relate to each other. Best stated:

"...how to extract good quality of topics that are clear, segregated and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics."

Selva Prabhakaran
 
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

# What is LDA? --> Latent Dirichlet Allocation

There is a fantastic blog post by Thushan Ganegedara that gives us an intuitive understanding of LDA. The link to teh blgo post is here and I highly recommend reading it!

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

The above post explains it like this:

"Say you have a set of 1000 words (i.e. most common 1000 words found in all the documents) and you have 1000 documents. Assume that each document on average has 500 of these words appearing in each. How can you understand what category each document belongs to? One way is to connect each document to each word by a thread based on their appearance in the document. Something like below"

<img src="img/lda_pic_3.PNG" width="1000" height="1000" align="center"/>

"And then when you see that **some documents are connected to same set of words**. You know they discuss the same topic. Then you can read one of those documents and know what all these documents talk about. But to do this you don’t have enough thread. You’re going to need around 500*1000=500,000 threads for that. But we are living in 2100 and we have exhausted all the resources for manufacturing threads, so they are so expensive and you can only afford 10,000 threads. How can you solve this problem?

We can solve this problem, by introducing a latent (i.e. hidden) layer. Say we know 10 topics/themes that occur throughout the documents. But these topics are not observed, we only observe words and documents, thus topics are latent. And we want to utilise this information to cut down on the number of threads. Then what you can do is, connect the words to the topics depending on how well that word fall in that topic and then connect the topics to the documents based on what topics each document touch upon.

Now say you got each document having around 5 topics and each topic relating to 500 words. That is we need 1000*5 threads to connect documents to topics and 10*500 threads to connect topics to words, adding up to 10000.

**Note**: The topics I use here (“Animals”, “Sports”, “Tech”) are imaginary. In the real solution, you won’t have such topics but something like (0.3*Cats,0.4*Dogs,0.2*Loyal, 0.1*Evil) representing the topic “Animals”. That is, as mentioned before, each document is a distribution of words."

Thanks Thushan, that was a terrific explanation!

<img src="img/lda_pic_4.PNG" width="1000" height="1000" align="center"/>

# Another way of thinking about it

Imagine that we have a bucnh of documents and we want to find the main topics. We give the LDA algorithm a set number, say 10 topics, and then pre-process the data, run the algorithm and get our word associations. LDA generates a topic with word weights (see belo"Creation of topics"). The higher the weight, the more important it is. This is known as the distribution of words within a topic. We also get the distribution of topics within teh documents (see below far right 'Topics allocation to documents").

<img src="img/lda_pic_2.PNG" width="1000" height="1000" align="center"/>

# Time to build!

<img src="img/bob.PNG" width="300" height="300" align="center"/>

# Data 

The data for this lesson is the '20 News groups" dataset. The link can be found here:

http://qwone.com/~jason/20Newsgroups/

We will be using a GitHub Content link from the blog opst tutorial later on, so don't worry about downloading.

# Installs

We will be using the nltk package which has phenominal documentation. In fact, the authors have made the book free to use at the link below:

https://www.nltk.org/book/

<img src="img/nltk_book.PNG" width="300" height="300" align="center"/>

In [1]:
# Run in python console
import nltk
#nltk.download('stopwords')

In [2]:
# Run in terminal or command prompt (install nltk stopwords and spacy)
# !python3 -m spacy download en
# !pip install spacy
# !pip install pyLDAvis

In [3]:
# Import packages
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

# What is LDA?

 - each document a collection of topics in a certain proportion
 - Each topic collectionof keywords in certain proportion
 - Provide algo number of topics, it rearranges the topic distribution within the docs and keyword distributions within the topics to get good distribution of topic-keyword distribution
 - Topic is collection of dominant keywords

According to Machine Learning Plus, the following are key factors to a good topic:
1. The quality of text processing
2. The variety of topics the text talks about
3. The choice of topic modeling algorithm
4. The number of topics fed to the algorithm
5. The algorithms tuning parameters

# Stopwords

In [6]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Import News Groups Data

News articles from 20 news groupings based on 11,000 articles.

In [7]:
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
10,From: irwin@cmptrc.lonestar.org (Irwin Arnstei...,8,rec.motorcycles
100,From: tchen@magnus.acs.ohio-state.edu (Tsung-K...,6,misc.forsale
1000,From: dabl2@nlm.nih.gov (Don A.B. Lindbergh)\n...,2,comp.os.ms-windows.misc


# Remove emails and newline characters

In [8]:
# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']


# Tokenization

Gensim to the rescue! We have already learned about the process of tokenization when doign work with sklearn and NLP. Getting the words into their most basic format is instrumental to NLP pre-processing. Gensim has some great utility functions that can take some of the work out of creating these tokens. In the following cell, you will see that it removes all the punctuation.

In [9]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]


# Creating Bigram and Trigram Models

We know that some words, when combined, create new meanings. Think of the difference between "hot" and "dog." But when we combine them we get "hot dog." This is an example of a bigram. A trigram is an extension of this concept, such as, "New York City."

In [10]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


# Remove Stopwords, Make Bigrams and Lemmatize

Lemmatization is about bringing a word back to its root lemma. Think of taking the verb "swam" and bringing the tense to "swim."

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

# Create the Dictionary and Corpus needed for Topic Modeling

With LDA, we need a dictionary and a corpus. The dictionary creates a unique ID for each word in teh document. The corpus is what creates a mapping between the word ID and the number of times that it occurs (frequency).

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

Wow, what the heck do all those tuples mean?!!

The first one for example, (0, 1) means that word id 0 occurs once in the document. Similarly, word id 1 occurs two times. This is how LDA wants data input so it can "visualize" and compute in planar space.

In [None]:
# What is the first word?
id2word[0]

In [None]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

# Building the Topic Model

Hyperparameters
 - corpus: the document collection
 - id2word: word mapping
 - random_state: seed
 - update_every: how often the model is updated
 - chucksize: how many documents will be passed in at a time
 - passes: number of training iterations
 - alpha: document-topic density
 - beta: word-topic density

In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

# Let's take a look at the results

In [None]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Compute Model Perplexity and Coherence Score

Fantastic blog that details how to measure LDA models: https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0. You can find more about hte math between perplexity and coherenc scores at this site.

Aka, how well did our unsupervised model do?

_Perplexity:_ How surprised a topic is when a new word is introduced. 
_Topic Coherence:_ Take the hgih scoring words in each topic and see how similar they are.

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Hold my beer, here comes the cool part

We will be using a really cool packages called pyLDAviz. Seriously, this thing is wicked!

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis