# A Brief Intro to Natural Langage Processing & Topic Modeling

#### AAWG Dev Day 6/14/2019 

-----------

NLP + LDA are sets of algorithms that help you organize, summarize, and understand large amounts of text data. 

We're only going to scratch the surface with a simplified view of both concepts, but this has all of the major steps in a topic modeling workflow. There are also some common problems embedded in the workflow below, which we can recognize and fix along the way. 

First we need the normal data science packages `numpy` and `pandas`. 

In [None]:
import pandas as np
import numpy as np 

Next we can use a big text dataset of dubious provenance (`fetch_20newsgroups`). The best description is in the [`sklearn` source code.](https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/datasets/twenty_newsgroups.py)

We begin by splitting into train and test sets, just to illustrate that `sklearn` has already given us a clean way to do so! 

In [None]:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train', shuffle = True)
# news_test = fetch_20newsgroups(subset='test', shuffle = True)

Check out the topics available to us in this dataset. This gives us a really nice test dataset to draw clear topic distinctions (autos vs space, for example) or quite similar (hockey vs baseball) to optimize topic model parameters. 

In [None]:
for name in news_train.target_names: print(name)

Each "news" entry is an email with lots of weird characters and issues. Perfect! Then the next step is to clean using a few very standard techniques. 

In [None]:
news_train.data[0]

In [None]:
categories = ['rec.sport.baseball', 'comp.graphics']
news_train = fetch_20newsgroups(subset='train', 
                                  categories=categories, 
                                  shuffle=True, 
                                  random_state=42)

First make a nice copy of the data. We can come back to this step every time something goes wrong. 

In [None]:
import unicodedata 
import sys 
text = news_train.data[:1000]
text[0]

### Clean punctuation

All of the symbols and punctuation can go. For our purpose, not very helpful. 

**What else is happening in here. Also crucial for NLP.**

In [None]:
## Dictionary of all punctuation
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                        if unicodedata.category(chr(i)).startswith('P'))

## now we can remove punctuation. 
text = [string.translate(punctuation).lower() for string in text]
# text = text.lower()
text[0]

In [None]:
set(list(text[0]))

### Tokenize & Remove Stopwords 

Split up the content using white spaces. Then remove all of the words that just get in thee way. 

**Which stopwords are we removing below?**

In [None]:
import nltk 

## You'll need to download them once before using below. 
# nltk.download('punkt')
# nltk.download('stopwords')

from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords

In [None]:
tok = [word_tokenize(t) for t in text]
tok[0][:20]

In [None]:
stop_words = stopwords.words('english')
stop_words[:20]

In [None]:
stp = [[word for word in tok_i if word not in stop_words] for tok_i in tok]

And now the same entry without any stopwords. 

In [None]:
stp[0][:20]

### Stemming

'Cook', 'Cooked', 'Cooking', 'Cooker' all have low statistical power separately, but share a root (or stem) meaning, and thus should be considered together for a better model. 

Thus Stemming is a sometimes off-putting way of getting closer to the root meaning of a word. We simply chop off the last few letters. It gets more elegant that that, but we're really just pruning most of our words to force them to group better. 

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

In [None]:
stem = [[porter.stem(word) for word in entry] for entry in stp]
stem[0][:20]

Keep a clean copy as a backup. 

In [None]:
processed_text = stem
# processed_text

### LDA Modeling 

And now to modeling!! pretty simple actually but it takes a minute. We'll use the popular `gensim` package, but there are certainly other options. 

In [None]:
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
import re
import warnings

Sanity check: how big is our dataset?

In [None]:
len(processed_text)

Create a dictionary of words in our dataset, and their counts. 

In [None]:
stem_dictionary = corpora.Dictionary(processed_text)
stem_corpus = [stem_dictionary.doc2bow(stem) for stem in processed_text]
stem_corpus[0][:20]

Finally, build an LDA model and harvest some evaluation information (Coherence is a metric of importance!) 

In [None]:
with warnings.catch_warnings():
  warnings.simplefilter("ignore")

  stem_model = models.ldamodel.LdaModel(corpus=stem_corpus, 
                                        id2word=stem_dictionary, 
                                        num_topics=2, 
                                        passes=10, 
                                        random_state = 1)
  stem_cm = CoherenceModel(model=stem_model, 
                           texts=processed_text, 
                           dictionary=stem_dictionary, 
                           coherence='c_v')
  stem_coherence = stem_cm.get_coherence()
stem_coherence

Check out this fabulous visualization for assessing and exploring your model results! 

In [None]:
# import sys
# !{sys.executable} -m pip install pyLDAvis

import pyLDAvis
import pyLDAvis.gensim as gensimvis

In [None]:
## This visualization might not be supported yet in Colab. 
## Try it in a jupyter notebook! 
vis_data = gensimvis.prepare(stem_model, stem_corpus, stem_dictionary)

In [None]:
pyLDAvis.display(vis_data)

### Followup challenges: 

* Remove the remaining special characters 
* Package up the NLP workflow into functions for efficient iteration 
* Use the `test` set to evaluate model fit. We only built a model with the training set -- we didn't actually evaulate it!. 
* Test implementing lemmatization, and compare the results to those of stemming used in this example 
* Use other subsets of the data (or the full set) to discover patterns of similarity among topics 
* How many clusters should we have used? Iterate through model parameters and plot the coherence scores to detect the optimal model fit
