# A Brief Intro to Natural Langage Processing & Topic Modeling

#### AAWG Dev Day 6/14/2019 

-----------

We're only going to scratch the surface with a simplified view of both concepts (NLP and LDA), but this has all of the major steps in a topic modeling workflow. 

First we need the normal data science packages `numpy` and `pandas`. 

In [1]:
import pandas as np
import numpy as np 

Next we can use a big text dataset of dubious provenance. The best description is in the [`sklearn` source code.](https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/datasets/twenty_newsgroups.py)

We begin by splitting into train and test sets, just to illustrate that `sklearn` has already given us a clean way to do so! 

In [2]:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train', shuffle = True)
news_test = fetch_20newsgroups(subset='test', shuffle = True)

Check out the topics available to us in this dataset. This gives us a really nice way to draw clear topic distinctions (autos vs space, for example) or quite similar (hockey vs baseball) to optimize topic model parameters. We'll just take it all for now. 

In [3]:
for name in news_train.target_names: print(name)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


Each "news" entry is an email with lots of weird characters and issues. Perfect! Then the next step is to clean using a few very standard techniques. 

In [4]:
news_train.data[1]

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

First make a nice copy of the data. We can come back to this step every time something goes wrong. 

In [5]:
import unicodedata 
import sys 
text = news_train.data[:1000]
text[1]

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"

### Clean punctuation

All of the symbols and punctuation can go. For our purpose, not very helpful. 

**What else is happening in here. Also crucial for NLP.**

In [7]:
## Dictionary of all punctuation
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                        if unicodedata.category(chr(i)).startswith('P'))

## now we can remove punctuation. What else is happening here? 
text = [string.translate(punctuation).lower() for string in text]
# text = text.lower()
text[1]

'from guykuocarsonuwashingtonedu guy kuo\nsubject si clock poll  final call\nsummary final call for si clock reports\nkeywords siaccelerationclockupgrade\narticleid shelley1qvfo9innc3s\norganization university of washington\nlines 11\nnntppostinghost carsonuwashingtonedu\n\na fair number of brave souls who upgraded their si clock oscillator have\nshared their experiences for this poll please send a brief message detailing\nyour experiences with the procedure top speed attained cpu rated speed\nadd on cards and adapters heat sinks hour of usage per day floppy disk\nfunctionality with 800 and 14 m floppies are especially requested\n\ni will be summarizing in the next two days so please add to the network\nknowledge base if you have done the clock upgrade and havent answered this\npoll thanks\n\nguy kuo <guykuouwashingtonedu>\n'

### Tokenize & Remove Stopwords 

Split up the content using white spaces. Then remove all of the words that just get in thee way. 

**Which stopwords are we removing below?**

In [0]:
import nltk 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords

## You'll need to download them once before using below. 
# nltk.download('punkt')
# nltk.download('stopwords')

In [129]:
tok = [word_tokenize(t) for t in text]
tok[0][:20]

['from',
 'lerxstwamumdedu',
 'wheres',
 'my',
 'thing',
 'subject',
 'what',
 'car',
 'is',
 'this',
 'nntppostinghost',
 'rac3wamumdedu',
 'organization',
 'university',
 'of',
 'maryland',
 'college',
 'park',
 'lines',
 '15']

In [130]:
stop_words = stopwords.words('english')
stop_words[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [0]:
stp = [word for word in tok if word not in stop_words]

And now the same entry without any stopwords. 

In [101]:
stp[0][:20]

['from',
 'lerxstwamumdedu',
 'wheres',
 'my',
 'thing',
 'subject',
 'what',
 'car',
 'is',
 'this',
 'nntppostinghost',
 'rac3wamumdedu',
 'organization',
 'university',
 'of',
 'maryland',
 'college',
 'park',
 'lines',
 '15']

### Stemming

'Cook', 'Cooked', 'Cooking', 'Cooker' all have low statistical power separately, but share a root (or stem) meaning, and thus should be considered together for a better model. 

Thus Stemming is a sometimes off-putting way of getting closer to the root meaning of a word. We simply chop off the last few letters. It gets more elegant that that, but we're really just pruning most of our words to force them to group better. 

In [8]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

In [133]:
stem = [[porter.stem(word) for word in entry] for entry in stp]
stem[0][:20]

['from',
 'lerxstwamumdedu',
 'where',
 'my',
 'thing',
 'subject',
 'what',
 'car',
 'is',
 'thi',
 'nntppostinghost',
 'rac3wamumdedu',
 'organ',
 'univers',
 'of',
 'maryland',
 'colleg',
 'park',
 'line',
 '15']

Keep a clean copy as a backup. 

In [0]:
processed_text = stem
# processed_text

### LDA Modeling 

And now to modeling!! pretty simple actually but it takes a minute. We'll use the popular `gensim` package, but there are certainly other options. 

In [0]:
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
import re
import warnings

Sanity check: how big is our dataset?

In [138]:
len(processed_text)

1000

Create a dictionary of words in our dataset, and their counts. 

In [9]:
stem_dictionary = corpora.Dictionary(processed_text)
stem_corpus = [stem_dictionary.doc2bow(stem) for stem in processed_text]
stem_corpus[0][:20]

NameError: name 'corpora' is not defined

Finally, build an LDA model and harvest some evaluation information (Coherence is a metric of importance!) 

In [0]:
with warnings.catch_warnings():
  warnings.simplefilter("ignore")

  stem_model = models.ldamodel.LdaModel(corpus=stem_corpus, 
                                        id2word=stem_dictionary, 
                                        num_topics=2, 
                                        passes=10, 
                                        random_state = 1)
  stem_cm = CoherenceModel(model=stem_model, 
                           texts=processed_text, 
                           dictionary=stem_dictionary, 
                           coherence='c_v')
  stem_coherence = stem_cm.get_coherence()

Check out this fabulous visualization for assessing and exploring your model results! 

In [0]:
# import sys
# !{sys.executable} -m pip install pyLDAvis

import pyLDAvis
import pyLDAvis.gensim as gensimvis

In [122]:
vis_data = gensimvis.prepare(stem_model, stem_corpus, stem_dictionary)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [123]:
pyLDAvis.display(vis_data)