# A Brief Intro to Natural Langage Processing & Topic Modeling

#### AAWG Dev Day 6/14/2019 

-----------

NLP + LDA are sets of algorithms that help you organize, summarize, and understand large amounts of text data. 

We're only going to scratch the surface with a simplified view of both concepts, but this has all of the major steps in a topic modeling workflow. There are also some common problems embedded in the workflow below, which we can recognize and fix along the way. 

For this exercise, we can use a big text dataset of dubious provenance (`fetch_20newsgroups`). The best description is in the [`sklearn` source code.](https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/datasets/twenty_newsgroups.py)

We begin by splitting into train and test sets, just to illustrate that `sklearn` has already given us a clean way to do so! 

In [1]:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train', shuffle = True)
# news_test = fetch_20newsgroups(subset='test', shuffle = True)

Check out the topics available to us in this dataset. This gives us a really nice test dataset to draw clear topic distinctions (autos vs space, for example) or quite similar (hockey vs baseball) to optimize topic model parameters. 

In [2]:
for name in news_train.target_names: print(name)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


Each "news" entry is an email with lots of weird characters and issues. Perfect! Then the next step is to clean using a few very standard techniques. 

In [3]:
news_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [4]:
categories = ['rec.sport.baseball', 'comp.graphics']
news_train = fetch_20newsgroups(subset='train', 
                                  categories=categories, 
                                  shuffle=True, 
                                  random_state=42)

First make a nice copy of the data. We can come back to this step every time something goes wrong. 

In [60]:
import unicodedata 
import sys 
import re
text_train = news_train.data[:1000]
text_train[0]

"From: ger@cv.ruu.nl (Ger Timmens)\nSubject: Re: Postscript drawing prog\nNntp-Posting-Host: triton.cv.ruu.nl\nOrganization: University of Utrecht, 3D Computer Vision Research Group\nLines: 30\n\nIn <0010580B.vma7o9@diablo.UUCP> diablo.UUCP!cboesel (Charles Boesel) writes:\n\n\n>In article <1993Apr19.171704.2147@Informatik.TU-Muenchen.DE> (comp.graphics.gnuplot,comp.graphics), rdd@uts.ipp-garching.mpg.de (Reinhard Drube) writes:\n>>In article <C5ECnn.7qo@mentor.cc.purdue.edu>, nish@cv4.chem.purdue.edu (Nishantha I.) writes:\n>>|> \tCould somebody let me know of a drawing utility that can be\n>>|> used to manipulate postscript files.I am specifically interested in\n>>|> drawing lines, boxes and the sort on Postscript contour plots.\n>>|> \tI have tried xfig and I am impressed by it's features. However\n>>|> it is of no use since I cannot use postscript files as input for the\n>>|> programme.Is there a utility that converts postscript to xfig format?\n>>|> \tAny help would be greatly app

### Clean punctuation

All of the symbols and punctuation can go. For our purpose, not very helpful. 

**What else is happening in here. Also crucial for NLP.**

In [63]:
## Dictionary of all punctuation
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                        if unicodedata.category(chr(i)).startswith('P'))

## now we can remove punctuation. 
text = [re.sub("[^a-z0-9 ]+", "", string.translate(punctuation).lower()) for string in text_train]

# text = text.lower()
text[0]

'from gercvruunl ger timmenssubject re postscript drawing prognntppostinghost tritoncvruunlorganization university of utrecht 3d computer vision research grouplines 30in 0010580bvma7o9diablouucp diablouucpcboesel charles boesel writesin article 1993apr191717042147informatiktumuenchende compgraphicsgnuplotcompgraphics rddutsippgarchingmpgde reinhard drube writesin article c5ecnn7qomentorccpurdueedu nishcv4chempurdueedu nishantha i writes could somebody let me know of a drawing utility that can be used to manipulate postscript filesi am specifically interested in drawing lines boxes and the sort on postscript contour plots i have tried xfig and i am impressed by its features however it is of no use since i cannot use postscript files as input for the programmeis there a utility that converts postscript to xfig format any help would be greatly appreciated nishanthahave you checked out adobe illustrator there are a few unix versionsfor it available depending on your platform i know of two 

In [64]:
set(list(text[0]))

{' ',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

### Tokenize & Remove Stopwords 

Split up the content using white spaces. Then remove all of the words that just get in thee way. 

**Which stopwords are we removing below?**

In [65]:
import nltk 

## You'll need to download them once before using below. 
# nltk.download('punkt')
# nltk.download('stopwords')

from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords

In [66]:
tok = [word_tokenize(t) for t in text]
tok[0][:20]

['from',
 'gercvruunl',
 'ger',
 'timmenssubject',
 're',
 'postscript',
 'drawing',
 'prognntppostinghost',
 'tritoncvruunlorganization',
 'university',
 'of',
 'utrecht',
 '3d',
 'computer',
 'vision',
 'research',
 'grouplines',
 '30in',
 '0010580bvma7o9diablouucp',
 'diablouucpcboesel']

In [67]:
stop_words = stopwords.words('english')
stop_words[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [68]:
set(list(text[0]))


{' ',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

In [69]:
stp = [[word for word in tok_i if word not in stop_words] for tok_i in tok]

In [70]:
stp

[['gercvruunl',
  'ger',
  'timmenssubject',
  'postscript',
  'drawing',
  'prognntppostinghost',
  'tritoncvruunlorganization',
  'university',
  'utrecht',
  '3d',
  'computer',
  'vision',
  'research',
  'grouplines',
  '30in',
  '0010580bvma7o9diablouucp',
  'diablouucpcboesel',
  'charles',
  'boesel',
  'writesin',
  'article',
  '1993apr191717042147informatiktumuenchende',
  'compgraphicsgnuplotcompgraphics',
  'rddutsippgarchingmpgde',
  'reinhard',
  'drube',
  'writesin',
  'article',
  'c5ecnn7qomentorccpurdueedu',
  'nishcv4chempurdueedu',
  'nishantha',
  'writes',
  'could',
  'somebody',
  'let',
  'know',
  'drawing',
  'utility',
  'used',
  'manipulate',
  'postscript',
  'filesi',
  'specifically',
  'interested',
  'drawing',
  'lines',
  'boxes',
  'sort',
  'postscript',
  'contour',
  'plots',
  'tried',
  'xfig',
  'impressed',
  'features',
  'however',
  'use',
  'since',
  'use',
  'postscript',
  'files',
  'input',
  'programmeis',
  'utility',
  'convert

In [71]:
char_set = {0}
for s in stp: 
    set_i = set(list(''.join(s)))
    for char in set_i: 
        if char not in char_set: char_set.add(char)
char_set
# set(list(''.join(stp[0])))

{0,
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

And now the same entry without any stopwords. 

In [72]:
stp[0][:20]

['gercvruunl',
 'ger',
 'timmenssubject',
 'postscript',
 'drawing',
 'prognntppostinghost',
 'tritoncvruunlorganization',
 'university',
 'utrecht',
 '3d',
 'computer',
 'vision',
 'research',
 'grouplines',
 '30in',
 '0010580bvma7o9diablouucp',
 'diablouucpcboesel',
 'charles',
 'boesel',
 'writesin']

### Stemming

'Cook', 'Cooked', 'Cooking', 'Cooker' all have low statistical power separately, but share a root (or stem) meaning, and thus should be considered together for a better model. 

Thus Stemming is a sometimes off-putting way of getting closer to the root meaning of a word. We simply chop off the last few letters. It gets more elegant that that, but we're really just pruning most of our words to force them to group better. 

In [73]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

In [74]:
stem = [[porter.stem(word) for word in entry] for entry in stp]
stem[0][:20]

['gercvruunl',
 'ger',
 'timmenssubject',
 'postscript',
 'draw',
 'prognntppostinghost',
 'tritoncvruunlorgan',
 'univers',
 'utrecht',
 '3d',
 'comput',
 'vision',
 'research',
 'grouplin',
 '30in',
 '0010580bvma7o9diablouucp',
 'diablouucpcboesel',
 'charl',
 'boesel',
 'writesin']

Keep a clean copy as a backup. 

In [75]:
processed_text = stem
# processed_text

### LDA Modeling 

And now to modeling!! pretty simple actually but it takes a minute. We'll use the popular `gensim` package, but there are certainly other options. 

In [76]:
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
import re
import warnings

Sanity check: how big is our dataset?

In [77]:
len(processed_text)

1000

Create a dictionary of words in our dataset, and their counts. 

In [78]:
stem_dictionary = corpora.Dictionary(processed_text)
stem_corpus = [stem_dictionary.doc2bow(stem) for stem in processed_text]
stem_corpus[0][:20]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 2),
 (4, 1),
 (5, 2),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 2),
 (19, 1)]

Finally, build an LDA model and harvest some evaluation information (Coherence is a metric of importance!) 

In [80]:
# iterations = [i for i in range(2,11)]
# coherence = []
# iterations

In [81]:
with warnings.catch_warnings():
  warnings.simplefilter("ignore")

  stem_model = models.ldamodel.LdaModel(corpus=stem_corpus, 
                                        id2word=stem_dictionary, 
                                        num_topics=2, 
                                        passes=10, 
                                        random_state = 1)
  stem_cm = CoherenceModel(model=stem_model, 
                           texts=processed_text, 
                           dictionary=stem_dictionary, 
                           coherence='c_v')
stem_coherence = stem_cm.get_coherence()
stem_coherence

0.48282777167218005

Check out this fabulous visualization for assessing and exploring your model results! 

In [82]:
# import sys
# !{sys.executable} -m pip install pyLDAvis

import pyLDAvis
import pyLDAvis.gensim as gensimvis

In [83]:
## This visualization might not be supported yet in Colab. 
## Try it in a jupyter notebook! 
vis_data = gensimvis.prepare(stem_model, stem_corpus, stem_dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [84]:
pyLDAvis.display(vis_data)

### Followup challenges: 

* Remove the remaining special characters 
* Package up the NLP workflow into functions for efficient iteration 
* Use the `test` set to evaluate model fit. We only built a model with the training set -- we didn't actually evaulate it!. 
* Test implementing lemmatization, and compare the results to those of stemming used in this example 
* Use other subsets of the data (or the full set) to discover patterns of similarity among topics 
* How many clusters should we have used? Iterate through model parameters and plot the coherence scores to detect the optimal model fit
