<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## Text Preprocessing Walkthrough

In [1]:
import re, nltk
from accessory_functions import nltk_path

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)
from nltk.corpus import reuters

## 1. A Simple Corpus

Create a simple corpus of three short documents.

In [2]:
corpus_orig = ['This is document one. I went running.',
               'This is document two. She was a writer.',
               'This document has a numerical entry: 4,000dollars.']

## 2. Normalization

Text normalization involves converting all text to lower case. It sometimes also involves removing numerical words from the corpus. One way to do both of these things is with regular expressions.

In [3]:
def lower_alpha_num(corpus):
    # convert to lower case
    corpus = map(str.lower, corpus)
    
    # remove alpha-numerical words
    corpus = map(lambda x: re.sub(r"""\w*\d\w*""", '', x), corpus)
    return list(corpus)

corpus = lower_alpha_num(corpus_orig)

corpus

['this is document one. i went running.',
 'this is document two. she was a writer.',
 'this document has a numerical entry: ,.']

## 3. Punctuation

Punctuation can also be removed with a regular expression using the `string` library which contains a list of (most) punctuation characters.

In [4]:
import string

# the punctuation list
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
def remove_punct(corpus):
    # regular expression to remove punctuation
    punc_re = re.compile('[%s]' % re.escape(string.punctuation))

    corpus = map(lambda x: punc_re.sub(' ', x), corpus)
    return list(corpus)

# don't store the results of the punctuation removal just yet
remove_punct(corpus)

['this is document one  i went running ',
 'this is document two  she was a writer ',
 'this document has a numerical entry    ']

## 4. Tokenization

Documents can be tokenized by sentence or word also using `nltk`. 

Sentence tokenization is less commonly used, but will be demonstrated first. Note that sentence punctuation is required for correct tokenization, so the punctuation removal performed above can't be performed first.

In [6]:
from nltk.tokenize import sent_tokenize

corpus_sent = map(sent_tokenize, corpus)

list(corpus_sent)

[['this is document one.', 'i went running.'],
 ['this is document two.', 'she was a writer.'],
 ['this document has a numerical entry: ,.']]

Word tokenization is more common. The punctuation removal described above will now be added in.

In [7]:
from nltk.tokenize import word_tokenize

def word_tokens(corpus):
    return list(map(word_tokenize, corpus))

corpus = word_tokens(remove_punct(corpus))

corpus

[['this', 'is', 'document', 'one', 'i', 'went', 'running'],
 ['this', 'is', 'document', 'two', 'she', 'was', 'a', 'writer'],
 ['this', 'document', 'has', 'a', 'numerical', 'entry']]

## 5. Stopword Removal

Commonly used words, called "stop words" can be removed using `nltk`.

In [8]:
from nltk.corpus import stopwords

stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

A function to generalize a text preprocessing method so that it will work on either a list (i.e. single document) or a list of lists (multiple, tokenzied documents).

In [9]:
def generalize_fun(corpus, lambda_fun):
    # must handle a list of lists (tokenized docs) and also a simple list
    
    if isinstance(corpus[0], list):
        # list of lists
        corpus = map(lambda_fun, corpus)
    else:
        # single list
        corpus = lambda_fun(corpus)
        
    return list(corpus)

Now the stop words can be removed.

In [10]:
def remove_sws(corpus):
    # stopword removal
    stop_words = stopwords.words('english')
    filter_fun = lambda x: list(filter(lambda x: x not in stop_words, x))

    corpus = generalize_fun(corpus, filter_fun)
    return list(corpus)

corpus = remove_sws(corpus)
corpus

[['document', 'one', 'went', 'running'],
 ['document', 'two', 'writer'],
 ['document', 'numerical', 'entry']]

## 6. Parts of Speech Tagging

Parts-of-speech (POS) tagging refers to the process of assigning tags, such as “noun” or “verb”, to words in documents.

Example POS tags: 

WP: wh-pronoun ("who", "what")  
VBZ: verb, 3rd person sing. present ("takes")  
VBG: verb, gerund/present participle ("taking")  
TO: to ("to go", "to him")   
DT: determiner ("the", "this")  
NN: noun, singular or mass ("door")  
.: Punctuation (".", "?")  

In [11]:
def pos_tag(corpus):        
    return list(map(nltk.pos_tag, corpus))

corpus_tagged = pos_tag(corpus)
corpus_tagged

[[('document', 'NN'), ('one', 'CD'), ('went', 'VBD'), ('running', 'VBG')],
 [('document', 'NN'), ('two', 'CD'), ('writer', 'NN')],
 [('document', 'NN'), ('numerical', 'JJ'), ('entry', 'NN')]]

## 7. Stemming

Stemming removes alternative work endings. It produces similar, but not identical, results to  lemmatization, which is a related technique. We will cover stemming here and lemmatization in the next notebook.

In [12]:
from nltk.stem import SnowballStemmer

def stem(corpus):
    # perform stemming
    stemmer = SnowballStemmer('english')
    stemmer_fun = lambda x: list(map(stemmer.stem, x))

    corpus = generalize_fun(corpus, stemmer_fun)
    return list(corpus)

stem(corpus)

[['document', 'one', 'went', 'run'],
 ['document', 'two', 'writer'],
 ['document', 'numer', 'entri']]

## 8. N-Grams

N-grams are multi-word tokens whose incorporation into a model can help add word order as an important feature.

In [13]:
from nltk.util import ngrams

def bigrams(corpus, ngram_size=2):
    # create n-grams
    ngram_fun = lambda x: list(map(' '.join, ngrams(x, ngram_size)))

    corpus = generalize_fun(corpus, ngram_fun)
    return list(corpus)

bigrams(corpus)

[['document one', 'one went', 'went running'],
 ['document two', 'two writer'],
 ['document numerical', 'numerical entry']]

## 9. Count Vectorizer: Converting Text to Numbers

Scikit-learn's [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts tokenized words to a sparse count matrix. Count vectorizer will normalize and tokenize words.

In [14]:
corpus_orig

['This is document one. I went running.',
 'This is document two. She was a writer.',
 'This document has a numerical entry: 4,000dollars.']

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X = cv.fit_transform(corpus_orig)
X = X.toarray()

X

array([[0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

This is known as a document-term matrix. The rows correspond to each document and the columns correspond to feature counts. It's easiest to see the features by converting to a dataframe.

Note that count vectorizer doesn't remove alpha-numerical values.

In [16]:
import pandas as pd
pd.DataFrame(X,
             columns=cv.get_feature_names())

Unnamed: 0,000dollars,document,entry,has,is,numerical,one,running,she,this,two,was,went,writer
0,0,1,0,0,1,0,1,1,0,1,0,0,1,0
1,0,1,0,0,1,0,0,0,1,1,1,1,0,1
2,1,1,1,1,0,1,0,0,0,1,0,0,0,0


However, a custom preprocessor can be used to remove the alpha-numerical words.

In [17]:
remove_nums = lambda x: re.sub(r"""\w*\d\w*""", ' ', x.lower())

cv = CountVectorizer(preprocessor=remove_nums)

X = cv.fit_transform(corpus_orig).toarray()
pd.DataFrame(X,
             columns=cv.get_feature_names())

Unnamed: 0,document,entry,has,is,numerical,one,running,she,this,two,was,went,writer
0,1,0,0,1,0,1,1,0,1,0,0,1,0
1,1,0,0,1,0,0,0,1,1,1,1,0,1
2,1,1,1,0,1,0,0,0,1,0,0,0,0


In [18]:
# avoid having to normalize in code below
corpus_norm = lower_alpha_num(corpus_orig)

Count vectorizer can also return only binary values if desired.

In [19]:
cv = CountVectorizer(binary=True)

X = cv.fit_transform(corpus_norm).toarray()
pd.DataFrame(X,
             columns=cv.get_feature_names())

Unnamed: 0,document,entry,has,is,numerical,one,running,she,this,two,was,went,writer
0,1,0,0,1,0,1,1,0,1,0,0,1,0
1,1,0,0,1,0,0,0,1,1,1,1,0,1
2,1,1,1,0,1,0,0,0,1,0,0,0,0


Ranges of n-grams can be added as additional features. The matrix is shown transposed for easier viewing.

In [20]:
cv = CountVectorizer(ngram_range=(1,2))

X = cv.fit_transform(corpus_norm).toarray()
pd.DataFrame(X,
             columns=cv.get_feature_names()).T

Unnamed: 0,0,1,2
document,1,1,1
document has,0,0,1
document one,1,0,0
document two,0,1,0
entry,0,0,1
has,0,0,1
has numerical,0,0,1
is,1,1,0
is document,1,1,0
numerical,0,0,1


Finally, count vectorizer can automatically remove stopwords.

In [21]:
cv = CountVectorizer(stop_words=stopwords.words('english'))

X = cv.fit_transform(corpus_norm).toarray()
pd.DataFrame(X,
             columns=cv.get_feature_names())

Unnamed: 0,document,entry,numerical,one,running,two,went,writer
0,1,0,0,1,1,0,1,0
1,1,0,0,0,0,1,0,1
2,1,1,1,0,0,0,0,0
