
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Natural Language Processing (NLP)


---


![](https://snag.gy/uvESGH.jpg)

## Learning Objectives


### Core
- Extract features from unstructured text using Scikit Learn
    - Count vectorizer
    - TFIDF vectorizer
- Describe downsides of bag-of-word approaches
- Remove stop words


### Target
- Identify parts of speech using NLTK
    - Stemming 
    - Segmentation
    - Parts of speech tagging


### Stretch

- Describe how TFIDF works and calculate scores by hand


### Lesson Guide
- [Introduction to text feature extraction](#intro)
- [An NLP project: rapstats.io](#rapstats)
- [Common NLP problems](#common)
- [Common NLP models](#models)
- [A simple example](#simple)
- [Bag-of-words / word counting](#bow)
    - [Sklearn `CountVectorizer`](#countvectorizer)
- [Term frequency - inverse document frequency](#tfidf)
    - [Sklearn `TfidfVectorizer`](#tfidf-vec)
- [Downsides to bag-of-words](#downsides-bow)
- [Segmentation](#segmentation)
    - [NLTK sentencer](#nltk-sentencer)
- [Stemming with NLTK](#stem-nltk)
    - [Stemming approaches](#group)
- [Stop words](#stopwords)
- [Part of speech tagging](#pos)
- [Unicode: a common pitfall](#unicode)
- [Conclusion](#conclusion)
- [Additional resources](#resources)

<a name="intro"></a>
## Introduction to text feature extraction

---

The models we have been using so far accept a 2D matrix of real numbers as input `X` and a target vector of classes or numbers `y`. What if our starting point data is not given in the form of a table of numbers, but rather is unstructured? This is the case when working with text documents.

We need a way to go from unstructured data to our numeric `X` matrix in order to use the same models. This is called _feature extraction_ and this lesson is dedicated to it.

The applications of using text data in statistical modeling are practically infinite. Some examples include:
- Sentiment analysis of Yelp reviews
- Identifying topics of news articles
- Classification of political authors

<a id='rapstats'></a>
## An example NLP Project:  rapstats.io

---

<a href="http://rapstats.io"><img src="https://snag.gy/8GSVqf.jpg"></a>

<img src="https://snag.gy/8eJNFv.jpg" style="width: 300px; float: left;">
<img src="https://snag.gy/2Hz0o7.jpg" style="width: 300px;">

**See Also:**

- [Largest Vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html)
- [Rap Genius: Rapstats](http://genius.com/rapstats)
- [Rap Lyric Generator, Hieu Nguyen, Brian Sa](http://nlp.stanford.edu/courses/cs224n/2009/fp/5.pdf)

<a id='common'></a>
## Common NLP problems

---

The table below details some of the most common problems and tasks in the vast field of natural language processing (NLP).

|Topic|Description|
|-|-|
| **Sentiment Analysis** | Is what is written positive or negative? | 
| **Named Entity Recognition** | Classify names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. |
| **Summarization** | Boiling down large bodies of text to paraphrased versions |
| **Topic Modeling** | What topics does a body of text belong to? (ie: Auto tagging of news articles) |
| **Question answering** | Given a human-language question, determine its answer. |
| **Word disambiguation** | Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet. |
| **Machine dialog systems** | Building response systems that react contextually to human input (ie: me: Siri, cook me some bacon.  Siri:  How do you like your bacon? ) | 


See Also:

- [News Headline Anlaysis](http://nbviewer.jupyter.org/github/AYLIEN/headline_analysis/blob/06f1223012d285412a650c201a19a1c95859dca1/main-chunks.ipynb?utm_content=buffer5d40c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)
- [Sentiment + Robot Classification in Movies](http://nbviewer.jupyter.org/github/cojette/ClusteringRobotsinMovie/blob/master/Classification%20of%20Robots%20in%20Movies.ipynb)
- [Text Summarization / Gensim](http://nbviewer.jupyter.org/github/piskvorky/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb)
- [Sentiment Analysis Intro](http://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/SentimentAnalysis.ipynb)

<a id='models'></a>
## Common NLP models and terms

---

- LSI (Latent Semantic Indexing)
- LDA (Latent Dirichlet Allocation)
- HDP (Hierarchical Dirichlet Process)
- Word2Vec
- LogisticRegression
- Naive Bayes
- SVM
- CountVectorizer
- TfIdF (term frequency inverse document frequency)
- DTM (document term matrix)

> **Note:** This is not an exhaustive list, nor will we be covering all of these models in class. NLP is a very deep and very broad area of data science that could warrant it's own immersive entirely!

<a id='simple'></a>
## A Simple Example
---

Suppose we are building a spam classifier. Inputs are emails and the output is a binary label.

Here's an example of an input email from each class:

In [1]:
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
"""

ham = """
Hello, \nI am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.
"""
print(spam)
print()
print(ham)


Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.



Hello, 
I am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.



- Can you think of a simple heuristic rule to catch email like this?

<a id='bow'></a>
## Bag-of-words / word counting
---

The bag-of-words model is a simplified representation of the raw data. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words.

Bag-of-words representations discard grammar, order, and structure in the text, but track occurances.

In [2]:
from collections import Counter
print(Counter(spam.lower().split()))
print()
print(Counter(ham.lower().split()))

Counter({'i': 7, 'of': 4, 'and': 3, 'your': 2, 'contact': 2, 'have': 2, 'to': 2, 'an': 2, 'this': 2, 'is': 2, 'am': 2, 'in': 2, 'with': 2, 'the': 2, 'years': 2, 'etc.': 2, 'hello,': 1, 'saw': 1, 'information': 1, 'on': 1, 'linkedin.': 1, 'carefully': 1, 'read': 1, 'through': 1, 'profile': 1, 'you': 1, 'seem': 1, 'outstanding': 1, 'personality.': 1, 'one': 1, 'major': 1, 'reason': 1, 'why': 1, 'you.': 1, 'my': 1, 'name': 1, 'mr.': 1, 'valery': 1, 'grayfer': 1, 'chairman': 1, 'board': 1, 'directors': 1, 'pjsc': 1, '"lukoil".': 1, '86': 1, 'old': 1, 'was': 1, 'diagnosed': 1, 'cancer': 1, '2': 1, 'ago.': 1, 'will': 1, 'be': 1, 'going': 1, 'for': 1, 'operation': 1, 'later': 1, 'week.': 1, 'decided': 1, 'will/donate': 1, 'sum': 1, '8,750,000.00': 1, 'euros(eight': 1, 'million': 1, 'seven': 1, 'hundred': 1, 'fifty': 1, 'thousand': 1, 'euros': 1, 'only': 1})

Counter({'to': 5, 'of': 4, 'you': 4, 'the': 3, 'i': 2, 'data': 2, 'scientist': 2, 'we': 2, 'and': 2, 'further': 2, 'hello,': 1, 'am': 1,

> In the above example we counted the number of times each word appeared in the text. Note that since we included all the words in the text, we created a dictionary that contains many words with only one appearance.

<a name="countvectorizer"></a>
## Sklearn `CountVectorizer`
---

Sklearn offers a `CountVectorizer` class which basically does the same, but which has many configurable options:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(token_pattern='\w+', stop_words=
                       {"the"}, ngram_range=(1,2))
cvec.fit([spam,ham])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words={'the'},
        strip_accents=None, token_pattern='\\w+', tokenizer=None,
        vocabulary=None)

### The count vectorizer returns a sparse matrix (see [scipy](https://docs.scipy.org/doc/scipy/reference/sparse.html))

In a sparse matrix, only those entries are stored which are different from zero. 
They are stored through triples of numbers, the occupied row and column index combination together with the value.

This is particularly useful for NLP where each document will only contain a small amount of all the possible words.


In [4]:
document_matrix = cvec.transform([spam, ham])
document_matrix

<2x283 sparse matrix of type '<class 'numpy.int64'>'
	with 305 stored elements in Compressed Sparse Row format>

In [5]:
cvec.get_feature_names()

['00',
 '00 euros',
 '000',
 '000 00',
 '2',
 '2 years',
 '750',
 '750 000',
 '8',
 '8 750',
 '86',
 '86 years',
 'ago',
 'ago i',
 'am',
 'am 86',
 'am in',
 'am writing',
 'an',
 'an on',
 'an operation',
 'an outstanding',
 'and',
 'and fifty',
 'and i',
 'and location',
 'and we',
 'and you',
 'any',
 'any further',
 'application',
 'application to',
 'are',
 'are pleased',
 'assistance',
 'assistance best',
 'at',
 'at hooli',
 'attached',
 'attached to',
 'be',
 'be going',
 'be of',
 'best',
 'best regards',
 'board',
 'board of',
 'can',
 'can be',
 'cancer',
 'cancer 2',
 'carefully',
 'carefully read',
 'chairman',
 'chairman of',
 'contact',
 'contact information',
 'contact with',
 'data',
 'data scientist',
 'date',
 'date time',
 'decided',
 'decided to',
 'diagnosed',
 'diagnosed with',
 'directors',
 'directors of',
 'donate',
 'donate sum',
 'eight',
 'eight million',
 'etc',
 'etc etc',
 'euros',
 'euros eight',
 'euros only',
 'fifty',
 'fifty thousand',
 'find',
 'f

In [6]:
len(cvec.get_feature_names())

283

#### Handling sparse matrices

In [7]:
print("Number of nonzero entries:")
print(document_matrix.nnz)
print("Highest count:")
print(document_matrix.max())
print("Row means:")
print(document_matrix.mean(axis=1))
print("Transform to numpy array format:")
print(document_matrix.toarray())

Number of nonzero entries:
305
Highest count:
7
Row means:
[[0.6819788 ]
 [0.56183746]]
Transform to numpy array format:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 0 2 0 1 1 3 1 1 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 2 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1
  2 1 2 1 1 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 2 1 1 1 1 0 0 1 1 7 2 0 1 1 1 1
  1 0 0 2 1 1 0 0 0 1 1 0 0 0 0 0 0 0 2 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1
  1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1 4 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1
  1 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0
  1 1 0 0 0 0 1 1 0 0 2 1 0 1 1 1 1 1 0 0 2 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1
  1 1 1 2 1 1 0 2 1 0 1 0 0 0 0 0 0 2 1 1 2 0 1 0 1 0 0 2 0 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 2 0 0 1 1 0 1 1 1 1 1 1 1 1
  1 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 2 2 1 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 1 1 1 1 1 2 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 2 1 1 0 0 0 0
  0 1 1 1 0 0 1 1 1 1 1 2 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0


#### Inserting the result into a dataframe

> **Note:** For huge text bodies continue working with the sparse matrix format. Many sklearn models can digest sparse matrices.

In [8]:
import pandas as pd
df = pd.DataFrame(cvec.transform([spam, ham]).toarray(),
                  columns=cvec.get_feature_names())

df.transpose().sort_values(0, ascending=False).transpose()

Unnamed: 0,i,of,and,to,euros,etc,have,an,this,will,...,date,date time,on site,on date,of interviews,of interview,of data,of any,find,location
0,7,4,3,2,2,2,2,2,2,2,...,0,0,0,0,0,0,0,0,0,0
1,2,4,2,5,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [9]:
df.sum(axis=0)

00                  1
00 euros            1
000                 1
000 00              1
2                   1
2 years             1
750                 1
750 000             1
8                   1
8 750               1
86                  1
86 years            1
ago                 1
ago i               1
am                  3
am 86               1
am in               1
am writing          1
an                  3
an on               1
an operation        1
an outstanding      1
and                 5
and fifty           1
and i               1
and location        1
and we              1
and you             1
any                 1
any further         1
                   ..
why                 1
why i               1
will                3
will be             1
will donate         1
will find           1
with                3
with cancer         1
with our            1
with you            1
would               1
would like          1
writing             1
writing in          1
x         

- ### Spend a couple of minutes scanning the documentation to figure out what those parameters do. 

Share a few takeaways from the documentation. What arguments and capabilities stand out to you? How do the arguments affect the parsing behavior?

[Count Vectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

<a name="tfidf"></a>
## Term frequency - inverse document frequency (tf-idf)

---

A tf-idf score tells us which words are most discriminating between documents. Words that occur a lot in one document but don't occur in many documents contain a great deal of discriminating power.

- This weight is a statistical measure used to evaluate how important a word is to a document in a collection (aka corpus).
- The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word plus one, obtained by dividing the total number of documents by the number of documents containing the term plus one, and then taking the logarithm of that quotient.

This enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents.

#### Let's see how it is calculated:

Term frequency `tf` is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = N_\text{term}
$$

Inverse document frequency `idf` is defined as the frequency of documents that contain that term over the whole corpus (logarithmically scaled and adjusted to give only positive results greater or equal to one):

$$
\mathrm{idf}(t, D) = 1+\log\left(\frac{1+N_\text{Documents}}{1+N_\text{Documents that contain term}}\right)
$$

Term frequency - Inverse Document Frequency (`tf-idf`) is calculated as:

$$
\text{tf-idf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$

Usually the obtained numbers are then rescaled in such a way that the tf-idf vector of each document has Euclidean length one.

#### An example

Consider the three sentences:

    a) I have a cat.

    b) I have a puppy.
    
    c) I have a dog, I have a kitten, and I have a pen.
    
- Calculate the tf-idf of the words `cat`, `have` and `puppy`.
- On paper, sketch the values obtained in a 3-dimensional coordinate system.
- Follow the default sklearn settings which will discard single letter words.

#### Tf-idf calculation

In [10]:
cat_tf(t,d)==1
cat_idf(t,D)==1+log(4/2)
tf-idf(𝑡,𝑑,𝐷)=2

SyntaxError: can't assign to operator (<ipython-input-10-dbc9adb0d7d1>, line 3)

In [None]:
have_tf(t,d)=(5)
have_idf(t,D)=1+log(4/4)
tf-idf(𝑡,𝑑,𝐷)=

In [None]:
puppy_tf(t,d)=(1)
puppy_idf(t,D)=1+log(4/2)
tf-idf(𝑡,𝑑,𝐷)=2

<a id='tfidf-vec'></a>
## Sklearn `TfidfVectorizer`

---

### Why Use TFIDF?

- Common words are penalized
- Rare words have more influence

Sklearn provides a tf-idf vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

#### Use the `TfidfVectorizer` to fit the spam and ham data.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words='english', norm='l2')
tvec.fit([spam, ham])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [12]:
df = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                  columns=tvec.get_feature_names(),
                  index=['spam', 'ham'])

df.transpose().sort_values('spam', ascending=False).head(10).transpose()

Unnamed: 0,years,euros,contact,personality,linkedin,lukoil,major,million,old,operation
spam,0.290133,0.290133,0.290133,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067,0.145067
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


These are the stopwords used by sklearn:

In [13]:
stopwords = tvec.get_stop_words()
sorted(list(stopwords))[:10]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost']

#### Test the above example

```python
test = ['I have a cat', 'I have a puppy', 'I have a dog, I have a kitten, and I have a pen']
```

In [14]:
cvec = CountVectorizer(token_pattern='\w+', stop_words=
                       {"the"}, ngram_range=(1,2))

In [15]:
test = ['I have a cat', 'I have a puppy', 'I have a dog, I have a kitten, and I have a pen']

In [16]:
tvec = TfidfVectorizer( norm="l2")
tvec.fit(test)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
#df = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                  columns=tvec.get_feature_names(),
                  index=['spam', 'ham'])

IndentationError: unexpected indent (<ipython-input-17-720813f7f121>, line 2)

<a name="downsides-bow"></a>
## Downsides to bag-of-words

---

Bag-of-word approaches like the one outlined above completely ignore the structure of a sentence. Bag-of-word approaches merely assess presence of specific words or word combinations.

The same word can have multiple meanings in different contexts. Consider for example the following two sentences:

- There's wood floating in the **sea**.
- Mike's in a **sea** of trouble with the move.

How do we teach a computer to disambiguate? Later we will cover some other techniques that may help.


<a id='segmentation'></a>
## Segmentation

---

_Segmentation_ is a technique to **identify sentences** within a body of text. Language is not a continuous uninterrupted stream of words: punctuation serves as a guide to group together words that convey meaning when contiguous.


In [18]:
easy_text = "I went to the zoo today. What do you think of that? I bet you hate it! Or maybe you don't"

easy_split_text = ["I went to the zoo today.",
                   "What do you think of that?",
                   "I bet you hate it!",
                   "Or maybe you don't"]


def simple_sentencer(text):
    '''take a string called `text` and return
    a list of strings, each containing a sentence'''

    sentences = []
    substring = ''
    for c in text:
        if c in ('.', '!', '?'):
            sentences.append(substring + c)
            substring = ''
        else:
            substring += c
    return sentences


simple_sentencer(easy_text)

['I went to the zoo today.',
 ' What do you think of that?',
 ' I bet you hate it!']

`# Result:`

    ['I went to the zoo today.',
     ' What do you think of that?',
     ' I bet you hate it!']

The function above doesn't work perfectly. On the other hand, the python NLTK library offers a more robust and easy to use sentencer.


<a id='nltk-sentencer'></a>
- ### There's an easier way to do the same thing!

In [19]:
from nltk.tokenize import PunktSentenceTokenizer

sent_detector = PunktSentenceTokenizer()
sent_detector.sentences_from_text(easy_text)

['I went to the zoo today.',
 'What do you think of that?',
 'I bet you hate it!',
 "Or maybe you don't"]

<a name="install"></a>
### Install NLTK packages

First, in your terminal, run 

```bash
pip install nltk
```

Then within python, run the following:

```python
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
```

- ### Use `nltk.download()` to explore the available packages.

In [20]:
import nltk
#nltk.download()

In [21]:
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/paxton615/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/paxton615/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<a name="stem-nltk"></a>
## Stemming with NLTK

---

**Text normalization** is the process of converting slightly different versions of words with essentially equivalent meaning into the same features.

For example: LinkedIn sees 6000+ variations of the title "Software Engineer" and 8000+ variations of the word "IBM".

- ### What are other common cases of text that could need normalization?


    - Person titles (Mr, MR, Dr etc.)
    - Dates (10/03, March 10 etc.)
    - Numbers
    - Plurals
    - Verb conjugations
    - Slang
    - SMS abbreviations

It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalize words to a common root. This technique is called _stemming_.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

As we did above we could define a Stemmer based on rules:

In [22]:
def stem(tokens):
    '''rule-based stemming of a bunch of tokens'''

    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ing'):
            new_bag.append(token[:-3])
        else:
            new_bag.append(token)

    return new_bag


list_science = ['Science', 'Scientist', 'Scientists']
print(list_science)
list_stem = stem(stem(list_science))
print(list_stem)
list_play = ['player', 'plays', 'playing']
print(list_play)
print(stem(list_play))

['Science', 'Scientist', 'Scientists']
['Scien', 'Scien', 'Scien']
['player', 'plays', 'playing']
['play', 'play', 'play']


Luckily for us, NLTK contains several robust stemmers.

In [23]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('walks'))
print(stemmer.stem('walked'))
print(stemmer.stem('Walking'))

walk
walk
walk


<a id='group'></a>
### Stemming approaches


> There are other stemmers available in NLTK. Look at [this article](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html) to find out about different stemmers. Have a look how it works in different languages.

<a id='stopwords'></a>
## Stop words

---

Some words are very common and provide no legitimate information about the content of the text.

- ### Can you give some examples?

> We should remove these _stop words_. Note that each language has different stop words.

In [24]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
sentence = "this is a foo bar sentence"

print([i for i in sentence.split() if i not in stop])

['foo', 'bar', 'sentence']


In [25]:
len(stop)

179

In [26]:
sorted(stop)[:20]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been']

<a id='pos'></a>
## Part of speech tagging

---

Each word has a specific role in a sentence (Verb, Noun, etc.). Parts-of-speech tagging (POS) is a feature extraction technique that attaches a tag to each word in the sentence in order to provide a more precise context for further analysis. This is often a resource intensive process, but it can sometimes improve the accuracy of our models to have the grammatical features.

In [27]:
from nltk.tag import pos_tag
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
tags = dict(pos_tag(tok.tokenize("today is a great day to learn nlp")))
tags

{'today': 'NN',
 'is': 'VBZ',
 'a': 'DT',
 'great': 'JJ',
 'day': 'NN',
 'to': 'TO',
 'learn': 'VB',
 'nlp': 'NN'}

Here is the explanation for the abbreviations:

In [29]:
import nltk

nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [30]:
# only for the ones met in our example
for val in list(set(tags.values())):
    print(nltk.help.upenn_tagset(val))

VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
None
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
None
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
None
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
None
DT: determiner
    all an another any both del each either every half la many much nary
    neith

<a id='unicode'></a>
## Unicode: a common pitfall

What happens when we get a character that is referenced outside of the character space, for instance a German umlaut **&ouml;** or a Japanese Katakana character  **片仮名 / カタカナ**?


- Python doesn't know how to handle these characters if it has to process it in any way
- Characters outside the Latin character space will get converted to ordinal 0
- This problem can be very frustrating to deal with

Luckily, sklearn has robust classes for text feature extraction. Use sklearn's built-in text preprocessing method when possible.  Always save/encode your text as UTF8 when there are options available to do so.

<a name="conclusion"></a>
## Conclusion

---

In this lesson we obtained an overview of Natural Language Processing (NLP) and learned about two very powerful toolkits:
- Scikit Learn Feature Extraction Text
- Natural Language Tool Kit

#### Some real world applications of these techniques:

- Spam Detection
- Preprocessing for larger NLP problems
- Job market analysis
- Crude topic analyis
- Building a keyword extractation heuristic and piping it into a marketing analysis 

<a id='resources'></a>
## Additional resources

---

- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) how they completed a classification task (with over 1000 response variables!) using restaurant review text
- A list of all stop-words is available [here](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-05/5.04-nlp/stop-words.txt) h/t sleevillanueva
- This lesson made use of Charlie Greenbacher's [Intro to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf), which he delivered at the [DC-NLP Meetup](http://www.meetup.com/DC-NLP/) 
- Wikipedia includes a [walkthrough](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of TF-IDF
- Play with Google's [ngram tool](https://books.google.com/ngrams/graph?content=data+science&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0)
- A hilarious data scientist gone rogue used NLP and Eigenfaces (Eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/)
- Check out KPCB's 2016 internet trends [this massive, insightful deck](http://www.kpcb.com/internet-trends)
- [Choosing a Stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html)
- Check documentation: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)