# Stop Words and Bag of NGrams

### Introducing 

In this lesson, we'll see  two features that we can add to our bag of words model, n-grams and stop words.

1. N-grams

One new feature is the use of n-grams.  As we know, so far we have encoded our documents simply by looking at the *counts* of the words in a document.  However, it's probably not too much of a stretch to consider that groupings or phrases of words may also be worth encoding.  This is the thought behind n-grams, where we encode not only the occurrence of a single token, but also a sequence of tokens.  

2. Stop Words

Stop words are just "low value" words, like "the", "a", that are eliminated from each document as they are believed to not assist  with helping a model perform classification.  

In this lesson we'll learn both techniques with our newsgroups dataset.

### Eliminating Low Value Words

Let's begin by loading up our newsgroups dataset.

In [4]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

documents = pd.Series(newsgroups_train['data'])
y = newsgroups_train['target']

Now let's take a look at our first document.

In [2]:
documents.iloc[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

Look at our first line of the document above:

`WHAT car is this!?`

We can see that some of the words, like `car`, are important for classifying this document.  But other words, like 'is', and 'this' occur fairly frequently, yet provide little information about the text.  One technique is simply to remove these words, called stop words.

> **Stop words** are commonly occurring words believed to contain little information that we remove from our document vectors.

Natural language processing libraries often come with a list of predefined stop words.  For example, let's look at  some of the stop words that sklearn has predefined for us  in the `feature_extraction.text` module.

In [5]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

stop_words = list(ENGLISH_STOP_WORDS)
len(stop_words)
# 318

stop_words[:10]

['why',
 'show',
 'sometime',
 'i',
 'before',
 'which',
 'find',
 'herein',
 'together',
 'became']

If we want to ignore stop words in our CountVectorizer, we can ignore stop words with the following:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizor = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
X = vectorizor.fit_transform(documents)

In [7]:
X[:3]

<3x129796 sparse matrix of type '<class 'numpy.int64'>'
	with 263 stored elements in Compressed Sparse Row format>

We can get a sense of the how the CountVectorizer reduced the text to by calling `inverse_transform`.

> Remember there originally was the phrase, "looked to be from the late 60s", and now many of those words are gone.

In [17]:
vectorizor.inverse_transform(X[0].toarray())

[array(['15', '60s', '70s', 'addition', 'body', 'bricklin', 'brought',
        'bumper', 'called', 'car', 'college', 'day', 'door', 'doors',
        'early', 'edu', 'engine', 'enlighten', 'funky', 'history', 'host',
        'il', 'info', 'know', 'late', 'lerxst', 'lines', 'looked',
        'looking', 'mail', 'maryland', 'model', 'neighborhood', 'nntp',
        'organization', 'park', 'posting', 'production', 'rac3', 'really',
        'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'subject',
        'tellme', 'thanks', 'thing', 'umd', 'university', 'wam',
        'wondering', 'years'], dtype='<U180')]

### Bag of N-grams

Another feature that we can incorporate in our bag of words model is the use of n-grams, called a bag of n-grams.  With n-grams we can not only categorize how often a specific word occurs, but how often a *sequence* occurs.  For example, in the code below we'll first break our text into sequences of length one and two, and then count how often each sequence occurs.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizor = CountVectorizer(ngram_range = (1, 2))

If we look towards the end of each document vector, we now see pairs of words.

In [6]:
n_gram_vectors = vectorizor.fit_transform(documents[:20])

In [7]:
vectorizor.inverse_transform(n_gram_vectors[0])[0][-10:]

array(['please mail', 'mail thanks', 'thanks il', 'il brought',
       'brought to', 'to you', 'you by', 'by your', 'your neighborhood',
       'neighborhood lerxst'], dtype='<U55')

And when we feed this to a logistic regression model, it will now consider pairs of words in classifying the document.

> And if we look at the beginning of the document vector we see the same list of individual words as before.

In [8]:
vectorizor.inverse_transform(n_gram_vectors[0])[0][:10]

array(['from', 'lerxst', 'wam', 'umd', 'edu', 'where', 'my', 'thing',
       'subject', 'what'], dtype='<U55')

So when we used `CountVectorizer(ngram_range = (1, 2))` we'll now represent our document in terms of counts of individual words and sequences of words.

And when we train a model, our model will consider different sequences of words in making it's classifications.

### Tradeoffs

With both stop words and n-grams, there are costs and benefits to employing them.  

* Stop words

With stop words, the idea of course, is to effectively reduce features that effectively are just noise, and thus reduce variance in our model.  But, how do we know that the features do not help a model discover an underlying signal?  We'll see that sometimes models perform better when we include the stop words.

* Bag of N-grams

While with N-grams, we can capture sequences of words, there is a downside in that the frequency of any specific sequence is quite rare.  In practice, this means that we generally will not employ an n-grams of length greater than 2.  

### Summary

In this lesson, we learned about two different features that we can incorporate in our bag of words model.  The first is stop words, where we remove words that occur often, and do not add much information about a document.  And the second is n-grams where we encode a sequence of words, as well as individual words.  We include both features in our CountVectorizer with the following.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

vectorizor = CountVectorizer(ngram_range = (1, 2), stop_words = ENGLISH_STOP_WORDS)