## Stop Words

Text documents often contain many occurrences of the same word. For example, in a document written in _English_, words such as _a_, _the_, _of_, and _it_ likely occur very frequently. When classifying a document based on the number of times specific words occur in the text document, these words can lead to biases, especially since they are generally common in **all** text documents you might want to classify. As a result, the concept of [_stop words_](https://en.wikipedia.org/wiki/Stop_words) was invented. Basically these words are the most commonly occurring words that should be removed during the tokenization process in order to improve subsequent classification efforts. 

We can easily specify that the __English__ stop words should be excluded during tokenization by using the `stop_words`. Note, _stop word_ dictionaries for other languages, or even specific domains, exist and can be used instead. We demonstrate the removal of stop words by using a `CountVectorizer` in the following simple example.

-----

In [1]:
# Define our vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', lowercase=True)

# Sample sentence to tokenize
my_text = 'This module introduced many concepts in text analysis.'

cv1 = CountVectorizer(lowercase=True)
cv2 = CountVectorizer(stop_words = 'english', lowercase=True)

tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
pp.pprint(tk_func1(my_text))

print()

print('Tokenization (with Stop words):')
pp.pprint(tk_func2(my_text))

Tokenization:
['this', 'module', 'introduced', 'many', 'concepts', 'in', 'text', 'analysis']

Tokenization (with Stop words):
['module', 'introduced', 'concepts', 'text', 'analysis']


## Stemming

So far, we have looked at several techniques to remove redundant or unimportant features. For example, we changed the case of all text to lowercase and we have applied stop words. However, there still is the issue of different forms of the same word, for example compute, computer, computed, and computing. The process of changing words back to their root, or basic form (by removing prefixes and suffixes) so that token frequencies match the use of the root token rather than being spread across multiple similar tokens is known as [stemming](https://en.wikipedia.org/wiki/Stemming). 

The most widely used stemmer, or program/method that performs stemming, is the _Porter Stemmer_, which was originally published in 1980 by Martin Porter. An improved version was released in 2000, which fixed a number of errors. NLTK includes the Porter Stemmer, which can be used with scikit learn by creating a special function that tokenizes text documents and passing this function as an argument to the `CountVectorizer` via the `tokenizer` attribute. By performing stemming inside this tokenize method, we can return a set of tokens for a document that have been stemmed. In the following code cell, we use a custom `tokenize` method that first builds a list of tokens by using nltk, and then maps the Porter stemmer to the list of tokens to generate a stemmed list.

-----


In [2]:
import nltk
mvr = nltk.corpus.movie_reviews

from sklearn.datasets import load_files

data_dir = '/usr/share/nltk_data/corpora/movie_reviews'

mvr = load_files(data_dir, shuffle = False)
print('Number of Reviews: {0}'.format(len(mvr.data)))

Number of Reviews: 2000


In [21]:
print(len(mvr))

5


In [4]:
print(mvr.data[:2])

[b'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience mem

In [23]:
print(mvr.target[:500])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [24]:
from sklearn.model_selection import train_test_split

mvr_train, mvr_test, y_train, y_test = train_test_split(
    mvr.data, mvr.target, test_size=0.25, random_state=23)

print('Number of Reviews in train: {0}'.format(len(mvr_train)))
print('Number of Reviews in test: {0}'.format(len(mvr_test)))
print(y_train)

Number of Reviews in train: 1500
Number of Reviews in test: 500
[1 1 0 ..., 1 0 0]


In [26]:
#print(mvr_test)



In [7]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
stemmer = PorterStemmer()

for w in example_words:
    print(stemmer.stem(w))

python
python
python
python
pythonli


In [8]:
new_text = "It is important to be very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."

tokens = nltk.word_tokenize(new_text)

print(tokens)

print("===============")

tokens = [token for token in tokens if token not in string.punctuation]

for w in tokens:
    print(stemmer.stem(w))

['It', 'is', 'important', 'to', 'be', 'very', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'python', '.', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once', '.']
It
is
import
to
be
veri
pythonli
while
you
are
python
with
python
all
python
have
python
poorli
at
least
onc


-----

## Classification

We identified the features (or tokens in the training documents) that we can use to classify the documents. Before we introduce a  classification technique on the newsgroups data, be aware that many issues might affect a classification process. In the context of this notebook, the data we have is similar to emails. Exclude email address information (like com, edu, etc.), proper names, and information such as dates, monetary information etc. The content in some categories will clearly overlap, such as _alt.atheism_ and _soc.religion.christian_. 

Issues like this demonstrate the **need** for manual intervention and introspection during the machine learning process. You would want to continually analyze classification results to ensure you understand what is occurring and why it is occurring.

-----

-----

### Naive Bayes Classifier

One of the simplest techniques for perfomring text classification is the [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier). Fundamentally this method applies Bayes theorem by (naively) assuming independence between the features. In scikit learn, we will use a [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) model, where we treat each feature independently. Thus we calculate the likelihood of a feature corresponding to each training label, and the accumulation of these likelihoods provides our overall classification. By working with log-likelihoods, this accumulation becomes a simple sum.

-----

In [9]:
# Split into training and testing
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(data_home='/dsa/data/DSA-8630/newsgroups/', subset='train', \
                           shuffle=True, random_state=23)

test = fetch_20newsgroups(data_home='/dsa/data/DSA-8630/newsgroups/', subset='test', \
                          shuffle=True, random_state=23)

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()

train_counts = cv.fit_transform(train['data']) # returns a Document-Term Matrix
train_counts

<11314x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 1787565 stored elements in Compressed Sparse Row format>

In [28]:
print(train['target'])

[19 18  9 ...,  9  3  7]


In [11]:
print(train_counts[:2])

  (0, 69782)	1
  (0, 114422)	1
  (0, 107541)	1
  (0, 89074)	1
  (0, 26076)	1
  (0, 41235)	1
  (0, 26056)	1
  (0, 33254)	1
  (0, 28626)	1
  (0, 41228)	1
  (0, 91138)	1
  (0, 67583)	1
  (0, 119701)	2
  (0, 42876)	1
  (0, 96452)	1
  (0, 125265)	2
  (0, 37565)	1
  (0, 62821)	1
  (0, 95156)	1
  (0, 55011)	2
  (0, 107539)	1
  (0, 90686)	1
  (0, 76902)	1
  (0, 84681)	1
  (0, 99381)	1
  :	:
  (1, 28601)	8
  (1, 114418)	1
  (1, 56283)	15
  (1, 32311)	8
  (1, 115475)	48
  (1, 68532)	3
  (1, 111533)	5
  (1, 62221)	2
  (1, 68766)	1
  (1, 29620)	15
  (1, 87949)	5
  (1, 124616)	9
  (1, 124055)	3
  (1, 114455)	90
  (1, 28146)	66
  (1, 125271)	1
  (1, 50527)	3
  (1, 29573)	6
  (1, 66608)	21
  (1, 76032)	1
  (1, 90379)	1
  (1, 89362)	66
  (1, 111322)	2
  (1, 41105)	3
  (1, 56979)	5


In [12]:
test_data = cv.transform(test['data'])
test_data

<7532x130107 sparse matrix of type '<class 'numpy.int64'>'
	with 1107956 stored elements in Compressed Sparse Row format>

In [13]:
nb = MultinomialNB()

clf = nb.fit(train_counts, train['target'])
predicted = clf.predict(test_data)
print(predicted)

[15  0 12 ...,  8  1  1]


In [14]:
print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test_data, test['target'])))

NB prediction accuracy =  77.3%


Above is implemented using pipeline function in sklearn. [Pipelines](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html) allows you to chain transformers and estimators together in such a way that you can use them as a single unit. Here vectorizer => classifier is made easier to work with using the Pipeline class. The fit() method of CountVectorizer() below will learn the vocabulary dictionary of all tokens in the input data train['data'].

In [15]:
from sklearn.pipeline import Pipeline

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("NB prediction accuracy = {0:5.1f}%".format(100.0 * clf.score(test['data'], test['target'])))

NB prediction accuracy =  77.3%


## TF IFD

Previously, we have simply used the number of times a token (i.e., word, or more generally an n-gram) occurs in a document to classify the document. Even with the removal of stop words, however, this can still overemphasize tokens that might generally occur across many documents (e.g., names or general concepts). An alternative technique that often provides robust improvements in classification accuracy is to employ the frequency of token occurrence, normalized over the frequency with which the token occurs in all documents. In this manner, we give higher weight in the classification process to tokens that are more strongly tied to a particular label. 

Formally this concept is known as [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf) (or tf-idf), and scikit-learn provides this functionality via the [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) that can either follow a tokenizer, such as `CountVectorizer` or can be combined together into a single transformer via the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

-----

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tools = [('tf', TfidfVectorizer()), ('nb', MultinomialNB())]
clf = Pipeline(tools)

# set_params() of TfidfVectorizer below, sets the parameters of the estimator. 
# The method works on simple estimators as 
# well as on nested objects (such as pipelines). 
# The pipelines have parameters of the form <component>__<parameter> 
# so that it’s possible to update each component of a nested object.
clf.set_params(tf__stop_words = 'english')

clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("NB (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%"\
      .format(100.0 * clf.score(test['data'], test['target'])))

NB (TF-IDF with Stop Words) prediction accuracy =  81.7%


----

### Logistic Regression

[Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) is typically employed on categorical variables, such as yes/no decision, or win/loss likelihoods. In the case of many labels, we can use the trick that logistic regressin can quantify the likelihood a vector is in or out of a particular category. Thus, by computing this over all categories we can determine the best label for each test vector. [scikit_learn](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) provides an implementation that can be easily used for our classification problem

-----

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer

clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                ('tfidf', TfidfTransformer()),
                ('lr', LogisticRegression())])


clf = clf.fit(train['data'], train['target'])
predicted = clf.predict(test['data'])

print("LR (TF-IDF with Stop Words) prediction accuracy = {0:5.1f}%".\
      format(100.0 * clf.score(test['data'], test['target'])))


LR (TF-IDF with Stop Words) prediction accuracy =  83.0%
