In [None]:
# Run this every time you open the spreadsheet
%load_ext autoreload
%autoreload 2
from collections import Counter
import lib

In [None]:
# Load the data.
# This function returns tweets and test_tweets, both lists of tweets
tweets, test_tweets = lib.read_data()

In previous notebooks, we have implemented a Naive Bayes classifier on the data. Let's remind of ourselves about how well it performs:

In [None]:
categories = ['Energy', 'Food', 'Medical', 'Water', 'None']

probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Pretty good, right! :) We would like to furthur enhance the performance though. Some questions you may have:
1. Are all words equally informative?
2. Words such as "*generator*" and "*generators*" seem to convey the same meaning. Can we merge them?

Next, we are going to play with three pre-processing steps to address these two questions.

### Stop words removal
Stop words, or function words (as opposed to *content words*), refer to commonly used words that are usually non-informative, such as "*the*", "*a*", or "*can*".

It is usually advantageous for the classifier to ignore these stop words, since they may add noises or cause numerical issues (e.g. underflow).

The `nltk` package provides a list of stop words in English, and we can remove them from our data simply by using equality tests, which can be considered as a *rule-based classifier* that classifies whether a word is a stop word or not by looking up a blacklist (i.e. the list of stop words).

Let's first look at some examples of stop words:

In [None]:
import nltk
from nltk.corpus import stopwords

eng_stopwords = set(stopwords.words('english'))
# look at some stopwords
print("Here are some example stopwords:")
for i,word in enumerate(eng_stopwords):
    if i>10:
        break
    print(word)

Here is an example of filtering a tweet using the stop word list:

In [None]:
tweet = tweets[0]
tokens = tweet.tokenSet
print('all tokens:\n', tokens, '\n')

filtered_tokens = set()
deleted_tokens = set()

for token in tweet.tokenSet:
    if token in eng_stopwords:
        deleted_tokens.add(token)
    else:
        filtered_tokens.add(token)

print('filtered_tokens:\n', filtered_tokens, '\n')
print('deleted_tokens:\n', deleted_tokens)

And now let's see if removing stop words actually helps with the classification performance:

In [None]:
categories = ['Energy', 'Food', 'Medical', 'Water', 'None']

probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, eng_stopwords)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, eng_stopwords)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Compare these results with the previous ones. Does stop word removal help?

## Stemming and Lemmatization

Remember that the goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

A difference between stemming and lemmatization is that stemming looks at the current word only, while lemmatization also takes the context into consideration. Either way, this pre-processing step could be somewhat tedious. Luckily, the powerful `nltk` provides tools for both.

### Stemming using the Porter stemmer
*Porter's algorithm*, developed in the 1980s, is a classic stemmer even used today.

In [None]:
from nltk.stem.porter import *

# Get the Porter stemmer
stemmer = PorterStemmer()

# Let's stemming on plurals
plurals = ['apples', 'batteries', 'generators', 'medicines', 'tests', 'feet']
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, stemmer.stem(plural)))
print()
    
# and variations of verbs
verbs = ['studies', 'thinks', 'goes', 'played', 'bought', 'went', 'ran', 'drew', ]
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, stemmer.stem(verb)))

You can add more words to `plurals` and see what the stemming results look like.  
You may find that the results may look a bit mechanical. This is because the Porter's algorithm is essentially a sequential application of a set of rules. To get better looking results, let's try out a lemmatizer.

In [None]:
# Uncomment and run the following line when you this cell for the first time:
# nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer

# Get the lemmatizer
lmtzr = WordNetLemmatizer()

# Lemmatize the plurals
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, lmtzr.lemmatize(plural)))
print()

# Lemmatize the verbs
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, lmtzr.lemmatize(verb)))

Not yet perfect, but much better, especially for the plurals. Whoray! :)

As before, let's check whether stemming or lemmatization can help with our classification task.

In [None]:
# Stemming
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stemmer=stemmer)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stemmer=stemmer)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

In [None]:
# Lemmatization
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, lmtzr=lmtzr)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, lmtzr=lmtzr)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

There's some improvement, not bad! 

Now let's try using these tricks together, i.e. combining stop words removal with stemming or lemmatization. We don't need both stemming and lemmatization since they are two alternatives serving the same purpose.

In [None]:
# Stop word removal + stemming
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stop_words=eng_stopwords, stemmer=stemmer)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stop_words=eng_stopwords, stemmer=stemmer)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

In [None]:
# Stop word removal + Lemmatization
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stop_words=eng_stopwords, lmtzr=lmtzr)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stop_words=eng_stopwords, lmtzr=lmtzr)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Does using several tricks together always work better using one of them alone? Why do you think is the case?