In [1]:
# Run this every time you open the spreadsheet
%load_ext autoreload
%autoreload 2
from collections import Counter
import lib

In [2]:
# Load the data.
# This function returns tweets and test_tweets, both lists of tweets
tweets, test_tweets = lib.read_data()

In previous notebooks, we have implemented a Naive Bayes classifier on the data. Let's remind of ourselves about how well it performs:

In [3]:
categories = ['Energy', 'Food', 'Medical', 'Water', 'None']

probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  54.054054054054056
Recall:  50.0
F1:  51.94805194805195

Food
Precision:  76.58227848101266
Recall:  93.7984496124031
F1:  84.3205574912892

Medical
Precision:  100.0
Recall:  38.46153846153846
F1:  55.55555555555556

None
Precision:  79.45205479452055
Recall:  73.41772151898734
F1:  76.3157894736842

Water
Precision:  75.0
Recall:  30.0
F1:  42.857142857142854

Average F1:  62.19941946514475


Pretty good, right! :) We would like to furthur enhance the performance though. Some questions you may have:
1. Are all words equally informative?
2. Words such as "*generator*" and "*generators*" seem to convey the same meaning. Can we merge them?

Next, we are going to play with three pre-processing steps to address these two questions.

### Stop words removal
Stop words, or function words (as opposed to *content words*), refer to commonly used words that are usually non-informative, such as "*the*", "*a*", or "*can*".

It is usually advantageous for the classifier to ignore these stop words, since they may add noises or cause numerical issues (e.g. underflow).

The `nltk` package provides a list of stop words in English, and we can remove them from our data simply by using equality tests, which can be considered as a *rule-based classifier* that classifies whether a word is a stop word or not by looking up a blacklist (i.e. the list of stop words).

Let's first look at some examples of stop words:

In [6]:
import nltk
from nltk.corpus import stopwords

eng_stopwords = set(stopwords.words('english'))
# look at some stopwords
print("Here are some example stopwords:")
for i,word in enumerate(eng_stopwords):
    if i>10:
        break
    print(word)

Here are some example stopwords:
more
during
can
at
isn
m
shan
s
in
an
our


Here is an example of filtering a tweet using the stop word list:

In [9]:
# No need to add code to this cell just run it! :)

tweet = tweets[0]
tokens = tweet.tokenSet
print('all tokens:\n', tokens, '\n')

filtered_tokens = set()
deleted_tokens = set()

for token in tweet.tokenSet:
    if token in eng_stopwords:
        deleted_tokens.add(token)
    else:
        filtered_tokens.add(token)

print('filtered_tokens:\n', filtered_tokens, '\n')
print('deleted_tokens:\n', deleted_tokens)

# Write your solutions to the following questions:
# Q1: What is tokens?
#

# Q2: What is filtered_tokens?
#

# Q3: What is deleted_tokens?
#

all tokens:
 {'anymore', 'the', 'monday..', 'structural', 'finally', 'before', 'got', 'huge', 'this', 'storm..still', 'house', 'new', 'done..', 'damage', 'tree', 'we', 'is', 'off'} 

filtered_tokens:
 {'anymore', 'monday..', 'structural', 'finally', 'got', 'huge', 'storm..still', 'house', 'new', 'done..', 'damage', 'tree'} 

deleted_tokens:
 {'the', 'before', 'this', 'we', 'is', 'off'}


And now let's see if removing stop words actually helps with the classification performance:

In [10]:
categories = ['Energy', 'Food', 'Medical', 'Water', 'None']

probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, eng_stopwords)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, eng_stopwords)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  60.0
Recall:  45.0
F1:  51.42857142857143

Food
Precision:  78.57142857142857
Recall:  93.7984496124031
F1:  85.51236749116609

Medical
Precision:  100.0
Recall:  46.15384615384615
F1:  63.1578947368421

None
Precision:  77.5
Recall:  78.48101265822785
F1:  77.9874213836478

Water
Precision:  72.72727272727273
Recall:  40.0
F1:  51.612903225806456

Average F1:  65.93983165320677


Compare these results with the previous ones. Does stop word removal help?

## Stemming and Lemmatization

Remember that the goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

A difference between stemming and lemmatization is that stemming looks at the current word only, while lemmatization also takes the context into consideration. Either way, this pre-processing step could be somewhat tedious. Luckily, the powerful `nltk` provides tools for both.

### Stemming using the Porter stemmer
*Porter's algorithm*, developed in the 1980s, is one of the most commonly used stemmers.

In [11]:
# No need to add code to this cell just run it! :)

from nltk.stem.porter import *

# Get the Porter stemmer
stemmer = PorterStemmer()

# Let's stemming on plurals
plurals = ['apples', 'batteries', 'generators', 'medicines', 'tests', 'feet']
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, stemmer.stem(plural)))
print()
    
# and variations of verbs
verbs = ['studies', 'thinks', 'goes', 'played', 'bought', 'went', 'ran', 'drew', ]
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, stemmer.stem(verb)))

plurals:
apples --> appl
batteries --> batteri
generators --> gener
medicines --> medicin
tests --> test
feet --> feet

verbs:
studies --> studi
thinks --> think
goes --> goe
played --> play
bought --> bought
went --> went
ran --> ran
drew --> drew


You can add more words to `plurals` and see what the stemming results look like.  
You may find that the results may look a bit mechanical. This is because the Porter's algorithm is essentially a sequential application of a set of rules. To get better looking results, let's try out a lemmatizer.

In [14]:
from nltk.stem.wordnet import WordNetLemmatizer

# Get the lemmatizer
lmtzr = WordNetLemmatizer()

# Lemmatize the plurals
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, lmtzr.lemmatize(plural)))
print()

# Lemmatize the verbs
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, lmtzr.lemmatize(verb)))

plurals:
apples --> apple
batteries --> battery
generators --> generator
medicines --> medicine
tests --> test
feet --> foot

verbs:
studies --> study
thinks --> think
goes --> go
played --> played
bought --> bought
went --> went
ran --> ran
drew --> drew


Not yet perfect, but much better, especially for the plurals. Whoray! :)

As before, let's check whether stemming or lemmatization can help with our classification task.

In [15]:
# Stemming
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stemmer=stemmer)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stemmer=stemmer)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  52.77777777777778
Recall:  47.5
F1:  50.0

Food
Precision:  76.92307692307692
Recall:  93.02325581395348
F1:  84.21052631578948

Medical
Precision:  100.0
Recall:  46.15384615384615
F1:  63.1578947368421

None
Precision:  79.45205479452055
Recall:  73.41772151898734
F1:  76.3157894736842

Water
Precision:  60.0
Recall:  30.0
F1:  40.0

Average F1:  62.73684210526316


In [16]:
# Lemmatization
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, lmtzr=lmtzr)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, lmtzr=lmtzr)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  54.285714285714285
Recall:  47.5
F1:  50.66666666666667

Food
Precision:  76.43312101910828
Recall:  93.02325581395348
F1:  83.91608391608392

Medical
Precision:  100.0
Recall:  46.15384615384615
F1:  63.1578947368421

None
Precision:  79.72972972972973
Recall:  74.68354430379746
F1:  77.12418300653594

Water
Precision:  66.66666666666667
Recall:  30.0
F1:  41.37931034482759

Average F1:  63.24882773419124


There's some improvement, not bad! 

Now let's try using these tricks together, i.e. combining stop words removal with stemming or lemmatization. We don't need both stemming and lemmatization since they are two alternatives serving the same purpose.

In [17]:
# Stop word removal + stemming
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stop_words=eng_stopwords, stemmer=stemmer)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stop_words=eng_stopwords, stemmer=stemmer)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  51.61290322580645
Recall:  40.0
F1:  45.07042253521127

Food
Precision:  77.41935483870968
Recall:  93.02325581395348
F1:  84.50704225352112

Medical
Precision:  100.0
Recall:  46.15384615384615
F1:  63.1578947368421

None
Precision:  76.62337662337663
Recall:  74.68354430379746
F1:  75.64102564102565

Water
Precision:  66.66666666666667
Recall:  40.0
F1:  50.0

Average F1:  63.67527703332003


In [18]:
# Stop word removal + Lemmatization
probs = {}
for category in categories:
    prior_prob, token_prob = lib.calc_probs_single(tweets, category, stop_words=eng_stopwords, lmtzr=lmtzr)
    probs[category] = (prior_prob, token_prob)

# Get average F1 score for the test set
predictions = [(tweet, lib.classify_nb_single(tweet, probs, stop_words=eng_stopwords, lmtzr=lmtzr)) for tweet in test_tweets] # maps each test tweet to its predicted label
lib.evaluate(predictions)

Energy
Precision:  55.88235294117647
Recall:  47.5
F1:  51.351351351351354

Food
Precision:  78.28947368421052
Recall:  92.24806201550388
F1:  84.69750889679715

Medical
Precision:  100.0
Recall:  53.84615384615385
F1:  70.0

None
Precision:  78.94736842105263
Recall:  75.9493670886076
F1:  77.41935483870968

Water
Precision:  66.66666666666667
Recall:  40.0
F1:  50.0

Average F1:  66.69364301737164


Does using several tricks together always work better using one of them alone? Why do you think is the case?