This notebook evaluates different methods for tokenization and stemming/lemmatization
and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).  Each tokenization method is evaluated on the same learning algorithm ($\ell_2$-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html

In [1]:
import nltk
import spacy
from nltk.stem.porter import *
from TokenizationTest import TokenizationTest
from happyfuntokenizing import Tokenizer as potts

In [2]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

# load NLTK porter stemmer
stemmer = PorterStemmer()

# load Potts sentiment tokenizer
potts_tokenizer=potts()

In [3]:
def spacy_tokenizer(data):
    spacy_tokens=nlp(data)
    return [token.text for token in spacy_tokens]

def spacy_lemmatizer(data):
    spacy_tokens=nlp(data)
    return [token.lemma_ for token in spacy_tokens]

In [9]:
tester=TokenizationTest("../../Data/sentiment.1000.train.txt", "../../Data/sentiment.1000.dev.txt")

In [10]:
tester.evaluate(str.split)

Function 'split' Accuracy: 0.858


In [11]:
tester.evaluate(stemmer.stem)

Function 'stem' Accuracy: 0.825


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
tester.evaluate(nltk.word_tokenize)

Function 'word_tokenize' Accuracy: 0.876


In [13]:
tester.evaluate(spacy_tokenizer)

Function 'spacy_tokenizer' Accuracy: 0.871


In [14]:
tester.evaluate(spacy_lemmatizer)

Function 'spacy_lemmatizer' Accuracy: 0.872


In [16]:
tester.evaluate(potts_tokenizer.tokenize)

Function 'tokenize' Accuracy: 0.885
