This notebook evaluates different methods for tokenization and stemming/lemmatization
and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).  Each tokenization method is evaluated on the same learning algorithm ($\ell_2$-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html

In [1]:
import nltk
import spacy
from nltk.stem.porter import *
from TokenizationTest import TokenizationTest
from happyfuntokenizing import Tokenizer as potts

In [2]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

# load NLTK porter stemmer
stemmer = PorterStemmer()

# load Potts sentiment tokenizer
potts_tokenizer=potts()

In [3]:
def spacy_tokenizer(data):
    spacy_tokens=nlp(data)
    return [token.text for token in spacy_tokens]

def spacy_lemmatizer(data):
    spacy_tokens=nlp(data)
    return [token.lemma_ for token in spacy_tokens]

In [4]:
tester=TokenizationTest("../../../Data/sentiment.1000.train.txt", "../../../Data/sentiment.1000.dev.txt")

In [5]:
tester.evaluate(str.split)

Tokenized text:  It was simple and yet so nice. I think the whole sense of sex segregation in society, which can be bitter, was shown very delicately. It had a bitter kind of hummer in it. The fact that most of the actors were not professionals, made the movie more tangible and more realistic. There was a "documentary" side to the movie too. The best scenes were those that all the girls, banned from watching, were listening passionately to the soldier, who is supposed to keep an eye on them, broadcasting the game. If you are an Iranian, the familiar cheering and dancing in the streets after a game won, fills you up with National pride!! If you are not Iranian, you'll still love it all the same!
Function 'split' Accuracy: 0.856


In [6]:
tester.evaluate(stemmer.stem)

Tokenized text:  i t   w a s   s i m p l e   a n d   y e t   s o   n i c e .   i   t h i n k   t h e   w h o l e   s e n s e   o f   s e x   s e g r e g a t i o n   i n   s o c i e t y ,   w h i c h   c a n   b e   b i t t e r ,   w a s   s h o w n   v e r y   d e l i c a t e l y .   i t   h a d   a   b i t t e r   k i n d   o f   h u m m e r   i n   i t .   t h e   f a c t   t h a t   m o s t   o f   t h e   a c t o r s   w e r e   n o t   p r o f e s s i o n a l s ,   m a d e   t h e   m o v i e   m o r e   t a n g i b l e   a n d   m o r e   r e a l i s t i c .   t h e r e   w a s   a   " d o c u m e n t a r y "   s i d e   t o   t h e   m o v i e   t o o .   t h e   b e s t   s c e n e s   w e r e   t h o s e   t h a t   a l l   t h e   g i r l s ,   b a n n e d   f r o m   w a t c h i n g ,   w e r e   l i s t e n i n g   p a s s i o n a t e l y   t o   t h e   s o l d i e r ,   w h o   i s   s u p p o s e d   t o   k e e p   a n   e y e   o n   t h e m ,   b r o a d c a s t i n g

In [10]:
tester.evaluate(nltk.word_tokenize)

Tokenized text:  It was simple and yet so nice . I think the whole sense of sex segregation in society , which can be bitter , was shown very delicately . It had a bitter kind of hummer in it . The fact that most of the actors were not professionals , made the movie more tangible and more realistic . There was a `` documentary '' side to the movie too . The best scenes were those that all the girls , banned from watching , were listening passionately to the soldier , who is supposed to keep an eye on them , broadcasting the game . If you are an Iranian , the familiar cheering and dancing in the streets after a game won , fills you up with National pride ! ! If you are not Iranian , you 'll still love it all the same !
Function 'word_tokenize' Accuracy: 0.876


In [11]:
tester.evaluate(spacy_tokenizer)

Tokenized text:  It was simple and yet so nice . I think the whole sense of sex segregation in society , which can be bitter , was shown very delicately . It had a bitter kind of hummer in it . The fact that most of the actors were not professionals , made the movie more tangible and more realistic . There was a " documentary " side to the movie too . The best scenes were those that all the girls , banned from watching , were listening passionately to the soldier , who is supposed to keep an eye on them , broadcasting the game . If you are an Iranian , the familiar cheering and dancing in the streets after a game won , fills you up with National pride ! ! If you are not Iranian , you 'll still love it all the same !
Function 'spacy_tokenizer' Accuracy: 0.871


In [12]:
tester.evaluate(spacy_lemmatizer)

Tokenized text:  it be simple and yet so nice . I think the whole sense of sex segregation in society , which can be bitter , be show very delicately . it have a bitter kind of hummer in it . the fact that most of the actor be not professional , make the movie more tangible and more realistic . there be a " documentary " side to the movie too . the good scene be those that all the girl , ban from watch , be listen passionately to the soldier , who be suppose to keep an eye on they , broadcast the game . if you be an Iranian , the familiar cheering and dance in the street after a game win , fill you up with National pride ! ! if you be not iranian , you will still love it all the same !
Function 'spacy_lemmatizer' Accuracy: 0.869


In [13]:
tester.evaluate(potts_tokenizer.tokenize)



  s = BeautifulSoup(s).get_text(" ")
  s = BeautifulSoup(s).get_text(" ")


Tokenized text:  it was simple and yet so nice . i think the whole sense of sex segregation in society , which can be bitter , was shown very delicately . it had a bitter kind of hummer in it . the fact that most of the actors were not professionals , made the movie more tangible and more realistic . there was a " documentary " side to the movie too . the best scenes were those that all the girls , banned from watching , were listening passionately to the soldier , who is supposed to keep an eye on them , broadcasting the game . if you are an iranian , the familiar cheering and dancing in the streets after a game won , fills you up with national pride ! ! if you are not iranian , you'll still love it all the same !
Function 'tokenize' Accuracy: 0.883
