# Text Classification

The first step of this project is to develop a **classification model to predict the positive/negative labels** of movie reviews. This prediction will be **based solely on the text content** of the reviews.

The data used in this project is the polarity dataset v2.0, http://www.cs.cornell.edu/people/pabo/movie-review-data/, of Cornell University.

### Step 1

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

In [2]:
df = pd.read_csv("data/moviereviews.tsv", sep='\t')

#### 3. Check the dataframe

In [3]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
len(df)

2000

In [5]:
# check number of both labels

df['label'].value_counts()

pos    1000
neg    1000
Name: label, dtype: int64

In [6]:
# check first negative review

print(df['review'][0])

how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternate

In [7]:
# check first positive review

print(df['review'][2])

this has been an extraordinary year for australian films . 
 " shine " has just scooped the pool at the australian film institute awards , picking up best film , best actor , best director etc . to that we can add the gritty " life " ( the anguish , courage and friendship of a group of male prisoners in the hiv-positive section of a jail ) and " love and other catastrophes " ( a low budget gem about straight and gay love on and near a university campus ) . 
i can't recall a year in which such a rich and varied celluloid library was unleashed from australia . 
 " shine " was one bookend . 
stand by for the other one : " dead heart " . 
>from the opening credits the theme of division is established . 
the cast credits have clear and distinct lines separating their first and last names . 
bryan | brown . 
in a desert settlement , hundreds of kilometres from the nearest town , there is an uneasy calm between the local aboriginals and the handful of white settlers who live nearby . 

#### 4. Check missing values

In [8]:
df.isnull().sum()

label      0
review    35
dtype: int64

There are 35 missing reviews. We should delete these rows.

In [9]:
# remove rows with missing reviews

df.dropna(inplace = True)

In [10]:
# check missing values

df.isnull().sum()

label     0
review    0
dtype: int64

#### 5. Check empty strings

In [11]:
# using the isspace() method

empty_strings = []

for i, lb, rv in df.itertuples():
    if rv.isspace():
        empty_strings.append(i)

In [12]:
print(empty_strings)
print(len(empty_strings))

[57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]
27


There are 27 reviews that correspond to empty strings. These reviews are identified by the indices in the empty_strings list. We should remove them.

In [13]:
# remove rows with empty strings

df.drop(empty_strings, inplace = True)

In [14]:
# check length

len(df)

1938

In [15]:
# check number of both labels

df['label'].value_counts()

pos    969
neg    969
Name: label, dtype: int64

We now have 1938 movie reviews (969 are positive and 969 are negative).

#### 6. Split the data into train and test sets

In [16]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#### 7. Build pipeline to vectorize the data and train/fit the model

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                    ('clf', LinearSVC())])

text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

#### 8. Make predictions with the test set

In [18]:
predictions = text_clf.predict(X_test)

#### 9. Evaluate the predictions

In [19]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [20]:
# confusion matrix

print(confusion_matrix(y_test, predictions))

[[235  47]
 [ 41 259]]


In [21]:
# classification report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [22]:
# accuracy score

print(accuracy_score(y_test, predictions))

0.8487972508591065


Based solely on the text content of the reviews we've managed to correctly classify **84,9%** of them as positive or negative.

#### 10. Test the fitted model on new data

In [23]:
# make up some reviews and test the model

my_review = "The movie was great! The main actors were superb and the storyline was convincing."
my_review_2 = "Terrible movie! A complete waste of money."

reviews = [my_review, my_review_2]

print(text_clf.predict(reviews))

['pos' 'neg']


Even though this is a very simple model, everything seems to be working just fine!

We've used **sklearn's TfidfVectorizer** - our text vectorization method - and the **Linear Support Vector Classification algorithm** to build our model. 

Let's consider other alternatives in terms of **text normalization and vectorization methods** and learning **algorithms**, that can be useful in different **text classification** scenarios.

In step 2 we'll be focusing on the different ways how we can **normalize** our documents (reviews).

### Step 2

Text normalization allows us to **reduce our vocabulary size** and hence the number of dimensions of our feature space, which results in efficiency gains in terms of storage and processing.

However, text normalization comes with the cost of **loss of information**.

It is up to us to find the right balance between those efficiency gains and our model's accuracy, considering both recall and precision.

Our ultimate goal is to **improve the overall performance of our NLP pipeline**, given the specific objectives of our project.

Let's see how the text vectorization method we applied in step 1 deals with text normalization.

#### 1. Check parameters of TfidfVectorizer  used in our pipeline

In [24]:
text_clf.named_steps['tfidf'].get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

Regarding text normalization and having a look at the default parameters that were used before, we can see that:
* **tokens of 2 or more alphanumeric characters** were selected (**punctuation was ignored**)
* tokens were **lowercased**

And that:
* **no stopwords** were removed
* apart from that, **no stemmer and no lemmatizer** were used

#### 2. Check dataset size and vocabulary size

In [25]:
len(df)

1938

In [26]:
len(text_clf.named_steps['tfidf'].vocabulary_)

33855

We have a reasonably small dataset and vocabulary, and thus we can choose to keep most of the information from our reviews without hindering our NLP pipeline performance.

Supposing this was not the case, what could we do? In order to maximize the range of possibilities, we could make some changes to our pipeline by **creating custom transformers for both text normalization and text vectorization** and **building some different models** using several distinct algorithms.

Let's start by creating a **custom text normalization transformer** that allows us to:
* lowercase words
* remove punctuation
* remove stopwords
* stem words
* lemmatize words
* lemmatize and stem words (in this order)

This is our step 3.

### Step 3

#### 1. Perform necessary imports

In [27]:
import unicodedata
import nltk
from nltk import pos_tag, sent_tokenize, wordpunct_tokenize
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.base import BaseEstimator, TransformerMixin

#### 2. Create a custom text normalization transformer

In [28]:
class TextNormalizer(BaseEstimator, TransformerMixin):

    def __init__(self, language='english', lemmatizer=WordNetLemmatizer(), stemmer=None, vectorizer='gensim'):
        '''
        The default language is English.
        The default lemmatizer is WordNetLemmatizer. If you don't want to use a lemmatizer,
        set it to None.
        To use a stemmer, set stemmer to 'porter' (PorterStemmer) or to 'snowball' (SnowballStemmer),
        otherwise, it defaults to None.
        The default vectorizer is Gensim - this way you get a list of tokens as an output.
        If you need a string instead, set vectorizer to 'other'.
        '''
        self.language = language
        self.stopwords = set(nltk.corpus.stopwords.words(self.language))
        
        self.lemmatizer = lemmatizer
        self.stemmer = stemmer
        
        #if lemmatizer == 'wordnet':
        #    self.lemmatizer = WordNetLemmatizer()
        #else:
        #    self.lemmatizer = None
        
        #if stemmer == 'porter':
        #    self.stemmer = PorterStemmer()
        #elif stemmer == 'snowball':
        #    self.stemmer = SnowballStemmer(self.language)
        #else:
        #    self.stemmer = None
        
        self.vectorizer = vectorizer
        
    def is_punct(self, token):
        # returns True if all characters of a token are punctuation signs
        return all(unicodedata.category(char).startswith('P') for char in token)
    
    def is_stopword(self, token):
        # returns True if the lowercased token is a stopword
        return token.lower() in self.stopwords
    
    def lemmatize(self, token, tag):
        '''
        Converts Penn Treebank part-of-speech tags (the default tag set in nltk.pos_tag)
        to WordNet tags - defaults to wn.Noun if the first letter of the Penn Treebank pos tag
        is neither 'N', 'V', 'R' or 'J'.
        Returns lemmatized token.
        '''        
        wordnet_tag = {
            'N': wn.NOUN, 
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)
        
        return self.lemmatizer.lemmatize(token, wordnet_tag)
    
    def stem(self, token):
        if self.stemmer == 'porter':
            self.stemmer = PorterStemmer()
        elif self.stemmer == 'snowball':
            self.stemmer = SnowballStemmer(self.language)
        return self.stemmer.stem(token)
    
    def normalize(self, review):
        '''
        Normalizes review by removing punctuation and stopwords, lowercasing tokens,
        and by lemmatizing or/and stemming tokens.
        '''
        if not self.lemmatizer == None and self.stemmer == None:
            return [self.lemmatize(token, tag).lower() 
                    for sentence in sent_tokenize(review) 
                    for (token, tag) in pos_tag(wordpunct_tokenize(sentence)) 
                    if not self.is_punct(token) and not self.is_stopword(token)]

        elif self.lemmatizer == None and not self.stemmer == None:
            return [self.stem(token).lower() 
                    for sentence in sent_tokenize(review) 
                    for (token, tag) in pos_tag(wordpunct_tokenize(sentence)) 
                    if not self.is_punct(token) and not self.is_stopword(token)]
        
        elif self.lemmatizer == None and self.stemmer == None:
            return [token.lower() 
                    for sentence in sent_tokenize(review) 
                    for (token, tag) in pos_tag(wordpunct_tokenize(sentence)) 
                    if not self.is_punct(token) and not self.is_stopword(token)]

        else:
            return [self.stem(lemmatized_token)
                    for lemmatized_token in [self.lemmatize(token, tag).lower() 
                    for sentence in sent_tokenize(review) 
                    for (token, tag) in pos_tag(wordpunct_tokenize(sentence)) 
                    if not self.is_punct(token) and not self.is_stopword(token)]]
    
    def fit(self, reviews, labels=None):
        return self
    
    def transform(self, reviews):
        # returns a list of tokens if vectorizer is set to 'gensim', otherwise, returns a string
        '''
        for review in reviews:
            if self.vectorizer == 'gensim':
                yield self.normalize(review)
            else:
                yield ' '.join(self.normalize(review))
        '''
        if self.vectorizer == 'gensim':
            return [self.normalize(review) for review in reviews]
        else:
            return [' '.join(self.normalize(review)) for review in reviews]

#### 3. Test the custom text normalization transformer

In [29]:
# Instantiate object with default values

normalizer = TextNormalizer()

Our class, TextNormalizer, inherits from BaseEstimator. Its objects are then estimators with **methods get_params() and set_params()**.

Let's check our object's parameters.

In [30]:
normalizer.get_params()

{'language': 'english',
 'lemmatizer': <WordNetLemmatizer>,
 'stemmer': None,
 'vectorizer': 'gensim'}

The language is **English**, the lemmatizer we'll be using is **WorNetLemmatizer**, **no stemming** will be performed, and the output will be a **list of tokens**, ready to use as input to a Gensim's text vectorization method.

As a side note, both stemming and lemmatization allow us to reduce the size of our vocabulary, but they accomplish it in different ways:
* **stemming** uses a series of rules to remove word affixes and the resulting token **might not be a valid word**
* **lemmatization** uses a knowledge base and may also use the word's POS tag to return the word's lemma, which **is always a valid word**

This is why **it makes sense to use a stemmer right after a lemmatizer, but not the other way around**, if you goal is to reduce dimensionality and improve recall.

TextNormalizer also inherits from TransformerMixin, so it has a **fit_transform method**.

--> Since our transform method yields a generator object, we use list() to consume the generator.<-- Let's test it!

In [33]:
df['review_normalized'] = normalizer.fit_transform(df['review'])

In [34]:
df.head()

Unnamed: 0,label,review,review_normalized
0,neg,how do films like mouse hunt get into theatres...,"[film, like, mouse, hunt, get, theatre, law, s..."
1,neg,some talented actresses are blessed with a dem...,"[talented, actress, bless, demonstrated, wide,..."
2,pos,this has been an extraordinary year for austra...,"[extraordinary, year, australian, film, shine,..."
3,pos,according to hollywood movies made in last few...,"[accord, hollywood, movie, make, last, decade,..."
4,neg,my first press screening of 1998 and already i...,"[first, press, screening, 1998, already, get, ..."


Everything seems to be working as expected. Our next step it to create a **custom Gensim vectorization transformer**.

### Step 4

#### 1. Perform necessary imports

In [35]:
import os
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.matutils import sparse2full



In [35]:
# os.path.exists(None) --> erro!!!
# stat: path should be string, bytes, os.PathLike or integer, not NoneType

#### 2. Create a custom Gensim vectorization transformer

In [36]:
class GensimVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, dict_path=None, tfidf_path=None):
        '''
        Allows to set a path for both the dictionary and the tfidf model
        to be used used by the Gensim vectorizer.
        '''
        self.dict_path = dict_path
        self.tfidf_path = tfidf_path
        self.id2word = None
        self.tfidf = None
        self.load()
        
    def load(self):
        # loads existing dictionary and tfidf model
        if not self.dict_path == None:
            self.id2word = Dictionary.load_from_text(self.dict_path)
            
        if not self.tfidf_path == None:
            self.tfidf = TfidfModel.load(self.tfidf_path)
    
    def save(self):
        # saves dictionary as a tab-delimited text file
        self.id2word.save_as_text('./reviews_dictionary.txt')
        # saves tfidf model as a pickled sparse matrix
        self.tfidf.save('./reviews_tfidf.pkl')
    
    def fit(self, reviews, labels=None):
        # creates dictionary and tfidf model
        if self.dict_path == None or self.tfidf_path == None:
            self.id2word = Dictionary(reviews)
            self.tfidf = TfidfModel(dictionary=self.id2word, normalize=True)
            self.save()
            
        return self
    
    def transform(self, reviews):
        # returns dense numpy array of tfidf vectors
        return [sparse2full(self.tfidf[self.id2word.doc2bow(review)], 
                            len(self.id2word)) for review in reviews]

In [238]:
'''
class GensimVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self, dict_path='./reviews_dictionary.txt', tfidf_path='./reviews_tfidf.pkl'):
        '''
        Allows to set a path for both the dictionary and the tfidf model
        to be used used by the Gensim vectorizer.
        '''
        self.dict_path = dict_path
        self.tfidf_path = tfidf_path
        self.id2word = None
        self.tfidf = None
        self.load()
        
    def load(self):
        # loads existing dictionary and tfidf model
        if os.path.exists(self.dict_path):
            self.id2word = Dictionary.load_from_text(self.dict_path)
            
        if os.path.exists(self.tfidf_path):
            self.tfidf = TfidfModel.load(self.tfidf_path)
    
    def save_dict(self):
        # saves dictionary as a tab-delimited text file
        self.id2word.save_as_text(self.dict_path)
    
    def save_tfidf(self):
        # saves tfidf model as a pickled sparse matrix
        self.tfidf.save(self.tfidf_path)
    
    def fit(self, reviews, labels=None):
        # creates dictionary and tfidf model
        self.id2word = Dictionary(reviews)
        self.tfidf = TfidfModel(dictionary=self.id2word, normalize=True)
        self.save_dict()
        self.save_tfidf()
        return self
    
    def transform(self, reviews):
        #for review in reviews:
        #    yield sparse2full(self.tfidf[self.id2word.doc2bow(review)], len(self.id2word))
        # returns list of tfidf dense vectors
        #return [sparse2full(self.tfidf[self.id2word.doc2bow(review)], 
        #                    len(self.id2word)) for review in reviews]
        return [sparse2full(self.id2word.doc2bow(review), 
                            len(self.id2word)) for review in reviews]
'''

#### 3. Test the custom Gensim vectorization transformer

In [37]:
# Instantiate object with default values

gensim_vectorizer = GensimVectorizer()

Let's check our object's parameters.

In [38]:
gensim_vectorizer.get_params()

{'dict_path': None, 'tfidf_path': None}

We have a `dict_path` and a `tfidf_path` that define where our dictionary and TF-IDF model should be loaded from. Gensim allows us to write dictionaries and models to disk, enabling us to load them later everytime they're needed.

Let's see if everything is working properly.

In [39]:
gensim_tfidf = gensim_vectorizer.fit_transform(df['review_normalized'])

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [40]:
_, tokens = zip(*sorted(zip(gensim_vectorizer.id2word.token2id.values(), gensim_vectorizer.id2word.token2id.keys())))
tokens[:10]

('abandon',
 'abode',
 'action',
 'adam',
 'alone',
 'alternate',
 'another',
 'appalling',
 'arrive',
 'arse')

In [41]:
df_gensim_tfidf = pd.DataFrame(gensim_tfidf, columns=tokens)

In [42]:
df_gensim_tfidf.head()

Unnamed: 0,abandon,abode,action,adam,alone,alternate,another,appalling,arrive,arse,...,castor,compardre,converted,dietrich,hassler,megabyte,potrayal,recline,staggeringly,swain
0,0.052881,0.196983,0.019049,0.052065,0.078179,0.065869,0.015404,0.087037,0.045431,0.125088,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.041786,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0164,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.014172,0.0,0.0,0.0,0.011461,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
len(df_gensim_tfidf)

1938

So far so good! We have our **1938 movie reviews** and a new **vocabulary size of 31836 tokens**.

We are now ready to choose a model and **build our new pipeline**. That's our step 5.

### Step 5

#### 1. Build pipeline to normalize text, vectorize it and train/fit the model

In [44]:
text_clf = Pipeline([
    ('normalizer', TextNormalizer()),
    ('vectorizer', GensimVectorizer()),
    ('clf_model', LinearSVC())
])

text_clf.fit(X_train, y_train)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Pipeline(memory=None,
         steps=[('normalizer',
                 TextNormalizer(language='english',
                                lemmatizer=<WordNetLemmatizer>, stemmer=None,
                                vectorizer='gensim')),
                ('vectorizer',
                 GensimVectorizer(dict_path=None, tfidf_path=None)),
                ('clf_model',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

#### 2. Make predictions with the test set

In [45]:
predictions = text_clf.predict(X_test)

#### 3. Evaluate the predictions

In [46]:
# confusion matrix

print(confusion_matrix(y_test, predictions))

[[226  56]
 [ 43 257]]


In [47]:
# classification report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.84      0.80      0.82       282
         pos       0.82      0.86      0.84       300

    accuracy                           0.83       582
   macro avg       0.83      0.83      0.83       582
weighted avg       0.83      0.83      0.83       582



In [48]:
# accuracy score

print(accuracy_score(y_test, predictions))

0.8298969072164949


In our particular case, using the same algorithm as before (LinearSVC), we get a slightly lower accuracy.

Even though we've tried to minimize the loss of information by choosing lemmatization over stemming, it seems that some information was lost.

Let's try to improve our result by **tuning some of our hyperparameters with GridSearchCV**.

### Step 6

#### 1. Perform necessary imports

In [49]:
from sklearn.model_selection import GridSearchCV
import tabulate

#### 2. Check parameters of our pipeline

In [50]:
#all parameters

text_clf.get_params()

{'memory': None,
 'steps': [('normalizer',
   TextNormalizer(language='english', lemmatizer=<WordNetLemmatizer>, stemmer=None,
                  vectorizer='gensim')),
  ('vectorizer', GensimVectorizer(dict_path=None, tfidf_path=None)),
  ('clf_model',
   LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
             intercept_scaling=1, loss='squared_hinge', max_iter=1000,
             multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
             verbose=0))],
 'verbose': False,
 'normalizer': TextNormalizer(language='english', lemmatizer=<WordNetLemmatizer>, stemmer=None,
                vectorizer='gensim'),
 'vectorizer': GensimVectorizer(dict_path=None, tfidf_path=None),
 'clf_model': LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
           intercept_scaling=1, loss='squared_hinge', max_iter=1000,
           multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
           verbose=0),
 'normalizer__language': 'english',

Since our loss of information should be related with our text normalization step, let's focus on this particular aspect of our pipeline.

In [51]:
text_clf.named_steps['normalizer'].get_params()

{'language': 'english',
 'lemmatizer': <WordNetLemmatizer>,
 'stemmer': None,
 'vectorizer': 'gensim'}

Let's see how **choosing lemmatization or/and stemming** affects our result.

#### 3. Tuning parameters with GridSearchCV

In [52]:
%%time

param_grid = [{
    'normalizer__lemmatizer': [WordNetLemmatizer(), None], 
    'normalizer__stemmer': [None, 'porter', 'snowball']
}]

grid_search = GridSearchCV(text_clf, param_grid, cv=3, scoring = 'accuracy', verbose=2)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None, total= 1.2min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.2min remaining:    0.0s
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None, total= 1.1min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=None, total= 1.1min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter, total= 1.2min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter, total= 1.3min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=porter, total= 1.3min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball, total= 2.6min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball, total= 2.8min
[CV] normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball 


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=<WordNetLemmatizer>, normalizer__stemmer=snowball, total= 2.8min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=None ...........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=None, total= 3.5min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=None ...........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=None, total= 1.1min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=None ...........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=None, total= 1.1min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=porter .........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=porter, total= 1.2min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=porter .........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=porter, total= 1.2min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=porter .........


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=porter, total= 1.3min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=snowball .......


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=snowball, total= 1.2min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=snowball .......


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=snowball, total= 1.2min
[CV] normalizer__lemmatizer=None, normalizer__stemmer=snowball .......


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[CV]  normalizer__lemmatizer=None, normalizer__stemmer=snowball, total= 1.2min


[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 28.2min finished
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Wall time: 29min 14s


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('normalizer',
                                        TextNormalizer(language='english',
                                                       lemmatizer=<WordNetLemmatizer>,
                                                       stemmer=None,
                                                       vectorizer='gensim')),
                                       ('vectorizer',
                                        GensimVectorizer(dict_path=None,
                                                         tfidf_path=None)),
                                       ('clf_model',
                                        LinearSVC(C=1.0, class_weight=None,
                                                  dual=True, fit_intercept=True,
                                                  intercep...
                                                  loss='squared_h

In [53]:
grid_search.best_params_

{'normalizer__lemmatizer': None, 'normalizer__stemmer': None}

As we were expecting, the **best result is obtained without lemmatizing or stemming** our reviews.

Let's check the results in more detail.

In [56]:
columns = ['lemmatizer', 'stemmer', 'accuracy']
table = []

cvres = grid_search.cv_results_

for mean_score, params in sorted(zip(cvres['mean_test_score'], cvres['params']), reverse=True):
    
    if params['normalizer__lemmatizer'] != None:
        lem = 'WordNetLemmatizer'
    else:
        lem = 'None'
    
    if params['normalizer__stemmer'] == 'porter':
        stem = 'PorterStemmer'
    elif params['normalizer__stemmer'] == 'snowball':
        stem = 'SnowballStemmer'
    else:
        stem = 'None'
    
    row=[lem, stem, mean_score]
    table.append(row)

print(tabulate.tabulate(table, headers=columns))

lemmatizer         stemmer            accuracy
-----------------  ---------------  ----------
None               None               0.820059
WordNetLemmatizer  SnowballStemmer    0.814159
None               SnowballStemmer    0.811947
WordNetLemmatizer  None               0.811209
WordNetLemmatizer  PorterStemmer      0.808997
None               PorterStemmer      0.807522


Our results confirm what we have said before: **no lemmatization and no stemming give us the best results** because we **reduce the loss of information** associated with these techniques.

Interestingly enough, the **second best result** is obtained by **combining lemmatization with stemming** (SnowballStemmer).

The **worst** result is obtained by **only stemming** the reviews with PorterStemmer.

What this means is that, despite they can be useful techniques in other scenarios, **for our particular case we should neither lemmatize nor stem our reviews**.

In other situations, like large-scale information retrieval applications where we want to maximize recall, for example, some combination of both might be useful.

Based on that, let's update our pipeline.

#### 4. Recreate pipeline (no lemmatization and no stemming)

In [57]:
text_clf = Pipeline([
    ('normalizer', TextNormalizer(lemmatizer=None)),
    ('vectorizer', GensimVectorizer()),
    ('clf_model', LinearSVC())
])

text_clf.fit(X_train, y_train)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Pipeline(memory=None,
         steps=[('normalizer',
                 TextNormalizer(language='english', lemmatizer=None,
                                stemmer=None, vectorizer='gensim')),
                ('vectorizer',
                 GensimVectorizer(dict_path=None, tfidf_path=None)),
                ('clf_model',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

#### 5. Make predictions with the test set

In [58]:
predictions = text_clf.predict(X_test)

#### 6. Evaluate the predictions

In [59]:
# confusion matrix

print(confusion_matrix(y_test, predictions))

[[228  54]
 [ 42 258]]


In [60]:
# classification report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.84      0.81      0.83       282
         pos       0.83      0.86      0.84       300

    accuracy                           0.84       582
   macro avg       0.84      0.83      0.83       582
weighted avg       0.84      0.84      0.83       582



In [61]:
# accuracy score

print(accuracy_score(y_test, predictions))

0.8350515463917526


We can now try different classifiers to see if we can finally improve our result! This is our step 7.

### Step 7