## Objective

    
We are going to implement an algorithm to classify reviews depending on the opinion polarity
This processing is also called “sentiment analysis”. The main algorithm used here is the **Naive Bayes**.

**We will implement 2 versions of Naive Bayes : A scratch version and another from with scikit-learn and NLTK**

For this exercice, we use 2 set of files. One represents negative opinion and the other one represents positive opinion. 

## Data Source : KAGGLE
(search demonetization-tweets dataset)

---------------------------------
## From Scratch Implementation


In [127]:
import os
import os.path as op
import numpy as np
import math
import re
from tqdm import tqdm
import pandas as pd
import string
from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from time import time
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
%matplotlib inline

### Load the data

In [2]:
enc='latin-1'

# load the tweets
df_tweets = pd.read_csv('../sentimentdata/demonetization-tweets.csv', encoding=enc)

# load the words weight 
f = open('../sentimentdata/AFINN-111.txt', 'r')
word, weight = [], []
for line in f.readlines():
    word.append(line.split()[0])
    weight.append(line.split()[1])

df_words_weight = pd.DataFrame.from_dict({'word':word, 'weight':weight})

# load the reviews with trailing whitespace removed 
positive_reviews = [line.rstrip('\n') for line in open('../sentimentdata/rt-polarity.pos', encoding=enc)]
negative_reviews = [line.rstrip('\n') for line in open('../sentimentdata/rt-polarity.neg', encoding=enc)]

print('Number of positive reviews:', len(positive_reviews))
print('\nA Sample of positive review:\n', positive_reviews[0])
print('\nNumber of negative reviews:', len(negative_reviews))
print('\nA Sample of negative review:\n', negative_reviews[0])

Number of positive reviews: 5331

A Sample of positive review:
 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 

Number of negative reviews: 5331

A Sample of negative review:
 simplistic , silly and tedious . 


### After loading data set, we are going to clean them by removing punctuation and stop words

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

def clean_text(input_reviews):
    """Clean a list of reviews: return text cleaned, without punctuation, 
    special caracters and a set of stop words in english
    
    Parameters
    ----------
    input_reviews: list. 
        list of text
     
    Returns
    -------
    cleaned_reviews: list. 
        List of cleaned text
    """
    cleaned_reviews = []
    # remove punctuation
    regex_no_punct = re.compile('[%s]' % re.escape(string.punctuation))
    reviews = [regex_no_punct.sub('', each_review) for each_review in input_reviews]
   
    # load and process stop words
    stopWords = set(stopwords.words('english'))
    for each_review in reviews:
        new_text=[]
        for word in each_review.split():
            if word not in stopWords:
                new_text.append(word)
        cleaned_reviews.append(new_text)
            
    return cleaned_reviews

positive_reviews_c = clean_text(positive_reviews)
negative_reviews_c = clean_text(negative_reviews)

### Now, we would like to predict for the reviews which one is positive or negative.

As we do not have a test set, we will prepare a training and a test set.

First, we produce the classification target for these documents
The negative opinion will be represented by class 0 and the positive opinion will be represented by class 1.

In [4]:
all_reviews = negative_reviews_c + positive_reviews_c
print(len(all_reviews))
y = np.ones(len(all_reviews), dtype=np.int)
y[:len(negative_reviews_c)] = 0

10662


Then, we count the occurences number of each word in the reviews and we build a vocabulary.  For that, i define a function count_words as follow. It takes few minutes, don't worry.

In [5]:
%%time
def count_words(input_reviews):
    """Vectorize text : return count of each word in the text snippets

    Parameters
    ----------
    input_reviews : list of str
        The texts

    Returns
    -------
    dict_words_and_count : dict
        A dictionary that points to an index in counts for each word.
    counts : ndarray, shape (n_samples, n_features)
        The counts of each word in each text.
        n_samples == number of documents.
        n_features == number of words in vocabulary.
    """
    
    # get the number of occurences of each word in the whole reviews 
    dict_words_and_count = Counter(all_reviews[0])
    for each_review in all_reviews[1:]:
        dict_words_and_count += Counter(each_review)

    # get all the words that build the vocabulary
    vocabulary = dict_words_and_count.keys()

    # the words are the features and the reviews are the samples
    # get their size
    n_samples, n_features = len(all_reviews), len(vocabulary)
    
    counts = np.zeros(shape=(n_samples, n_features), dtype=int)

    for j, each_review in enumerate(all_reviews):
        for i, word in enumerate(vocabulary):
            counts[j, i] = each_review.count(word)
    print("n_samples = %d, n_features = %d"%( n_samples, n_features))
    return dict_words_and_count, counts


# Count words in text
dict_words_and_count , X = count_words(all_reviews) 
VOCABULARY=list(dict_words_and_count.keys())

n_samples = 10662, n_features = 20360
CPU times: user 1min 55s, sys: 2.19 s, total: 1min 58s
Wall time: 2min


 We split our dataset in training and testing set and we verify that they are quite balanced


In [6]:
np.random.seed(29)
rand_index = np.random.permutation(X.shape[0])

X_test, X_train = X[rand_index[:2000]], X[rand_index[2000:]]
y_test, y_train = y[rand_index[:2000]], y[rand_index[2000:]]

print(len(y_train[y_train==0]), len(y_train[y_train==1]))

4344 4318


### We build our class NB in order to implement the Naive Bayes classifier from scratch

In [7]:
class NB(BaseEstimator, ClassifierMixin):

    def __init__(self):
        # init train data vocabulary
        self.V = VOCABULARY
        
        # init length of the train data vocabulary
        self.nDistinctWordsInV = len(self.V)
        
        # init 2 arrays of conditional probability of each word knowing its class 
        # 2 arrays as we know there are positive and negative classes
        self.array_condprob_c_neg = np.ndarray(shape=(self.nDistinctWordsInV), dtype=float)
        self.array_condprob_c_pos = np.ndarray(shape=(self.nDistinctWordsInV), dtype=float)
        
        # init prior probability of each class
        self.prior_c_neg = 0
        self.prior_c_pos = 0
        
        # init the highest prior probabilities of the different classes
        self.prior_c = 0 
        
    def fit(self, X, y):
        
        # set negative and positive texts
        self.X = X
        self.y = y 
        
        # get the number of the documents of test data : n_samples
        # get the number of unique words of the whole test data : n_features
        n_samples, n_features = self.X.shape   
        
        # Compute prior probability of the existing classes
        # Here, class 0 for negative critics and class 1 for positive critics
        self.prior_c = int(n_samples / 2)           
        
        X_neg = X[y == 0] #X_[0:self.prior_c]
        X_pos = X[y == 1] #X_[-self.prior_c:]
        
        # Compute total number of words in negative reviews and in positive reviews
        nTotal_words_in_class_neg = np.sum(X_neg)
        nTotal_words_in_class_pos = np.sum(X_pos)

        for word_id in tqdm(range(0, n_features)):
            
            # compute the conditional probability of each word knowing its class
            condprob_word_in_c_neg = (np.sum(X_neg[:, word_id]) + 1) / (nTotal_words_in_class_neg + self.nDistinctWordsInV)
            condprob_word_in_c_pos = (np.sum(X_pos[:, word_id]) + 1) / (nTotal_words_in_class_pos + self.nDistinctWordsInV)
            
            # save the results in arrays
            self.array_condprob_c_neg[word_id] = condprob_word_in_c_neg
            self.array_condprob_c_pos[word_id] = condprob_word_in_c_pos     
                

    # Function predict
    def predict(self, X):
        # set negative and positive texts
        n_sample_test , n_features_test = X.shape
        
        # For each document in the test data
        for doc_id in tqdm(range(n_sample_test)):
        
            # compute the log prior propability of the class knowing all the classes
            self.prior_c_neg = math.log(self.prior_c)
            self.prior_c_pos = math.log(self.prior_c)
            
            # get the count of unique words in a given document to predict
            Xdoc_test = X[doc_id, :]
            
            # compute the log prior probability of the classes of the document knowing its vocabulary
            for word_id in range(n_features_test):

                self.prior_c_neg +=  Xdoc_test[word_id] * math.log(self.array_condprob_c_neg[word_id])
                self.prior_c_pos +=  Xdoc_test[word_id] * math.log(self.array_condprob_c_pos[word_id])
              
                #self.prior_c_neg +=  Xdoc_test[word_id] * math.log(self.vocabulary_condprob_c_neg[word_id])
                #self.prior_c_pos +=  Xdoc_test[word_id] * math.log(self.vocabulary_condprob_c_pos[word_id])
              
            # take the highest log prior probability and assign its corresponding class to the document to predict
            if self.prior_c_neg > self.prior_c_pos:
                self.y_hat[doc_id] = 0
            else : 
                self.y_hat[doc_id] = 1
                
    
    def score(self, X, y):
        # set dimension of predictions variable
        self.y_hat = np.ones(len(y), dtype=np.int)
        
        # do predictions
        self.predict(X)
        
        y = y.astype(int)
        score = np.mean(np.isclose(y, self.y_hat))
        return score


### Evaluate the performance of our model with a 5-folds cross-validation and see the accuracy of our classifier 

In [9]:
# Cross-validation with training data:
#   - X : matrix of words occurrences per review
#   - y : labels (0 for negative  , 1 for positive)

nb = NB()
n_fold_5 = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(nb, X_train, y_train, cv=n_fold_5)
print('Without stop words, the scores with cross validation in 5 splits are : ' + str(scores))
print("==========================================================================================")
mean_scores_cv_ = np.mean(scores)
print("Mean score without stop words = ", mean_scores_cv_)


100%|██████████| 20360/20360 [00:02<00:00, 8603.07it/s]
100%|██████████| 1733/1733 [04:22<00:00,  7.64it/s]
100%|██████████| 20360/20360 [00:02<00:00, 10058.98it/s]
100%|██████████| 1733/1733 [03:51<00:00,  7.74it/s]
100%|██████████| 20360/20360 [00:02<00:00, 8145.12it/s]
100%|██████████| 1733/1733 [03:53<00:00,  7.64it/s]
100%|██████████| 20360/20360 [00:01<00:00, 10391.41it/s]
100%|██████████| 1732/1732 [03:51<00:00,  7.88it/s]
100%|██████████| 20360/20360 [00:02<00:00, 10017.34it/s]
100%|██████████| 1731/1731 [04:06<00:00,  6.97it/s]


Without stop words, the scores with cross validation in 5 splits are : [ 0.76457011  0.76283901  0.76399308  0.76847575  0.76487579]
Mean score without stop words =  0.764950747529


###  Let's fit on the whole trainining set and score on test data.

In [11]:
nb = NB()
nb.fit(X_train, y_train)
score_test = nb.score(X_test, y_test)*100
#print('Without Stop words, the score for the test is %.4f%%'%()

100%|██████████| 20360/20360 [00:03<00:00, 6343.59it/s]
100%|██████████| 2000/2000 [04:52<00:00,  7.48it/s]


In [12]:
score_test

77.549999999999997

---------------------------------
## Scikit-learn and NLTK  implementation


### We perform a cross validation, taking into account the stop words and punctuation

In [132]:
%%time
# Prepare the data with stop words and punctuation
all_reviews_skl = negative_reviews + positive_reviews
train_in_skl = list(pd.Series(all_reviews_skl).loc[rand_index[2000:]].values)
test_in_skl =  list(pd.Series(all_reviews_skl).loc[rand_index[:2000]].values)

# Vectorize the data
counts_vecto = CountVectorizer()
counts_vecto.fit(train_in_skl)
X_train_skl = counts_vecto.transform(train_in_skl)
X_test_skl = counts_vecto.transform(test_in_skl)

# Score and predict
nb_skl = MultinomialNB()
scores_skl = cross_val_score(nb_skl, X_train_skl, y_train, cv=n_fold_5)
skl_scores_cv_ =  np.mean(scores_skl)

nb_skl2 = MultinomialNB()
nb_skl2.fit(X_train_skl, y_train)


CPU times: user 482 ms, sys: 37.6 ms, total: 519 ms
Wall time: 553 ms


In [133]:
# get vocabulary keys root
print('Size of vocabulary with punctuation and stop words: ', len(counts_vecto.vocabulary_.keys()))
print('Mean CV score in %.3f%% ,Test Score %.3f%%'%(skl_scores_cv_*100, nb_skl2.score(X_test_skl, y_test)*100))
# get vocabulary keys root
roots_ = counts_vecto.vocabulary_.keys()
print('Size of initial vocabulary : ', len(roots_))

Size of vocabulary with punctuation and stop words:  16641
Mean CV score in 76.703% ,Test Score 78.650%
Size of initial vocabulary :  16641


**scikit-learn** classifier goes much faster than the manual ones as it is optimized
### Repeat the tasks without stop words and punctuation 

In [131]:

# Vectorize the data
counts_vecto = CountVectorizer(strip_accents='ascii', stop_words='english', analyzer='word')
counts_vecto.fit(train_in_skl)
X_train_skl = counts_vecto.transform(train_in_skl)
X_test_skl = counts_vecto.transform(test_in_skl)

# Score and predict
nb_skl = MultinomialNB()
scores_skl = cross_val_score(nb_skl, X_train_skl, y_train, cv=n_fold_5)
skl_scores_cv_ =  np.mean(scores_skl)

nb_skl2 = MultinomialNB()
nb_skl2.fit(X_train_skl, y_train)

print('Mean CV score %.3f%% , Test Score %.3f%%'%( skl_scores_cv_*100, nb_skl2.score(X_test_skl, y_test)*100))
# get vocabulary keys root
roots_ = counts_vecto.vocabulary_.keys()
print('Size of vocabulary without stop words: ', len(roots_))

Mean CV score 75.987% , Test Score 77.050%
Size of vocabulary without stop words:  16325


### We use NLTK to do a stemming with the class *SnowballStemmer*.
we keep the stop words

In [129]:
import nltk
from nltk import SnowballStemmer

def GetWordsRoots(text):
    
    wordsList = text.split()
    for word in wordsList:
        yield(SnowballStemmer("english", ignore_stopwords=False).stem(word))
    
counts_vecto_stem = CountVectorizer(analyzer = GetWordsRoots).fit(train_in_skl)
X_train_skl = counts_vecto_stem.transform(train_in_skl)
X_test_skl = counts_vecto_stem.transform(test_in_skl)

# Score and predict
nb_skl = MultinomialNB()
scores_skl = cross_val_score(nb_skl, X_train_skl, y_train, cv=n_fold_5)
skl_scores_cv_ =  np.mean(scores_skl)

nb_skl2 = MultinomialNB()
nb_skl2.fit(X_train_skl, y_train)

print('Mean CV score %.3f%% , Test Score %.3f%%'%( skl_scores_cv_*100, nb_skl2.score(X_test_skl, y_test)*100))
# Vectorize input data and fit

# get vocabulary keys root
roots_ = counts_vecto_stem.vocabulary_.keys()
print('Size of vocabulary with stemming and without stop words: ', len(roots_))

Mean CV score 76.518% , Test Score 78.800%
Size of vocabulary with stemming and without stop words:  13297


In [139]:
from sklearn.linear_model import LogisticRegression



**Stemming reduces the number of words . We have 13297 instead of 16325 and the score has improved. **

### We filter by grammatical category (POS: Part Of Speech) and keep only the names, verbs, adverbs and adjectives for the classification

#### http://www.nltk.org/book/ch05.html

In [140]:
from nltk import pos_tag
#import nltk
#nltk.download()

def GetSignificantWords(texts):
    words_and_tags = nltk.pos_tag(nltk.word_tokenize(texts))
    for word in words_and_tags:
        if word[1] in ['JJ', 'JJR', 'JJS','NN', 'NNS', 'NNP', 'NNPS','RB', 'RBR','RBS',\
                        'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
            yield word[0]

# Vectorize input data and fit
counts_vecto_pos = CountVectorizer(analyzer = GetSignificantWords).fit(train_in_skl)
X_train_skl = counts_vecto_pos.transform(train_in_skl)
X_test_skl = counts_vecto_pos.transform(test_in_skl)

# Score and predict
nb_skl = MultinomialNB()
scores_skl = cross_val_score(nb_skl, X_train_skl, y_train, cv=n_fold_5)
skl_scores_cv_ =  np.mean(scores_skl)

nb_skl2 = MultinomialNB()
nb_skl2.fit(X_train_skl, y_train)

print('NB: Mean CV score %.3f%% , Test Score %.3f%%'%( skl_scores_cv_*100, nb_skl2.score(X_test_skl, y_test)*100))

# Vectorize input data and fit

# get vocabulary keys root
significant_words = counts_vecto.vocabulary_.keys()
print('Size of vocabulary with POS: ', len(significant_words))

NB: Mean CV score 76.114% , Test Score 77.500%
LR: Mean CV score 76.114% , Test Score 75.950%
Size of vocabulary with POS:  16641


** POS reduces the number of tokens but less than Stemming . We have 38733 instead of 39443. The score is lower than stemming score and than manual score.**