# **US Airlines Tweets Sentiment Classification**

### **1. Problem Statement**
In this problem, a sentiment analysis is performed on US Airlines from twitter data
available @ https://www.kaggle.com/crowdflower/twitter-airline-sentiment with Naïve Bayes classifier. The tasks involved in this project are as follows:

* Build a dictionary based on your training corpus. Calculate conditional probability of each token for each class (this is also called unigram probability). Then evaluate on test data and report accuracy.
* Try to improve your algorithm. Some suggestions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i. Remove STOP words from the vocabulary that appear vary frequently but not related to the attitude or opinion of the writer. <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ii. Reduce the size of your vocabulary further by taking only top-k frequent word types that appear in the training dataset. Vary k and compare performance.

### **2. Importing Libraries & Loading Data**

In [None]:
#importing the required libraries & packages
import pandas as pd
import numpy as np
import time
import argparse
import string
from sklearn.model_selection import train_test_split
from nltk.tokenize import regexp_tokenize
from datetime import datetime
import pytz
from sklearn import metrics

In [None]:
#loading the US Airline Sentiment data
data_frame = pd.read_csv('Tweets.csv')
data_frame.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
data_frame.shape()

(14640, 15)

### **3. Data Preprocessing**

The US Airline Sentiment dataset has so many features. Among them, in our project we will be working with the 'text' & 'airline_sentiment' features. Here, the 'text' is considered as a feature (X) and the 'airline_sentiment' as a target label (y) for the classifier.

The values in the target data are categorical and are 'neutral', 'positive' & 'negative'. For the easy computation, we are replacing the 'neutral', 'positive' & 'negative' with 0, 1 & 2 values respectively.

In [None]:
#repalcing the categorical values of 'airline_sentiment' to numeric values
data_frame['airline_sentiment'].replace(('neutral', 'positive', 'negative'), (0, 1, 2), inplace=True)
data_frame['airline_sentiment'].value_counts()
#most of the data convergence to negative

2    9178
0    3099
1    2363
Name: airline_sentiment, dtype: int64

In [None]:
#forming the feature & label variables
data = data_frame['text'].values.tolist() #as x border
labels = data_frame['airline_sentiment'].values.tolist() #as y border
#because we are working with pandas library we need to make a list and import our data into a list
#also we did feature extraction and chose more important features like text and sentiment

In [None]:
#First five samples text
#take output to see whether we did it right or not
data[:5]

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky.",
 "@VirginAmerica I didn't today... Must mean I need to take another trip!",
 '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse',
 "@VirginAmerica and it's a really big bad thing about it"]

In [None]:
#first 5 samples label
#considering labels 
labels[:5]

[0, 1, 0, 2, 2]

### **4. Splitting the data for Classification**

The data splitting is done in 80-20 split using trian_test_split method of sklearn.

In [None]:
#splitting the data into 80 and 20 split
train_X, test_X, y_train, y_test = train_test_split(data, labels, test_size=0.2, 
                                                    random_state=42, shuffle=True)

print(f'Number of training examples: {len(train_X)}')
print(f'Number of testing examples: {len(test_X)}')

Number of training examples: 11712
Number of testing examples: 2928


### **5. Text Preprocessing**

A process of transforming text into something an algorithm can digest is text processing. This includes:
* &nbsp;tokenizing the data
* &nbsp;removing the punctuation
* &nbsp;removing the stopwords
* &nbsp;stemming 
* &nbsp;lemmatization

As of now, we are only going to tokenize the data and work with it without removing the punctuation or stop words and apply any other text processing methods.

In [None]:
# Here is a default pattern for tokenization
default_pattern =  r"""(?x)                  
                        (?:[A-Z]\.)+          
                        |\$?\d+(?:\.\d+)?%?    
                        |\w+(?:[-']\w+)*      
                        |\.\.\.               
                        |(?:[.,;"'?():-_`])    
                    """

In [None]:
#funtion for tokenizing the data
""" Tokenize sentence with specific pattern
Arguments: text {str} -- sentence to be tokenized, such as "I love NLP"
Keyword Arguments: pattern {str} -- reg-expression pattern for tokenizer (default: {default_pattern})
Returns: list -- list of tokenized words, such as ['I', 'love', 'nlp'] """
def tokenize(text, pattern = default_pattern):

  text = text.lower()
  return regexp_tokenize(text, pattern)

In [None]:
# Tokenize training text into tokens
tokenized_text = []
for i in range(0, len(train_X)):
    tokenized_text.append(tokenize(train_X[i]))

X_train = tokenized_text

# Tokenize testing text into tokens
tokenized_text = []
for i in range(0, len(test_X)):
    tokenized_text.append(tokenize(test_X[i]))

X_test = tokenized_text

In [None]:
#tokenized train & test data
print(X_train[0], X_train[1])
print(X_test[0])

['@', 'united', 'you', 'are', 'offering', 'us', '8', 'rooms', 'for', '32', 'people', 'fail'] ['@', 'jetblue', 'jfk', 'nyc', 'staff', 'is', 'amazing', '.', 'the', 'lax', 'jetblue', '...', 'sending', 'an', 'email', 'with', 'details', 'but', 'it', 'was', 'a', 'disappointing', 'experience', '@', 'jetbluecheeps']
['@', 'southwestair', "you're", 'my', 'early', 'frontrunner', 'for', 'best', 'airline', 'oscars2016']


### **6. Building Dictionary**

Building dictionary of the training data.

In [None]:
#building dictionary
def createDictionary(data):
  """ Function: To create a dictionary of tokens from the data
  Arguments: data in the type - list
  Returns: Sorted dictionary of the tokens and their count in the data """

  dictionary = dict()
  for sample in  data:
    for token in sample:
      dictionary[token] = dictionary.get(token, 0) + 1
  #sorting the dictionary based on the values
  sorted_dict = sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
  return dict(sorted_dict)

In [None]:
bog = createDictionary(X_train)
#top 10 items in the dictionary
print("Top 10 tokens in the training dictionary:\n")
list(bog.items())[:10]

Top 10 tokens in the training dictionary:



[('@', 13290),
 ('.', 12534),
 ('to', 6858),
 ('the', 4856),
 ('i', 4385),
 ('?', 3729),
 ('a', 3619),
 (',', 3354),
 ('united', 3338),
 ('you', 3284)]

### **7. Building the Navie Bayes Classifier**

Now, we define text classifier class called NBClassifier, which comprises of three functions:
* createDictionary()
* fit()
* predict()
* score()

**createDictionary():** This function takes in the tokenized text data and gives out the dictionary or the bag of words of the data.

**fit():** This function has all the word counts required to calculate the Navie Bayes Classifier probabilities and then fits the classifier on our training data.

**predict():** The test data is inputed to this function which determines the sentiment label based of each tweet by using the word counts computed during the training process (from fit function). In this step, Laplace smoothing is applied while computing Naïve Bayes probabilities for the test data.

**score():** Determine how many tweets are classified correctly and measures the performance of the model in terms of accuracy.

In [None]:
#Navie Bayes Classifier 
class NBClassifier:

    def __init__(self, X_train, y_train, size):
      tz_TH = pytz.timezone('Asia/Tehran') 
      print("Model Start Time:", datetime.now(tz_TH).strftime("%H:%M:%S"))
      self.X_train = X_train
      self.y_train = y_train
      self.size = size

    def createDictionary(self):
      """ Function: To create a dictionary of tokens from the data
      Arguments: data in the type - list
      Returns: Sorted dictionary of the tokens and their count in the data """
      dictionary = dict()
      for sample in  X_train:
        for token in sample:
          dictionary[token] = dictionary.get(token, 0) + 1
      #sorting the dictionary based on the values
      sorted_dict = sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
      return dict(sorted_dict)
    
    def fit(self):
      """ Function: To compute the count of words in training data dictionary
        Arguments: Trianing data & Size of dictionary
        Returns: dictionary of tokens with their class wise probabilities """
      
      X_train_dict = self.createDictionary()
      if self.size == 'full':
        self.words_list = list(X_train_dict.keys())
        self.words_count = dict.fromkeys(self.words_list, None)
      else:
        self.words_list = list(X_train_dict.keys())[:int(self.size)]
        self.words_count = dict.fromkeys(self.words_list, None)
            
      #DataFrame of training data
      train = pd.DataFrame(columns = ['X_train', 'y_train'])
      train['X_train'] = X_train
      train['y_train'] = y_train

      train_0 = train.copy()[train['y_train'] == 0]
      train_1 = train.copy()[train['y_train'] == 1]
      train_2 = train.copy()[train['y_train'] == 2]

      #computing the prior of each class
      Pr0 = train_0.shape[0]/train.shape[0]
      Pr1 = train_1.shape[0]/train.shape[0]
      Pr2 = train_2.shape[0]/train.shape[0]
      
      self.Prior = np.array([Pr0, Pr1, Pr2])
        
      #converting list of lists into a list
      def flatList(listOfList):
        flatten = []
        for elem in listOfList:
          flatten.extend(elem)
        return flatten
  
      #Creating the data list for each class - tokens of each class
      X_train_0 = flatList(train[train['y_train'] == 0]['X_train'].tolist())
      X_train_1 = flatList(train[train['y_train'] == 1]['X_train'].tolist())
      X_train_2 = flatList(train[train['y_train'] == 2]['X_train'].tolist())
    
      self.X_train_len = np.array([len(X_train_0), len(X_train_1), len(X_train_2)])

      for token in self.words_list:
        #list to store three word counts of a token
        res = []

        #inserting count of token in class 0: Neutral
        res.insert(0, X_train_0.count(token))

        #inserting count of token in class 1: Positive
        res.insert(1, X_train_1.count(token))

          #inserting count of token in class 2: Negative
        res.insert(2, X_train_2.count(token))

        #assigning the count list to its token in the dictionary 
        self.words_count[token] = res
      return self

    def predict(self, X_test):
      """ Function: Predicts the label of the data
        Arguments: self and the test data
        Returns: List of predicted labels for the test data """     
      pred = []
      for sample in X_test:
        mul = np.array([1,1,1])
        for tokens in sample:
          vocab_count = len(self.words_list)
          if tokens in self.words_list:
            prob = ((np.array(self.words_count[tokens])+1) / (self.X_train_len + vocab_count))
          #except:
            #prob = ((np.array([0,0,0])+1) / (self.X_train_len + vocab_count))
          mul = mul * prob
        val = mul * self.Prior
        pred.append(np.argmax(val))
      tz_TH = pytz.timezone('Asia/Tehran') 
      print("Model End Time:", datetime.now(tz_TH).strftime("%H:%M:%S"))
      return pred
    
    def score(self, pred, labels):
      """ Function: To compute the perfoemance of the model
        Arguments: self, predicted labels and actual labels of the test data
        Returns: Number of lables correctly predicted and the accuracy of the model """
      correct = (np.array(pred) == np.array(labels)).sum()
      accuracy = correct/len(pred)
      return correct, accuracy

### **8. Navie Bayes Classifier Training and Evaluation**

The Navie Bayes Classifier, NBClassifier takes three arguments:
* X_train: Features of training dataset
* y_train: Labels of training dataset
* size: Size of vacabulary to be used in the model

All three arguments are needed for the model to work.

In [None]:
# Creating holders to store the model performance results
attributes = []
corr = []
acc = []

#function to call for storing the results
def storeResults(attr, cor,ac):
  attributes.append(attr)
  corr.append(round(cor, 3))
  acc.append(round(ac, 3))

In [None]:
#training the classifier     
nb = NBClassifier(X_train, y_train, 'full')  
nb.fit()

#predicting the labels for test samples
y_pred = nb.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred))

Model Start Time: 22:05:03
Model End Time: 22:05:42
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor1, acc1 = nb.score(y_pred, y_test)
print(metrics.classification_report(y_test, y_pred,zero_division=1))
print("Count of Correct Predictions:", cor1)
print("Accuracy of the model: %i / %i = %.4f " %(cor1, len(y_pred), acc1))

              precision    recall  f1-score   support

           0       0.66      0.40      0.50       580
           1       0.86      0.53      0.66       459
           2       0.79      0.96      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.77      0.63      0.67      2928
weighted avg       0.78      0.78      0.76      2928

Count of Correct Predictions: 2291
Accuracy of the model: 2291 / 2928 = 0.7824 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Unprocessed Data', cor1, acc1)

The Navie Bayers Classifier that we trainined on the data predicts 78.24% of samples correctly. Now, to improve this number few more text processing methods are appilied on the training data and then the classifier is trained on this  modified data to predict the sentiment of the test samples.



### **9. Trying to improve the NBClassifier**

To improve the performance of the NBClassifier, 
* apply other text processing methods
* reduce the size of dictonary


#### **9.1. Further Processing Text Data**

In this step, we are going to apply two text processing methods on the previously tokenized data:
* remove the punctuation 
* remove stop words

##### **9.1.1. Remove Puntuation**

In [None]:
#string of punctiations
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
#Removing the punctuation
'''Function: Removes the punctuation from the tokens
   Arguments: list of text data samples
   Returns: list of tokens of each sample without punctuation '''
def removePunctuation(data):
    update = []
    for sample in data:
        #removing punctuation from the tokens
        re_punct = [''.join(char for char in word if char not in string.punctuation) for word in sample]
        #removes the empty strings
        re_punct = [word for word in re_punct if word]
       
        update.append(re_punct)
    return update

In [None]:
#Removing punctuation from training data text tokens  
X_train_P = removePunctuation(X_train)

#Removing punctuation from testing data text tokens
X_test_P = removePunctuation(X_test)

#train & test data after removing punctuation
print(X_train_P[0])
print(X_test_P[0])

['united', 'you', 'are', 'offering', 'us', '8', 'rooms', 'for', '32', 'people', 'fail']
['southwestair', 'youre', 'my', 'early', 'frontrunner', 'for', 'best', 'airline', 'oscars2016']


In [None]:
#training the classifier     
nb_punct = NBClassifier(X_train_P, y_train, 'full')
nb_punct.fit()

#predicting the labels for test samples
y_pred_P = nb_punct.predict(X_test_P)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_P))

Model Start Time: 22:05:43
Model End Time: 22:06:21
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor2, acc2 = nb_punct.score(y_pred_P, y_test)
print(metrics.classification_report(y_test, y_pred_P,zero_division=1))
print("Count of Correct Predictions:", cor2)
print("Accuracy of the model: %i / %i = %.4f " %(cor2, len(y_pred_P), acc2))

              precision    recall  f1-score   support

           0       0.66      0.39      0.49       580
           1       0.84      0.54      0.66       459
           2       0.79      0.96      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.76      0.63      0.67      2928
weighted avg       0.77      0.78      0.76      2928

Count of Correct Predictions: 2285
Accuracy of the model: 2285 / 2928 = 0.7804 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('No Punctuation Data', cor2, acc2)

##### **9.1.2. Remove Stopwords**

In [None]:
'''Function: Removes the stopwords from the tokens
   Arguments: list of text data samples
   Returns: list of tokens of each sample without punctuation '''
def removeStopWords(data):
    update = []
    stopwords = ['the', 'at','i', 'of', 'us', 'have', 'a', 'you','ours', 'themselves', 
                 'that', 'this', 'be', 'is', 'for']
    for sample in data:
        #removing stopwords from tokenized data
        re_stop = [word for word in sample if word not in stopwords]
        
        update.append(re_stop)
    return update

In [None]:
#Removing stopwords from training data text tokens  
X_train_S = removeStopWords(X_train)

#Removing stopwords from testing data text tokens
X_test_S = removeStopWords(X_test)

#train & test data after removing stopwords
print(X_train_S[0])
print(X_test_S[0])

['@', 'united', 'are', 'offering', '8', 'rooms', '32', 'people', 'fail']
['@', 'southwestair', "you're", 'my', 'early', 'frontrunner', 'best', 'airline', 'oscars2016']


In [None]:
#training the classifier     
nb_stop = NBClassifier(X_train_S, y_train, 'full')
nb_stop.fit()

#predicting the labels for test samples
y_pred_S = nb_stop.predict(X_test_S)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_S))

Model Start Time: 22:06:21
Model End Time: 22:07:00
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor3, acc3 = nb_stop.score(y_pred_S, y_test)
print(metrics.classification_report(y_test, y_pred_S,zero_division=1))
print("Count of Correct Predictions:", cor3)
print("Accuracy of the model: %i / %i = %.4f " %(cor3, len(y_pred_S), acc3))

              precision    recall  f1-score   support

           0       0.65      0.45      0.53       580
           1       0.84      0.54      0.66       459
           2       0.80      0.95      0.87      1889

    accuracy                           0.79      2928
   macro avg       0.76      0.65      0.69      2928
weighted avg       0.78      0.79      0.77      2928

Count of Correct Predictions: 2300
Accuracy of the model: 2300 / 2928 = 0.7855 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Removed few Stopwords', cor3, acc3)

##### **9.1.3. Removing both Punctuation & Few Stopwords**

In [None]:
#Removing stopwords from training data text tokens  
X_train_PS = removeStopWords(X_train_P)

#Removing stopwords from testing data text tokens
X_test_PS = removeStopWords(X_test_P)

#train & test data after removing stopwords
print(X_train_PS[0])
print(X_test_PS[0])

['united', 'are', 'offering', '8', 'rooms', '32', 'people', 'fail']
['southwestair', 'youre', 'my', 'early', 'frontrunner', 'best', 'airline', 'oscars2016']


In [None]:
#training the classifier     
nb_PS = NBClassifier(X_train_PS, y_train, 'full')
nb_PS.fit()

#predicting the labels for test samples
y_pred_PS = nb_PS.predict(X_test_PS)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_PS))

Model Start Time: 22:07:00
Model End Time: 22:07:39
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor4, acc4 = nb_PS.score(y_pred_PS, y_test)
print(metrics.classification_report(y_test, y_pred_PS,zero_division=1))
print("Count of Correct Predictions:", cor4)
print("Accuracy of the model: %i / %i = %.4f " %(cor4, len(y_pred_PS), acc4))

              precision    recall  f1-score   support

           0       0.63      0.43      0.51       580
           1       0.82      0.56      0.67       459
           2       0.80      0.94      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.75      0.64      0.68      2928
weighted avg       0.77      0.78      0.76      2928

Count of Correct Predictions: 2283
Accuracy of the model: 2283 / 2928 = 0.7797 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Removed both Punctuation & Few Stopwords', cor4, acc4)

#### **9.2. Reducing the Dictionary Size**

To improve the model performance, we reduce the size of training dictionary further by taking only top-k frequent word types that appear in it. Here, we vary the value of k and compare the model performance.



In [None]:
#total tokens in training dictionary
print('Total tokens in the dictionary:', len(bog))

Total tokens in the dictionary: 13606


##### **9.2.1. Considering Top 5k Tokens**

**5k Tokens of Vocabulary - Unprocessed data**

In [None]:
#training the classifier - 5000 tokens 
nb_5k = NBClassifier(X_train, y_train, '5000')
nb_5k.fit()

#predicting the labels for test samples
y_pred_5k = nb_5k.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k))

Model Start Time: 22:07:39
Model End Time: 22:07:54
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor5, acc5 = nb_5k.score(y_pred_5k, y_test)
print(metrics.classification_report(y_test, y_pred_5k,zero_division=1))
print("Count of Correct Predictions:", cor5)
print("Accuracy of the model: %i / %i = %.4f " %(cor5, len(y_pred), acc5))

              precision    recall  f1-score   support

           0       0.63      0.52      0.57       580
           1       0.76      0.69      0.72       459
           2       0.84      0.91      0.87      1889

    accuracy                           0.80      2928
   macro avg       0.75      0.71      0.72      2928
weighted avg       0.79      0.80      0.79      2928

Count of Correct Predictions: 2332
Accuracy of the model: 2332 / 2928 = 0.7964 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Unprocessed Data', cor5, acc5)

**5k Tokens of Vocabulary - No Punctuation Data**

In [None]:
#training the classifier - 5000 tokens 
nb_5k_P = NBClassifier(X_train_P, y_train, '5000')
nb_5k_P.fit()

#predicting the labels for test samples
y_pred_5k_P = nb_5k_P.predict(X_test_P)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_P))

Model Start Time: 22:07:54
Model End Time: 22:08:08
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor6, acc6 = nb_5k.score(y_pred_5k_P, y_test)
print(metrics.classification_report(y_test, y_pred_5k_P,zero_division=1))
print("Count of Correct Predictions:", cor6)
print("Accuracy of the model: %i / %i = %.4f " %(cor6, len(y_pred), acc6))

              precision    recall  f1-score   support

           0       0.64      0.48      0.55       580
           1       0.76      0.65      0.70       459
           2       0.83      0.92      0.87      1889

    accuracy                           0.79      2928
   macro avg       0.74      0.68      0.71      2928
weighted avg       0.78      0.79      0.78      2928

Count of Correct Predictions: 2309
Accuracy of the model: 2309 / 2928 = 0.7886 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - No Punctuation Data', cor6, acc6)

**5k Tokens of Vocabulary - Removed few Stopwords**




In [None]:
#training the classifier - 5000 tokens 
nb_5k_S = NBClassifier(X_train_S, y_train, '5000')
nb_5k_S.fit()

#predicting the labels for test samples
y_pred_5k_S = nb_5k_S.predict(X_test_S)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_S))

Model Start Time: 22:08:09
Model End Time: 22:08:23
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor7, acc7 = nb_5k_S.score(y_pred_5k_S, y_test)
print(metrics.classification_report(y_test, y_pred_5k_S,zero_division=1))
print("Count of Correct Predictions:", cor7)
print("Accuracy of the model: %i / %i = %.4f " %(cor7, len(y_pred), acc7))

              precision    recall  f1-score   support

           0       0.61      0.55      0.58       580
           1       0.74      0.68      0.71       459
           2       0.85      0.89      0.87      1889

    accuracy                           0.79      2928
   macro avg       0.73      0.71      0.72      2928
weighted avg       0.79      0.79      0.79      2928

Count of Correct Predictions: 2321
Accuracy of the model: 2321 / 2928 = 0.7927 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Removed few Stopwords', cor7, acc7)

**5k Tokens of Vocabulary - Removed both Punctuation & Few Stopwords**




In [None]:
#training the classifier - 5000 tokens 
nb_5k_PS = NBClassifier(X_train_PS, y_train, '5000')
nb_5k_PS.fit()

#predicting the labels for test samples
y_pred_5k_PS = nb_5k_PS.predict(X_test_PS)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_PS))

Model Start Time: 22:08:23
Model End Time: 22:08:38
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor8, acc8 = nb_5k_PS.score(y_pred_5k_PS, y_test)
print(metrics.classification_report(y_test, y_pred_5k_PS,zero_division=1))
print("Count of Correct Predictions:", cor8)
print("Accuracy of the model: %i / %i = %.4f " %(cor8, len(y_pred), acc8))

              precision    recall  f1-score   support

           0       0.61      0.52      0.56       580
           1       0.72      0.65      0.69       459
           2       0.84      0.90      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.72      0.69      0.70      2928
weighted avg       0.78      0.78      0.78      2928

Count of Correct Predictions: 2296
Accuracy of the model: 2296 / 2928 = 0.7842 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Removed both Punctuation & Few Stopwords', cor8, acc8)

##### **9.2.2. Considering Top 10k Tokens**
**10k Tokens of Vocabulary - Unprocessed data**

In [None]:
#training the classifier - 10000 tokens 
nb_10k = NBClassifier(X_train, y_train, '10000')
nb_10k.fit()

#predicting the labels for test samples
y_pred_10k = nb_10k.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_10k))

Model Start Time: 22:08:38
Model End Time: 22:08:52
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor9, acc9 = nb_10k.score(y_pred_10k, y_test)
print(metrics.classification_report(y_test, y_pred_10k,zero_division=1))
print("Count of Correct Predictions:", cor9)
print("Accuracy of the model: %i / %i = %.4f " %(cor9, len(y_pred), acc9))

              precision    recall  f1-score   support

           0       0.63      0.52      0.57       580
           1       0.76      0.69      0.72       459
           2       0.84      0.91      0.87      1889

    accuracy                           0.80      2928
   macro avg       0.75      0.71      0.72      2928
weighted avg       0.79      0.80      0.79      2928

Count of Correct Predictions: 2332
Accuracy of the model: 2332 / 2928 = 0.7964 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Unprocessed Data', cor9, acc9)

**10k Tokens of Vocabulary - No Punctuation Data**

In [None]:
#training the classifier - 10000 tokens 
nb_10k_P = NBClassifier(X_train_P, y_train, '10000')
nb_10k_P.fit()

#predicting the labels for test samples
y_pred_10k_P = nb_10k_P.predict(X_test_P)
  
#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_10k_P))

Model Start Time: 22:08:53
Model End Time: 22:09:21
NBClassifier Model miss any prediction??? False


In [None]:
#Performance of the classifier
cor10, acc10 = nb_10k_P.score(y_pred_10k_P, y_test)
print(metrics.classification_report(y_test, y_pred_10k_P,zero_division=1))
print("Count of Correct Predictions:", cor10)
print("Accuracy of the model: %i / %i = %.4f " %(cor10, len(y_pred), acc10))

              precision    recall  f1-score   support

           0       0.64      0.41      0.50       580
           1       0.80      0.59      0.68       459
           2       0.80      0.94      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.75      0.65      0.68      2928
weighted avg       0.77      0.78      0.76      2928

Count of Correct Predictions: 2287
Accuracy of the model: 2287 / 2928 = 0.7811 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - No Punctuation Data', cor10, acc10)

**10k Tokens of Vocabulary - Removed few Stopwords**




In [None]:
#training the classifier - 10000 tokens 
nb_10k_S = NBClassifier(X_train_S, y_train, '10000')
nb_10k_S.fit()

#Sredicting the labels for test samSles
y_pred_10k_S = nb_10k_S.predict(X_test_S)
  
#Checking
print("NBClassifier Model miss any Srediction???", len(X_test) != len(y_pred_10k_S))

Model Start Time: 22:09:21
Model End Time: 22:09:50
NBClassifier Model miss any Srediction??? False


In [None]:
#Performance of the classifier
cor11, acc11 = nb_10k_S.score(y_pred_10k_S, y_test)
print(metrics.classification_report(y_test, y_pred_10k_S,zero_division=1))
print("Count of Correct Predictions:", cor11)
print("Accuracy of the model: %i / %i = %.4f " %(cor11, len(y_pred), acc11))

              precision    recall  f1-score   support

           0       0.63      0.49      0.55       580
           1       0.81      0.61      0.70       459
           2       0.82      0.93      0.87      1889

    accuracy                           0.79      2928
   macro avg       0.76      0.68      0.71      2928
weighted avg       0.78      0.79      0.78      2928

Count of Correct Predictions: 2321
Accuracy of the model: 2321 / 2928 = 0.7927 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Removed few Stopwords', cor11, acc11)

**10k Tokens of Vocabulary - Removed both Punctuation & Few Stopwords**




In [None]:
#training the claPSPSifier - 10000 tokenPS 
nb_10k_PS = NBClassifier(X_train_PS, y_train, '10000')
nb_10k_PS.fit()

#PSredicting the labelPS for tePSt PSamPSlePS
y_pred_10k_PS = nb_10k_PS.predict(X_test_PS)
  
#Checking
print("NBClaPSPSifier Model miSS any PSrediction???", len(X_test) != len(y_pred_10k_PS))

Model Start Time: 22:09:50
Model End Time: 22:10:18
NBClaPSPSifier Model miSS any PSrediction??? False


In [None]:
#Performance of the classifier
cor12, acc12 = nb_10k_PS.score(y_pred_10k_PS, y_test)
print(metrics.classification_report(y_test, y_pred_10k_PS,zero_division=1))
print("Count of Correct Predictions:", cor12)
print("Accuracy of the model: %i / %i = %.4f " %(cor12, len(y_pred), acc12))

              precision    recall  f1-score   support

           0       0.62      0.45      0.52       580
           1       0.79      0.61      0.69       459
           2       0.81      0.93      0.87      1889

    accuracy                           0.78      2928
   macro avg       0.74      0.66      0.69      2928
weighted avg       0.77      0.78      0.77      2928

Count of Correct Predictions: 2293
Accuracy of the model: 2293 / 2928 = 0.7831 


In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Removed both Punctuation & Few Stopwords', cor12, acc12)

### **10. Comparing the Results**

In [None]:
#creating dataframe
results = pd.DataFrame({ 'Data Modification': attributes,    
    'Correct Predictions': corr,
    'Model Accuracy': acc})

In [None]:
results.sort_values(by=['Model Accuracy', 'Correct Predictions'], ascending=False)

Unnamed: 0,Data Modification,Correct Predictions,Model Accuracy
4,5k Tokens of Voab - Unprocessed Data,2332,0.796
8,10k Tokens of Voab - Unprocessed Data,2332,0.796
6,5k Tokens of Voab - Removed few Stopwords,2321,0.793
10,10k Tokens of Voab - Removed few Stopwords,2321,0.793
5,5k Tokens of Voab - No Punctuation Data,2309,0.789
2,Removed few Stopwords,2300,0.786
7,5k Tokens of Voab - Removed both Punctuation &...,2296,0.784
11,10k Tokens of Voab - Removed both Punctuation ...,2293,0.783
0,Unprocessed Data,2291,0.782
9,10k Tokens of Voab - No Punctuation Data,2287,0.781


**NOTE: Detailed description & analysis of each step are mentioned in the report.**