<a href="https://colab.research.google.com/github/pramendra/univ_lab3/blob/master/distribute_Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: IMDB Sentiment Classifier

**Univ.AI** <br>
**DS-1 Cohort 1** <br>

---

![movie_reviews](https://github.com/pramendra/univ_lab3/blob/master/images/movie_reviews.jpg?raw=1)

## Table of Contents 
* [Lab3 IMDB Sentiment Classifier](#Lab3:-Cleaning-and-EDA-on-Goodreads)
  * [Overview](##Overview)
  * [Part1: Building a Sentiment Classifier using Bag of Words (BOW)](###Part-1-Building-a-Sentiment-Classifier-using-Bag-of-Words-(BOW))
    * [1.1 Loading and Splitting data](###1.1-Loading-and-Splitting-the-data)
    * [1.2 Preprocessing](###1.2-Preprocessing)
    * [1.3 Tokenization](###1.3-Tokenization)
    * [1.4 Classification Model](###1.4-Classification-Model)
    * [1.5 Prediction and Accuracy](###1.5-Prediction-and-Accuracy)
    * [1.6 Confusion Matrix](###1.6-Confusion-Matrix)
    * [1.7 Viewing and decoding predictions](###1.7-Viewing-and-decoding-predictions)

## Overview

In this lab, we will work on building a sentiment classifier. Sentiment classification helps in identifying opinions in text and labelling them (usually positive, negative, neutral) based on the emotions people express within them. We shall work with the **IMDB dataset** where we will try to classify different movie reviews into either a **'positive'** or a **'negative'** category. 

**Data**:

The data from this homework is a small part of the overall data taken from the `Large Movie Review Dataset v1.0`. The data is split evenly with 4k reviews intended for training and 4k for testing your classifier. Moreover, each set has 2k positive and 2k negative reviews. 

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.
The original data was compiled by *Andrew Maas* and can be found here: [IMDb Reviews](https://ai.stanford.edu/~amaas/data/sentiment/).

In [4]:
# Install the gensim & nltk libraries if not already installed
!pip install gensim
!pip install nltk



In [6]:
!git init; git pull https://github.com/pramendra/univ_lab3

Reinitialized existing Git repository in /content/.git/
remote: Enumerating objects: 8004, done.[K
remote: Counting objects: 100% (8004/8004), done.[K
remote: Compressing objects: 100% (7998/7998), done.[K
remote: Total 8004 (delta 3), reused 8004 (delta 3), pack-reused 0[K
Receiving objects: 100% (8004/8004), 9.92 MiB | 23.26 MiB/s, done.
Resolving deltas: 100% (3/3), done.
From https://github.com/pramendra/univ_lab3
 * branch            HEAD       -> FETCH_HEAD


In [7]:
# Run this cell to import the libraries required

import imdb
import numpy as np
import pandas as pd
import gensim
from keras.preprocessing import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
%matplotlib inline


import re
from bs4 import BeautifulSoup

import nltk

In [8]:
!tar -xzf ./aclImdb.tar.gz

### Part 1 : Building a Sentiment Classifier using Bag of Words (BOW)

In this part, we will build a **Logistic Regression** model using the **Bag of Words (BOW)** approach. In BOW, every text or document is represented as the bag(multiset) of its words where the frequency of occurrence of each word is used as a feature for training a classifier. Check this [wikipedia link](https://en.wikipedia.org/wiki/Bag-of-words_model#:~:text=The%20bag%2Dof%2Dwords%20model,word%20order%20but%20keeping%20multiplicity.) for more information.

Load the IMDB dataset using the helper python script **imdb.py**. This has already been imported at the beginning.

**1.1 Loading and Splitting data**

Load the data-set and split into train and test. Print out the shape of your train and test features and labels. ('X_train','X_test','y_train','y_test')

In [20]:
# load and split the data using the imdb.py helper python script. This has been already been imported for you
# your code here
(X_train, y_train), (X_test, y_test) = imdb.load_imdb()

train positive reviews read
train negative reviews read
test positive reviews read
test negative reviews read


**1.2 Preprocessing**

Clean the reviews by completing the following function. Follow the instructions as mentioned in the comments. Call this function to obtain a list of cleaned train and test reviews.

*Hint* - Use [**re.sub**](https://docs.python.org/3/library/re.html) to replace regular expressions or regex.

In [21]:
# helper function to perform stemming

def stemming(text):
    
    # stems each word in the review to it's root word
    stemmer = nltk.porter.PorterStemmer()
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

In [45]:
def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

def remove_nonalpha(raw_text):
  cleanr = re.compile('[^a-z\s]+')
  cleantext = re.sub(cleanr, '', raw_text)
  cleantext = re.sub('\s+', ' ', cleantext).strip()
  return cleantext

def preprocess_review(text):
    
    """
    Function
    --------
    preprocess_review

    Inputs
    ------
    text: the review which need to be pre-processed

    Returns
    -------
    clean_review: the pre-processed review obtained after performing the below instructions in the Notes

    Notes
    -----
    
    1. Remove html tags from the text
    2. Change all reviews to lowercase
    3. Remove special characters - remove everything that is not a letter or a space (use re.sub to replace regex pattern by single 
       space. The regex pattern for removal of special characters is '[^a-z\s]+')
    4. Remove extra(multiple) spaces (use re.sub to replace regex pattern by single space. The regex pattern for removal of multiple 
       spaces is '\s+')
    5. Perform stemming, which converts(stems) the word to its base form (for eg: running, runs to run)
    """
    
    # your code here
    nohtml = cleanhtml(text)
    lower_text = nohtml.lower()
    non_alpha = remove_nonalpha(lower_text)
    return stemming(non_alpha)
    
    

In [47]:
# Calling the preprocess_review function to obtain list of the cleaned reviews

X_train_cleaned = [preprocess_review(review) for review in X_train]
print('Training data cleaned')
X_test_cleaned = [preprocess_review(review) for review in X_test]
print('Test data cleaned')


Training data cleaned
Test data cleaned


In [70]:
len(X_train_cleaned), len(X_test_cleaned)

(4000, 4000)


Let's check out an example of how a cleaned text looks like

In [48]:
print(X_train_cleaned[5])

i alway tell peopl that enchant april is an adult movi with no cuss no sex and no violenc one might think of it as the ultim chick flick but i bet there are one or two enlighten men out there who love it too dont invit the kid though thi movi is veri lowkeyse enchant april is a veri heal experi the sound track and gorgeou sceneri along with the ladi gentl manner bring to mind the peac and beauti of a preraphaelit paintinglest anyon think your truli onli watch one kind of movi i will paraphras a line i heard onc on saturday night live and say that my two favorit movi are the deer hunter and enchant april


**1.3 Tokenization**

Encode the cleaned reviews into integers which can be used by as features in our classification model. This is also called as **tokenization**. We will use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) method from the *sklearn* library with a maximum vocabulary size of 1000 and remove stopwords. *Stopwords* are commonly occuring words in English - like *and,i,for,us,the etc*

In [49]:
tokenizer = CountVectorizer(max_features = 1000, stop_words = 'english')

Let's fit the tokenizer to our cleaned training set and store the train & test tokenized data to these variables respectively: **X_train_tokenized** and **X_test_tokenized**

In [68]:
# your code here
X_train_token = CountVectorizer().fit(X_train_cleaned)
X_train_tokenized  = token.transform(X_train_cleaned)
names = token.get_feature_names()

x_train_df = pd.DataFrame(X_train_tokenized.toarray(), columns=names)
x_train_df



X_test_token = CountVectorizer().fit(X_test_cleaned)
X_test_tokenized  = token.transform(X_test_cleaned)
names = token.get_feature_names()

x_test_df = pd.DataFrame(X_test_tokenized.toarray(), columns=names)
x_test_df

Unnamed: 0,aa,aaaahhhhhhh,aaaarrgh,aaargh,aachen,aada,aadha,aag,aahhh,aam,aamir,aankhen,aapk,aapkey,aargh,aaron,aatish,aavjo,ab,aback,abahi,abandon,abba,abbasmustan,abbi,abbott,abbrevi,abc,abccbsnbc,abctv,abduct,abductor,abdul,abe,abel,aberr,abet,abey,abhay,abhi,...,zmeu,zoe,zoey,zoink,zola,zombi,zombiecannib,zombiesnatch,zombiesreview,zomerhitt,zone,zonether,zonewhil,zoo,zooland,zoom,zoomin,zori,zorro,zougan,zrate,zsigmond,zu,zucker,zucov,zue,zuf,zui,zukick,zukov,zulu,zulun,zungia,zuniga,zuth,zuwarrior,zuzzzuzz,zwartboek,zx,zz
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3996,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3997,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3998,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [69]:
# shape of the tokenized features for train & test data
print(X_train_tokenized.shape,X_test_tokenized.shape)

(4000, 32886) (4000, 32886)


Why do you think removing stop-words is important?

*your answer here*

*   important for the context so removing the, is, etc




There is a slight fault in the way we have pre-processed and then tokenized data. Can you spot the fault?

*Hint: The fault lies in the sequence of operations done* 

*your answer here*


Print out the top 10 most occuring words in the training set along with their frequency. Note that `tokenizer.vocabulary_` will only give you the index of the word in the vocab; not the count itself

In [None]:
# your code here


**1.4 Classification model**

Use a *Logistic Regression* model for classification with the tokenized reviews as features. Fine-tune your model to find the best value of the hyper-parameter 'C' from the follwing values: *'{0.001,0.01,0.1,1,10,100}'*. Store the final model in the variable *'bow_model'*

In [None]:
# your code here



**1.5 Prediction and Accuracy**

Print the accuracy of your model on both the train and test data-set. 

In [None]:
# your code here




**1.6 Confusion Matrix**

Plot the confusion matrix (2x2 matrix of your actual vs predicted values) for both train and test data. Complete the function below which will help you with this. You can use the [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) from `sklearn` and the [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) from the `seaborn` library to help with this. They have already been imported at the start of this notebook

In [None]:
# function to plot confusion matrix

def plot_confusion_matrix(model, validation_features, validation_labels):
     
    """
    Function
    --------
    plot_confusion_matrix

    Inputs
    ------
    model: the classification model
    validation_features: features/variables used in the model (X)
    validation_labels: response variable (good/bad sentiment) (y)

    Notes
    -----
    Calling this function should plot a confusion matrix. Use heatmap from the seaborn library
    """
    
    # Predict the values from the validation dataset
    y_pred = _______
    
    # Convert validation observations to one hot vectors
    y_true = _______
    
    # compute the confusion matrix
    confusion_mtx = _______ 

    df_cm = pd.DataFrame(confusion_mtx, range(2),
                      range(2))
    sns.heatmap(df_cm, annot=True, annot_kws = {'size':15}, cmap = 'Blues',fmt = 'd',
                norm=LogNorm(df_cm.values.min(),df_cm.values.max()),
                cbar_kws={"ticks":[0,1,10,1e2,1e3,1e4]},vmin=0.001, vmax=10000)
    plt.tight_layout()
    plt.title('Confusion matrix')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.ylim(2,0)
    plt.show()
    

In [None]:
# generate the confusion matrix plots for train data

plot_confusion_matrix(bow_model, X_train_tokenized, y_train)

In [None]:
# generate the confusion matrix plots for test data

plot_confusion_matrix(bow_model, X_test_tokenized, y_test)

**1.7 Viewing and decoding predictions**

Let's look at the test data for a few cases for the False Positives and False Negatives and see where we are going wrong with the Predictions

In [None]:
y_pred = bow_model.predict(X_test_tokenized)
df_test = pd.DataFrame(list(zip(X_test, y_pred, y_test)), columns = ['X_test', 'y_pred', 'y_test'])

Create 2 subset dataframes from **df_test** : **df_fp** for all False Positives and **df_fn** for all False Negatives 

In [None]:
# False Positives
df_fp = df_test[(df_test['y_pred']==1) & (df_test['y_test']==0)]

# False Negatives
df_fn = df_test[(df_test['y_pred']==0) & (df_test['y_test']==1)]

Let's create a function to find out the words which contribute most to a particular review

In [None]:
# This function takes in the input of the tokenized_review and returns the important words occuring in the review 
# (according to the tokenizer vocab) along with their frequency

def imp_words(tokenized_review):
    sum_words = X_train_tokenized.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in tokenizer.vocabulary_.items()]
    words_freq = dict(sorted(words_freq, key = lambda x: x[1], reverse=True))

    imp_words = []
    for x, val in enumerate(tokenized_review):
        if val >= 1:
            for word, index in tokenizer.vocabulary_.items():
                if index == x:
                    imp_words.append((word, words_freq[word]))
    
    return sorted(imp_words, key = lambda x: x[1], reverse=True)

Let's pick one of the False Positive cases from the `df_fp` dataframe.

In [None]:
# Run this cell to see an example of a FP review
df_fp['X_test'][2001]

In [None]:
# Call the function 'imp_words' to find out the important occuring words in the review
imp_words(X_test_tokenized.toarray()[2001])

What do you observe? Why do you think this was classified as a False Positive?

*your answer here*


Let's do the same analysis for one of the False Negatives from the `df_fn` dataframe

In [None]:
# Run this cell to see an example of a FN review
df_fn['X_test'][51]

Find the *important words* in this review which contribute to the classification. Why do you think this was classified as a False Negative?

In [None]:
# your code here

