# **Assignment-2 for CS60075: Natural Language Processing**

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Prithwish Jana, Udit Dharmin Desai

#### Date of Announcement: 15th Sept, 2021
#### Deadline for Submission: 11.59pm on Wednesday, 22nd Sept, 2021 
#### Submit this .ipynb file, named as `<Your_Roll_Number>_Assn2_NLP_A21.ipynb`

The central idea of this assignment is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.  This dataset consists of 50k movie reviews (25k positive, 25k negative). You can download the dataset from https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews



Please submit with outputs. 

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report , accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter, defaultdict
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
stopwords = stopwords.words('english')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#Load the IMDB dataset. You can load it using pandas as dataframe
dataset = pd.read_csv('/content/IMDB Dataset.csv.zip')
print(dataset)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Remove html tags
2. Remove URLS
3. Remove non alphanumeric character
4. Remove Stopwords
5. Perform stemming and lemmatization

You can use regex from re. 

In [None]:
#lower case the corpus; remove- html tags, urls, non alpha numerics, stopwords ;stemming&lemmatization
def preprocess(raw_text):   
  cleantext = raw_text.lower()
  cleaner = re.compile('<.*?>')
  cleantext = re.sub(cleaner, ' ', cleantext)    
  cleantext = re.sub(r'http\S+', '', cleantext)  
  cleantext = re.sub(r'[^\w\s]', '', cleantext)  
  lemmatizer = WordNetLemmatizer()  
  cleanwords = [lemmatizer.lemmatize(word) for word in word_tokenize(cleantext) if word not in stopwords]  
  cleantext = ' '.join(cleanwords)  
  return cleantext
# preprocess the entire dataset
  def ppDataset(dataset):   
   for i in range(len(dataset)):
     dataset.iloc[i][0] = preprocess(dataset.iloc[i][0])
   return dataset

  dataset = ppDataset(dataset)
print("Preprocessed dataset:\n", dataset)

Preprocessed dataset:
                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [None]:
# Print Statistics of Data like avg length of sentence , proposition of data w.r.t class labels
positive = 0
sumlen = 0
for i in range(len(dataset)):
  if dataset.iloc[i][1] == 'positive':
    positive += 1
  sumlen += len(word_tokenize(dataset.iloc[i][0]))
print("Average length of sentence = {: .4f} words".format(sumlen/len(dataset)))
print("Proposition of data w.r.t class labels:")
print("Positive reviews:{:4f}".format(positive))
print("Negative reviews:{:.4f}".format( len(dataset)- positive))
print("Proportion of postitive to negative reviews:{:.4}".format(positive/ len(dataset)*100))


Average length of sentence =  279.1375 words
Proposition of data w.r.t class labels
Positive reviews:25000.000000
Negative reviews:25000.0000
Proportion of postitive to negative reviews:50.0


# Naive Bayes classifier

In [None]:
# get reviews column from df
reviews = dataset['review'].values

# get labels column from df
labels = dataset['sentiment'].values

In [None]:
# Use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
dataset['encoded'] = encoded_labels
encoder_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
labels = dataset['encoded']

# print(enc.classes_)

In [None]:
# Split the data into train and test (80% - 20%). 
# Use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, labels, test_size = 0.2, stratify = labels)
# train_sentences, test_sentences, train_labels, test_labels

Here there are two approaches possible for building vocabulary for the naive Bayes.
1. Take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. Take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
You are supposed to go by the 2nd approach.
 
Also building vocab by taking all words in the train set is memory intensive, hence you are required to build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# Use Count vectorizer to get frequency of the words
'''
max_features parameter : If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vec = CountVectorizer(max_features = 3000)
X = vec.fit_transform(Sentence_list)
'''


vec = CountVectorizer(max_features = 3000)
X = vec.fit_transform(train_sentences)
counts = X.sum(axis = 0).A1
vocab = list(vec.get_feature_names())

freq = Counter(dict(zip(vocab, counts)))

print("The 100 most common words in the reviews are: ", freq.most_common(100), sep = '\n')


The 100 most common words in the reviews are: 
[('the', 535305), ('and', 259443), ('of', 231641), ('to', 214359), ('is', 169280), ('br', 161843), ('it', 153038), ('in', 149611), ('this', 120626), ('that', 115169), ('was', 75998), ('as', 73453), ('movie', 70481), ('for', 69996), ('with', 69739), ('but', 66986), ('film', 63755), ('you', 55386), ('on', 54688), ('not', 48686), ('he', 46944), ('are', 46916), ('his', 45975), ('have', 44193), ('one', 42898), ('be', 42779), ('all', 37540), ('at', 37344), ('they', 36204), ('by', 35737), ('an', 34322), ('who', 33804), ('so', 32753), ('from', 32354), ('like', 32296), ('there', 30079), ('or', 28653), ('just', 28241), ('her', 27634), ('out', 27475), ('about', 27327), ('if', 27246), ('has', 26466), ('what', 25819), ('some', 25065), ('good', 23797), ('can', 23419), ('more', 22484), ('when', 22366), ('very', 22319), ('she', 21521), ('up', 21152), ('no', 20304), ('time', 19981), ('even', 19892), ('my', 19863), ('would', 19624), ('which', 18682), ('only

In [None]:
# Use laplace smoothing for words in test set not present in vocab of train set

class Naive_Bayes:
    def __init__(self, classes):
      self.classes = classes

    def smoothing(self, word, tclass):          
      num = self.wcounts[tclass][word] + 1
      den = self.n_items[tclass] + len(self.vocab)
      return math.log(num / den)

    def fit(self, X, y):
        self.vocab = vocab
        self.wcounts = {}
        self.n_items = {}
        self.log_p = {}
        n = len(X)
        grouped = self.group(X, y)
        for c, data in grouped.items():
          self.n_items[c] = len(data)
          self.log_p[c] = math.log(self.n_items[c] / n) 
          self.wcounts[c] = defaultdict(lambda: 0)
          for txt in data:
            counts = Counter(nltk.word_tokenize(txt))
            for word, count in counts.items():
                self.wcounts[c][word] += count
        return self

    def predict(self, X):
        result = []
        for text in X:
          scores = {c: self.log_p[c] for c in self.classes}
          words = set(nltk.word_tokenize(text))
          for word in words:
              if word not in self.vocab: 
                continue
              for c in self.classes:
                log_wgc = self.smoothing(word, c)
                scores[c] += log_wgc
          result.append(max(scores, key = scores.get))
        return result

    def group(self, X, y):
      data = {}
      for c in self.classes:                          
        data[c] = X[np.where(y == c)]
      return data

In [None]:
# Build the model. Don't use the model from sklearn



In [None]:
# Test the model on test set and report Accuracy
import math
nb = Naive_Bayes(classes = np.unique(labels)).fit(train_sentences, train_labels)

# Test the model on test set and report Accuracy
predicted_labels = nb.predict(test_sentences)
print("The accuracy of the Naive Bayes classifier is: \n{:.4f}%\n\n".format(accuracy_score(test_labels, predicted_labels) * 100))
print(" ")
print("The classification report is as follows: \n\n", classification_report(test_labels, predicted_labels))


The accuracy of the Naive Bayes classifier is: 
83.5000%


 
The classification report is as follows: 

               precision    recall  f1-score   support

           0       0.79      0.90      0.85      5000
           1       0.89      0.77      0.82      5000

    accuracy                           0.83     10000
   macro avg       0.84      0.83      0.83     10000
weighted avg       0.84      0.83      0.83     10000



# *LSTM* based Classifier

Use the above train and test splits.

In [None]:
# Hyperparameters of the model
vocab_size = len(tokenizer.word_index) + 1  # choose based on statistics
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 150 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'

In [None]:
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [None]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 100)          11213200  
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 11,300,801
Trainable params: 11,300,801
Non-trainable params: 0
_________________________________________________________________


In [None]:
num_epochs = 5   #training the model
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# Calculate accuracy on Test data
'''
prediction = model.predict(test_padded)

'''
prediction = model.predict(test_padded)
# Get probabilities
print("Probabilities: ", prediction, sep='\n')

# Get labels based on probability 1 if p>= 0.5 else 0
for p in prediction:
  if p[0] >= 0.5:
    p[0] = 1
  else:
    p[0] = 0
prediction = prediction.astype('int32') 
print("\nLabels:", prediction, sep='\n')

# Accuracy : one can use classification_report from sklearn

print("\nAccuracy of the model: {:.4f}%\n".format(accuracy_score(test_labels, prediction) * 100))
print("Classification report: \n", classification_report(test_labels, prediction, labels = [0, 1]), sep='\n')


Probabilities: 
[[0.99276364]
 [0.01569769]
 [0.01924348]
 ...
 [0.5059785 ]
 [0.12448171]
 [0.99615866]]

Labels:
[[1]
 [0]
 [0]
 ...
 [1]
 [0]
 [1]]

Accuracy of the model: 86.3100%

Classification report: 

              precision    recall  f1-score   support

           0       0.85      0.87      0.86      5000
           1       0.87      0.85      0.86      5000

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Get predictions for random examples

In [None]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)

# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)

# Get probabilities
print("Probablities : ")
print(model.predict(padded))

# Get labels based on probability 1 if p>= 0.5 else 0

for p in prediction:
    if p[0] >=0.5:
        p[0] = 1
    else:
        p[0] = 0
prediction = prediction.astype('int32') 
print("\nLabels:", prediction, sep='\n')


Probablities : 
[[0.9916369 ]
 [0.03113177]
 [0.06507811]]

Labels:
[[1]
 [0]
 [0]
 ...
 [1]
 [0]
 [1]]
