# Assignment 3 on Natural Language Processing

## Date : 30th Sept, 2020

### Instructor : Prof. Sudeshna Sarkar

### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

The central idea of this assignment is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.



Please submit with outputs. 

In [48]:
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report

In [49]:
#Load the IMDB dataset. You can load it using pandas as dataframe
df = pd.read_csv("IMDB Dataset.csv")

# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Remove html tags
2. Remove URLS
3. Remove non alphanumeric character
4. Remove Stopwords
5. Perform stemming and lemmatization

You can use regex from re. 

In [50]:
import nltk
import nltk.tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stopwords = set(stopwords.words('english'))
reviews = df['review'].values
labels = df['sentiment'].values
# lowercase
reviews = [review.lower() for review in reviews]
# remove html tags
reviews = [re.sub(r'<.*?>', ' ', review) for review in reviews]
# remove url
reviews = [re.sub(r'http[s]?://\S+', ' ', review) for review in reviews]
reviews = [re.sub(r'www\.\S+', ' ', review) for review in reviews]
reviews = [review.split(' ') for review in reviews]
# remove non alphanumeric characters
reviews = [[re.sub(r'[^a-zA-Z0-9]','', word) for word in review] for review in reviews]
# remove stopwords
reviews = [[word for word in review if word!='' and word not in stopwords] for review in reviews]
# lemmatization
reviews = [[lemmatizer.lemmatize(word,pos='v') for word in review] for review in reviews]
reviews = [[word for word in review if word!=''] for review in reviews]
reviews = [' '.join(words) for words in reviews]
df['review'] = reviews

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [51]:
# Print Statistics of Data like avg length of sentence , proposition of data w.r.t class labels
# find total number of words in corpus
total_words = sum([len(review.split(' ')) for review in reviews])
# distinct_words = set()
# find distinct words in corpus
# for i, review in enumerate(reviews):
#   distinct_words = distinct_words.union(set(review.split(' ')))
# num_distinct_words = len(distinct_words)
# find max_length of sentence in corpus
max_length = max([len(review.split(' ')) for review in reviews])
num_sentences = len(reviews)
# print("Number of distinct words in corpus is:", num_distinct_words)
print("Average length of sentence is:", total_words/num_sentences)
print("Maximum length of sentence is:", max_length)
print("Proportion of data with positive labels is:",np.sum(labels=='positive')/len(labels))
print("Proportion of data with negative labels is:",np.sum(labels=='negative')/len(labels))

Number of distinct words in corpus is: 153417
Average length of sentence is: 120.23092
Maximum length of sentence is: 1437
Proportion of data with positive labels is: 0.5
Proportion of data with negative labels is: 0.5


# Naive Bayes classifier

In [52]:
# get reviews column from df
reviews = df['review'].values

# get labels column from df
labels = df['sentiment'].values

In [53]:
# Use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print(encoder.classes_)

['negative' 'positive']


In [54]:
# Split the data into train and test (80% - 20%). 
# Use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, test_size=0.20, stratify=encoded_labels, random_state=0)

# train_sentences, test_sentences, train_labels, test_labels

Here there are two approaches possible for building vocabulary for the naive Bayes.
1. Take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. Take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
You are supposed to go by the 2nd approach.
 
Also building vocab by taking all words in the train set is memory intensive, hence you are required to build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [55]:
from sklearn.feature_extraction.text import CountVectorizer
# Use Count vectorizer to get frequency of the words
'''
max_features parameter : If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vec = CountVectorizer(max_features = 3000)
X = vec.fit_transform(Sentence_list)
'''
vectorizer = CountVectorizer(max_features = 3000)
# fitting on train_sentences and converting  to train_features array
train_features = vectorizer.fit_transform(train_sentences).toarray()
# get the vocabulary
vocab = vectorizer.get_feature_names()

In [56]:
# Use laplace smoothing for words in test set not present in vocab of train set

# Solution provided in next cell in conditional_probability function while building the model using dictionary vocab_map

In [57]:
# Build the model. Don't use the model from sklearn
num_classes = 2
d = len(vocab)

# probability of pos and neg classes
neg_class_prob = np.sum(train_labels==0)/len(train_labels)
pos_class_prob = np.sum(train_labels==1)/len(train_labels)

# class_word_counts stores total count of a particular word in a particular class
class_word_counts = np.zeros((num_classes,d))
# class_counts stores total count of words in a particular class
class_counts = np.zeros(num_classes)

for i in range(num_classes):
  class_mask = (train_labels==i) # find examples belonging to ith class
  class_word_counts[i] = np.sum(train_features[class_mask],axis=0) # find total count of particular word in a class

class_counts = np.sum(class_word_counts,axis=1) # find total num words for each class

# dictionary storing indices of words in vocabulary
vocab_map = {word:i for i, word in enumerate(vocab)}

# Use laplace smoothing for words in test set not present in vocab of train set
# Finds the smoothed conditional probability of a word given its class
def conditional_probability(word, label):
  n=0 # if word not in vocabulary
  if word in vocab_map:
    n=class_word_counts[label,vocab_map[word]] # if word in vocabulary, count of word in class
  return (n+1)/(class_counts[label]+d)

# predicts label for sentences based on Naive Bayes model constructed above
def predict(sentence):
  words = sentence.split(' ')
  # sum of log probabilities is used instead of product so that product of small number does not go below precision
  log_neg_prob = np.sum([np.log(conditional_probability(word,0)) for word in words]) # adding log(p(word|class)) for neg class
  log_pos_prob = np.sum([np.log(conditional_probability(word,1)) for word in words]) # adding log(p(word|class)) for pos class
  log_neg_prob += np.log(neg_class_prob) # adding log(p(class)) for neg class
  log_pos_prob += np.log(pos_class_prob) # adding log(p(class)) for pos class
  # find the more probable class
  if log_pos_prob>=log_neg_prob:
    return 1
  else:
    return 0

In [58]:
# Test the model on test set and report Accuracy
predicted_test_labels = [predict(sentence) for sentence in test_sentences]
# finding accuracy on test set
test_accuracy = np.mean(predicted_test_labels==test_labels)
print("Accuracy on test set is:",test_accuracy)

Accuracy on test set is: 0.8444


# *LSTM* based Classifier

Use the above train and test splits.

In [74]:
# Hyperparameters of the model
# as the vocabulary size is 153417 which is too large, we take the 3000 most frequent words as vocab
vocab_size = 3000 # choose based on statistics
oov_tok = '<OOK>'
embedding_dim = 100
# as average length of sentence is 120, max_length is taken as 150 and the rest of words are ignored
max_length = 150 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'

In [75]:
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [76]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 150, 100)          300000    
_________________________________________________________________
bidirectional_6 (Bidirection (None, 128)               84480     
_________________________________________________________________
dense_12 (Dense)             (None, 24)                3096      
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 25        
Total params: 387,601
Trainable params: 387,601
Non-trainable params: 0
_________________________________________________________________


In [77]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [78]:
# Calculate accuracy on Test data
'''
prediction = model.predict(test_padded)

'''
# Get probabilities
prediction_prob = model.predict(test_padded)

# Get labels based on probability 1 if p>= 0.5 else 0
predicted_test_labels = [1 if prob>=0.5 else 0 for prob in prediction_prob]

# Accuracy : one can use classification_report from sklearn
test_accuracy = np.mean(predicted_test_labels==test_labels)
print("Accuracy on test set is", test_accuracy)
print("--------------------------")
print(classification_report(test_labels,predicted_test_labels))

Accuracy on test set is 0.8703
--------------------------
              precision    recall  f1-score   support

           0       0.90      0.83      0.87      5000
           1       0.84      0.91      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



## Get predictions for random examples

In [79]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)

# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)

# Get probabilities
prediction_prob = model.predict(padded)
print(prediction_prob)

# Get labels based on probability 1 if p>= 0.5 else 0
predicted_labels = [1 if prob>=0.5 else 0 for prob in prediction_prob]
print("Predicted labels:",predicted_labels)


[[0.69801337]
 [0.1681131 ]
 [0.12915443]]
Predicted labels: [1, 0, 0]
