![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

In [25]:
#### Add your code here ####

from tensorflow.keras.datasets import imdb

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [26]:
#Take 10000 most frequent words
vocab_size= 10000 

#load dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

In [27]:
#After the loading of the data set, lets review it
print("No of review in the dataset :",x_train.shape)
print("word index in the first review item :",x_train[0])
print("No of words in the fist review item :",len(x_train[0]))

No of review in the dataset : (25000,)
word index in the first review item : [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 53

### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [28]:
#### Add your code here ####

#make all sequences of the same length using keras pad_sequences

from keras.preprocessing.sequence import pad_sequences

#The Maximum Number of word length for each review
wordlen = 300  

#padding used post to keep dummy word added at the end of the review
x_train = pad_sequences(x_train, maxlen=wordlen, padding='post') 
x_test =  pad_sequences(x_test, maxlen=wordlen, padding='post')

In [29]:
#After the Padding of the data set, lets review it
print("\nNo of review in the dataset :",x_train.shape)
print("\nNo of words in the fist review item :",len(x_train[0]))
print("\nwords index in the fist review item : :",x_train[0])


No of review in the dataset : (25000, 300)

No of words in the fist review item : 300

words index in the fist review item : : [   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141

**After the Padding the Word List vect has padded with 0, and Total words length in a Review is 300. Also, I used padding="post", so that the extra words appears after the original review**

### Print shape of features & labels (2 Marks)

In [30]:
#### Add your code here ####

print ("\nShape of Features(Review) in training set: ", x_train.shape)
print ("\nShape of Labels in training set: ", y_train.shape)

print ("\nShape of Features(Review) in test set: ", x_test.shape)
print ("\nShape of Labels in test set: ", y_test.shape)


Shape of Features(Review) in training set:  (25000, 300)

Shape of Labels in training set:  (25000,)

Shape of Features(Review) in test set:  (25000, 300)

Shape of Labels in test set:  (25000,)


**Out of Total 50000 review, 25000 is used in Training and rest 25000 used for Testing purposes. After padding, Maxmum length of a review items considered to be 300.**

**Number of review, number of words in each review**

In [31]:
print ("Number of review - in Training Set: ", (x_train.shape)[0])
print ("Number of review - in Test Set    : ", (x_test.shape)[0])


Number of review - in Training Set:  25000
Number of review - in Test Set    :  25000


In [32]:
#Number of words in each review (Considered here the Training Set Only)

#Excluding 0 during count as we have already done padding with 0 index
list_review_wordscount=([sequence[sequence!=0].size for sequence in x_train])

#no of review
print("Total No of Review item in the List: ",len(list_review_wordscount),'\n')

print("Printing the first 10 review word count from the entire Training Set of 25000\n")
for i in range(0,10): 
    print(i, list_review_wordscount[i])

Total No of Review item in the List:  25000 

Printing the first 10 review word count from the entire Training Set of 25000

0 218
1 189
2 141
3 300
4 147
5 43
6 123
7 300
8 233
9 130


In [33]:
#Number of words in each review (Considered here the Test Set Only)

#Excluding 0 during count as we have already done padding with 0 index
list_review_wordscount=([sequence[sequence!=0].size for sequence in x_test])

#no of review
print("Total No of Review item in the List: ",len(list_review_wordscount),'\n')

print("Printing the first 10 review word count from the entire Training Set of 25000\n")
for i in range(0,10): 
    print(i, list_review_wordscount[i])

Total No of Review item in the List:  25000 

Printing the first 10 review word count from the entire Training Set of 25000

0 68
1 260
2 300
3 181
4 108
5 132
6 300
7 180
8 134
9 300


I have **printed the first 10 review word count from both Training and Test set**, just to discard long list of review items, we can print the entire list in a by selecting the range as the entire x_train list.

**Number of labels**

In [34]:
#### Add your code here ####

import numpy as np

print("No of Label in Training Set ", np.unique(y_train))
print("No of Label in Test Set     ", np.unique(y_test))

No of Label in Training Set  [0 1]
No of Label in Test Set      [0 1]


Label **0 - Negetive** And Label **1 Positive** Two different Label exists in the Dataset for Both Training and Test Data

### Print value of any one feature and it's label (2 Marks)

**Feature value**

In [35]:
#### Add your code here ####

rev_no=11 #Lets a random review say 11
print("Feature:", x_test[rev_no])

Feature: [   1   54   13   86  219   14   20   11    4  750   13   16   38 1612
   12  340 4280   11   61  652   13  161   67   12   18    6 2068   95
  872   51    4  609  903   67  146  149   32    2  102  150    8   67
  121   12  435  355   61  482    9   12   16   19  755  457   15   16
    4   86    8    2    4  226   13  244   11    6  925    2   13   67
  916  538  449    2   51    9 1448  449   94    6  925  449   94   24
    6  925  449  858   13   67  142 3642  449  115  330    2  769  148
 2289   92   60 5276    4  953    8   30 3057   42 1231    8    4 2269
   39    4   86  470  102   15   48   25  219    2   25   26  184   76
 5822    5  351    4   86  342    2    8   14  769   63   93   12   38
  629   11    4   86  273  164  939  164  916    4  953  188 3057 4572
   36  385   16    4   64   31   15  100 4866   41   96   46    7   12
   86   88    7 1697 1265   95   88   59   69 1618   44    4   14   20
   33  222 1027    8 1231    8   32   15   60  151   12   16    6   

**Label value**

In [36]:
#### Add your code here ####

print("Label:", y_test[[rev_no]])

Label: [0]


In [37]:
#Lets review with the provided data
y_test[0:15]

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1])

### Decode the feature value to get original sentence (2 Marks)

***First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset***

In [38]:
#### Add your code here ####

# load the dictionary mappings from word to integer index
word_index_dict = imdb.get_word_index()

***Now use the dictionary to get the original words from the encodings, for a particular sentence***

In [39]:
#### Add your code here ####

# reverse word index to map integer indexes to their respective words
reverse_word_index = dict([(value, key) for (key, value) in word_index_dict.items()])

#decode the review, mapping integer indices to words
# indices are off by 3 because 0, 1, and 2 are reserverd indices for "padding", "Start of sequence" and "unknown"

orig_review = ' '.join([reverse_word_index.get(i-3, '?') for i in x_test[rev_no]])
print("Original Review Content:\n")

orig_review

Original Review Content:



"? when i first saw this movie in the theater i was so angry it completely blew in my opinion i didn't see it for a decade then decided what the hell let's see i'm watching all ? movies now to see where it went wrong my guess is it was with sequel 5 that was the first to ? the whole i am in a dream ? i see weird stuff oh ? what is happening oh its a dream oh its not a dream oh wait i see something spooky oh never mind ? storyline those sequels don't even require the box to be opened or stick to the rules from the first 4 movies that if you saw ? you are pretty much screwed and dead the first 3 ? to this storyline which made it so scary in the first place nothing fantasy nothing weird the box got opened boom they came was the only one that could bargain her way out of it first because of uncle frank then because she had information about the this movie at least attempts to stick to all that even though it was a bad story it was still somewhat ? no i'm pretty sure part 5 was the first pa

***Get the sentiment for the above sentence***
- positive (1)
- negative (0)

## Defining a Baseline Model:
When we work with machine learning, one important step is to define a baseline model. This usually involves **a simple model**, ***which is then used as a comparison with the more advanced models that you want to test.*** In this case, I have used **the baseline model as LogisticRegression model** to compare it to the more advanced methods involving (deep) neural networks.

In [40]:
#### Add your code here ####

# The classification model we are going to use is the logistic regression which is a simple yet powerful linear model 
# that is mathematically speaking in fact a form of regression between 0 and 1 based on the input feature vector.

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(x_train, y_train)
score = classifier.score(x_test, y_test)
print(score)
y_pred = classifier.predict(x_test[[rev_no]])
print(y_pred)
if(y_pred[0] == 1):
  print("Positive: ", y_pred[0]) 
else:
  print("Negative: ", y_pred[0]) 

0.50376
[0]
Negative:  0


**In the above review(Rev No 11)** its a mix review with more negative senses, 

*   though it was a bad story
*   nothing fantasy nothing weird

posibly got a Negative review.

With the **Baseline Model of LogisticRegression** the **prediction is correct** but **accuracy score is poor with 50.38%**, With this accuracy most of the other reviews are not getting predicted correctly and is not matching.




## Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [41]:
#### Add your code here ####

from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Dropout, Embedding, TimeDistributed, Bidirectional

#We already Have the information Given as below

#Size of the vocabulary will be 10000
vocab_size= 10000

#Give dimension of the dense embedding as 100
embedDim=100

#Length of input sequences should be 300
max_review_len = 300

# Define a Sequential Model
model = Sequential()

#Add Embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embedDim, input_length=max_review_len))

#Add a Bidrectional LSTM Layer with Dropout=0.2
model.add(Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.2), merge_mode="concat")) # Default Activation Function is "tanh" is used here

#Add a TimeDistributed layer with 100 Dense neurons(Already mentioned as required)
model.add(TimeDistributed(Dense(100, activation="relu")))

#Add Flatten layer
model.add(Flatten())

#Add Dense layer
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.5))

#Output Layer
model.add(Dense(1, activation='sigmoid'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [42]:
#### Add your code here ####

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Print model summary (2 Marks)

In [43]:
#### Add your code here ####

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 300, 200)          160800    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 300, 100)          20100     
_________________________________________________________________
flatten_1 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 50)                1500050   
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                

### Fit the model (2 Marks)

In [44]:
#### Add your code here ####

from keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_loss', min_delta=0.001, verbose=1, mode='min')
callbacks_list = [earlystop]

model.fit(x_train, y_train, batch_size=32, epochs=5, validation_split=0.3, callbacks=callbacks_list) #Lets Use 30% of Training Data as Validation Data

Epoch 1/5
Epoch 2/5
Epoch 00002: early stopping


<tensorflow.python.keras.callbacks.History at 0x7fb19b69c6a0>

### Evaluate model (2 Marks)

In [45]:
#### Add your code here ####

scores = model.evaluate(x_test, y_test, batch_size = 32, verbose= 1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 88.05%


### Predict on one sample (2 Marks)

In [51]:
#### Add your code here ####

y_pred = model.predict(x_test[[rev_no]]) #Lets predict the Same review we Have used in Baseline Model rev_no=11

pred=int(y_pred[0][0])

print("Predicting for the Movie Review Number: ",rev_no,'\n')

if(pred == 1):
  print("Movie Sentiment is : Positive (", pred,')') 
else:
  print("Movie Sentiment is : Negative (", pred,')') 

Predicting for the Movie Review Number:  11 

Movie Sentiment is : Negative ( 0 )


**The prediction is Negative** for the Same review(rev_no=11) we evaluated in the BaseModel, **However this time the accuracuy is Higher compared to earlier 50.37% to 88.05%**



---

