# Sentiment Classification


## Problem Description:

Generate Word Embeddingand retrieve outputs of each layer with Keras based on the Classification task.
Word embeddingare a type of word representation that allows words with similar meaning to have a similar representation.
It is a distributed representation for the text that is perhaps one of the key breakthroughs for the impressive 
performance of deep learning methods on challenging natural language processing problems.
We will use the IMDb dataset to learn word embeddingas we train our dataset. 
This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).


## Data Description:
The Dataset of 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). 
Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). 
For convenience, the words are indexed by their frequency in the dataset, 
meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, 
using a max vocab size of 10,000.As a convention, "0" does not stand for a specific word, but instead is used to 
encode any unknown word.


## Loading the dataset

In [0]:
from keras.datasets import imdb
vocab_size = 10000 #vocab size
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 30  #number of word used from each review

## Train test split

In [74]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [0]:
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen,padding='post')
x_test =  pad_sequences(x_test, maxlen=maxlen,padding='post')

In [76]:
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

(25000, 30) (25000,)
(25000, 30) (25000,)


In [77]:
x_train[0]

array([  18,   51,   36,   28,  224,   92,   25,  104,    4,  226,   65,
         16,   38, 1334,   88,   12,   16,  283,    5,   16, 4472,  113,
        103,   32,   15,   16, 5345,   19,  178,   32], dtype=int32)

In [78]:
y_train[1]

0

In [79]:
#Get the word index and then Create a key-value pair for word and word_id
word_index = imdb.get_word_index()
len(word_index)

88584

In [80]:
reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_train[1]])
print(decoded_review)

truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [82]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout,Bidirectional
model = Sequential()
model.add(Embedding(vocab_size,output_dim=20))
model.add(Dropout(0.1))
model.add(Bidirectional(LSTM(units = 300, dropout=0.1, recurrent_dropout=0.2)))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 20)          200000    
_________________________________________________________________
dropout (Dropout)            (None, None, 20)          0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 600)               770400    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 601       
Total params: 971,001
Trainable params: 971,001
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

In [84]:
batch_size = 200
model.fit(x_train, y_train, batch_size=batch_size, epochs=10,  validation_data=(x_test, y_test), verbose=1, callbacks = [es])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 00006: early stopping


<tensorflow.python.keras.callbacks.History at 0x7fe41cebb6d8>

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
y_pred = model.predict(x_test)

In [86]:
y_pred[0]

array([0.8692282], dtype=float32)

In [0]:
#rounding off 
import numpy as np
y_predr = []
for i in range(len(y_pred)):
    y_predr.append(np.round(y_pred[i]))

In [88]:
from sklearn.metrics import confusion_matrix
import pandas as pd
from sklearn import metrics
print("Confusion Matrix:\n")
print(pd.DataFrame(confusion_matrix(y_test, y_predr, labels=[0, 1]), index=['true:negative', 'true:postive'], columns=['pred:negative', 'pred:postive']))
print(metrics.classification_report(y_test, y_predr))
confusion_matrix(y_test, y_predr)

Confusion Matrix:

               pred:negative  pred:postive
true:negative           9935          2565
true:postive            3347          9153
              precision    recall  f1-score   support

           0       0.75      0.79      0.77     12500
           1       0.78      0.73      0.76     12500

    accuracy                           0.76     25000
   macro avg       0.76      0.76      0.76     25000
weighted avg       0.76      0.76      0.76     25000



array([[9935, 2565],
       [3347, 9153]])

In [0]:
# Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
#Fetching 28 the record
testRecord = np.array([x_test[20]])

In [96]:
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[20]])
print(decoded_review)

turns this film will leave you ? i found the cast of this movie to be outstanding and is not a movie to be ignored excellent go rent it today


In [97]:
from keras import backend as K# with a Sequential model
for i in range(1,4):
    get_layer_output = K.function([model.layers[0].input],[model.layers[i].output])
    layer_output = get_layer_output([testRecord])[0]
    print(f'Output of layer{i}:',layer_output)

Output of layer1: [[[-4.65203896e-02  1.86735447e-02 -1.95319410e-02  1.00971468e-01
   -1.04768788e-02 -2.18674708e-02 -9.91381798e-03 -4.53054197e-02
    7.10829999e-03 -7.40332603e-02 -5.32241091e-02  3.22945183e-04
   -3.88734159e-03  3.85249257e-02  1.92133486e-02 -1.00648247e-01
    4.48713899e-02  2.97793304e-06 -2.11031139e-02  4.11805399e-02]
  [ 1.42734321e-02 -2.33644564e-02  2.24801265e-02  4.42969054e-02
    3.45924161e-02 -3.09708472e-02 -3.48970816e-02 -2.38658953e-02
    1.54063879e-02 -7.47941155e-03 -1.20499684e-02 -3.06876730e-02
   -1.69527121e-02 -4.00346741e-02 -1.02210455e-02 -2.55237650e-02
    7.13019213e-03  3.09951752e-02  1.23121487e-02  1.80029664e-02]
  [ 8.84205196e-03  2.17464529e-02  3.91567498e-03  4.95414436e-03
   -3.97935919e-02 -9.97586641e-03 -2.23667664e-03  1.63233969e-02
   -8.65611713e-03 -1.11917891e-02  2.75000166e-02 -5.25040813e-02
    1.55943343e-02  1.16423226e-03  1.82386767e-02  3.78081612e-02
    1.89039838e-02  8.80999397e-03 -2.0854

In [98]:
y_test[28] #verifying the actual y values

1

In [0]:
!pip install -q wordcloud
import nltk

In [100]:
#Lets submit our sample review on model and see outcome
from nltk import word_tokenize
from keras.preprocessing import sequence
test=[]

for word in word_tokenize( "movie is trash and wasted my time"):
     test.append(word_index[word])
test=sequence.pad_sequences([test],maxlen=maxlen,padding='post')
prediction = model.predict(test)
print(prediction)

[[0.02353219]]


In [101]:
if (np.round(prediction) == 1):
    print("Review is postive")
else:
    print("Review is negative")  

Review is negative
