# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:
1. Import test and train data 
2. Import the labels ( train and test)

In [1]:
from keras.datasets import imdb

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from keras.preprocessing.sequence import pad_sequences

vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [3]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [4]:
print ("x_train shape: ", x_train.shape)
print ("y_train shape: ", y_train.shape)
print ("x_test shape: ", x_test.shape)
print ("y_test shape: ", y_test.shape)

x_train shape:  (25000, 300)
y_train shape:  (25000,)
x_test shape:  (25000, 300)
y_test shape:  (25000,)


In [5]:
print("Maximum value of a word index ")
print(max([max(sequence) for sequence in x_train]))
print("Maximum length num words of review in train ")
print(max([len(sequence) for sequence in x_train]))

Maximum value of a word index 
9999
Maximum length num words of review in train 
300


#### 3. Get the word index and then Create key value pair for word and word_id. (12.5 points)

In [44]:
# Make Word to ID dictionary
INDEX_FROM=3   # word index offset
word_to_id = imdb.get_word_index() #Get the word index
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["[PAD]"] = 0
#word_to_id[""] = 0
word_to_id["[🏃]"] = 1 # START
word_to_id["[❓]"] = 2 # UNKNOWN

# Make ID to Word dictionary
id_to_word = {value:key for key,value in word_to_id.items()}

def restore_original_text(imdb_x_array):
    return (' '.join(id_to_word[id] for id in imdb_x_array ))

In [45]:
id_to_word

{34704: 'fawn',
 52009: 'tsukino',
 52010: 'nunnery',
 16819: 'sonja',
 63954: 'vani',
 1411: 'woods',
 16118: 'spiders',
 2348: 'hanging',
 2292: 'woody',
 52011: 'trawling',
 52012: "hold's",
 11310: 'comically',
 40833: 'localized',
 30571: 'disobeying',
 52013: "'royale",
 40834: "harpo's",
 52014: 'canet',
 19316: 'aileen',
 52015: 'acurately',
 52016: "diplomat's",
 25245: 'rickman',
 6749: 'arranged',
 52017: 'rumbustious',
 52018: 'familiarness',
 52019: "spider'",
 68807: 'hahahah',
 52020: "wood'",
 40836: 'transvestism',
 34705: "hangin'",
 2341: 'bringing',
 40837: 'seamier',
 34706: 'wooded',
 52021: 'bravora',
 16820: 'grueling',
 1639: 'wooden',
 16821: 'wednesday',
 52022: "'prix",
 34707: 'altagracia',
 52023: 'circuitry',
 11588: 'crotch',
 57769: 'busybody',
 52024: "tart'n'tangy",
 14132: 'burgade',
 52026: 'thrace',
 11041: "tom's",
 52028: 'snuggles',
 29117: 'francesco',
 52030: 'complainers',
 52128: 'templarios',
 40838: '272',
 52031: '273',
 52133: 'zaniacs',

In [46]:
#Lets Decode and check for some train values.
x_train[10]

array([   6,  346,  137,   11,    4, 2768,  295,   36, 7740,  725,    6,
       3208,  273,   11,    4, 1513,   15, 1367,   35,  154,    2,  103,
          2,  173,    7,   12,   36,  515, 3547,   94, 2547, 1722,    5,
       3547,   36,  203,   30,  502,    8,  361,   12,    8,  989,  143,
          4, 1172, 3404,   10,   10,  328, 1236,    9,    6,   55,  221,
       2989,    5,  146,  165,  179,  770,   15,   50,  713,   53,  108,
        448,   23,   12,   17,  225,   38,   76, 4397,   18,  183,    8,
         81,   19,   12,   45, 1257,    8,  135,   15,    2,  166,    4,
        118,    7,   45,    2,   17,  466,   45,    2,    4,   22,  115,
        165,  764, 6075,    5, 1030,    8, 2973,   73,  469,  167, 2127,
          2, 1568,    6,   87,  841,   18,    4,   22,    4,  192,   15,
         91,    7,   12,  304,  273, 1004,    4, 1375, 1172, 2768,    2,
         15,    4,   22,  764,   55, 5773,    5,   14, 4233, 7444,    4,
       1375,  326,    7,    4, 4760, 1786,    8,  3

In [47]:
restore_original_text(x_train[10])

"a short while in the cell together they stumble upon a hiding place in the wall that contains an old [❓] after [❓] part of it they soon realise its magical powers and realise they may be able to use it to break through the prison walls br br black magic is a very interesting topic and i'm actually quite surprised that there aren't more films based on it as there's so much scope for things to do with it it's fair to say that [❓] makes the best of it's [❓] as despite it's [❓] the film never actually feels restrained and manages to flow well throughout director eric [❓] provides a great atmosphere for the film the fact that most of it takes place inside the central prison cell [❓] that the film feels very claustrophobic and this immensely benefits the central idea of the prisoners wanting to use magic to break out of the cell it's very easy to get behind them it's often said that the unknown is the thing that really [❓] people and this film proves that as the director [❓] that we can nev

In [55]:
#Lets Decode and check for some test values.
x_test[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [54]:
restore_original_text(x_test[0])

"[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

### 4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)

In [20]:
import tensorflow as tf
from tensorflow import keras
from keras import backend as K
import numpy as np
#from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional, GlobalAveragePooling1D

tf.set_random_seed(100)

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(64, activation='relu'))

# Dropout for regularization
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(32, activation="relu"))
model.add(keras.layers.Dense(16, activation="relu"))
model.add(keras.layers.Dense(1, activation="sigmoid"))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                1088      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_5 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 17        
Total para

In [21]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [22]:
history = model.fit(x_train,
                    y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_test, y_test),
                    verbose=1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### 5. Report the Accuracy of the model. (5 points) 

In [29]:
print('\nModel Performance: Log Loss and Accuracy on train data')
print(model.evaluate(x_train, y_train, batch_size = 20))
print('\nModel Performance: Log Loss and Accuracy on validation data')
print(model.evaluate(x_test, y_test, batch_size = 20))


Model Performance: Log Loss and Accuracy on train data
[0.1387550181351602, 0.9538399930953979]

Model Performance: Log Loss and Accuracy on validation data
[0.3233553539603949, 0.8763999971866607]


In [53]:
'''
PREDICT
'''
import random
import pandas as pd

original_x_test = x_test

def restore_original_text(imdb_x_array):
    return (' '.join(id_to_word[id] for id in imdb_x_array ))

def imdb_class_to_str(imdb_class):
    if imdb_class == 0:
        return 'negative'
    return 'positive'

right = 0
mistake = 0

index_list = []
original_text_list = []
pred_prob_list = []
pred_class_list = []
y_test_list = []
fail_str_list = []

for i in range(100):
    index = random.randint(0, len(x_test))
    
    pred_prob = model.predict(x_test[index:(index+1)])[0][0] 
    pred_class = model.predict_classes(x_test[index:(index+1)])[0][0]
    
    '''
    print('pred_prod:', pred_prod)
    print('pred_class:', pred_class)
    print('y_test[index] :', y_test[index])
    '''
    fail_str = '' 
    
    if y_test[index] == pred_class:
        right += 1
    else:
        mistake += 1
        fail_str = 'Fail'
        
    original_text = restore_original_text(original_x_test[index])

    index_list.append(index)
    original_text_list.append(original_text)
    pred_prob_list.append(pred_prob)
    pred_class_list.append(imdb_class_to_str(pred_class))
    y_test_list.append(imdb_class_to_str(y_test[index]))
    fail_str_list.append(fail_str)

print("right : ", right)
print("mistake : ", mistake)
print("accuracy:", right/(right+mistake))

df = pd.DataFrame({'index': index_list, 
                   'x_test_original_text': original_text_list, 
                   'probability': pred_prob_list, 
                   'pred_class': pred_class_list,
                   'y_test': y_test_list,
                   'is_fail': fail_str_list
                  })

df[['index', 'x_test_original_text','probability','pred_class','y_test','is_fail']]

right :  95
mistake :  5
accuracy: 0.95


Unnamed: 0,index,x_test_original_text,probability,pred_class,y_test,is_fail
0,16001,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.950951,positive,positive,
1,2593,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.008440,negative,negative,
2,13786,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.014839,negative,negative,
3,3038,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.987743,positive,positive,
4,15828,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.020465,negative,negative,
5,14472,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.823340,positive,positive,
6,2479,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.050895,negative,negative,
7,12321,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.596800,positive,negative,Fail
8,14374,[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...,0.020176,negative,negative,
9,334,several town members she [❓] as though she had...,0.028556,negative,negative,


### 6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)

In [25]:
#Define function to get output of all the layers in the model for specific test input
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())

    def getLayerOutput(layer):
        get_Layer_Output = k.function([model.layers[0].input], [layer.output])
        return get_Layer_Output([x_test[0:1,]])[0]
    
    layer_output = []
    
    for layer in model.layers:
        layer_output.append(getLayerOutput(layer))

In [26]:
#Get the count of ouput. It should be equal to the number of layers
len(layer_output)

7

In [27]:
#Check all the outputs
layer_output

[array([[[0.5553787 , 0.18011284, 0.6420567 , ..., 0.55895984,
          0.64191055, 0.5879252 ],
         [0.5553787 , 0.18011284, 0.6420567 , ..., 0.55895984,
          0.64191055, 0.5879252 ],
         [0.5553787 , 0.18011284, 0.6420567 , ..., 0.55895984,
          0.64191055, 0.5879252 ],
         ...,
         [0.38840163, 0.52240217, 0.18018687, ..., 0.7324097 ,
          0.08540547, 0.76900613],
         [0.43940055, 0.29121077, 0.47865772, ..., 0.3135028 ,
          0.57172513, 0.8048034 ],
         [0.6175225 , 0.5789665 , 0.5236012 , ..., 0.6587695 ,
          0.22243512, 0.33174098]]], dtype=float32),
 array([[0.54188764, 0.25476873, 0.60314345, 0.5315438 , 0.81225485,
         0.28845206, 0.14934397, 0.41312903, 0.15780509, 0.30540037,
         0.11417412, 0.3650676 , 0.14723535, 0.5278066 , 0.5997043 ,
         0.5833638 ]], dtype=float32),
 array([[0.12634443, 0.        , 0.        , 0.        , 0.684017  ,
         0.26375854, 0.06051523, 0.07846066, 0.19827874, 0.      

In [28]:
#Get the specific layer output
layer_output[3]

array([[0.12634443, 0.        , 0.        , 0.        , 0.684017  ,
        0.26375854, 0.06051523, 0.07846066, 0.19827874, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.3097297 ,
        0.19487377, 0.        , 0.        , 0.70889115, 0.        ,
        0.09870214, 0.5269151 , 0.        , 0.        , 0.        ,
        0.04202384, 0.        , 0.29322898, 0.09015664, 0.04474584,
        0.        , 0.42819998, 0.07366087, 0.39347255, 0.        ,
        0.        , 0.        , 0.39152402, 0.        , 0.        ,
        0.41031033, 0.23007463, 0.        , 0.23913945, 0.        ,
        0.6136151 , 0.        , 0.        , 0.        , 0.23703943,
        0.        , 0.        , 0.0378878 , 0.18176587, 0.03147261,
        0.        , 0.32030234, 0.        , 0.        , 0.1992329 ,
        0.        , 0.1060954 , 0.        , 0.04277664]], dtype=float32)