# Project: Sentiment Classification (imdb reviews)

# By: Krishna Kant Kaushal

##### Python version used: Python 3.7.3

### 🥏Problem Description:
Generate Word Embedding and retrieve outputs of each layer with Keras based on the Classification task. Word embedding are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for the text that is perhaps one of the key
breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We will use the IMDb dataset to learn word embedding as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).

### 🥏Data Description:
The Dataset of 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to
encode any unknown word.

## 🥏Import the required librarries

In [None]:
# for array manupulation
import numpy as np

# for keras models
from keras import models

# for individual layers in the sequential model
from keras import layers

# for loading imdb data
from keras.datasets import imdb

# to make all sequences in a list have same length
from keras.preprocessing.sequence import pad_sequences

# for retrieving the output of each layers
from keras import backend as K

Using TensorFlow backend.


## 🥏1. Import test and train data & 
## 🥏2. Import the labels (train and test) 

In [None]:
vocab_size = 10000 #vocab size
#number of word used from each review
maxlen = 20  # Using last 20 words from each review to speed up training.

# load dataset as a list of ints
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=vocab_size)

In [None]:
# make all sequences of the same length
training_data = pad_sequences(training_data, maxlen=maxlen)
testing_data =  pad_sequences(testing_data, maxlen=maxlen)

In [None]:
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)

#### Exploring the Data:

In [None]:
print("Categories:", np.unique(targets))
print("Number of unique words:", len(np.unique(np.hstack(data))))


length = [len(i) for i in data]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

Categories: [0 1]
Number of unique words: 9858
Average Review length: 20.0
Standard Deviation: 0.0


Single training example - first review of the dataset, which is labeled as positive (1).

In [None]:
print("Label:", targets[0])
print("\n",data[0])

Label: 1

 [  65   16   38 1334   88   12   16  283    5   16 4472  113  103   32
   15   16 5345   19  178   32]


The code below retrieves the dictionary mapping word indices back into the original words

In [None]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i, "#") for i in data[0]] )
print(decoded) 

their with her nobody most that with wasn't to with armed acting watch an for with heartfelt film want an


Data preparation:

In [None]:
def vectorize(sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1
    return results
 
data = vectorize(data)
targets = np.array(targets).astype("float32")

Split our data into training and testing sets.

In [None]:
test_x = data[:10000]
test_y = targets[:10000]
train_x = data[10000:]
train_y = targets[10000:]

In [None]:
test_x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.]])

In [None]:
print(test_x.shape, train_x.shape)
print(test_y.shape, train_y.shape)

(10000, 10000) (40000, 10000)
(10000,) (40000,)


## 🥏3. Get the word index and create a key-value pair for word and word_id

In [None]:
# Get the word index
index = imdb.get_word_index()
# To print word index, remove comment (#) from below line
# print(index)

In [None]:
# create a key-value pair for word and word_id
reverse_index = dict([(value, key) for (key, value) in index.items()])

# To print 'key-value' pair for 'word' and 'word_id', remove comment (#) from below line
# print(reverse_index)

## 🥏4. Build a Sequential Model using Keras for the Sentiment Classification task 

#### Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [None]:
model = models.Sequential()

In [None]:
# Input - Layer
model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))

# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
          
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))

# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))

In [None]:
# model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 50)                500050    
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 50)                2550      
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 51        
Total params: 505,201
Trainable params: 505,201
Non-trainable params: 0
________________________________________________

In [None]:
# Compiling the model
model.compile(
 optimizer = "adam",
 loss = "binary_crossentropy",
 metrics = ["accuracy"]
)

In [None]:
# Fitting the model
results = model.fit(
 train_x, train_y,
 epochs= 2,
 batch_size = 500,
 validation_data = (test_x, test_y)
)

Train on 40000 samples, validate on 10000 samples
Epoch 1/2
Epoch 2/2


## 🥏5. Report the Accuracy of the model

In [None]:
print("Accuracy of the model is :", 100*np.mean(results.history["accuracy"]),"%")

Accuracy of the model is : 73.67249727249146 %


<b>Note</b>: 
> 1. The test-accuracy is a bit low as only 20 words of reviews is used. 
> 2. Incresing the words in review may well result in increase in accuracy. 
> 3. Increasing epoch no. may also result in increase in accuracy.

#### Lets check the accuracy with 200 words review

In [None]:
maxlen = 200  # Using the last 200 words from each review

# load dataset as a list of ints
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=vocab_size)

# make all sequences of the same length
training_data = pad_sequences(training_data, maxlen=maxlen)
testing_data =  pad_sequences(testing_data, maxlen=maxlen)

data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)

# Vectorize every review and filling it with zeros so it contains exactly 10,000 numbers.
def vectorize(sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1
    return results
 
data = vectorize(data)
targets = np.array(targets).astype("float32")

# Splitting the dataset in test and train sets
test_x = data[:10000]
test_y = targets[:10000]
train_x = data[10000:]
train_y = targets[10000:]

# Fitting the model
results = model.fit(
 train_x, train_y,
 epochs= 2,
 batch_size = 500,
 validation_data = (test_x, test_y)
)

# Accuracy of the model
print("\n\nAccuracy of the model is :", 100*np.mean(results.history["accuracy"]),"%")

Train on 40000 samples, validate on 10000 samples
Epoch 1/2
Epoch 2/2


Accuracy of the model is : 90.53249955177307 %


Note:
> We see here that there is improvement in model accuracy after increasing the no. of words in review.

#### Now lets check the accuracy with epoch = 5

In [None]:
# Fitting the model with epoch = 5
results = model.fit(
 train_x, train_y,
 epochs= 5,
 batch_size = 500,
 validation_data = (test_x, test_y)
)

# Accuracy of the model
print("\n\nAccuracy of the model is :", 100*np.mean(results.history["accuracy"]),"%")

Train on 40000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Accuracy of the model is : 97.84649610519409 %


Note:
> We see here that there is further improvement in model accuracy after increasing the epoch value from 2 to 5.

## 🥏6. Retrive the output of each layer in keras for a given single test sample from the trained model you built

Retrive output of each layer for first test sample i.e. test_x[:1,:]

In [None]:
i=1
for layer in model.layers:
    keras_function = K.function([model.input], [layer.output])
    print("\nOutput of layer", i,"is:\n\n", keras_function([test_x[:1,:], 1]))
    i+=1


Output of layer 1 is:

 [array([[0.6408298 , 0.53438866, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.214016  , 0.        ,
        0.48909628, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.20184799, 0.        , 0.        , 0.        ,
        0.        , 0.00532168, 0.        , 0.        , 0.07950675,
        0.        , 0.07530031, 0.        , 0.84624875, 0.        ,
        0.        , 0.6949614 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.22178644,
        0.        , 0.        , 0.        , 0.70583624, 0.27183208,
        0.        , 0.        , 0.        , 0.        , 0.        ]],
      dtype=float32)]

Output of layer 2 is:

 [array([[0.6408298 , 0.53438866, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.214016  , 0.        ,
        0.48909628, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.20184799, 0.      

> <b>Note</b>: The value of output layer for input <b>test_x[:1,:]</b> should be equal to that of actual output i.e. <b>test_y[:1]</b>. 

Lets check that

In [None]:
test_y[:1]

array([1.], dtype=float32)

> So for input <b>test_x[:1,:]</b>, the actual output = predicted output