# Sentiment Classification

In [0]:
# Importing required libraries
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.preprocessing.sequence import pad_sequences

## Loading the dataset

##IMDB Movie reviews sentiment classification
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

In [0]:
from keras.datasets import imdb

In [0]:
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [6]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [0]:
y_train = np.array(y_train)
y_test = np.array(y_test)

## Understanding the dataset

In [10]:
#check the train shape
x_train.shape

(25000, 300)

In [11]:
x_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458, 4468,   66, 3941,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172, 4536, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920, 4613,  469,    4,   22,   71,   87,   

In [12]:
y_train.shape

(25000,)

In [13]:
y_train[0:10]

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0])

In [14]:
x_test.shape 

(25000, 300)

In [15]:
y_test.shape

(25000,)

There are 25000 data for training and 25000 data records for testing 

## Building the model along with the embedding

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.

## Building an LSTM model

In [17]:
print('Build model...')
model = Sequential()
model.add(Embedding(vocab_size, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Build model...



Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [23]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________


In [0]:
batch_size = 32

In [20]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))

Train...



Train on 25000 samples, validate on 25000 samples
Epoch 1/15





Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f8d61f0e048>

The model has got good accuracy on the training. Let us see how it works on testing

In [21]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)



In [22]:
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.7138513833588361
Test accuracy: 0.86008


The testing accuracy is very less compared to training accuracy. This may be due to the fact that large number of training parameters or it may also be due to the fact that we chose smaller vocabulary.


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [36]:
for idx, layer in enumerate(model.layers):
  print(model.layers[idx].output)

Tensor("embedding_1/embedding_lookup/Identity:0", shape=(?, ?, 128), dtype=float32)
Tensor("lstm_1/TensorArrayReadV3:0", shape=(?, 128), dtype=float32)
Tensor("dense_1/Sigmoid:0", shape=(?, 1), dtype=float32)


we are seeing the shape of the three tensors formed in our model. 

Let us build small models and predict the output of each layer.

In [33]:
from keras.models import Model

#model for layer1
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer('embedding_1').output)
intermediate_output = intermediate_layer_model.predict(x_test[0])
print(intermediate_output.shape)
print(intermediate_output)

(300, 1, 128)
[[[ 0.10009428  0.06586564 -0.1270075  ...  0.05326356  0.01376145
   -0.07130247]]

 [[ 0.10009428  0.06586564 -0.1270075  ...  0.05326356  0.01376145
   -0.07130247]]

 [[ 0.10009428  0.06586564 -0.1270075  ...  0.05326356  0.01376145
   -0.07130247]]

 ...

 [[ 0.07268848  0.06812154 -0.10390531 ...  0.04641328 -0.00473515
    0.03089936]]

 [[ 0.11306102 -0.00706581 -0.11204792 ... -0.02783196 -0.02873449
   -0.11377687]]

 [[ 0.04546463  0.09044972  0.11785187 ... -0.10651184  0.11764654
   -0.09185338]]]


In [34]:
#model for layer2
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer('lstm_1').output)
intermediate_output = intermediate_layer_model.predict(x_test[0])
print(intermediate_output.shape)
print(intermediate_output)

(300, 128)
[[ 0.00654792 -0.01969376 -0.06102571 ... -0.14492716 -0.01894907
  -0.02714133]
 [ 0.00654792 -0.01969376 -0.06102571 ... -0.14492716 -0.01894907
  -0.02714133]
 [ 0.00654792 -0.01969376 -0.06102571 ... -0.14492716 -0.01894907
  -0.02714133]
 ...
 [-0.01300291  0.0036605  -0.05254975 ... -0.09516702  0.03453996
  -0.01613788]
 [ 0.00852975 -0.00619132 -0.00311589 ...  0.05427508 -0.04054945
   0.00287724]
 [-0.04903843 -0.08219673 -0.23726662 ... -0.06947719 -0.13758294
  -0.08548969]]


In [35]:
#model for layer3
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer('dense_1').output)
intermediate_output = intermediate_layer_model.predict(x_test[0])
print(intermediate_output.shape)
print(intermediate_output)

(300, 1)
[[4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692905e-01]
 [4.0692905e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0692908e-01]
 [4.0

In [87]:
#Let us predict the output for a test sample
p = model.predict(x_test[5000])
p = np.argmax(p, axis=-1)[0]
print(p)

0


The output comes to be 0 which means that the review is classified as negative. Let us see how the input was.

In [49]:
x_test[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

##Understanding the data

In [44]:
# download the key,value pair formed by the imdb dataset
word_index = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [0]:
# forming a reverse key index to get the word for each designated index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [55]:
# checkin one test data see how the transformation is done
review_text = ' '.join([reverse_word_index.get(i, '?') for i in x_test[0]])
print(review_text)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the wonder own as by is sequence i i and and to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world and her an have faint beginning own as is sequence


Let us create a custom function to do all these steps

In [0]:
def print_test_result(model,test_data_index):
  data = x_test[test_data_index]
  p = model.predict_classes(data)
  p = np.argmax(p, axis=-1)[0]
  label = "Positive" if p else "Negative"   
  print("The sentiment of the review is: " +label)
  given_label = "Positive" if y_test[test_data_index] else "Negative"   
  print("The sentiment value  for the review provided in the data is: " +given_label)
  review_text = ' '.join([reverse_word_index.get(i, '?') for i in data])
  print("Review_text:")
  print(review_text)

In [66]:
print_test_result(model,0)

The sentiment of the review is: Negative
The sentiment value  for the review provided in the data is: Negative
Review_text:
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the wonder own as by is sequence i i and and to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world and her an have faint beginning own as is sequence


The review text has many words like "sadly","down","boring", "faint beginning".
our model has correctly classified this review as negative.

Let us check few more examples.

In [67]:
print_test_result(model,1267)

The sentiment of the review is: Negative
The sentiment value  for the review provided in the data is: Positive
Review_text:
br another all there bit or is heartbreaking this foul in is psychotic bargain this called calls and to her plot and all it by naval was had saying what all me good up female this of how lot br of on movie much of versions this of on it who and meredith start and to and anywhere would different had version to myers of almost br is killer br am production film now leg would lines have is franchise br sing expectations found like it disappointed this fellow not these possessed no that trying in about altman execution race i i of younger br awful there will secret who and would to about and young and of br italian hospital and would there had unique each but of being not more he gets no would it his turns in practically film of night of and plane br about acting game in kathryn this twisted to that there will see stumbles violent i i of on game in tomorrow world is a

even this text has many negative words like "disappointed","stumbles","awful"
our model has predicted it as negative.



In [69]:
print_test_result(model,8000)

The sentiment of the review is: Negative
The sentiment value  for the review provided in the data is: Positive
Review_text:
scene that story at definitely mainly go all whereas up been and are of quit to of degree then totally movie emotional scene i'm that an and simple best era in and of seemed edison br of their movie was takes all well hugo in think gives this and all and of and and gives br and this reliable and all satisfying of equally past br levels making first of bother adaption better of and lived badly gives 7 to and specific is guess br said making scene take all end doc this minutes be simple limp hours or and 7 in wal and to and in footage and in gielgud ripped gives and to worst these of ever comedies edward whose which old that and up older been think yourself must this that these it is stayed rarely are of and br finds popular scene it know as on find is themes this witness of and april this really all high some br terrifying but be appearances comedy that of fat it o

# Building bidirectional LSTM

In [0]:
bidirectional_model = Sequential()
bidirectional_model.add(Embedding(vocab_size, 128, input_length=maxlen))
bidirectional_model.add(Bidirectional(LSTM(64)))
bidirectional_model.add(Dropout(0.5))
bidirectional_model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
bidirectional_model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])


In [77]:
print(bidirectional_model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 300, 128)          1280000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 1,378,945
Trainable params: 1,378,945
Non-trainable params: 0
_________________________________________________________________
None


In [81]:

print('Train...')
bidirectional_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=[x_test, y_test])

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8d5a57b1d0>

In [90]:
score, acc = bidirectional_model.evaluate(x_test, y_test,
                            batch_size=batch_size)



In [91]:
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.720015089430213
Test accuracy: 0.85816


Though the training accuracy is very high , the test accuracy is poor than normal LSTM.

##Conclusion


*   we have downloaded the IMDB dataset and performed sentiment analysis
*   the vectorization of the model is done by keras itself
*   Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers)
* Two models are built , a LSTM model and a Bidirectional LSTM model
* An embedding layer is added to each of the model to convert the sequence into feature vectors
* Both the models performed well on training compared to testing
* LSTM ( training accuracy - 0.9948     ; testing accuracy -   0.86)    
* Bidirectional LSTM ( training accuracy - 0.9883   ; testing accuracy - 0.85816)   

