## IMDB Movie reviews sentiment classification

IMDB Movie reviews sentiment classification is available as part of keras.datasets

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). 

The below model uses word embedding and LSTMs to predict the sentiment of a movie review.

In [None]:
from __future__ import print_function

import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
import numpy as np
import string

In [None]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

In [None]:
global word_to_id 
INDEX_FROM=3
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [None]:
def str_cleanup(str):
    PunctuationToRemove = [".", ",", ":", ";", "!" ,"?", "&"]
    s1 = ""
    for c in str:                           #for characters in user's input
        if c not in PunctuationToRemove:    #characters that don't include punctuations and blanks
            s1 = s1 + c                     #store the above result to s1
    return string.lower(s1)

In [None]:
def str_to_data (str):
    str = str_cleanup(str)
    words = str.split(" ")
    test_comment = np.zeros(len(words))
    i=0
    for word in words:
        test_comment[i] = word_to_id[word] if word in word_to_id else 0
        i = i+1
    test_comment = [int(i) for i in test_comment]
    return test_comment

In [None]:
def data_to_str (data):
    id_to_word = {value:key for key,value in word_to_id.items()}
    return (' '.join(id_to_word[id] for id in data ))

In [None]:
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

### Load Data (Train and Test)

In [None]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

### Exercise 1:  Check data

- word index
  - what is the index for "it"?
  - what is the index for "happy"?
  - Hint: use word_to_id["xxxxx"]
- comments (x_train)
  - print word indexes for comments in x_train[10]
  - print words in x_train[10]
      - Hint: use data_to_str(x_train[10]) 
- sentiment (y_train)
  - What is the sentiment for y_train[10]
      - Hint: 1 means positive, 0 means negative

In [None]:
print(x_train[10])

In [None]:
print(data_to_str(x_train[10]))

### Make input to have constant length

In [None]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

### Build NeuralNet Model

In [None]:
# Embedding layer: maps words into of weight vector of 128 size each
# 1 word is one of 20000 possibilities 
# Using a parameter list of (20000,128) matrix, each word can be represented 
# by a weight vector each 128 size

In [None]:
max_features

In [None]:
### word embedding
### [1, 20000] X [20000, 128] = [1, 128]
### 20000*128

In [None]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(80, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

### Train Model

In [None]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=5,
          validation_data=(x_test, y_test))

### Test Model

In [None]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)

In [None]:
print('Test score:', score)
print('Test accuracy:', acc)

### Confusion Matrix

In [None]:
result = model.predict(x_test.reshape(len(x_test),x_test.shape[1]),
                       batch_size=1000,verbose = 2)

In [None]:
y_pred = [int(i+.5) for i in result]

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
tbl = confusion_matrix(y_test, y_pred)

In [None]:
tbl

In [None]:
print ("Negative Accuracy = ", tbl[0,0]*100./sum(tbl[1,]), "%")

In [None]:
print ("Postive Accuracy = ", tbl[1,1]*100./sum(tbl[1,]), "%")

### Exercise 2:  Change model and check accuracy

- increase the number of LSTMs to 128; train the model for 5 epochs and check the accuracy