In [1]:
import json

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

Using TensorFlow backend.


In [2]:
VOCABULARY_SIZE = 20000
MAX_WORDS = 500

### Training data set

In [3]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = VOCABULARY_SIZE)

Mappings of **Word -> ID** and **ID -> Word**

In [4]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}

All input documents must have the same length. We will limit the maximum review length to MAX_WORDS by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the pad_sequences() function in Keras. For now, set MAX_WORDS to 500.

In [5]:
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

### Design Model

**Input**: Sequence of words (integer ids) whose length are MAX_WORDS.

**Output**: Binary label (0 means *Negative* and 1 means *Positive*)

In [6]:
model=Sequential()
model.add(Embedding(VOCABULARY_SIZE, 32, input_length=MAX_WORDS))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           640000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 693,301
Trainable params: 693,301
Non-trainable params: 0
_________________________________________________________________
None


### Compile and train our model

We first need to compile our model by specifying the loss function and optimizer we want to use while training, as well as any evaluation metrics we'd like to measure.

In [7]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Once compiled, we can run the training process.

In [8]:
batch_size = 64
num_epochs = 5

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs);

Instructions for updating:
Use tf.cast instead.
Train on 24936 samples, validate on 64 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Test model to measure accuracy

In [9]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.86792


### Save models and words mapping

In [10]:
model.save("model.h5")

with open("words.json", "w") as f:
    json.dump(word2id, f, indent=4)

### Prediction example

In [11]:
x = "A girl is happy while playing with her toys"
x = x.lower().split()
x = [word2id.get(i, 0) for i in x]
x = sequence.pad_sequences([x], maxlen=max_words)
score = model.predict(x)[0][0]
sentiment = {0: "Negative", 1: "Positive"}[model.predict_classes(x)[0][0]]

print(f"The result is '{sentiment}' with a score of {score:.2f}")

The result is 'Positive' with a score of 0.66
