Results so far on the validation set (as of notebooks 1_classify_movie_reviews and 9_embeddings_movie_reviews):
    - Standard feed-forward NN with one-hot encoding of words [full length reviews]: 88%
    - Standard feed-forward NN with learned word embeddings [500 word review length]: 87%
    - Standard feed-forward NN with GloVe word embeddings [500 word review length]: 56%
    

In [15]:
MAX_FEATURES = 10000
MAX_REVIEW_LENGTH = 500

We can reuse some of the following methods from previous notebooks (9_embeddings_movie_reviews.ipynb in general)...

In [21]:
def prepare_data():
    """
    Loads imdb data and splits it into train / val / test sets.
    """
    
    from keras import preprocessing
    from keras.datasets import imdb
    
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)
    print("Number of training samples:", len(x_train))

    x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=MAX_REVIEW_LENGTH)
    x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=MAX_REVIEW_LENGTH)

    # Split a part of the training dataset for validation
    x_val = x_train[:10000]
    y_val = y_train[:10000]
    x_train = x_train[10000:]
    y_train = y_train[10000:]

    print("x_train.shape:", x_train.shape)
    print("x_val.shape:", x_val.shape)
    print("x_test.shape:", x_test.shape)
    
    return ((x_train, y_train), (x_val, y_val), (x_test, y_test))


def plot_history(history, review_length = 100):
    """
    Plots the history of a model training - its loss and accuracy.
    """
    
    import matplotlib.pyplot as plt
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy [review length: %d]' % review_length)
    plt.legend()
    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss [review length: %d]' % review_length)
    plt.legend()
    plt.show()

Let's prepare data

In [22]:
(x_train, y_train), (x_val, y_val), (x_test, y_test) = prepare_data()

Number of training samples: 25000
x_train.shape: (15000, 500)
x_val.shape: (10000, 500)
x_test.shape: (25000, 500)


In [25]:
def create_train_simplernn_model():
    from keras.models import Sequential
    from keras.layers import Dense, Embedding, SimpleRNN

    model = Sequential()
    model.add(Embedding(max_features, 32))
    model.add(SimpleRNN(32))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    history = model.fit(x_train, y_train,
                       epochs=10,
                       batch_size=128,
                       validation_data=(x_val, y_val))
    return model, history

We can reuse the same plotting method as of earlier notebooks..

In [None]:
model, history = create_train_simplernn_model()

Train on 15000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
 2304/15000 [===>..........................] - ETA: 14s - loss: 0.0344 - acc: 0.9913

In [None]:
plot_history(history, review_length=500)

Tops 86%, but we had more with feed-forward!

In [13]:
def create_train_lstm_model():
    from keras.models import Sequential
    from keras.layers import Dense, Embedding, LSTM
    
    model = Sequential()
    model.add(Embedding(max_features, 32))
    model.add(LSTM(32))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

    return model, history

In [14]:
model, history = create_train_lstm_model()

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10

KeyboardInterrupt: 