First of all, set environment variables and initialize spark context:

In [None]:
%env SPARK_DRIVER_MEMORY=8g
%env PYSPARK_PYTHON=/usr/bin/python3.5
%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.5

from zoo.common.nncontext import *
sc = init_nncontext(init_spark_conf().setMaster("local[4]"))

# Binary classification
We'll be working with "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.

Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

The following code will load the dataset (when you run it for the first time, about 80MB of data will be downloaded to your machine).

Then we vectorize the sequences to prepare the data we are going to feed to the model. We also separate part of the data for validation.

In [None]:
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(nb_words=10000)

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

Build the network, then compile:

In [None]:
from zoo.pipeline.api.keras import models
from zoo.pipeline.api.keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

#### Accuracy checkout
To checkout the behavior of this model in Keras, the original code use `matplotlib` library to draw the following `history` object
    
    history = model.fit(partial_x_train,
                    partial_y_train,
                    nb_epoch=5,
                    batch_size=512,
                    validation_data=(x_val, y_val)
                    )
After `fit` method finishes, the results are stored in `history` and thus could be visualized.

Currently in analytics-zoo, `fit` method does not have any return. Results can only be checked via tensorboard. Code above need to be replaced with following:

In [None]:
model.set_tensorboard('./', '3-5_summary')
model.fit(partial_x_train,
          partial_y_train,
          nb_epoch=5,
          batch_size=512,
          validation_data=(x_val, y_val)
          )

Then you could see result in tensorboard by command in terminal `tensorboard --logdir ./`

Check the result on test data:

In [None]:
results = model.evaluate(x_test, y_test)
print('test_acc:', results[0].result)

#### Predict result
Predict the result on test data, in Keras, it is easy to just call following code to get the result

    model.predict(x_test)
In analytics-zoo, the return of `predict` is RDD, so you need to call `collect` method to get the result:

In [None]:
prediction = model.predict(x_test)
result = prediction.collect()