# IMDB Movie reviews : LSTM and Binary Classification


Two-class classification, or binary classification, may be the most widely applied kind of machine learning problem. In this example, we will learn to classify movie reviews into "positive" reviews and "negative" reviews, just based on the text content of the reviews using **LSTMs**.

## The IMDB dataset
We'll be working with "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.
Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.



In [None]:
def fun():
    '''Trains an LSTM model on the IMDB sentiment classification task.
    The dataset is actually too small for LSTM to be of any advantage
    compared to simpler, much faster methods such as TF-IDF + LogReg.
    Notes:
    - RNNs are tricky. Choice of batch size is important,
    choice of loss and optimizer is critical, etc.
    Some configurations won't converge.
    - LSTM loss decrease patterns during training can be quite different
    from what you see with CNNs/MLPs/etc.
    
    Source: https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py and
            https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.5-classifying-movie-reviews.ipynb
    '''

    from keras.preprocessing import sequence
    from keras.models import Sequential
    from keras.layers import Dense, Embedding
    from keras.callbacks import TensorBoard
    from keras.callbacks import ModelCheckpoint
    from keras.layers import LSTM
    from keras.datasets import imdb
    
    from hops import tensorboard, hdfs

    max_features = 20000
    maxlen = 80  # cut texts after this number of words (among top max_features most common words)
    batch_size = 256

    hdfs.log('Loading data...')
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
    hdfs.log(str(len(x_train)) + 'train sequences')
    hdfs.log(str(len(x_test)) + 'test sequences')

    hdfs.log('Pad sequences (samples x time)')
    x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
    x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
    hdfs.log('x_train shape:' + str(x_train.shape))
    hdfs.log('x_test shape:' + str(x_test.shape))

    hdfs.log('Build model...')
    model = Sequential()
    model.add(Embedding(max_features, 128))
    model.add(LSTM(128, dropout=0.5, recurrent_dropout=0.5))
    model.add(Dense(1, activation='sigmoid'))

    # try using different optimizers and different optimizer configs
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    model_checkpoint = ModelCheckpoint(tensorboard.logdir() + '/imdb.h5py', 
                                       monitor='val_loss', 
                                       verbose=0, 
                                       save_best_only=False, 
                                       save_weights_only=False, 
                                       mode='auto', 
                                       period=1)
    tensorboard_callback = TensorBoard(log_dir=tensorboard.logdir(), 
                                                   histogram_freq=0, 
                                                   batch_size=batch_size, 
                                                   write_graph=True, 
                                                   write_grads=False,
                                                   write_images=True, 
                                                   embeddings_freq=0, 
                                                   embeddings_layer_names=None, 
                                                   embeddings_metadata=None)
    hdfs.log('Train...')
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=10,
              validation_data=(x_test, y_test),
              callbacks=[tensorboard_callback])
    score, acc = model.evaluate(x_test, y_test,
                                batch_size=batch_size)
    hdfs.log('Test score:' + str(score))
    hdfs.log('Test accuracy:' + str(acc))

In [None]:
from hops import tflauncher
tflauncher.launch(spark, fun)