## Sentiment Classification using Char-level Embedding on Large Movie Reviews

[Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is understood as a classic natural language processing problem. In this example, a large moview review dataset was chosen from IMDB to do a sentiment classification task with some deep learning approaches. The labeled data set consists of 50,000 [IMDB](http://www.imdb.com/) movie reviews (good or bad), in which 25000 highly polar movie reviews for training, and 25,000 for testing. The dataset is originally collected by Stanford researchers and was used in a [2011 paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf), and the highest accuray of 88.33% was achieved without using the unbalanced data. This example illustrates some deep learning approaches to do the sentiment classification with [BigDL](https://github.com/intel-analytics/BigDL) python API.

### Load the IMDB Dataset
The IMDB dataset need to be loaded into BigDL, note that the dataset has been pre-processed, and each review was encoded as a sequence of integers. Each integer represents the index of the overall frequency of dataset, for instance, '5' means the 5-th most frequent words occured in the data. It is very convinient to filter the words by some conditions, for example, to filter only the top 5,000 most common word and/or eliminate the top 30 most common words. Let's define functions to load the pre-processed data.

In [1]:
from bigdl.dataset import base
import numpy as np

def download_imdb(dest_dir):
    """Download pre-processed IMDB movie review data

    :argument
        dest_dir: destination directory to store the data

    :return
        The absolute path of the stored data
    """
    file_name = "imdb.npz"
    file_abs_path = base.maybe_download(file_name,
                                        dest_dir,
                                        'https://s3.amazonaws.com/text-datasets/imdb.npz')
    return file_abs_path

def load_imdb(dest_dir='~/'):
    """Load IMDB dataset.

    :argument
        dest_dir: where to cache the data (relative to `~/.bigdl/dataset`).

    :return
        the train, test separated IMDB dataset.
    """
#     path = download_imdb(dest_dir)
    path = "imdb.npz"
    f = np.load(path)
    x_train = f['x_train']
    y_train = f['y_train']
    x_test = f['x_test']
    y_test = f['y_test']
    f.close()

    return (x_train, y_train), (x_test, y_test)

print('Processing text dataset')
(x_train, y_train), (x_test, y_test) = load_imdb()
print('finished processing text')

import json

def get_word_index(dest_dir='/tmp/.bigdl/dataset', ):
    """Retrieves the dictionary mapping word indices back to words.

    :argument
        path: where to cache the data (relative to `~/.bigdl/dataset`).

    :return
        The word index dictionary.
    """
    file_name = "imdb_word_index.json"
    path = base.maybe_download(file_name,
                               dest_dir,
                               source_url='https://s3.amazonaws.com/text-datasets/imdb_word_index.json')
    f = open(path)
    data = json.load(f)
    f.close()
    return data

print('Processing vocabulary')
word_idx = get_word_index()
idx_word = {v:k for k,v in word_idx.items()}
print('finished processing vocabulary')

Processing text dataset
finished processing text
Processing vocabulary
finished processing vocabulary


### Text pre-processing

Before we train the network, some pre-processing steps need to be applied to the dataset. 


In [2]:
def replace_oov(x, oov_char, max_words):
    """
    Replace the words out of vocabulary with `oov_char`
    :param x: a sequence
    :param max_words: the max number of words to include
    :param oov_char: words out of vocabulary because of exceeding the `max_words`
        limit will be replaced by this character

    :return: The replaced sequence
    """
    return [oov_char if w >= max_words else w for w in x]

def pad_sequence(x, fill_value, length):
    """
    Pads each sequence to the same length
    :param x: a sequence
    :param fill_value: pad the sequence with this value
    :param length: pad sequence to the length

    :return: the padded sequence
    """
    if len(x) >= length:
        return x[(len(x) - length):]
    else:
        return [fill_value] * (length - len(x)) + x

def transform_char(w, idx_word, char_padding_value, word_len, UNK_char_value):
    """
    Transfer each word to char sequence
    :param x: a sequence
    :param fill_value: pad the sequence with this value
    :param length: pad sequence to the length

    :return: the padded sequence
    """
    word = idx_word.get(w - index_from, 0)
    if not word:
        l = [char_idx.get(c, UNK_char_value) for c in word]
        return pad_sequence(l, char_padding_value, word_len)
    else:
        return [char_padding_value] * word_len

In [3]:
print('start transformation')

from zoo.common.nncontext import *
sc = init_nncontext("Sentiment Analysis Example")

padding_value = 1
start_char = 2
oov_char = 3
index_from = 3
max_words = 5000
sequence_len = 500
char_input_length = 16
char_dim = 100
char_input_dim = 30
char_output_dim = 100
char_padding_value = 0
UNK_char_value = 27

char_idx = {}
alphabet = "abcdefghijklmnopqrstuvwxyz"
for idx, char in enumerate(alphabet):
    char_idx[char] = idx + 1
    
merged_train_rdd = sc.parallelize(zip(x_train, y_train), 2) \
    .map(lambda record: ([start_char] + [w + index_from for w in record[0]], record[1])) \
    .map(lambda record: (replace_oov(record[0], oov_char, max_words), record[1])) \
    .map(lambda record: (pad_sequence(record[0], padding_value, sequence_len), record[1])) \
    .map(lambda record: Sample.from_ndarray([np.array(record[0]), np.array([transform_char(w, idx_word, 0, char_input_length, 27) for w in record[0]])], np.array(record[1])))

merged_test_rdd = sc.parallelize(zip(x_test, y_test), 2) \
    .map(lambda record: ([start_char] + [w + index_from for w in record[0]], record[1])) \
    .map(lambda record: (replace_oov(record[0], oov_char, max_words), record[1])) \
    .map(lambda record: (pad_sequence(record[0], padding_value, sequence_len), record[1])) \
    .map(lambda record: Sample.from_ndarray([np.array(record[0]), np.array([transform_char(w, idx_word, 0, char_input_length, 27) for w in record[0]])], np.array(record[1])))

print('finish transformation')

start transformation
finish transformation


### Word Embedding

[Word embedding](https://en.wikipedia.org/wiki/Word_embedding) is a recent breakthrough in natural language field. The key idea is to encode words and phrases into distributed representations in the format of word vectors, which means each word is represented as a vector. There are two widely used word vector training alogirhms, one is published by Google called [word to vector](https://arxiv.org/abs/1310.4546), the other is published by Standford called [Glove](https://nlp.stanford.edu/projects/glove/). In this example, pre-trained glove is loaded into a lookup table and will be fine-tuned during the training process. BigDL provides a method to download and load glove in `news20` package.

In [4]:
from bigdl.dataset import news20
import itertools

word_embedding_dim = 100

print('loading glove')
glove = news20.get_glove_w2v(source_dir='/tmp/.bigdl/dataset', dim=word_embedding_dim)
print('finish loading glove')

print('processing glove')
w2v = [glove.get(idx_word.get(i - index_from), np.random.uniform(-0.05, 0.05, word_embedding_dim))
        for i in range(1, max_words + 1)]
w2v = np.array(list(itertools.chain(*np.array(w2v, dtype='float'))), dtype='float') \
        .reshape([max_words, word_embedding_dim])
print('finish processing glove')

loading glove
finish loading glove
processing glove
finish processing glove


### Build models

Next, let's build the GRU model with word and char-level embedding for the sentiment classification. 


In [5]:
from zoo.pipeline.api.keras.layers import *
from zoo.pipeline.api.keras.models import *

word_input = Input(shape=(500,))
char_input = Input(shape=(500, char_input_length))

char_emb = TimeDistributed(CharEmbedding(input_dim=char_input_dim, output_dim=char_output_dim, \
                                         char_embed_dim=char_dim, input_length=char_input_length))(char_input)
# word_emb = Embedding(embedding_file="/tmp/.bigdl/dataset/glove.6B.100d.txt", word_index=word_idx , trainable=False)(word_input)
word_emb = Embedding(input_dim=max_words, output_dim=word_embedding_dim, input_length=sequence_len, )
emb = word_emb(word_input)
word_emb.set_weights(np.array([w2v]))
word_emb.trainable = False

x1 = TimeDistributed(Dense(100))(char_emb)
x2 = TimeDistributed(Dense(100))(emb)
x = merge(inputs=[x1, x2], mode = "sum" )
x = ELU()(x)
# x = merge(inputs=[char_emb, emb], mode = "concat")

x = GRU(128)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.2)(x)

output = Dense(1, activation='sigmoid')(x)

creating: createZooKerasInput
creating: createZooKerasInput
creating: createZooKerasCharEmbedding
creating: createZooKerasTimeDistributed
creating: createZooKerasEmbedding
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasDense
creating: createZooKerasTimeDistributed
creating: createZooKerasMerge
creating: createZooKerasELU
creating: createZooKerasGRU
creating: createZooKerasDense
creating: createZooKerasDropout
creating: createZooKerasDense


In [6]:
inputs = [word_input, char_input]
model = Model(input=inputs, output=output)

creating: createZooKerasModel


In [7]:
model.summary()

### Training and Evaluation

In [8]:
from bigdl.optim.optimizer import *
from bigdl.nn.criterion import *

max_epoch = 4
batch_size = 56
model_type = 'lstm'


optimizer = Optimizer(
        model=model,
        training_rdd=merged_train_rdd,
        criterion=BCECriterion(),
        end_trigger=MaxEpoch(max_epoch),
        batch_size=batch_size,
        optim_method=Adam())

optimizer.set_validation(
        batch_size=batch_size,
        val_rdd=merged_test_rdd,
        trigger=EveryEpoch(),
        val_method=Top1Accuracy())

import datetime as dt

logdir = '/tmp/.bigdl/'
app_name = 'adam-' + dt.datetime.now().strftime("%Y%m%d-%H%M%S")

train_summary = TrainSummary(log_dir=logdir, app_name=app_name)
train_summary.set_summary_trigger("Parameters", SeveralIteration(50))
val_summary = ValidationSummary(log_dir=logdir, app_name=app_name)
optimizer.set_train_summary(train_summary)
optimizer.set_val_summary(val_summary)

creating: createBCECriterion
creating: createMaxEpoch
creating: createAdam
creating: createDistriOptimizer
creating: createEveryEpoch
creating: createTop1Accuracy
creating: createTrainSummary
creating: createSeveralIteration
creating: createValidationSummary


<bigdl.optim.optimizer.Optimizer at 0x7f1cb0f755f8>

In [9]:
%%time
train_model = optimizer.optimize()
print ("Optimization Done.")

In [10]:
predictions = train_model.predict(merged_test_rdd)

def map_predict_label(l):
    if l > 0.5:
        return 1
    else:
        return 0
def map_groundtruth_label(l):
    return l.to_ndarray()[0]

y_pred = np.array([map_predict_label(s) for s in predictions.collect()])

y_true = np.array([map_groundtruth_label(s.label) for s in merged_test_rdd.collect()])

In [30]:
correct = 0
for i in range(0, y_pred.size):
    if (y_pred[i] == y_true[i]):
        correct += 1

accuracy = float(correct) / y_pred.size
print ('Prediction accuracy on validation set is: ', accuracy)

Prediction accuracy on validation set is:  0.8756
