# NLP Data Pipeline: IMDb Movie Reviews

In this notebook, we demonstrate how to develop a data pipeline for sentiment analysis on the IMDb Movie Review dataset. By the end of the notebook, you will understand:

- how to develop a data pipeline for numericalizing text and generate input for neural networks.
- how to perform sampling for efficient batching on data with variable lengths.
- how to put everything together in a modular way with the help of the abstraction in Gluon.

You will learn the following concepts for basic data pipeline:

- Dataset
- Transform functions
- Vocabulary and numericalization

And for efficient batched data loading:

- Batchify functions
- Bucketing samplers
- Data loader

We use IMDb Dataset for sentiment analysis and treat it as binary classification.
- Contains parts for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb
- In each data set, the number of comments labeled as "positive" and "negative" is equal.

In [None]:
from mxnet import gluon
import gluonnlp as nlp

## Data Pipeline in Gluon

### Dataset

``` python
class Dataset(object):
    def __getitem__(self, idx):
        ...
    
    def __len__(self):
        ...

    def transform(self, fn, lazy=True):
        # Returns a new dataset with each sample
        # transformed by the function `fn`.
```

In [None]:
imdb_train = nlp.data.IMDB('train')
imdb_test = nlp.data.IMDB('test')

In [None]:
text, score = imdb_train[0] # (text, score)
print('text: "{}"'.format(text))
print('score: {}'.format(score))

### Transform functions

In [None]:
def tokenize_while_preserving_score(sample):
    sentence, score = sample
    return sentence.split(), score

imdb_train_tokens_score = imdb_train.transform(tokenize_while_preserving_score)
imdb_test_tokens_score = imdb_test.transform(tokenize_while_preserving_score)

In [None]:
tokens, score = imdb_train_tokens_score[0] # (tokens, score)
print('tokens: "{}"'.format(tokens[:20]))
print('score: {}'.format(score))

In [None]:
length_clip_20 = nlp.data.ClipSequence(20)
print('Original length: {}'.format(len(tokens)))
print('Clipped length: {}'.format(len(length_clip_20(tokens))))

### Vocabulary and Numericalization

In [None]:
def get_first(first, second):
    return first

imdb_train_tokens = imdb_train_tokens_score.transform(get_first)
import itertools
tokens_iter = itertools.chain.from_iterable(imdb_train_tokens)

token_counts = nlp.data.count_tokens(tokens_iter)
print('# the: {}'.format(token_counts['the']))

imdb_vocab = nlp.Vocab(token_counts, min_freq=10)
print(imdb_vocab)
print(imdb_vocab.idx_to_token[:10] + ["..."])

In [None]:
indices = imdb_vocab[tokens]
print(list(zip(tokens, indices))[:20])

In [None]:
print('Unknown token {} with index {}'.format(imdb_vocab.unknown_token,
                                              imdb_vocab[imdb_vocab.unknown_token]))
print('Padding token {} with index {}'.format(imdb_vocab.padding_token,
                                              imdb_vocab[imdb_vocab.padding_token]))

### API Docs

- [gluonnlp.data.IMDB](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html) dataset.
- [gluonnlp.data built-in transform](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html#transforms) functions.
- [gluonnlp.Vocab](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/vocab.html#vocabulary) class and [gluonnlp.data.count_tokens](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html#gluonnlp.data.count_tokens) function.
- [Vocabulary and Embedding API](http://gluon-nlp.mxnet.io/v0.9.x/api/notes/vocab_emb.html) notes.
- [Data Loading API](https://gluon-nlp.mxnet.io/v0.9.x/api/notes/data_api.html) notes.

### Exercise 1: preprocess and numericalize IMDB dataset

- Complete the `preprocess` function.

In [None]:
length_clip_500 = nlp.data.ClipSequence(500)

def preprocess(tokens, score):
    # Implement the following preprocessing logic:
    # 1. convert scores to binary classification:
    #   - 1 for scores higher than 5
    #   - 0 otherwise
    # 2. cap the sample lengths at 500 using `length_clip_500` function.
    # 3. numericalize the tokens with the `imdb_vocab`. cap the length at 500
    raise NotImplementedError
    return indices, label

In [None]:
preprocess(tokens, score)

In [None]:
train_dataset = imdb_train_tokens_score.transform(preprocess, lazy=False)
test_dataset = imdb_test_tokens_score.transform(preprocess, lazy=False)
print(train_dataset[0])

## Efficient Data Loading and Sampling

- Convert text into array-like data for efficient processing.
- Sampling strategy to reduce wasted computation from padding.

### Batchify indices into array

In [None]:
sample_lengths = train_dataset.transform(lambda x, y: len(x))
import numpy as np
print('Length min/max/stdev: {}/{}/{:.2f}'.format(np.min(sample_lengths),
                                                  np.max(sample_lengths),
                                                  np.std(sample_lengths)))

In [None]:
padding_val = imdb_vocab[imdb_vocab.padding_token]
pad_tokens = nlp.data.batchify.Pad(axis=0, pad_val=padding_val)

train_token_indices = train_dataset.transform(get_first)

padded_tokens = pad_tokens([train_token_indices[i] for i in range(10)])
padded_tokens.shape

In [None]:
stack_labels = nlp.data.batchify.Stack(dtype='float32')

def get_second(first, second):
    return second

train_labels = train_dataset.transform(get_second)

stacked_labels = stack_labels([train_labels[i] for i in range(10)])
stacked_labels.shape

In [None]:
batchify_fn = nlp.data.batchify.Tuple(pad_tokens, stack_labels)
batchify_fn([train_dataset[i] for i in range(10)])

### Sampling for Efficient Batching

In [None]:
batch_size = 64
data_loader = gluon.data.DataLoader(train_dataset, batchify_fn=batchify_fn,
                                    batch_size=batch_size)
print('Average length of batches is {:.2f}'.format(np.mean([x.shape[1] for x, y in data_loader])))

<img src="img/no_bucket_strategy.png" style="width: 100%;"/>

<img src="img/fixed_bucket_strategy_ratio0.7.png" style="width: 100%;"/>

In [None]:
bucket_sampler = nlp.data.FixedBucketSampler(sample_lengths, batch_size=64, shuffle=True)
print(bucket_sampler.stats())

In [None]:
bucket_sampler_iter = iter(bucket_sampler)
batch_indices = next(bucket_sampler_iter)
batch_sample_lengths = [len(train_dataset[i][0]) for i in batch_indices]
print('Batch length min/max/stdev: {}/{}/{:.2f}'.format(np.min(batch_sample_lengths),
                                                        np.max(batch_sample_lengths),
                                                        np.std(batch_sample_lengths)))
print('Samples in first batch: ', batch_indices[:10] + ['...'])

### API Docs

- [gluonnlp.data.batchify](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.batchify.html) functions.
- [gluonnlp.data.FixedBucketSampler](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html#gluonnlp.data.FixedBucketSampler) and [other sampler](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html#samplers) classes.
- [gluon.data.DataLoader](https://mxnet.apache.org/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader) class.
- [Data Loading API](https://gluon-nlp.mxnet.io/v0.9.x/api/notes/data_api.html) notes.

### Exercise 2: Load IMDB dataset

- Create fixed bucket samplers for IMDB training and test datasets.
- Put `batchify_fn` and the fixed bucket samplers together and create dataloaders for training and test datasets.
- Examine the stats from the samplers. Play with `ratio` and `bucket_scheme` (see [bucketing schemes](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/data.html#samplers)), and see how they affect the result.

In [None]:
train_sampler = nlp.data.FixedBucketSampler(...)
test_sampler = nlp.data.FixedBucketSampler(...)

In [None]:
train_dataloader = gluon.data.DataLoader(...)
test_dataloader = gluon.data.DataLoader(...)

In [None]:
print('Average length of batches is {:.2f}'.format(np.mean([x.shape[1] for x, y in train_dataloader])))

In [None]:
next(iter(train_dataloader))