# IMDB data set: opinions and recurrent neural networks

## Required imports

In [None]:
from collections import Counter
from keras.datasets import imdb
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from pathlib import Path
import pickle

### Loading the data set

Load the data set, it will be downloaded and cached.

In [None]:
(x_train, y_train), (x_test, y_test) = imdb.load_data()

## Exploring the data set

### Data shape and types

Shape and type of the input and output.

In [None]:
x_train.shape, x_train.dtype, y_train.shape, y_train.dtype

In [None]:
x_test.shape, x_test.dtype, y_test.shape, y_test.dtype

Both training and test sets have 25,000 examples each.  The input is a list of integers, the output either 0 or 1.

In [None]:
type(x_train[0]), len(x_train[0]), type(x_train[0][0])

In [None]:
set(y_train)

Each training input consists of a list of integers.  Each integer uniquely represents a word, the list represents a text as an ordered sequence of words. The corresponding output is an integer, either 0 or 1, representing the opinion expressed in the review text.

### Review lengths

We can visualize the distribution of the review lengths in a histogram, one for the training, the other for the test input.

In [None]:
figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
for i, reviews in enumerate((x_train, x_test)):
    review_lengths = map(len, reviews)
    axes[i].hist(list(review_lengths), bins=50);
figure.tight_layout()

### Word distribution

The distribution of the words, or features, can also be visualized.

The following computation is rather time consuming, so its results are pickled, so that they can be reused for demonstartion purposes without redoing the computation.

In [None]:
pickle_path = Path('feature_count.pkl')
if not pickle_path.exists():
    feature_counter = Counter()
    for review in x_train:
        for feature in review:
            feature_counter[feature] += 1
    with open('feature_count.pkl', 'wb') as pickle_file:
        pickle.dump(feature_counter, pickle_file)
else:
    with open('feature_count.pkl', 'rb') as pickle_file:
        feature_counter = pickle.load(pickle_file)

In [None]:
feature_counter.most_common(10)

Note that the most common word starts at index 4, which may be unexpected.

In [None]:
feature_counter[0], feature_counter[1], feature_counter[2], feature_counter[3]

Index 0 serves as padding, 1 as start of a review (note that it occurs as many times as there are reviews in the training set).  For more details, see the section on texts below.

In [None]:
len(feature_counter)

In [None]:
plt.semilogy(list(feature_counter.keys()), list(feature_counter.values()),
             '.', alpha=0.3, markersize=1)
plt.xlabel('feature')
plt.ylabel('frequency');

The features, i.e., the words, follow a Zipf-like distribution, which doesn't come as a surprise.  Since this computation is time consuming, we assume similar results for the test set.

Note that the minimum index is 1, the maximum 88586.

In [None]:
min(feature_counter.keys()), max(feature_counter.keys())

### Sentiment distribution

The distirbution of the opinions, 0 or 1, can be visualized in a bar plot, again one for the training, the other for the test output.

In [None]:
figure, axes = plt.subplots(1, 2)
for i, opinions in enumerate((y_train, y_test)):
    counter = [0, 0]
    for opinion in opinions:
        counter[opinion] += 1
    axes[i].bar(['0', '1'], counter, 0.5);
figure.tight_layout()

Positive and negetive opinions are uniformly distributed in the training set and the test set.

## Texts

The texts represented as a list of integers, each integer representing a specific word.  The word index, i.e, a dictionary that has the words as keys, and the integers as values is also available in the IMDB dataset.

In [None]:
word_index = imdb.get_word_index()

In [None]:
word_index['the']

However, in order to translate the lists of integers into the original reviews, the data has to be loaded appropriately.  The `load_data` method has some optional arguments that should be specified.  Index 0 is usually reserved for padding, i.e., to ensure that short sequences can be extended to the required length.  Index 1 indicates the start of a review (`start_char`), while index 2 is used to represent words that have not been indexed, either because they were not part of the data set, they were too infrequently used, or, if the top words are left out, too common to be considered informative (`oov_char`).  Hence, the actual word index starts at 4.

The word index has to be shifted by `index_from`, and the strings representing padding, start and unknown added.  The following function will do this, compute the reverse dictionary, and return both.

In [None]:
def compute_indices(word_index=None, index_from=3, padding_idx=0, start_idx=1, unknown_idx=2):
    if word_index is None:
        word_index = imdb.get_word_index()
    word_to_idx = {k: v + index_from for k, v in word_index.items()}
    word_to_idx['<pad>'] = padding_idx
    word_to_idx['<start>'] = start_idx
    word_to_idx['<unknown>'] = unknown_idx
    return word_to_idx, {v: k for k, v in word_to_idx.items()}

In [None]:
word_to_idx, idx_to_word = compute_indices(word_index)

The first review in the training set can now be "translated" back to English.  Note that there is no punctuation, and, since only the 1,000 most common words were loaded, quite a number of `<unknown>` crop up in the text.

In [None]:
print(' '.join(idx_to_word[idx] for idx in x_train[0]))

The sentiment expressed in this review is positive, so the output should be 1.

In [None]:
y_train[0]

Find the first review expressing a negative sentiment.

In [None]:
neg_idx = list(y_train).index(0)

In [None]:
print(' '.join(idx_to_word[idx] for idx in x_train[neg_idx]))

Clearly, the reviewer was not taken by the movie.

## Stop words

When you display the top-25 words, it is quite clear that most will not be very informative, except "but" at index 20 and "not" at index 23.

In [None]:
for i in range(4, 4 + 26):
    print(i, idx_to_word[i])

It would most likely be fine to exclude all most frequent words upto "but", and limit the number of words to the 5,000 most frequent ones in order to reduce the dataset size, and hence the computations when training the network.