## RNN
We'll be using a recurrent neural network (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words

In [42]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Read dataset

In [43]:
imdb_data = pd.read_pickle('pickles/cleaned_data2.pkl')

reviews = list(imdb_data.review)
sentiments = list(imdb_data.sentiment)

## Data Pre-Processing
The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

In [44]:
reviews = [review.replace('\n', ' ') for review in reviews]
reviews = '\n'.join([review for review in reviews])

In [45]:
from string import punctuation

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

## Encoding the words
The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [48]:
# feel free to use this import 
from collections import Counter


## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

### Test your code

As a text that you've implemented the dictionary correctly, print out the number of unique words in your vocabulary and the contents of the first, tokenized review.

In [49]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  108519

Tokenized review: 
 [[5, 234, 355, 13, 408, 2827, 192, 162, 1600, 117, 556, 107, 32, 36, 2833, 2827, 1083, 11283, 18, 519, 94, 117, 298, 29, 1338, 20, 5028, 400, 8445, 20, 581, 1702, 1060, 648, 326, 519, 2954, 237, 65, 298, 170, 2827, 6247, 303, 11080, 5455, 1810, 517, 38409, 1116, 1279, 16848, 440, 3392, 1920, 764, 1912, 1785, 872, 236, 17465, 12374, 268, 3662, 3064, 440, 282, 50259, 3468, 9565, 5904, 1073, 917, 2189, 5085, 14921, 239, 2036, 5940, 415, 6818, 6445, 50, 156, 171, 14, 39, 208, 791, 20, 603, 103, 201, 20, 14, 4, 1482, 662, 105, 280, 1038, 2216, 183, 662, 614, 662, 748, 2827, 4, 714, 111, 32, 192, 56, 132, 2833, 1383, 1735, 24, 4, 39, 1432, 13, 368, 920, 2827, 104, 7742, 268, 472, 1186, 519, 519, 5154, 2907, 1723, 162, 2604, 17466, 4346, 162, 102, 470, 12, 171, 19, 989, 692, 646, 4346, 98, 764, 4068, 603, 296, 609, 982, 764, 345, 13, 2827, 123, 138, 1792, 2609, 233, 1262, 12, 445, 3403, 389]]


### Removing Outlier

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
1. Padding/truncating the remaining data so that we have reviews of the same length.

Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training.



In [51]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 1458


## Padding sequences
To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some seq_length, we'll pad with 0s. For reviews longer than seq_length, we can truncate them to the first seq_length words. A good seq_length, in this case, is 200.

In [52]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [54]:
# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

### Training, Validation, Test
With our data in nice shape, we'll split it into training, validation, and test sets.

- We'll need to create sets for the features and the labels, train_x and train_y, for example.
- Define a split fraction, split_frac as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9.
- Whatever data is left will be split in half to create the validation and testing data.

In [56]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = sentiments[:split_idx], sentiments[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(40000, 200) 
Validation set: 	(5000, 200) 
Test set: 		(5000, 200)
