## Recurrent neural networks for sequence classificatioin
----

In this tutorial, we are going to build a recurrent neural network (RNN) for sequence classification. 

The goal of the RNN will be to classify the sentiment of a movie review. We will be training this model using the IMDB sentiment analysis dataset ([dataset source](http://ai.stanford.edu/~amaas/data/sentiment/)).

To download the dataset, you just have to run the **get_datasets.sh** script in the **datasets** folder, from your terminal. This script will automatically download and extract the zipped file.

> **chmod o+x datasets/get_datasets.sh** 

> **datasets/get_datasets.sh**

This dataset comes with 25000 movie reviews for training and 25000 reviews for testing. Let's jump in the data analysis :)!

### Import here useful libraries
----

In [1]:
import pandas as pd
import glob

### Exploratory data analysis on the IMDB dataset

In [2]:
# Get the filenames of the positive/negative reviews we will use for training the RNN
train_pos_reviews = glob.glob('datasets/aclImdb/train/pos/*')
train_neg_reviews = glob.glob('datasets/aclImdb/train/neg/*')

# Concatenate both positive and negative reviews
train_reviews = train_pos_reviews + train_neg_reviews

# Create target vector. Target 1 is for positive reviews and target 0 is for negative reviews
train_targets = [1]*len(train_pos_reviews) + [0]*len(train_neg_reviews)

In [3]:
print('Number positive reviews used for training: ', len(train_pos_reviews))
print('Number negative reviews used for training: ', len(train_neg_reviews))

Number positive reviews used for training:  12500
Number negative reviews used for training:  12500


In [5]:
# Initialize empty dataset to store the movie review, the sentiment, as well as its rating
train_df = pd.DataFrame([], index=range(len(train_reviews)), columns=['review', 'target', 'score'])

# Read each review in part, and store it on our empty dataframe
for i, (review_filename, target) in enumerate(zip(train_reviews, train_targets)):
    review_file = open(review_filename, 'r') 
    review = review_file.read()
    train_df.loc[i, 'review'] = review
    train_df.loc[i, 'target'] = target
    train_df.loc[i, 'score'] = review_filename.split('.')[0].split('_')[-1]

In [9]:
# Print a few lines
train_df.head(5)

Unnamed: 0,review,target,score
0,"This has to be, hands down, hats off, one of t...",1,10
1,I admit creating great expectations before wat...,1,7
2,I love this film and it is such a wonderful ex...,1,9
3,"In this 1943 film, Judy Garland is deemed not ...",1,8
4,In celebration of Earth Day Disney has release...,1,8


In [13]:
# Read a particular review
idx_review = 2 # change here to see a different review
train_df.loc[idx_review,'review']

'I love this film and it is such a wonderful example of a family jeopardy, a romantic love story, and a very sad story plot. Everything was just so perfect and excellent about this film. It was such a great mixture of actors and actresses and with some laughs and a lot of cries this film deserves to get plenty of awards. With the mention of beautiful scenario, and although I would relate this film to The Notebook and The Family Stone, it was sort of much more cunning, sad, and brilliant than those films. The Evening tells of a love story between an old woman dreaming back to her younger years, and her two daughters stay by her side while she is not well. The story dating back is so strongly told and wonderful I was sitting on the edge of my seat. You really get to know all the characters and by the end, I was wanting to watch it all over again. This is a amazingly sad and vividly acted and plotted movie that is really one of a kind and should be seen by all for how wonderful it really 