<a href="https://colab.research.google.com/github/mayujie/MSC_text_mining/blob/master/Sentiment_RNN_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!gdown https://drive.google.com/uc?id=1_1TllbB1pdowmZ2LE1PL3moh1x3trCp6

Downloading...
From: https://drive.google.com/uc?id=1_1TllbB1pdowmZ2LE1PL3moh1x3trCp6
To: /content/data.zip
0.00B [00:00, ?B/s]12.0MB [00:00, 188MB/s]


In [None]:
!unzip data.zip

Archive:  data.zip
   creating: data/
  inflating: data/labels.txt         
  inflating: data/reviews.txt        


# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. 
>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

Here we'll use a dataset of movie reviews, accompanied by sentiment labels: positive or negative.

![alt text](https://drive.google.com/uc?id=1TkHLK9_aME3VmlZ5vB2wZC4gjTTSOB4w)
<img src="assets/reviews_ex.png" width=40%>

### Network Architecture

The architecture for this network is shown below.

![alt text](https://drive.google.com/uc?id=1B4fdKBByMFTjJ2LJv_oqmeTV4n62jn7S)
<img src="assets/network_diagram.png" width=40%>

>**First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the Word2Vec lesson. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input, here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.*

>**After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the movie review data. 

>**Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1. 

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).

---
### Load in and visualize the data

In [None]:
import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [None]:
print(reviews[:2000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

**we need to pre-process this data and tokenize(to organize) all of words in our vocabulary so that we have numerical data to feed to our model later. Since we're using an embedding layer, we'll need to encode each word as an integer and we'll also want to clean up our data a bit.**

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:
>* We'll want to get rid of periods and extraneous punctuation.
* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. 
* Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [None]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # turn our text to lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])
# if it's not in the punctuation list, we keep it 

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
# this gives me a version of the review text that is all text no punctuation 
print(all_text[:2000])
# Punctuation that in this case, will not really have any bearing on whether our review is classifed as positive or negative.

bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   
story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mo

**Next, i know that my reviews are separated by a new line characters \n. So to separate out our reviews, im going to split the text into each review using \n as the delimiter here. Then i can combine all the review back together as one big string. **

Finally, splitting that text into individual words. 

In [None]:
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [None]:
words[:30]
# Essentially, the original text that i printed out only all the punctuation is removed and we've separated everything into 
# idividual words.

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

**Next, we have to take our "word data" and our "label text data" convert this into numerical data.**

1st task of exercises will be to create a dictionary that can convert any unique word into an integer token. Then using this dictionary, you need to create a list of tokenized words, all the words in our data but converted into their integer values. 

i'd also like it so that our dictionary maps more frequent words to lower integer tokens. One important thing to note here is that later, we gonna pad our input vector with zeros. So i actually don't want 0 as a word token. I want the tokenized values to start at one. 

And so, the most common word in our vocabulary should be mapped to the integer value 1. So create that dictionary, use it to tokenize our words, and then store those tokens in a list.

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [None]:
# feel free to use this import 
from collections import Counter

## Build a dictionary that maps words to integers
# counting how many times each word appeared
counts = Counter(words)
print(counts.most_common(5))

[('the', 336713), ('and', 164107), ('a', 163009), ('of', 145864), ('to', 135720)]


In [None]:
# this sorts each unique word by its frequency of occurrence, 
# so this vocab should hold all the unique words that make up our word data without any repeats
vocab = sorted(counts, key=counts.get, reverse=True) # sort descending
# shows the key of sorted word by number of appeared
print(vocab[:10])
# encode words with integer start from 1
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)} # default start from 1
vocab_to_int

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i']


{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'it': 8,
 'in': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 's': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 'you': 21,
 'on': 22,
 't': 23,
 'not': 24,
 'he': 25,
 'are': 26,
 'his': 27,
 'have': 28,
 'be': 29,
 'one': 30,
 'all': 31,
 'at': 32,
 'they': 33,
 'by': 34,
 'an': 35,
 'who': 36,
 'so': 37,
 'from': 38,
 'like': 39,
 'there': 40,
 'her': 41,
 'or': 42,
 'just': 43,
 'about': 44,
 'out': 45,
 'if': 46,
 'has': 47,
 'what': 48,
 'some': 49,
 'good': 50,
 'can': 51,
 'more': 52,
 'she': 53,
 'when': 54,
 'very': 55,
 'up': 56,
 'time': 57,
 'no': 58,
 'even': 59,
 'my': 60,
 'would': 61,
 'which': 62,
 'story': 63,
 'only': 64,
 'really': 65,
 'see': 66,
 'their': 67,
 'had': 68,
 'we': 69,
 'were': 70,
 'me': 71,
 'well': 72,
 'than': 73,
 'much': 74,
 'get': 75,
 'bad': 76,
 'been': 77,
 'people': 78,
 'will': 79,
 'do': 80,
 'other': 81,
 'also': 82,
 'into':

Next, using this dictionary to tokenize all of our word data. So here looking at each individual review. Each of these is one item and review split from before when i separated reviews by '\n'.

Then for each word in a review, using my dictionary to convert that word into its integer value, and appending the token as review to review_ints. So the edn result will be a list of tokenized reviews.

In [None]:
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

**Test your code**

As a text that you've implemented the dictionary correctly, print out the number of unique words in your vocabulary and the contents of the first, tokenized review.

In [None]:
# stats about vocabulary, it will print the length of dictionary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
# which means that we have this many unique words that make up our reviews data.
print()

# print tokens in first review// it will print the 1st review in your tokenized review list.
print('Number of reviews: ', len(reviews_split))
print('Number of reviews: ', len(reviews_ints))
print('Tokenized review: \n', reviews_ints[:1])
# this tokenized review without any 0 value which is good, and these encoded values look as expect. 

Unique words:  74072

Number of reviews:  25001
Number of reviews:  25001
Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


**Your next similar task is gonna be to encode our label text into numerical values. We saw that this text was just lines of "positive" or "negative", and we need to create an array encoded labels that converts the word postive to 1 and negative to 0.**

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise:** Convert labels from `positive` and `negative` to 1 and 0, respectively, and place those in a new list, `encoded_labels`.

like the reviews text that a new label is one every new line in this file. So i can get a list of labels by spliting ourn loaded data using the newline character as a delimiter. 

For every label in the spilt list, i'm gonna add 1 to my array if it reads as positive, and a 0 otherwise.

In [None]:
# 1=positive, 0=negative label conversion
labels_split = labels.split("\n")
labels_split[:5]

['positive', 'negative', 'positive', 'negative', 'positive']

In [None]:

encoded_labels = np.array([1 if label=='positive' else 0 for label in labels_split])
print(len(encoded_labels))
encoded_labels

25001


array([1, 0, 1, ..., 1, 0, 0])

**After encoding all our word and label data as an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so we'll want to shape our reviews into a consistent specific length.**

There are two things we'll need to do to approach this task.

- 1st, take a look at the review data and see do we have any especially sure longer reviews that might mess with our training process. i'll especially look to see if we have any reviews of length 0 which will not provide any text information and will just act as **noisy data**. If i find any those 0 length reviews, i'll want to remove them from our data entirely.

- 2nd, i'll look at the remaining reviews, and for really long reviews, i'll actually truncate them at a specific length. i'll do something similar for shortest reviews and make sure that i'm creating a set of reviews that are all the same length.

This will be our **padding** and **truncation** step, where we basically pad our data with columns of zeros or remove columns until we get our desired input shape. 

### Removing Outliers

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.

![alt text](https://drive.google.com/uc?id=177smc0HLAOX8qFCKIRv9Fr8hVHyaXopn)
<img src="assets/outliers_padding_ex.png" width=40%>

Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training.

Before we pad our review text, we should check for reviews of length 0. The way i'm gonna do this is to use a counter. For each review length that's currently in our data, whether that's a length of 0 or thousands of words, i'll look at how many reviews are of that length. So this returns a dictionary of review lengths and account for how many our reviews fail into those lengths. 

So here i'm looking at how many of our reviews are 0 length and i'll also print out the longest review length just to see.

In [None]:
reviews_ints[:1]

[[21025,
  308,
  6,
  3,
  1050,
  207,
  8,
  2138,
  32,
  1,
  171,
  57,
  15,
  49,
  81,
  5785,
  44,
  382,
  110,
  140,
  15,
  5194,
  60,
  154,
  9,
  1,
  4975,
  5852,
  475,
  71,
  5,
  260,
  12,
  21025,
  308,
  13,
  1978,
  6,
  74,
  2395,
  5,
  613,
  73,
  6,
  5194,
  1,
  24103,
  5,
  1983,
  10166,
  1,
  5786,
  1499,
  36,
  51,
  66,
  204,
  145,
  67,
  1199,
  5194,
  19869,
  1,
  37442,
  4,
  1,
  221,
  883,
  31,
  2988,
  71,
  4,
  1,
  5787,
  10,
  686,
  2,
  67,
  1499,
  54,
  10,
  216,
  1,
  383,
  9,
  62,
  3,
  1406,
  3686,
  783,
  5,
  3483,
  180,
  1,
  382,
  10,
  1212,
  13583,
  32,
  308,
  3,
  349,
  341,
  2913,
  10,
  143,
  127,
  5,
  7690,
  30,
  4,
  129,
  5194,
  1406,
  2326,
  5,
  21025,
  308,
  10,
  528,
  12,
  109,
  1448,
  4,
  60,
  543,
  102,
  12,
  21025,
  308,
  6,
  227,
  4146,
  48,
  3,
  2211,
  12,
  8,
  215,
  23]]

In [None]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print(type(review_lens), '   ', review_lens[123])
# this 0 length review is just going to add noise into our dataset
print("Zero-length reviews: {}".format(review_lens[0]))
# longest review has over 2000 words in it.
print("Maximum review length: {}".format(max(review_lens)))

<class 'collections.Counter'>     136
Zero-length reviews: 1
Maximum review length: 2514


Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our RNN. We'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently.

> **Exercise:** First, remove *any* reviews with zero length from the `reviews_ints` list and their corresponding label in `encoded_labels`.

In [None]:
print('Number of reviews before removing outliers: ', len(reviews_ints))
print('Number of labels before removing outliers: ', len(encoded_labels))
## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))
print('Number of labels after removing outliers: ', len(encoded_labels))

Number of reviews before removing outliers:  25001
Number of labels before removing outliers:  25001
Number of reviews after removing outliers:  25000
Number of labels after removing outliers:  25000


Now the next thing i want to deal with is very long review text data and standardizing the length of our reviews in general. we saw that the maximum revealing was about 2500 words and that's gonna be too many steps for our RNN. In cases like this i want to truncate this data to a resonable size and number of steps.  

---
## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll pad with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

> **Exercise:** Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.

In [None]:
print(len(reviews_ints[0]))
print(len(reviews_ints[2]))

140
447


In [None]:
print(type(reviews_ints))
k = np.zeros((4, 15), dtype=int)
print(k,'\n')
m = [[21025,   308 ,    6 ,    3 , 1050   ,207 ,    8 , 2138,    32 ,    1, 34],
     [  63 ,   4   , 3 , 125  , 36  , 47, 7472 ,1395],
     [22382,    42, 46418   , 15,   706, 17139 , 3389 ,  22, 33, 11, 55, 47,    77  ,  35, 3, 1, 55, 29],
     [4505,  505  , 15 ,   3, 3342 , 162, 8312, 1652,    6 ,4819]]
for i, row in enumerate(m):
    if i < 4:
        print(len(row))
        print(i ,np.array(row)[:15], len(np.array(row)[:15]))
        print(i ,np.array(row)[-5:], '\n')
        k[i, -len(row):] = np.array(row)[:15]
        
print(len(k[0]), '\n')
print(k)

<class 'list'>
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] 

11
0 [21025   308     6     3  1050   207     8  2138    32     1    34] 11
0 [   8 2138   32    1   34] 

8
1 [  63    4    3  125   36   47 7472 1395] 8
1 [ 125   36   47 7472 1395] 

18
2 [22382    42 46418    15   706 17139  3389    22    33    11    55    47
    77    35     3] 15
2 [35  3  1 55 29] 

10
3 [4505  505   15    3 3342  162 8312 1652    6 4819] 10
3 [ 162 8312 1652    6 4819] 

15 

[[    0     0     0     0 21025   308     6     3  1050   207     8  2138
     32     1    34]
 [    0     0     0     0     0     0     0    63     4     3   125    36
     47  7472  1395]
 [22382    42 46418    15   706 17139  3389    22    33    11    55    47
     77    35     3]
 [    0     0     0     0     0  4505   505    15     3  3342   162  8312
   1652     6  4819]]


solution for creating an array of features, reviews that have either been padded on the left with 0s until their sequence length on or truncated at that length. 

1st creating an array of 0s, that's just the final shape that i know i want. That is, it should have as many rows as i have reviews in the input reviews_ints data into as many columns as the specified sequence length, and this will just hold all the 0 integers for now.

Then for each review in my list, i'll put it as a row in my features array. The 1st review is gonna go on the 1st row, and 2nd in the 2nd row, and so on. I started out thinking of my short review case. I want to keep a left padding of 0s up until i reach where that review can fill the remaining values.

So i'm looking at filling my features, starting at the index that's at the end of the features row, (minus the length of the input review). so if a review is short, this means our features are gonna keep the zeros which are padding on the left, and review tokens will be on the right side.

It turns out that i only have to add one more piece to this line to make this work for a long reviews too. here for annual review including those longer than the given senquence length, I'm truncating them at that sequence length, and this should fill the corresponding features row. So, this loop will do this for every review in reviews_ints, and then it returns these features. 

In [None]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    ## implement function
    # getting the correct rows x cols shape     
    features=np.zeros((len(reviews_ints), seq_length), dtype=int)
    
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [None]:
# Test your implementation!
# set sequence length to 200
seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

# it don't trigger any of these error messages, so dimensions are correct
## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches(rows) 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

As shown above, A lot of these start with 0s, which is what i expect for left padding, and others have filled up these rows with various token values. And i'll also add that, in this step, we've actually introduced a new token into our review features. 

Remember before, all words in our vocabulary hadn't associated integer value, and we started organizing with value 1. So in our vocab_to_int dictionary, we had integers from 1 up to 74000 also. And here by adding 0 as padding, i've effectively inserted the 0 token into our vocabulary. 

**Next, we need to split the features and encoded labels into three different datasets: Training, Validation and test sets. We need to create datasets for grouping our features and labels like train_x and train_y, for example. And we'll use these different sets to train and test our model.**

## Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets. 
* You'll need to create sets for the features and the labels, `train_x` and `train_y`, for example. 
* Define a split fraction, `split_frac` as the fraction of data to **keep** in the training set. Usually this is set to 0.8 or 0.9. 
* (the 20% of the data) Whatever data is left will be split in half to create the validation and *testing* data respectively.

Pytorch resources we can use to effectively batch and iterate through these different datasets.

In [None]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)
# 0.8 --> take 80% of data and label as training data
split_idx = int(len(features)* split_frac)

train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

# split rest 20% data and labels to half as validation and testing data
test_idx = int(len(remaining_x)* 0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes: (number of reviews, sequence length)")
print("Train set: \t\t{}".format(train_x.shape),
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))
print()
print("Train label: \t\t{}".format(train_y.shape),
      "\nValidation label: \t{}".format(val_y.shape),
      "\nTest label: \t\t{}".format(test_y.shape))

			Feature Shapes: (number of reviews, sequence length)
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)

Train label: 		(20000,) 
Validation label: 	(2500,) 
Test label: 		(2500,)


**Check your work**

With train, validation, and test fractions equal to 0.8, 0.1, 0.1, respectively, the final, feature data shapes should look like:
```
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2500, 200)
```

---
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

We can actually create data loaders for our data by following a couple of steps.

- 1st, we can use pytorch's tensor dataset to wrap tensor data into a known format. This dataset basically takes any amount of tensors with the same 1st dimension, so the same number of rows, and in our case this is out input features and the label tensors and it creats a dataset that can be processed and batched by pytorch's dataloader class. So once we create our data wrapping it in a tensor dataset, we can then pass that to a data loader as usual. 

- Dataloader just takes in some data and a batch size and it returns a dataloader that batches our data as we typically might. This is a great alternative to creating a generator function for batching our data into full batches. The dataloader class is gonna take care of a lot of behind-the-scenes work for us and here's what this look like in code.

1st, creating tensor datasets by passing in the tensor version of x and y that i created from above, and torch.from_numpy just takes in numpy arrays and convert them into tensors. So doing that for training, validation and test data.

**In fact, we can do these steps the other way. Creating a tesnor dataset for all my data and then splitting the data into different sets. Both approachs work.**

Then for each tensor dataset that i just created, I'm passing it into pytorch's dataloader or i can specify a batch size parameter = 50 in this case.This defines training, validation and test dataloaders that i can use in my train loop to batch data into the size i want. 

So this gives me 3 different iterators and i just want to show what a sample of data from this dataloader looks like. 

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [None]:
print(len(train_loader), len(valid_loader), len(test_loader) )

400 50 50


**Omission: shuffling data**

Make sure to shuffle your data, so that your model doesn't learn anything about the ordering of the data, and instead can focus on the content. We can do this with a DataLoader by setting shuffle=True. You'll find this updated code in the exercise and solution notebooks.

**TensorDataset**

Take a look at the source code for [**the TensorDataset class**](https://github.com/pytorch/tnt/blob/master/torchnet/dataset/tensordataset.py), you can see that it's "purpose" is to provide an easy way to create a dataset out of standard data structures.

Below, looking at our train_loader and getting an iterator, then grabbing one batch of data using a call to next. so this should return some sample input features and some sample labels.

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    0,     0,     0,  ...,  5276, 30229,  2409],
        [    0,     0,     0,  ...,    15,     3,  1607],
        [    0,     0,     0,  ...,   136,    21,    51],
        ...,
        [    0,     0,     0,  ...,   944,     8,     3],
        [    0,     0,     0,  ...,  3330,    95,    56],
        [    0,     0,     0,  ...,    13,   651,   141]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
        1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        1, 0])


Then printing out the size of my input which i can see is the batch size 50 and sequence length 200 and the label size which is just 50, one label for each review in the input batch and i see my tokens and encoded label as well.

---
By now, you've had a lot of practice with data processing and with defining RNNs. Below is what it should look like generally. The model should be able to take in our word tokens, and the first thing that these go through will be an **embedding layer**.

We have about 74000 different words, and so this layer is gonna be responsible for converting our word tokens, our integers into embeddings of a specific size. Now, you could train a Word2Vec model separately, and actually just use the learnd word embeddings as input to an LSTM. **But it turns out that these embedding layers are still useful even if they haven't been trained to learn the semantic relationships between words.** 

So in this case, what we're mainly using this **embedding layer for is dimensionality reduction**. It will learn to look at our large vocabulary and map each word into a vector of specified embedding dimension. Then after our embedding layer, we have an LSTM layer. This is defined by a hidden state size and number of layers as you know.

At each step, these LSTM cells will produce an output and a new hidden state. The hidden state will be passed to the next cell as input, and this is how we represent a memory in this model. The output is going to be fed into a Sigmoid activated fully connected output layer. This layer will be responsible for mapping the LSTM outputs to a desired output size. In this case, this should be the number of our sentiment classes, postive or negative. 

Then the Sigmoid activation function is responsible for turning all of those outputs into a value between 0 or 1. This is the range we expect for our encoded sentiment labels. 0 is negative and 1 is a postive review. So this model is going to look at a sequence of words that make up a review. **Here, we're interested in only the last Sigmoid output because this will produce the one label we're looking for at the end of processing a sequence of words in a review.** 

---
# Sentiment Network with PyTorch

Below is where you'll define the network.

![alt text](https://drive.google.com/uc?id=1B4fdKBByMFTjJ2LJv_oqmeTV4n62jn7S)
<img src="assets/network_diagram.png" width=40%>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.


### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships. 

> **Exercise:** Complete the `__init__`, `forward`, and `init_hidden` functions for the SentimentRNN model class.

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


**`__init__` explanation**

First I have an **embedding layer**, which should take in the size of our vocabulary (our number of integer tokens) and produce an embedding of `embedding_dim` size. So, as this model trains, this is going to create and embedding lookup table that has as many rows as we have word integers, and as many columns as the embedding dimension.

Then, I have an **LSTM layer**, which takes in inputs of `embedding_dim` size. So, it's accepting embeddings as inputs, and producing an output and hidden state of a hidden size. I am also specifying a number of layers, and a dropout value, and finally, I’m setting `batch_first` to True because we are using DataLoaders to batch our data like that!

Then, the LSTM outputs are passed to a dropout layer and then a fully-connected, linear layer that will produce `output_size` number of outputs. And finally, I’ve defined a sigmoid layer to convert the output to a value between 0-1.

**Feedforward behavior**

Moving on to the `forward` function, which takes in an input `x` and a `hidden` state, I am going to pass an input through these layers in sequence.

In [None]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define all layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.5)
        
        # linear layer
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        # getting the batch_size of my input x
        batch_size = x.size(0)
        
        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm outpus
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
                
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(), 
                     weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(), 
                     weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden


---
**`forward` explanation**

So, first, I'm getting the `batch_size` of my input x, which I’ll use for shaping my data. Then, I'm passing x through the embedding layer first, to get my embeddings as output

These embeddings are passed to my lstm layer, alongside a hidden state, and this returns an `lstm_output` and a new `hidden` state! Then I'm going to stack up the outputs of my LSTM to pass to my last linear layer.

Then I keep going, passing the reshaped lstm_output to a dropout layer and my linear layer, which should return a specified number of outputs that I will pass to my sigmoid activation function.

Now, I want to make sure that I’m returning only the **last** of these sigmoid outputs for a batch of input data, so, I’m going to shape these outputs into a shape that is batch_size first. Then I'm getting the last bacth by called `sig_out[:, -1]`, and that’s going to give me the batch of last labels that I want!

Finally, I am returning that output and the hidden state produced by the LSTM layer.

**`init_hidden`**

That completes my forward function and then I have one more: `init_hidden` and this is just the same as you’ve seen before. The hidden and cell states of an LSTM are a tuple of values and each of these is size (n_layers by batch_size, by hidden_dim). I’m initializing these hidden weights to all zeros, and moving to a gpu if available.

---

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

> **Exercise:** Define the model  hyperparameters.


In [None]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the padding + our word tokens
output_size = 1
embedding_dim = 400 
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


This should look familiar, but the main thing to note here is our vocab_size.

This is actually the length of our `vocab_to_int` dictionary (all our unique words) **plus one** to account for the `0`-token that we added, when we padded our input features. So, if you do data pre-processing, you may end up with one or two extra, special tokens that you’ll need to account for, in this parameter!

Then, I want my `output_size` to be 1; this will be a sigmoid value between 0 and 1, indicating whether a review is positive or negative.

Then I have my embedding and hidden dimension. The embedding dimension is just a smaller representation of my vocabulary of 70k words and I think any value between like 200 and 500 or so would work, here. I’ve chosen 400. Similarly, for our hidden dimension, I think 256 hidden features should be enough to distinguish between positive and negative reviews.

I’m also choosing to make a 2 layer LSTM. Finally, I’m instantiating my model and printing it out to make sure everything looks good.

---
Training and Optimization
The training code, should look pretty familiar. One new detail is that, we'll be using a new kind of cross entropy loss that is designed to work with a single Sigmoid output.

- [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or Binary Cross Entropy Loss, applies cross entropy loss to a single value between 0 and 1.

We'll define an Adam optimizer, as usual.

---
## Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. You can also add code to save a model by name.

>We'll also be using a new kind of cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


**Output, target format**

You should also notice that, in the training loop, we are making sure that our outputs are squeezed so that they do not have an empty dimension **output.squeeze()** and the labels are float tensors, **labels.float()**. Then we perform backpropagation as usual.

**Train and eval mode**

Below, you can also see that we switch between train and evaluation mode when the model is training versus when it is being evaluated on validation data!

**Training Loop**

Below, you’ll see a usual training loop.

I’m actually only going to do four epochs of training because that's about when I noticed the validation loss stop decreasing.

- You can see that I am initializing my hidden state before entering the batch loop then have my usual detachment from history for the hidden state and backpropagation steps.
- I’m getting my input and label data from my train_dataloader. Then applying my model to the inputs and comparing the outputs and the true labels.
- I also have some code that checks performance on my validation set, which, if you want, may be a great thing to use to decide when to stop training or which best model to save!

In [None]:
# training params

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Loss: 0.655416... Val Loss: 0.658124
Epoch: 1/4... Step: 200... Loss: 0.670902... Val Loss: 0.639982
Epoch: 1/4... Step: 300... Loss: 0.635361... Val Loss: 0.584437
Epoch: 1/4... Step: 400... Loss: 0.690364... Val Loss: 0.636807
Epoch: 2/4... Step: 500... Loss: 0.454073... Val Loss: 0.594098
Epoch: 2/4... Step: 600... Loss: 0.702295... Val Loss: 0.586064
Epoch: 2/4... Step: 700... Loss: 0.441579... Val Loss: 0.540988
Epoch: 2/4... Step: 800... Loss: 0.609320... Val Loss: 0.594099
Epoch: 3/4... Step: 900... Loss: 0.361510... Val Loss: 0.504463
Epoch: 3/4... Step: 1000... Loss: 0.329989... Val Loss: 0.558757
Epoch: 3/4... Step: 1100... Loss: 0.167394... Val Loss: 0.484343
Epoch: 3/4... Step: 1200... Loss: 0.333570... Val Loss: 0.478562
Epoch: 4/4... Step: 1300... Loss: 0.229571... Val Loss: 0.517741
Epoch: 4/4... Step: 1400... Loss: 0.375394... Val Loss: 0.542310
Epoch: 4/4... Step: 1500... Loss: 0.201725... Val Loss: 0.497360
Epoch: 4/4... Step: 1600... Loss: 

Make sure to take a look at how training **and** validation loss decrease during training! Then, once you're satisfied with your trained model, you can test it out in a couple ways to see how it behaves on new data!

---
## Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

**Testing the Trained Model**

I want to show you two great ways to test: using test data and using inference. The first is similar to what you’ve seen in our CNN lessons. I am iterating through the test data in the **`test_loader`**, recording the test loss and calculating the accuracy based on how many labels this model got correct!

I’m doing this by looking at the **rounded value** of our output. Recall that this is a sigmoid output between 0-1 and so rounding this value will give us an integer that is the most likely label: 0 or 1. Then I’m comparing that predicted label to the true label; if it matches, I record that as a correctly-labeled test review.

In [None]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.487
Test accuracy: 0.810


Below, I’m printing out the average test loss and the accuracy, which is just the number of correctly classified items divided by the number of pieces of test data,total.

We can see that the test loss is **`0.516`** and the accuracy is about **81.1%** !

Next, you're ready for your last task! Which is to define a **`predict`** function to perform inference on any given text review!

### Inference on a test review

You can change this **`test_review`** to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!
    
> **Exercise:** Write a `predict` function that takes in a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!
* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.


**Inference**

Let's put all these pieces together! One of the coolest ways to test a model like this is to give it user-generated data, without any true label, and see what happens. So, in this case, that data will just be a single string: a review that you can write and here’s just one test_reviewas an example:

In [None]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'


We can see that this review is a negative one, but let's see if our model can identify it's sentiment correctly!

Our task is to write a `predict` function that takes in a trained model, a `test_review` like this one that is just normal text and punctuation, a `sequence_length` for padding.

The process by which you make predictions based on user data, is called **`inference`**.

Pre-process the `test_review`
The first thing we'll have to do it to process the `test_review`, so that it is converted into a tensor that our model can see as input. In fact, this involves quite a lot of pre-processing, but nothing that you haven't seen before!

I broke this down into a series of steps.

I have a helper function `tokenize_review` that is responsible for doing some data processing on my test_review.

It takes in my `test_review`, and then does a couple of things:

1. First, I convert my test_review to lowercase, and remove any punctuation, so I’m left with all text.
2. Then I breaks it into individual words with split(), and I’m left with a list of words in the review.
3. I encode those words using the `vocab_to_int` dictionary that we already defined, near the start of this lesson.

Now, I am assuming a few things here, including: this review is one review, not a batch, and that this review only includes words already in our dictionary, and in this case that will be true, but you can add code to handle unknown characters, I just didn’t do that in my model.

In [None]:
from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # Lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in test_review if c not in punctuation])
    
    # splitting by spaces
    text_words = test_text.split()
    
    # tokens
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in text_words])
    
    return test_ints


In [None]:
# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

[[1, 247, 18, 10, 28, 108, 113, 14, 388, 2, 10, 181, 60, 273, 144, 11, 18, 68, 76, 113, 2, 1, 410, 14, 539]]


In [None]:
# test sequence padding
seq_length=200
features = pad_features(test_ints, seq_length)

print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   1 247  18  10  28
  108 113  14 388   2  10 181  60 273 144  11  18  68  76 113   2   1 410
   14 539]]


In [None]:
# test conversion to tensor and pass into your model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

torch.Size([1, 200])


Okay, so this tokenize function returns a list of integers; my tokenized review!

**Padding and converting into a Tensor**

For my next couple of steps, I’m going to pad the ints, returned by the `tokenize_review` function and shape them into our `sequence_length` size; since our model was trained on sequence lengths of 200, I’m going to use the same length, here. I'll pad it using the `pad_features` function that we defined earlier.

Finally, I’m going to convert the padded result into a Tensor. So, these are all the steps, and I’m going to wrap this all up in my predict function.

In [None]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be 
        positive or negative in sentiment, using a trained model.
        
        params:
        net - A trained net 
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''
    net.eval()
    
    # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())
    
    # print output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    
    # print custom response based on whether test_review is pos/neg
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected!")
        

So, using the passed in arguments, I’m tokenizing my review using my helper function, then padding it using my pad function, and converting it into a Tensor that can be seen by my model.

Then, I’m passing this tensor into my trained net which will return an output of length one. With this output, I can grab the most likely class, which will be the rounded value 0 or 1; this is my prediction!

Lastly, I want to print out a custom message for a positive or negative detected review, and I’m doing that at the bottom of the above function!

**You can test this out on sample positive and negative text reviews to see how this trained model behaves!** Below, you can see how it identifies our negative test review correctly.

In [None]:
# call function
# try negative and positive reviews!
seq_length=200 # good to use the length that was trained on
predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.003804
Negative review detected!


**Conclusion**

Now that you have a trained model and a predict function, you can pass in any kind of text and this model will predict whether the text has a positive or negative sentiment. You can use this to try to find what words it associates with positive or negative sentiment.

Later, you'll learn how to deploy a model like this to a production environment so that it can respond to any kind of user data put into a web app!

### Try out test_reviews of your own!

Now that you have a trained model and a predict function, you can pass in _any_ kind of text and this model will predict whether the text has a positive or negative sentiment. Push this model to its limits and try to find what words it associates with positive or negative.

Later, you'll learn how to deploy a model like this to a production environment so that it can respond to any kind of user data put into a web app!

In [None]:
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'


In [None]:
# test code and generate tokenized review
test_ints2 = tokenize_review(test_review_neg)
print(test_ints2)

[[1, 247, 18, 10, 28, 108, 113, 14, 388, 2, 10, 181, 60, 273, 144, 11, 18, 68, 76, 113, 2, 1, 410, 14, 539]]


In [None]:
# test sequence padding
seq_length=200
features2 = pad_features(test_ints2, seq_length)

print(features2)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   1 247  18  10  28
  108 113  14 388   2  10 181  60 273 144  11  18  68  76 113   2   1 410
   14 539]]


In [None]:
# test conversion to tensor and pass into your model
feature_tensor2 = torch.from_numpy(features2)
print(feature_tensor2.size())

torch.Size([1, 200])


In [None]:
# call function
# try positive reviews!
seq_length=200 # good to use the length that was trained on
predict(net, test_review_pos, seq_length)

Prediction value, pre-rounding: 0.978377
Positive review detected!
