# Exercise 5 (NLP): Very Deep Learning

**Natural language processing (NLP)** is the ability of a computer program to understand human language as it is spoken. It involves a pipeline of steps and by the end of the exercise, we would be able to classify the sentiment of a given review as POSITIVE or NEGATIVE.


Before starting, it is important to understand the need for RNNs and the lecture from Stanford is a must to see before starting the exercise:

https://www.youtube.com/watch?v=iX5V1WpxxkY

When done, let's begin. 

In [64]:
# In this exercise, we will import libraries when needed so that we understand the need for it. 
# However, this is a bad practice and don't get used to it.
import numpy as np

# read data from reviews and labels file.
with open('data/reviews.txt', 'r') as f:
    reviews_ = f.readlines()
with open('data/labels.txt', 'r') as f:
    labels = f.readlines()

In [65]:
# One of the most important task is to visualize data before starting with any ML task. 
for i in range(5):
    print(labels[i] + "\t: " + reviews_[i][:100] + "...")

positive
	: bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life...
negative
	: story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terr...
positive
	: homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan...
negative
	: airport    starts as a brand new luxury    plane is loaded up with valuable paintings  such belongin...
positive
	: brilliant over  acting by lesley ann warren . best dramatic hobo lady i have ever seen  and love sce...




We can see there are a lot of punctuation marks like fullstop(.), comma(,), new line (\n) and so on and we need to remove it. 

Here is a list of all the punctuation marks that needs to be removed 
```
(!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)
```


## Task 1: Remove all the punctuation marks from the reviews.
Many ways of doing it: Regex, Spacy, import punctuation from string.

In [106]:
# Make everything lower case to make the whole dataset even. 
## testing to make it a bit smalled
reviews_ = reviews_[:20]
reviews = ''.join(reviews_).lower()

In [107]:
# complete the function below to remove punctuations and save it in no_punct_text
import re


def text_without_punct(reviews):
    """ uses regex substitution to remove punctuation"""
    reviews = re.sub(r'[^\w\s]','',reviews)
    return reviews

no_punct_text = text_without_punct(reviews)
print(no_punct_text)

bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   
story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent mo

In [108]:
# split the formatted no_punct_text into words
def split_in_words(no_punct_text):
    """ \w+' means "a word character (a-z etc.) and splits instead of on spacing """ 
    word_array= re.findall(r"[\w']+", no_punct_text)
    return word_array

words = split_in_words(no_punct_text)
print(words)

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', 'such', 'as', 'teachers', 'my', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'high', 's', 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', 'teachers', 'the', 'scramble', 'to', 'survive', 'financially', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', 'pomp', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', 'i', 'immediately', 'recalled', 'at', 'high', 'a', 'classic', 'line', 'inspector', 'i', 'm', 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers', 'student', 'welcome', 'to', 'bromw

In [109]:
# once you are done print the ten words that should yield the following output
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

In [110]:
# print the total length of the words
len(words)

4962

In [111]:
# Total number of unique words
len(set(words))

1605


Next step is to create a vocabulary. This way every word is mapped to an integer number.
```
Example: 1: hello, 2: I, 3: am, 4: Robo and so on...
```


In [112]:
# Lets create a vocab out of it

# feel free to use this import 
from collections import Counter

## Let's keep a count of all the words and let's see how many words are there. 
def word_count(words):
    return Counter(words)

counts=word_count(words)

In [113]:
# If you did everything correct, this is what you should get as output. 
print (counts['wonderful'])

print (counts['bad'])

0
2


## Task 2: Word to Integer and Integer to word
The task is to map every word to an integer value and then vice-versa. 


In [114]:
# define a vocabulary for the words
def vocabulary(counts):
    # want to have a list of items
    return list(counts.items())
    #pass

vocab = vocabulary(counts)
print(len(vocab))
vocab[1]

1605


('high', 6)

In [115]:
# map each vocab word to an integer. Also, start the indexing with 1 as we will use 
# '0' for padding and we dont want to mix the two.
def vocabulary_to_integer(vocab):
    i = 1
    vocab_to_list = dict()
    for item in vocab:
        vocab_to_list[item[0]] = i
        i = i + 1
    return vocab_to_list

vocab_to_int = vocabulary_to_integer(vocab)
print(vocab_to_int)

{'bromwell': 1, 'high': 2, 'is': 3, 'a': 4, 'cartoon': 5, 'comedy': 6, 'it': 7, 'ran': 8, 'at': 9, 'the': 10, 'same': 11, 'time': 12, 'as': 13, 'some': 14, 'other': 15, 'programs': 16, 'about': 17, 'school': 18, 'life': 19, 'such': 20, 'teachers': 21, 'my': 22, 'years': 23, 'in': 24, 'teaching': 25, 'profession': 26, 'lead': 27, 'me': 28, 'to': 29, 'believe': 30, 'that': 31, 's': 32, 'satire': 33, 'much': 34, 'closer': 35, 'reality': 36, 'than': 37, 'scramble': 38, 'survive': 39, 'financially': 40, 'insightful': 41, 'students': 42, 'who': 43, 'can': 44, 'see': 45, 'right': 46, 'through': 47, 'their': 48, 'pathetic': 49, 'pomp': 50, 'pettiness': 51, 'of': 52, 'whole': 53, 'situation': 54, 'all': 55, 'remind': 56, 'schools': 57, 'i': 58, 'knew': 59, 'and': 60, 'when': 61, 'saw': 62, 'episode': 63, 'which': 64, 'student': 65, 'repeatedly': 66, 'tried': 67, 'burn': 68, 'down': 69, 'immediately': 70, 'recalled': 71, 'classic': 72, 'line': 73, 'inspector': 74, 'm': 75, 'here': 76, 'sack': 77

In [116]:
# verify if the length is same and if 'and' is mapped to the correct integer value.
print(len(vocab_to_int))
vocab_to_int['and']

1605


60

Let's see what positve words in positive reviews we have and what we have in negative reviews. 

In [117]:
positive_counts = Counter()
negative_counts = Counter()

In [118]:
for i in range(len(reviews_)):
    if(labels[i] == 'positive\n'):
        for word in reviews_[i].split(" "):
            positive_counts[word] += 1
    else:
        for word in reviews_[i].split(" "):
            negative_counts[word] += 1

In [119]:
labels

['positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive

In [120]:
positive_counts.most_common()

[('', 504),
 ('.', 134),
 ('the', 115),
 ('a', 84),
 ('to', 63),
 ('br', 54),
 ('and', 49),
 ('is', 43),
 ('of', 43),
 ('it', 33),
 ('that', 32),
 ('in', 31),
 ('s', 26),
 ('this', 23),
 ('as', 22),
 ('i', 21),
 ('his', 18),
 ('one', 17),
 ('on', 17),
 ('you', 17),
 ('he', 17),
 ('who', 16),
 ('williams', 16),
 ('with', 14),
 ('or', 13),
 ('has', 13),
 ('but', 13),
 ('be', 13),
 ('film', 13),
 ('t', 12),
 ('for', 12),
 ('from', 12),
 ('by', 12),
 ('was', 12),
 ('like', 11),
 ('not', 11),
 ('collette', 11),
 ('all', 10),
 ('\n', 10),
 ('an', 10),
 ('have', 10),
 ('good', 10),
 ('some', 9),
 ('about', 9),
 ('when', 9),
 ('what', 9),
 ('her', 9),
 ('story', 9),
 ('robin', 9),
 ('bolt', 8),
 ('no', 8),
 ('can', 7),
 ('their', 7),
 ('where', 7),
 ('out', 7),
 ('movie', 7),
 ('high', 6),
 ('been', 6),
 ('brooks', 6),
 ('are', 6),
 ('being', 6),
 ('best', 6),
 ('stettner', 6),
 ('at', 5),
 ('time', 5),
 ('other', 5),
 ('much', 5),
 ('see', 5),
 ('if', 5),
 ('they', 5),
 ('do', 5),
 ('more', 5

In [121]:
negative_counts.most_common()

[('', 454),
 ('the', 139),
 ('.', 133),
 ('of', 75),
 ('a', 63),
 ('to', 53),
 ('i', 48),
 ('and', 43),
 ('it', 40),
 ('is', 39),
 ('s', 33),
 ('in', 33),
 ('this', 33),
 ('was', 31),
 ('for', 26),
 ('with', 25),
 ('br', 24),
 ('t', 24),
 ('as', 20),
 ('but', 18),
 ('out', 17),
 ('or', 17),
 ('that', 16),
 ('just', 16),
 ('film', 16),
 ('his', 15),
 ('not', 14),
 ('have', 13),
 ('he', 13),
 ('on', 12),
 ('airport', 12),
 ('one', 12),
 ('an', 11),
 ('from', 10),
 ('\n', 10),
 ('don', 10),
 ('only', 10),
 ('what', 10),
 ('allen', 10),
 ('who', 9),
 ('by', 9),
 ('no', 9),
 ('you', 9),
 ('they', 9),
 ('were', 9),
 ('are', 9),
 ('much', 9),
 ('even', 8),
 ('be', 8),
 ('so', 8),
 ('if', 8),
 ('could', 8),
 ('there', 8),
 ('at', 8),
 ('woody', 8),
 ('time', 7),
 ('think', 7),
 ('plane', 7),
 ('like', 7),
 ('my', 7),
 ('little', 7),
 ('about', 7),
 ('really', 7),
 ('bergman', 7),
 ('has', 6),
 ('scene', 6),
 ('too', 6),
 ('than', 6),
 ('great', 6),
 ('can', 6),
 ('two', 6),
 ('while', 6),
 ('f

The above is just to show the most common words in the positive and negative sentences. However, there are a lot of unnecessary words like `the`, `a`, `was`, and so on. Can you find a way to show the relevant words and not these words? 

```
Hint: Stop Words removal or normalizing each term.
```

In [122]:
from nltk.corpus import stopwords
# import nltk && nltk.download("stopwords") : might need to do this 

words[:30]

# removes the br as well
filtered_words = [word for word in words if word not in stopwords.words('english') or word == 'br']
print(filtered_words)

# removing stop words 

# stemming the terms 



['bromwell', 'high', 'cartoon', 'comedy', 'ran', 'time', 'programs', 'school', 'life', 'teachers', 'years', 'teaching', 'profession', 'lead', 'believe', 'bromwell', 'high', 'satire', 'much', 'closer', 'reality', 'teachers', 'scramble', 'survive', 'financially', 'insightful', 'students', 'see', 'right', 'pathetic', 'teachers', 'pomp', 'pettiness', 'whole', 'situation', 'remind', 'schools', 'knew', 'students', 'saw', 'episode', 'student', 'repeatedly', 'tried', 'burn', 'school', 'immediately', 'recalled', 'high', 'classic', 'line', 'inspector', 'sack', 'one', 'teachers', 'student', 'welcome', 'bromwell', 'high', 'expect', 'many', 'adults', 'age', 'think', 'bromwell', 'high', 'far', 'fetched', 'pity', 'story', 'man', 'unnatural', 'feelings', 'pig', 'starts', 'opening', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'turned', 'insane', 'violent', 'mob', 'crazy', 'chantings', 'singers', 'unfortunately', 'stays', 'absurd', 'whole', 'time', 'general', '

In [123]:
[vocab_to_int[word] for word in words[:30]]

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 13,
 21,
 22,
 23,
 24,
 10,
 25,
 26,
 27,
 28]

In [124]:
vocab_to_int['bromwell']

1

## One hot encoding

We need one hot encoding for the labels. Think of a reason why we need one hot encoded labels for classes?


The words are categorical variables. To acheive better predictions/performance, one-hot-label encoders are a form that better for ML algorithms.


## Task 3: Create one hot encoding for the labels. 

* Write the one hot encoding logic in the `one_hot` function.
* Use 1 for positive label and 0 for negative label.
* Save all the values in the `encoded_labels` function.

In [125]:
# 1 for positive label and 0 for negative label
def one_hot(labels):
    one_hot_labels = list()
    for label in labels:
        print(label)
        if "positive" in label: 
            one_hot_labels.append(1)
        elif "negative" in label:
             one_hot_labels.append(0)
    return one_hot_labels

encoded_labels = one_hot(labels)
print(len(encoded_labels))

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative



positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative




positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative


negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive




negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive


positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative




positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative



positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative



negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive


positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative




positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative


negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive




negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive


positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative




positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative


negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive




negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive


positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative




positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative

positive

negative


In [126]:
#print the length of your label and uncomment next line only if the encoded_labels size is 25001.
# If you dont get the intuition behind this step, print encoded_labels to see it.
encoded_labels = encoded_labels[:25000]
print(encoded_labels)

[1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 

In [127]:
len(encoded_labels)

25000

In [128]:
reviews_ints = []

reviews_split = reviews.split('.')
for review in reviews_split:
#     print(review)
#     print("**")
#     for word in review
    reviews_ints.append([vocab_to_int[word] for word in review.split()])
print(reviews_ints)
    


[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 13, 21], [22, 23, 24, 10, 25, 26, 27, 28, 29, 30, 31, 1, 2, 32, 33, 3, 34, 35, 29, 36, 37, 3, 21], [10, 38, 29, 39, 40, 10, 41, 42, 43, 44, 45, 46, 47, 48, 49, 21, 50, 10, 51, 52, 10, 53, 54, 55, 56, 28, 52, 10, 57, 58, 59, 60, 48, 42], [61, 58, 62, 10, 63, 24, 64, 4, 65, 66, 67, 29, 68, 69, 10, 18, 58, 70, 71], [], [], [], [], [], [], [], [], [9], [], [], [], [], [], [], [], [], [], [2], [4, 72, 73, 74, 58, 75, 76, 29, 77, 78, 52, 79, 21], [65, 80, 29, 1, 2], [58, 81, 31, 82, 83, 52, 22, 84, 85, 31, 1, 2, 3, 86, 87], [88, 4, 89, 31, 7, 90, 91, 92, 52, 4, 93, 43, 94, 95, 96, 97, 4, 98], [99, 100, 101, 4, 102, 103, 31, 3, 4, 104, 105, 52, 106, 6], [4, 107, 108, 109, 3, 110, 111, 112, 113, 114, 115, 116, 10, 117, 118, 52, 7, 32, 119], [120, 7, 121, 106, 10, 53, 12, 101, 122, 123, 124, 125, 126, 7, 127, 128, 129, 130], [131, 132, 133, 10, 134, 135, 136, 110, 129], [10, 137, 138, 139, 140, 141, 142, 143, 29, 4, 144,

In [129]:
# This step is to see if any review is empty and we remove it. Otherwise the input will be all zeroes.
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))


Zero-length reviews: 30
Maximum review length: 125


In [130]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  268
Number of reviews after removing outliers:  238


In [101]:
len(encoded_labels)
print(reviews_ints)

[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 13, 21], [22, 23, 24, 10, 25, 26, 27, 28, 29, 30, 31, 1, 2, 32, 33, 3, 34, 35, 29, 36, 37, 3, 21], [10, 38, 29, 39, 40, 10, 41, 42, 43, 44, 45, 46, 47, 48, 49, 21, 50, 10, 51, 52, 10, 53, 54, 55, 56, 28, 52, 10, 57, 58, 59, 60, 48, 42], [61, 58, 62, 10, 63, 24, 64, 4, 65, 66, 67, 29, 68, 69, 10, 18, 58, 70, 71], [], [], [], [], [], [], [], [], [9], [], [], [], [], [], [], [], [], [], [2], [4, 72, 73, 74, 58, 75, 76, 29, 77, 78, 52, 79, 21], [65, 80, 29, 1, 2], [58, 81, 31, 82, 83, 52, 22, 84, 85, 31, 1, 2, 3, 86, 87], [88, 4, 89, 31, 7, 90, 91, 92, 52, 4, 93, 43, 94, 95, 96, 97, 4, 98], [99, 100, 101, 4, 102, 103, 31, 3, 4, 104, 105, 52, 106, 6], [4, 107, 108, 109, 3, 110, 111, 112, 113, 114, 115, 116, 10, 117, 118, 52, 7, 32, 119], [120, 7, 121, 106, 10, 53, 12, 101, 122, 123, 124, 125, 126, 7, 127, 128, 129, 130], [131, 132, 133, 10, 134, 135, 136, 110, 129], [10, 137, 138, 139, 140, 141, 142, 143, 29, 4, 144,

## Task 4: Padding the data

> Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**

In [131]:
# Write the logic for padding the data
import numpy

seq_length = 10

def pad_features(reviews_ints, seq_length):
    row_length = len(reviews_ints) 
    column_length = seq_length
    
    # declares the size of the area 
    final_return = numpy.zeros((row_length, seq_length))
    for i in range(0, len(final_return)):
        # variable to measure the difference in length
        index = seq_length - len(reviews_ints[i]) 
        
        if len(reviews_ints[i]) <= seq_length:
            padded_array =  [0] * index
            for val in reviews_ints[i]:
                padded_array.append(val)
           
            final_return[i] = padded_array
        else:
            final_return[i] = reviews_ints[i][:index]
    return final_return
print(pad_features(reviews_ints, seq_length))

[[   0.    0.    0. ...    4.    5.    6.]
 [   7.    8.    9. ...   14.   15.   16.]
 [  22.   23.   24. ...   28.   29.   30.]
 ...
 [   0.    0.  212. ... 1603.   60. 1604.]
 [   0.    0.    0. ...    0.    0.  648.]
 [ 212.  212.  334. ...    3.  799. 1605.]]


In [138]:
# Verify if everything till now is correct. 

seq_length = 20

features = pad_features(reviews_ints, seq_length=seq_length)

#

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

[[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   7.   8.   9.  10.  11.  12.]
 [ 22.  23.  24.  10.  25.  26.  27.  28.  29.  30.]
 [ 10.  38.  29.  39.  40.  10.  41.  42.  43.  44.]
 [  0.  61.  58.  62.  10.  63.  24.  64.   4.  65.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   4.  72.  73.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.  58.  81.  31.  82.  83.]
 [  0.   0.  88.   4.  89.  31.   7.  90.  91.  92.]
 [  0.   0.   0.   0.   0.   0.  99. 100. 101.   4.]
 [  0.   4. 107. 108. 109.   3. 110. 111. 112. 113.]
 [  0.   0. 120.   7. 121. 106.  10.  53.  12. 101.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.  10. 137.]
 [146.   4. 147. 148.   7.  32. 149.  37. 150. 151.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0. 154.]
 [165. 166. 167.  13. 168. 169. 170.  94. 171.

Now we have everything ready. It's time to split our dataset into `Train`, `Test` and `Validate`. 

Read more about the train-test-split here : https://cs230-stanford.github.io/train-dev-test-split.html

## Task 5: Lets create train, test and val split in the ratio of 8:1:1.  

Hint: Either use shuffle and slicing in Python or use train-test-val split in Sklearn. 

In [139]:
train_frac = 0.8
val_frac = 0.1
test_frac = 0.1

import pandas as pd

Y = pd.DataFrame(encoded_labels)
def train_test_val_split(features):
    df = pd.DataFrame(features)
    return np.split(df.sample(frac=train_frac), [int(val_frac*len(df)), int(test_frac*len(df))])
    #return train_test_split(X, test_size=test_frac, train_size=train_frac, )
    
    #pass

def train_test_val_labels(encoded_labels):
    df = pd.DataFrame(encoded_labels)
    return np.split(df.sample(frac=train_frac), [int(val_frac*len(df)), int(test_frac*len(df))])

train_x, val_x, test_x = train_test_val_split(features)
train_y, val_y, test_y = train_test_val_labels(encoded_labels)

In [140]:
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(23, 20) 
Validation set: 	(0, 20) 
Test set: 		(167, 20)


## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

### Task 6: Create a generator function for the dataset. 
See the above link for more info.

In [141]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets for train, test and val
train_data = TensorDataset(torch.tensor(train_x.values), torch.tensor(train_y.values))
print(train_data)
valid_data = TensorDataset(torch.tensor(val_x.values), torch.tensor(val_y.values))
test_data = TensorDataset(torch.tensor(test_x.values), torch.tensor(test_y.values))

# # dataloaders
batch_size = 50 

# # make sure to SHUFFLE your training data. Keep Shuffle=True.
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(train_data, batch_size=batch_size,shuffle=True)
test_loader = DataLoader(train_data, batch_size=batch_size,shuffle=True)

<torch.utils.data.dataset.TensorDataset object at 0x10c035748>


In [142]:
# obtain one batch of training data and label. 
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([23, 20])
Sample input: 
 tensor([[   0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
            0.,    0.,    0.,    0.,    0.,    0.,   10.,  643.,    3.,  152.],
        [ 342.,  499.,  464.,  171.,  721.,  459.,  173.,  524.,  449.,    4.,
         1328.,   52., 1329.,  131., 1330.,   31., 1331.,  429.,  697.,  103.],
        [   0.,    0.,    0.,    0.,    0.,  284., 1473., 1474.,  133., 1475.,
           60., 1476., 1477.,  429.,   24.,    4., 1478., 1106.,    9.,    6.],
        [   0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,  244.,
          174.,  740.,   28.,  244.,  449.,   24.,  306.,  101.,   10.,  741.],
        [   0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
            0.,    0.,    0.,  873.,   32., 1202.,  500.,    4., 1203.,  696.],
        [ 524.,  464.,  171., 1288.,   23.,   61., 1289., 1290.,  617., 1291.,
          729.,   29., 1292.,   97., 1293., 1294.,   29.,    4., 1295

In [143]:
# Check if GPU is available.
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

No GPU available, training on CPU.


## Creating the Model 

Here we are creating a simple RNN in PyTorch and pass the output to the a Linear layer and Sigmoid at the end to get the probability score and prediction as POSITIVE or NEGATIVE. 

The network is very similar to the CNN network created in Exercise 2. 

More info available at: https://pytorch.org/docs/0.3.1/nn.html?highlight=rnn#torch.nn.RNN

Read about the parameters that the RNN takes and see what will happen when `batch_first` is set as `True`.

In [144]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # RNN layer
        self.rnn = nn.RNN(vocab_size, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # RNN out layer
        rnn_out, hidden = self.rnn(x, hidden)
    
        # stack up lstm outputs
        rnn_out = rnn_out.view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(rnn_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

    


## Task 7 : Know the shape

Given a batch of 64 and input size as 1 and a sequence length of 200 to a RNN with 2 stacked layers and 512 hidden layers, find the shape of input data (x) and the hidden dimension (hidden) specified in the forward pass of the network. Note, the batch_first is kept to be True. 



In [50]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
hidden_dim = 256
n_layers = 1

net = SentimentRNN(vocab_size, output_size, hidden_dim, n_layers)

print(net)

TypeError: __init__() missing 1 required positional argument: 'n_layers'


## Task 8: LSTM 

Before we start creating the LSTM, it is important to understand LSTM and to know why we prefer LSTM over a Vanilla RNN for this task. 
> Here are some good links to know about LSTM:
* [Colah Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [Understanding LSTM](http://blog.echen.me/2017/05/30/exploring-lstms/)
* [RNN effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)


Now create a class named SentimentLSTM with `n_layers=2`, and rest all hyperparameters same as before. Also, create an embedding layer and feed the output of the embedding layer as input to the LSTM model. Dont forget to add a regularizer (dropout) layer after the LSTM layer with p=0.4 to prevent overfitting. 

In [None]:
import torch.nn as nn

class SentimentLSTM(nn.Module):
    """
    The LSTM model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentLSTM, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define embedding, LSTM, dropout and Linear layers here
        
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        pass
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [None]:
# Instantiate the model with these hyperparameters
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 300
hidden_dim = 256
n_layers = 2

net = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


### Task 9: Loss Functions
We are using `BCELoss (Binary Cross Entropy Loss)` since we have two output classes. 

Can Cross Entropy Loss be used instead of BCELoss? 

If no, why not? If yes, how?

Is `NLLLoss()` and last layer as `LogSoftmax()` is same as using `CrossEntropyLoss()` with a Softmax final layer? Can you get the mathematical intuition behind it?

In [None]:
#Training and Validation

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

## Inference
Once we are done with training and validating, we can improve training loss and validation loss by playing around with the hyperparameters. Can you find a better set of hyperparams? Play around with it. 

### Task 10: Prediction Function
Now write a prediction function to predict the output for the test set created. Save the results in a CSV file with one column as the reviews and the prediction in the next column. Calculate the accuracy of the test set.

In [None]:
def predict():
    pass

## Bonus Question: Create an app using Flask

> Extra bonus points if someone attempts this question:
* Save the trained model checkpoints.
* Create a Flask app and load the model. A similar work in the field of CNN has been done here : https://github.com/kumar-shridhar/Business-Card-Detector (Check `app.py`)
* You can use hosting services like Heroku and/or with Docker to host your app and show it to everyone. 
Example here: https://github.com/selimrbd/sentiment_analysis/blob/master/Dockerfile
