<a href="https://colab.research.google.com/github/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/Sentiment_analysis_via_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with an RNN
In this notebook I have implemented a RNN that performs sentiment analysis. <br>
Reason for using RNN instead of a strictly feedforward network is that we can also include information about *sequence* of words.

### Network Architecture
Below would be the architecture diagram for my sentiment analysis model - <br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/network_diagram.png?raw=1"></img>

**Notes -**
1. Since we are performing sentiment analysis, we need a more efficient representation of words as compared to one_hot_encoded vectors. Hence, using *embeded layer for dimensionality reduction.*
2. The new embeddings will be passed to LSTM cells. LSTM cells will add recurrent connections and add ability to *include information about sequence of words.*
3. Final LSTM outputs will go to *Sigmoid output layer.*

### Load in and visualize the data

In [0]:
import numpy as np

In [0]:
with open('reviews.txt', 'r') as f:
  reviews = f.read()
with open('labels.txt', 'r') as f:
  labels = f.read()

In [5]:
print(reviews[:100])
print(labels[:100])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life
positive
negative
positive
negative
positive
negative
positive
negative
positive
negative
positive
n


## Data pre-processing
### Getting rid of punctuations 
1. Get rid of punctuation marks etc.
2. Reviews are delimited by \n. Use \n as delimiter to split text into each reviews.
3. Combine reviews in step-2 into 1 big string.

In [0]:
from string import punctuation

# get rid of punctuation
reviews = reviews.lower()
all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and space
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [7]:
words[:20]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such']

### Encoding reviews
Create an array that contains integer encoded version of words in reviews. The word appearing the most should have least integer value. Example if *the* appeared the most in reviews, then assign *'the' : 1*  

In [8]:
from collections import Counter

counts = Counter(words)
'''
counts = Counter({'bromwell': 5,
                  'high': 742,
                  'is': 39879,
                  'a': 60733
                  })

vocabulary_to_int = {'the': 1,
                      'and': 2,
                      'a': 3,
                      'of': 4,
                      'to': 5,
                      'is': 6
                    }
'''
vocabulary = sorted(counts, key=counts.get, reverse=True)
vocabulary_to_int = {word:ii for ii, word in enumerate(vocabulary, 1)}
reviews_int = []
for reviews in reviews_split:
  reviews_int.append([vocabulary_to_int[word] for word in reviews.split()])
print(reviews_int[:1])


[[10402, 311, 6, 3, 1183, 201, 8, 2207, 33, 1, 168, 56, 15, 49, 85, 8223, 43, 425, 122, 142, 15, 3124, 59, 144, 9, 1, 5189, 5902, 455, 72, 5, 261, 12, 10402, 311, 13, 2002, 6, 73, 2756, 5, 696, 76, 6, 3124, 1, 19466, 5, 1691, 6862, 1, 5903, 1692, 36, 52, 68, 210, 143, 63, 1382, 3124, 14748, 1, 19467, 4, 1, 221, 759, 31, 2689, 72, 4, 1, 5904, 10, 723, 2, 63, 1692, 54, 10, 208, 1, 330, 9, 64, 3, 1565, 3893, 728, 5, 2824, 185, 1, 425, 10, 1252, 9175, 33, 311, 3, 380, 320, 5905, 10, 135, 136, 5, 9176, 30, 4, 133, 3124, 1565, 2467, 5, 10402, 311, 10, 535, 12, 113, 1828, 4, 59, 669, 103, 12, 10402, 311, 6, 226, 4074, 48, 3, 2152, 12, 8, 229, 21]]


### Encoding labels
If review is positive, then corresponding label is 0 else 1.

In [0]:
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])


### Removing Outliers
This step involves - 
1. Getting rid of extremely long/short reviews
2. Padding/truncating reaining data to maintain constant review length.

<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/outliers_padding_ex.png?raw=1"></img>

In [16]:
# removing reviews of length 0
print('Number of reviews before removing outliers: ', len(reviews_int))
non_zero_idx = {ii for ii, review in enumerate(reviews_int) if len(review)!=0}

reviews_int = [reviews_int[ii] for ii in non_zero_idx]
encoded_labels = [encoded_labels[ii] for ii in encoded_labels]


print('Number of reviews after removing outliers: ', len(reviews_int))

Number of reviews before removing outliers:  3956
Number of reviews after removing outliers:  3956


In [0]:
def pad_features(reviews_int, seq_length):
  for i, row in enumerate(reviews_int):
    features[i, -len(row):] = np.array(row)[:seq_length]

# Training, Testing and Validating 