# RNNs & Sentiment Analysis (NOTES)


- RecurrentNN variation of the FF-NNs
- used when task can be represented as a sequence
- sentence is a seq of words
- RNN takes whole sequence of vectors as input (CNN takes single vector)
- if each word in document is a vector embedd, then whole document can be represented as order 3 tensor
- LSTM (a more sophisticated RNN)

### Theory of Recurrent Neural Networks (RNNs)

- RNNs are structured with recurrent layers, similar to standard feedforward neural networks (NNs).
- They incorporate a hidden recurrent state that gets updated at each step during sequence processing.
- At the start of processing any sequence, the model is initialized with a one-dimensional vector representing the hidden state.

### Training Process

- Each word in the sequence is fed into the model, leading to an update in the hidden state (HS).
- This process continues until all words have been processed, generating a final hidden state vector.
- This final HS vector is then fed into a fully connected layer to yield the final class prediction.
- Activation functions, such as tanh, can be employed to constrain the hidden state values between -1 and 1.
- During learning, weights are updated at each step, and the loss is computed during backpropagation.
- For tasks such as sequence-to-sequence translation in NLP, hidden state values from each layer can be utilized instead of only the final HS.

### RNN for Sentiment Analysis

- Sentiment analysis becomes a binary classification task (i.e., positive or negative) when using RNNs.

### Potential Challenges and Solutions

- One of the challenges with RNNs is the occurrence of exploding or vanishing gradients due to the recursive layers, which can cause instability in the network.
- Solutions include:
  - Implementing gradient clipping to limit the gradients from becoming excessively large. This involves introducing a hyperparameter C to establish an upper limit.
  - Reducing the input sequence length. Since a shorter sequence means fewer iterations, the maximum sequence length can be chosen as a hyperparameter.


## LSTM

Flaws of RNN
- hard to retain information long term (cant capture long-term sentence dependency)
- poor at capturing context of word within sentence (lacing context dependency, due to how its trained)
- unable to predict things that came early on in the sentence due to it being trained in one directional

LSTM can combat these issues
- a LSTM is a more sophisticated RNN and contains 2 extra propertises (an update gate and a forget gate)
- these 2 additions, make it easier to learn long term dependencies
- in context of Sentiment Analysis, LSTM will remember important information and leave our irrelevant ones, preventing the irrelevent features to dilute important information, maintaining long term dependencies

- LSTM has a similar strcuture to RNN with the recursive hidden state but the LSTM cell is more complex
- it has a series of gates that allow for the additional calculations. lets break it down

#### forget gate
- learns which elements of the sequence to forget
- the previous hidden state (h. t-1) and the latest input step (x1) and concat. together and pass through a matrix of learned weights on the forget gate
a sigmoid function the bounds the value between 0 and 1
- This resulting matrix, ft, is multiplied pointwise by the cell state from the previous step, ct-1. 
- This effectively applies a mask to the previous cell state so that only the relevant information from the previous cell state is brought forward.

#### input gate
- takes in concat. input and pass it into a sigmoid function to bound between 0 and 1

#### output gate
- calculates final output out of the LSTM cell
- learned para on output gate control which elemets of the previous HS and current output and combined to then carry forward to the next stage

- in one forward pass, we iterate through the model, init hidden state, cell stae and update at each stem
- backprop used to calculate gradients relative to loss, to know which direction to update our para
- LSTM has more computations than RNN, so more complex computation graph 
- and backprop calculation for gradient will also take longer
- but despite longer time, LSTM offers signifant improve in performance compared to RNN
- this is because the 3 gates give the model ability to determine which elements of the input should be used to udpate the hidden state etc
- which means model is better at forming long term dependencies and retain information from prev steps

### Bidirectional LSTMs
- modifed LSTM that considers both the words before and adter it at each step within sequence
- LSTMMs process seq in regular order and reserve order simultaniosly, maintaining 2 hidden states
- this allows for the context of any given word within seq can be better captured
- Bidirectional LSTM offer improved performance 

# CODE

## Building A Sentiment Analyzer using LSTMs

In [1]:
import pandas as pd
from string import punctuation
import numpy as np
import torch
from nltk.tokenize import word_tokenize
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
from torch import optim
import json

In [15]:
'''
n = 3000
Data comes from 3 diff sources - film, product and location reviews
label = 0,1 at 50/50 split

'''

'\nn = 3000\nData comes from 3 diff sources - film, product and location reviews\nlabel = 0,1 at 50/50 split\n\n'

In [2]:
with open("sentiment.txt") as f:
    reviews = f.read()
    
data = pd.DataFrame([review.split('\t') for review in reviews.split('\n')])
data.columns = ['Review','Sentiment']
data = data.sample(frac=1)

In [4]:
data.head(15)

Unnamed: 0,Review,Sentiment
816,The warmth it generates is in contrast to its ...,1
1720,"Cute, quaint, simple, honest.",1
2026,I've owned this phone for 7 months now and can...,1
952,It presents a idyllic yet serious portrayal of...,1
1016,Highly recommended.,1
1755,"When I'm on this side of town, this will defin...",1
85,Give this one a look.,1
1454,The last 3 times I had lunch here has been bad.,0
56,"Excellent cast, story line, performances.",1
2297,This one works and was priced right.,1


In [5]:
# Proprocessing Step

def split_words_reviews(data):
    text = list(data['Review'].values)
    clean_text = []
    for t in text:
        clean_text.append(t.translate(str.maketrans('', '', punctuation)).lower().rstrip()) #lowers
    tokenized = [word_tokenize(x) for x in clean_text] # tokenise
    all_text = []
    for tokens in tokenized:
        for t in tokens:
            all_text.append(t)
    return tokenized, set(all_text) #set for unique word count (corpus)

reviews, vocab = split_words_reviews(data)

reviews[0]

['the',
 'warmth',
 'it',
 'generates',
 'is',
 'in',
 'contrast',
 'to',
 'its',
 'austere',
 'backdrop']

In [6]:
#converting works to numbers (embeddings for corpus)
def create_dictionaries(words):
    word_to_int_dict = {w:i+1 for i, w in enumerate(words)}
    int_to_word_dict = {i:w for w, i in word_to_int_dict.items()}
    return word_to_int_dict, int_to_word_dict

word_to_int_dict, int_to_word_dict = create_dictionaries(vocab)

int_to_word_dict

{1: 'abound',
 2: 'chalkboard',
 3: 'tying',
 4: 'chains',
 5: 'unacceptible',
 6: 'mortified',
 7: 'shallow',
 8: 'london',
 9: 'sensitivities',
 10: 'rpg',
 11: 'gristle',
 12: 'gels',
 13: 'sidelined',
 14: 'burgers',
 15: 'rushed',
 16: 'candle',
 17: '110',
 18: 'bathrooms',
 19: 'subtle',
 20: 'primal',
 21: 'moved',
 22: 'delete',
 23: 'mouse',
 24: 'window',
 25: 'most',
 26: 'improvisation',
 27: 'croutons',
 28: 'engaging',
 29: 'theory',
 30: 'steamboat',
 31: 'movieit',
 32: 'surrounding',
 33: 'six',
 34: 'watsons',
 35: 'union',
 36: 'pissd',
 37: 'chilly',
 38: 'finish',
 39: 'utter',
 40: 'filmmaking',
 41: 'recommending',
 42: 'livingworking',
 43: 'z',
 44: 'upper',
 45: 'beep',
 46: 'portrayed',
 47: 'jean',
 48: 'muddled',
 49: 'good',
 50: 'grimes',
 51: 'admins',
 52: 'card',
 53: 'burger',
 54: 'mile',
 55: 'dangerous',
 56: 'reactions',
 57: 'itdefinitely',
 58: 'ears',
 59: 'cutouts',
 60: 'babysitting',
 61: 'improved',
 62: 'flops',
 63: 'save',
 64: 'reminde

In [8]:
with open('word_to_int_dict.json', 'w') as fp:
    json.dump(word_to_int_dict, fp)

In [16]:
print(np.max([len(x) for x in reviews]))
print(np.mean([len(x) for x in reviews]))

'''
- Note that NNs will take inputs of a fixed length
- but our reviews are all of diff lengths
- therefore, we have to add padding (adding empty tokens)

HOWEVER
- longer sentence will made LSTM layer deeper, in turn makes backprop training longer
- large % of our input would be sparse and empty tokens
- so make input size 50 (between 20-70)

Logic
- for reviews longer than 50, drop the rest of the token
- for reviews shorter than 50, add empty padded tokens

'''

70
11.783666666666667


'\n- Note that NNs will take inputs of a fixed length\n- but our reviews are all of diff lengths\n- therefore, we have to add padding (adding empty tokens)\n\nHOWEVER\n- longer sentence will made LSTM layer deeper, in turn makes backprop training longer\n- large % of our input would be sparse and empty tokens\n- so make input size 50 (between 20-70)\n\nLogic\n- for reviews longer than 50, drop the rest of the token\n- for reviews shorter than 50, add empty padded tokens\n\n'

In [9]:
def pad_text(tokenized_reviews, seq_length):
    
    reviews = []
    
    for review in tokenized_reviews:
        if len(review) >= seq_length:
            reviews.append(review[:seq_length])
        else:
            reviews.append(['']*(seq_length-len(review)) + review)
        
    return np.array(reviews)

padded_sentences = pad_text(reviews, seq_length = 50)

padded_sentences[0]

array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', 'the', 'warmth', 'it', 'generates', 'is', 'in',
       'contrast', 'to', 'its', 'austere', 'backdrop'], dtype='<U33')

In [13]:
'''
need to assign what empty token means in our model, so use 0
'''
int_to_word_dict[0] = ''
word_to_int_dict[''] = 0

In [14]:
'''
encoding our padded sentence into numeric vectors for feeding into model,
similar to bag of words
'''
encoded_sentences = np.array([[word_to_int_dict[word] for word in review] for review in padded_sentences])

encoded_sentences[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0, 2271, 1732, 3564, 2325,  335,
       4336, 4034, 4945, 4127, 4628,  225])


### Model Architecture

   - Input layer
   - Embedding layer:
        - model learns the vector representation of the words that its being trained on
        - use precomputed embeddings (like GLoVe)
        - tho using our own embedding layer
        - our input sequences are fed through the layers and come out as a seq. on vectors
        - this vector seq in then fed into out LSTM layer
   - LSTM layer:
        - the LSTM layers learns sequentially from seq. of embeddings and outputs a single vector representation of the final hidden state of LSTM

   - Final HS layer:
       - the HS vector gets fed here
       - will follow standard NN architecture from here on 
   - FC layer:
       - complexities of model architecture based on data
   - Classification/Output layer:
       - since classification, has one node with 0 and 1 prediction values

In [38]:
class SentimentLSTM(nn.Module):
    
    #init as size of vocab, # of LSTM layers, size of models HS
    def __init__(self, n_vocab, n_embed, n_hidden, n_output, n_layers, drop_p = 0.8):
        super().__init__()
        
        self.n_vocab = n_vocab  
        self.n_layers = n_layers 
        self.n_hidden = n_hidden 
        
        #embedding layers have the length of # of words in vocab and size of embed vect
        self.embedding = nn.Embedding(n_vocab, n_embed)
        self.lstm = nn.LSTM(n_embed, n_hidden, n_layers, batch_first = True, dropout = drop_p)
        self.dropout = nn.Dropout(drop_p)
        self.fc = nn.Linear(n_hidden, n_output)
        self.sigmoid = nn.Sigmoid()
        
        
    def forward (self, input_words):
                          
                          
        embedded_words = self.embedding(input_words)
        lstm_out, h = self.lstm(embedded_words) 
        lstm_out = self.dropout(lstm_out)
        lstm_out = lstm_out.contiguous().view(-1, self.n_hidden) #using view to reshape tensor
        fc_out = self.fc(lstm_out)                  
        sigmoid_out = self.sigmoid(fc_out)              
        sigmoid_out = sigmoid_out.view(batch_size, -1)  
    
        sigmoid_last = sigmoid_out[:, -1].squeeze()  # squeezing the last output to remove extra dimension
    
        return sigmoid_last, h
    
    #init hidden layers w dim. of batch size
    #allows model to train/pred on many sentences at once than training sequentially
    def init_hidden (self, batch_size):
        
        device = "cuda"
        weights = next(self.parameters()).data
        h = (weights.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
             weights.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))
        
        return h

In [30]:
#initialising our model

n_vocab = len(word_to_int_dict)
n_embed = 50
n_hidden = 100
n_output = 1
n_layers = 2

net = SentimentLSTM(n_vocab, n_embed, n_hidden, n_output, n_layers)

In [31]:
#Model Training
# Train/Valid/Test set split 80/10/10

labels = np.array([int(x) for x in data['Sentiment'].values])

train_ratio = 0.8
valid_ratio = (1 - train_ratio)/2

#ratio slicing
total = len(encoded_sentences)
train_cutoff = int(total * train_ratio)
valid_cutoff = int(total * (1 - valid_ratio))

train_x, train_y = torch.Tensor(encoded_sentences[:train_cutoff]).long(), torch.Tensor(labels[:train_cutoff]).long()
valid_x, valid_y = torch.Tensor(encoded_sentences[train_cutoff : valid_cutoff]).long(), torch.Tensor(labels[train_cutoff : valid_cutoff]).long()
test_x, test_y = torch.Tensor(encoded_sentences[valid_cutoff:]).long(), torch.Tensor(labels[valid_cutoff:])

train_data = TensorDataset(train_x, train_y)
valid_data = TensorDataset(valid_x, valid_y)
test_data = TensorDataset(test_x, test_y)


#use the split dataset to create PyTorch DataLoader object
# allows us to batch process
#randomly shuffled (removes bias from training order)

batch_size = 1

train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(valid_data, batch_size = batch_size, shuffle = True)
test_loader = DataLoader(test_data, batch_size = batch_size, shuffle = True)

In [32]:
print_every = 2400
step = 0
n_epochs = 3
clip = 5  #grad clipping
criterion = nn.BCELoss() # Binary Cross Entropy, as we are dealing with pred. single bin. class
optimizer = optim.Adam(net.parameters(), lr = 0.001)

In [40]:
for epoch in range(n_epochs):
    h = net.init_hidden(batch_size)
    
    for inputs, labels in train_loader:
        step += 1  
        net.zero_grad()
        output, h = net(inputs)
        loss = criterion(output, labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        
        if (step % print_every) == 0:            
            net.eval()
            valid_losses = []

            for v_inputs, v_labels in valid_loader:
                       
                v_output, v_h = net(v_inputs)
                v_loss = criterion(v_output, v_labels.float())
                valid_losses.append(v_loss.item())

            print("Epoch: {}/{}".format((epoch+1), n_epochs),
                  "Step: {}".format(step),
                  "Training Loss: {:.4f}".format(loss.item()),
                  "Validation Loss: {:.4f}".format(np.mean(valid_losses)))
            net.train()

Epoch: 1/3 Step: 4800 Training Loss: 2.1769 Validation Loss: 0.6860
Epoch: 2/3 Step: 7200 Training Loss: 0.0487 Validation Loss: 0.8035
Epoch: 3/3 Step: 9600 Training Loss: 0.0001 Validation Loss: 1.0089


### Observations

- extremely overfitting on the training data
- result of small data set 
- and in embedd. layer, words occur only ince in training set and never in the valid set
- ideally need a much bigger set for better generalisations

In [41]:
torch.save(net.state_dict(), 'model.pkl')

In [42]:
net = SentimentLSTM(n_vocab, n_embed, n_hidden, n_output, n_layers)
net.load_state_dict(torch.load('model.pkl'))

<All keys matched successfully>

In [48]:
'''net.eval()
test_losses = []
num_correct = 0

for inputs, labels in test_loader:

    test_output, test_h = net(inputs)
    loss = criterion(test_output, labels.float())
    test_losses.append(loss.item())
    
    preds = torch.round(test_output.squeeze())
    correct_tensor = preds.eq(labels.float().view_as(preds))
    correct = np.squeeze(correct_tensor.numpy())
    num_correct += np.sum(correct)
    
print("Test Loss: {:.4f}".format(np.mean(test_losses)))
print("Test Accuracy: {:.2f}".format(num_correct/len(test_loader.dataset)))    '''

'net.eval()\ntest_losses = []\nnum_correct = 0\n\nfor inputs, labels in test_loader:\n\n    test_output, test_h = net(inputs)\n    loss = criterion(test_output, labels.float())\n    test_losses.append(loss.item())\n    \n    preds = torch.round(test_output.squeeze())\n    correct_tensor = preds.eq(labels.float().view_as(preds))\n    correct = np.squeeze(correct_tensor.numpy())\n    num_correct += np.sum(correct)\n    \nprint("Test Loss: {:.4f}".format(np.mean(test_losses)))\nprint("Test Accuracy: {:.2f}".format(num_correct/len(test_loader.dataset)))    '

In [49]:
def preprocess_review(review):
    review = review.translate(str.maketrans('', '', punctuation)).lower().rstrip()
    tokenized = word_tokenize(review)
    if len(tokenized) >= 50:
        review = tokenized[:50]
    else:
        review= ['0']*(50-len(tokenized)) + tokenized
    
    final = []
    
    for token in review:
        try:
            final.append(word_to_int_dict[token])
            
        except:
            final.append(word_to_int_dict[''])
        
    return final

In [None]:
'''def predict(review):
    net.eval()
    words = np.array([preprocess_review(review)])
    padded_words = torch.from_numpy(words)
    pred_loader = DataLoader(padded_words, batch_size = 1, shuffle = True)
    for x in pred_loader:
        output = net(x)[0].item()
    
    msg = "This is a positive review." if output >= 0.5 else "This is a negative review."
    print(msg)
    print('Prediction = ' + str(output))'''

In [None]:
'''predict("The film was good")
predict("It was not good")
'''

## Deploying app on Heroku