# RNNs & Sentiment Analysis

- RecurrentNN variation of the FF-NNs
- used when task can be represented as a sequence
- sentence is a seq of words
- RNN takes whole sequence of vectors as input (CNN takes single vector)
- if each word in document is a vector embedd, then whole document can be represented as order 3 tensor
- LSTM (a more sophisticated RNN)

In [1]:
import pandas as pd
from string import punctuation
import numpy as np
import torch
from nltk.tokenize import word_tokenize
from torch.utils.data import TensorDataset, DataLoader
from torch import nn
from torch import optim
import json

### Theory of Recurrent Neural Networks (RNNs)

- RNNs are structured with recurrent layers, similar to standard feedforward neural networks (NNs).
- They incorporate a hidden recurrent state that gets updated at each step during sequence processing.
- At the start of processing any sequence, the model is initialized with a one-dimensional vector representing the hidden state.

### Training Process

- Each word in the sequence is fed into the model, leading to an update in the hidden state (HS).
- This process continues until all words have been processed, generating a final hidden state vector.
- This final HS vector is then fed into a fully connected layer to yield the final class prediction.
- Activation functions, such as tanh, can be employed to constrain the hidden state values between -1 and 1.
- During learning, weights are updated at each step, and the loss is computed during backpropagation.
- For tasks such as sequence-to-sequence translation in NLP, hidden state values from each layer can be utilized instead of only the final HS.

### RNN for Sentiment Analysis

- Sentiment analysis becomes a binary classification task (i.e., positive or negative) when using RNNs.

### Potential Challenges and Solutions

- One of the challenges with RNNs is the occurrence of exploding or vanishing gradients due to the recursive layers, which can cause instability in the network.
- Solutions include:
  - Implementing gradient clipping to limit the gradients from becoming excessively large. This involves introducing a hyperparameter C to establish an upper limit.
  - Reducing the input sequence length. Since a shorter sequence means fewer iterations, the maximum sequence length can be chosen as a hyperparameter.


### LSTM

Flaws of RNN
- hard to retain information long term (cant capture long-term sentence dependency)
- poor at capturing context of word within sentence (lacing context dependency, due to how its trained)
- unable to predict things that came early on in the sentence due to it being trained in one directional

LSTM can combat these issues
- a LSTM is a more sophisticated RNN and contains 2 extra propertises (an update gate and a forget gate)
- these 2 additions, make it easier to learn long term dependencies
- in context of Sentiment Analysis, LSTM will remember important information and leave our irrelevant ones, preventing the irrelevent features to dilute important information, maintaining long term dependencies

- LSTM has a similar strcuture to RNN with the recursive hidden state but the LSTM cell is more complex
- it has a series of gates that allow for the additional calculations. lets break it down

#### forget gate
- learns which elements of the sequence to forget
- the previous hidden state (h. t-1) and the latest input step (x1) and concat. together and pass through a matrix of learned weights on the forget gate
a sigmoid function the bounds the value between 0 and 1
- This resulting matrix, ft, is multiplied pointwise by the cell state from the previous step, ct-1. 
- This effectively applies a mask to the previous cell state so that only the relevant information from the previous cell state is brought forward.

#### input gate