<a href="https://colab.research.google.com/github/gupta24789/sentiment-analysis/blob/main/sentiment_rnn_lighting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RNN

**Recurrent neural network** (RNN) are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$.

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

Below shows an example sentence, with the RNN predicting output. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros.

![](https://github.com/gupta24789/sentiment-analysis/blob/main/imgaes/rnn.jpg?raw=1)


In [2]:
!pip install -q lightning

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m840.2/840.2 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m777.7/777.7 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import pandas as pd
import numpy as np
import itertools
import warnings

import re
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.nn.utils.rnn import  pad_sequence
from torch.utils.data import Dataset, DataLoader, TensorDataset

from torchmetrics import Accuracy
import lightning as L
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import ModelCheckpoint

warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Utilities

In [4]:
def convert_token_to_number(tweet, verbose = False):
  unk_token_id = token2idx['__UNK__']
  encoded_tweet = []

  if verbose:
    print(f"UNK TOKEN ID : {unk_token_id}")
    print(f"RAW TWEET : {tweet}")

  for w in tweet.split(" "):
    if w in token2idx:
      encoded_tweet.append(token2idx[w])
    else:
      encoded_tweet.append(unk_token_id)

  return encoded_tweet


def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

### Load Data

In [12]:
train_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/val.csv")
train_df.processed_tweet = train_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])
val_df.processed_tweet = val_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])
train_df = train_df[['raw_tweet', 'processed_tweet', 'label']].dropna().reset_index(drop = True)
val_df = val_df[['raw_tweet', 'processed_tweet', 'label']].dropna().reset_index(drop = True)



In [13]:
## Clean the tweet
train_df['processed_text'] = train_df.processed_tweet.apply(lambda x: " ".join(x))
val_df['processed_text'] = val_df.processed_tweet.apply(lambda x: " ".join(x))

X_train = train_df.processed_text
y_train = train_df.label
X_val = val_df.processed_text
y_val = val_df.label

In [14]:
train_df.head(4)

Unnamed: 0,raw_tweet,processed_tweet,label,processed_text
0,Want to say a huge thanks to @WarriorAssaultS ...,"[want, say, huge, thank, ff, thank, support, :)]",1.0,want say huge thank ff thank support :)
1,@jaynehh_ you just need a job and get a letter...,"[need, job, get, letter, work, place, say, wor...",1.0,need job get letter work place say work letter...
2,"@knhillrocks HA yes, make it quick tho :D","[ha, ye, make, quick, tho, :d]",1.0,ha ye make quick tho :d
3,@shartyboy Thanks for texting me back :)) I'm ...,"[thank, text, back, :), i'm, text, tomorrow, :)]",1.0,thank text back :) i'm text tomorrow :)


## Create Vocab

In [15]:
special_tokens = ['__PAD__','__UNK__']
words = list(set(itertools.chain.from_iterable(train_df.processed_text.apply(lambda x: x.split(" ")))))
words = special_tokens +  words
token2idx = {w:i for i,w in enumerate(words)}
idx2tokens = {i:w for i,w in enumerate(words)}
print(f"vocab size : {len(token2idx)}")

vocab size : 9093


## Convert tweet to number

In [16]:
X_train_encoded = X_train.apply(lambda x: convert_token_to_number(x))
X_val_encoded = X_val.apply(lambda x: convert_token_to_number(x))

In [17]:
X_train_encoded[:2]

0        [6330, 25, 7473, 1451, 2554, 1451, 8575, 807]
1    [7945, 4105, 3599, 5591, 1561, 741, 25, 1561, ...
Name: processed_text, dtype: object

## Data Loaders

In [18]:
def data_collator(batch):
  features = [torch.tensor(item[0]) for item in batch]
  labels = torch.tensor([item[1] for item in batch], dtype = torch.float32)
  features = pad_sequence(features, batch_first=True, padding_value= token2idx['__PAD__'])
  return (features, labels)

batch_size = 32
train_dl = DataLoader(list(zip(X_train_encoded, y_train)), batch_size = batch_size, shuffle = True, collate_fn = data_collator)
val_dl = DataLoader(list(zip(X_val_encoded, y_val)), batch_size = batch_size, shuffle = False, collate_fn = data_collator)

In [19]:
example = next(iter(train_dl))
features, labels = example[0], example[1]
features.shape, labels.shape

(torch.Size([32, 16]), torch.Size([32]))

## Build Model

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

![](https://github.com/gupta24789/sentiment-analysis/blob/main/imgaes/rnn_emb.jpg?raw=1)

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

---

The RNN returns 2 tensors
  - output : [batch size, sentence length, hidden dim]
  - hidden : [num_layers, batch size, hidden dim]


output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state from very layers.

In [20]:
class RNNModel(L.LightningModule):

  def __init__(self, vocab_size , emb_dim, hidden_dim, num_classes, learning_rate, dropout, num_layers = 1):
    super().__init__()
    ## define variable
    self.learning_rate = learning_rate

    ## define layers
    self.embedding = nn.Embedding(num_embeddings= vocab_size, embedding_dim= emb_dim)
    self.rnn = nn.RNN(emb_dim, hidden_dim, batch_first=True, dropout = dropout, num_layers = num_layers, bidirectional= False)
    self.linear = nn.Linear(hidden_dim, num_classes)
    self.sigmoid = nn.Sigmoid()


    ## loss & metrics
    self.loss_fn = nn.BCELoss()
    self.train_accuracy = Accuracy(task = 'binary', num_classes= 2, threshold = 0.5)
    self.val_accuracy = Accuracy(task = 'binary', num_classes = 2, threshold= 0.5)



  def forward(self, features, verbose = False):
    """
    """

    # features : [batch size, batch max sent len]
    embedded = self.embedding(features)

    # embedded : [batch size, batch max sent len, emb dim]
    output, hidden = self.rnn(embedded)

    #  output : [batch size, batch max sent len,  hidden dim]
    #  hidden : [num_layers, batch size, hidden dim]

    hidden_squeezed = hidden[-1,:,:].squeeze(0)

    ## hidden_squeezed : [batch size, hidden dim]
    linear_out = self.linear(hidden_squeezed)
    sigmoid_out = self.sigmoid(linear_out)

    if verbose:
      print(f"Input shape : {features.shape}")
      print(f"EMB shape : {embedded.shape}")
      print(f"RNN output shape : {output.shape}")
      print(f"RNN hidden shape : {hidden.shape}")
      print(f"RNN hidden_squeezed shape : {hidden_squeezed.shape}")
      print(f"linear_out shape : {linear_out.shape}")
      print(f"sigmoid_out shape : {sigmoid_out.shape}")

    return sigmoid_out


  def training_step(self, batch, batch_idx):
    features, labels = batch[0], batch[1]
    logits = self(features)
    logits = logits.squeeze(dim=1)
    loss = self.loss_fn(logits, labels)
    self.train_accuracy(logits, labels)
    self.log_dict({"train_loss": loss, "train_acc": self.train_accuracy}, on_step = False, on_epoch = True, prog_bar = True)

  def validation_step(self,batch, batch_idx):
    features, labels = batch[0], batch[1]
    logits = self(features)
    logits = logits.squeeze(dim=1)
    loss = self.loss_fn(logits, labels)
    self.val_accuracy(logits, labels)
    self.log_dict({"val_loss": loss, "val_acc":  self.val_accuracy}, on_step = False, on_epoch = True, prog_bar = True)

  def on_train_epoch_end(self):
    self.train_accuracy.reset()

  def on_validation_epoch_end(self):
    if self.current_epoch!=0:
      print(f"Epoch : {self.current_epoch} val accuracy : {self.val_accuracy.compute()}")
    self.val_accuracy.reset()

  def configure_optimizers(self):
    optimizer = optim.Adam(self.parameters(), lr = self.learning_rate)
    return optimizer

In [21]:
## Test the model
model = RNNModel(vocab_size=len(token2idx),emb_dim =100, hidden_dim = 64,
                 num_classes = 1, learning_rate = .001, dropout = 0.2,
                 num_layers = 1)

logits = model(features, verbose = True)
logits = logits.squeeze(dim=1)
loss = model.loss_fn(logits, labels)
print(f"\nLoss: {loss}")

Input shape : torch.Size([32, 16])
EMB shape : torch.Size([32, 16, 100])
RNN output shape : torch.Size([32, 16, 64])
RNN hidden shape : torch.Size([1, 32, 64])
RNN hidden_squeezed shape : torch.Size([32, 64])
linear_out shape : torch.Size([32, 1])
sigmoid_out shape : torch.Size([32, 1])

Loss: 0.7041961550712585


## Single Layer RNN

In [24]:
## Build Trainer
model = RNNModel(vocab_size=len(token2idx),emb_dim =300, hidden_dim = 512,
                 num_classes = 1, learning_rate = 1e-3, dropout = 0.5, num_layers = 1)

## logger
logger =  CSVLogger("logs", name="sentiment_analysis")

## checkpoints
checkpoint_callback  = ModelCheckpoint(filename='{epoch}-{val_loss:.2f}-{val_accuracy:.2f}',
                                        every_n_epochs = 10,
                                        save_top_k = -1,
                                        monitor='val_loss',
                                      )


trainer = L.Trainer(accelerator="cpu",
                     max_epochs = 5,
                     check_val_every_n_epoch=1,
                     callbacks=[checkpoint_callback],
                     logger=logger

                    )

## Train the Model
trainer.fit(model, train_dl, val_dl)

INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: 
  | Name           | Type           | Params
--------------------------------------------------
0 | embedding      | Embedding      | 2.7 M 
1 | rnn            | RNN            | 416 K 
2 | linear         | Linear         | 513   
3 | sigmoid        | Sigmoid        | 0     
4 | loss_fn        | BCELoss        | 0     
5 | train_accuracy | BinaryAccuracy | 0     
6 | val_accuracy   | BinaryAccuracy | 0     
--------------------------------------------------
3.1 M     Trainable params
0         Non-

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 val accuracy : 0.4959999918937683


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 val accuracy : 0.4959999918937683


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 3 val accuracy : 0.4959999918937683


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=5` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch : 4 val accuracy : 0.4959999918937683


## Multi Layer RNN

In [25]:
## Build Trainer
model = RNNModel(vocab_size=len(token2idx),emb_dim =300, hidden_dim = 512,
                 num_classes = 1, learning_rate = 1e-3, dropout = 0.5, num_layers = 3)

## logger
logger =  CSVLogger("logs", name="sentiment_analysis")

## checkpoints
checkpoint_callback  = ModelCheckpoint(filename='{epoch}-{val_loss:.2f}-{val_accuracy:.2f}',
                                        every_n_epochs = 10,
                                        save_top_k = -1,
                                        monitor='val_loss',
                                      )


trainer = L.Trainer(accelerator="cpu",
                     max_epochs = 5,
                     check_val_every_n_epoch=1,
                     callbacks=[checkpoint_callback],
                     logger=logger

                    )

## Train the Model
trainer.fit(model, train_dl, val_dl)

INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: 
  | Name           | Type           | Params
--------------------------------------------------
0 | embedding      | Embedding      | 2.7 M 
1 | rnn            | RNN            | 1.5 M 
2 | linear         | Linear         | 513   
3 | sigmoid        | Sigmoid        | 0     
4 | loss_fn        | BCELoss        | 0     
5 | train_accuracy | BinaryAccuracy | 0     
6 | val_accuracy   | BinaryAccuracy | 0     
--------------------------------------------------
4.2 M     Trainable params
0         Non-

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 val accuracy : 0.4945000112056732


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 val accuracy : 0.4945000112056732


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 3 val accuracy : 0.4945000112056732


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=5` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch : 4 val accuracy : 0.4945000112056732


You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.