<a href="https://colab.research.google.com/github/gupta24789/sentiment-analysis/blob/main/sentiment_rnn_lighting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RNN

**Recurrent neural network** (RNN) are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$.

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

Below shows an example sentence, with the RNN predicting output. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros.

![](https://github.com/gupta24789/sentiment-analysis/blob/main/imgaes/rnn.jpg?raw=1)


In [1]:
!pip install -q pytorch-lightning

In [2]:
import pandas as pd
import numpy as np
import itertools
import warnings

import re
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.nn.utils.rnn import  pad_sequence
from torch.utils.data import Dataset, DataLoader

from torchmetrics import Accuracy
import pytorch_lightning as pl

warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Set Seed

In [3]:
## set seed
np.random.seed(121)
torch.manual_seed(121)
pl.seed_everything(121)

INFO:lightning_fabric.utilities.seed:Seed set to 121


121

## Utilities

In [4]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

### Load Data

In [5]:
train_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/val.csv")

train_df.processed_tweet = train_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])
val_df.processed_tweet = val_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])

## remove blank
train_df = train_df[train_df.processed_tweet.str.len()!=0]
val_df = val_df[val_df.processed_tweet.str.len()!=0]

train_df = train_df.dropna()
val_df = val_df.dropna()

## reset index
train_df = train_df.reset_index(drop = True)
val_df = val_df.reset_index(drop = True)

In [6]:
train_df.label.value_counts()

0.0    3999
1.0    3987
Name: label, dtype: int64

In [7]:
val_df.label.value_counts()

0    1000
1     999
Name: label, dtype: int64

In [8]:
train_df.head(3)

Unnamed: 0,raw_tweet,processed_tweet,label
0,Want to say a huge thanks to @WarriorAssaultS ...,"[want, say, huge, thank, ff, thank, support, :)]",1.0
1,@jaynehh_ you just need a job and get a letter...,"[need, job, get, letter, work, place, say, wor...",1.0
2,"@knhillrocks HA yes, make it quick tho :D","[ha, ye, make, quick, tho, :d]",1.0


## Build Vocab

In [9]:
special_words = ['__PAD__','__UNK__','</e>']
unique_words = list(set(itertools.chain.from_iterable(train_df.processed_tweet.tolist())))
vocab = special_words + unique_words
vocab = {w:i for i,w in enumerate(vocab)}
print(f"Number of words in vocab : {len(vocab)}")

Number of words in vocab : 9092


## Convert tweet to number

In [10]:
def tweet_to_tensor(processed_tweet_list, unk_token = '__UNK__'):
  to_tensor_list = []
  unk_token_id = vocab[unk_token]

  for w in processed_tweet_list:
    to_tensor_list.append(vocab.get(w,unk_token_id))

  to_tensor = torch.tensor(to_tensor_list)
  return to_tensor

In [11]:
train_df['tensor_tweet'] = [tweet_to_tensor(tweet) for tweet in train_df.processed_tweet]
val_df['tensor_tweet'] = [tweet_to_tensor(tweet) for tweet in val_df.processed_tweet]

In [12]:
train_df.head(3)

Unnamed: 0,raw_tweet,processed_tweet,label,tensor_tweet
0,Want to say a huge thanks to @WarriorAssaultS ...,"[want, say, huge, thank, ff, thank, support, :)]",1.0,"[tensor(344), tensor(3466), tensor(290), tenso..."
1,@jaynehh_ you just need a job and get a letter...,"[need, job, get, letter, work, place, say, wor...",1.0,"[tensor(7849), tensor(5791), tensor(5811), ten..."
2,"@knhillrocks HA yes, make it quick tho :D","[ha, ye, make, quick, tho, :d]",1.0,"[tensor(6944), tensor(7290), tensor(2124), ten..."


In [13]:
train_data = train_df[['tensor_tweet','label']].reset_index(drop = True).to_dict("records")
val_data = val_df[['tensor_tweet','label']].reset_index(drop = True).to_dict("records")

## Data Loaders

In [14]:
def custom_collate(data):
  features = [d['tensor_tweet'] for d in data]
  labels = [d['label'] for d in data]
  padded_features = pad_sequence(features, batch_first=True, padding_value= vocab['__PAD__'])
  labels = torch.tensor(labels, dtype = torch.float32)
  batch = {"features": padded_features,"labels": labels}
  return batch

In [15]:
train_dl = DataLoader(train_data, batch_size = 2, collate_fn = custom_collate, shuffle = True)
example = next(iter(train_dl))
feature, label = example['features'], example['labels']

In [16]:
feature

tensor([[7311, 7763, 8251, 6895, 6349, 5179, 4317, 2106, 1910,  584],
        [6791, 1236, 8373, 3708, 1120, 4369, 3466, 2324, 3708,    0]])

In [17]:
label

tensor([0., 1.])

In [18]:
### dataloader
BATCH_SIZE = 64
train_dl = DataLoader(train_data, batch_size = BATCH_SIZE, collate_fn = custom_collate, shuffle = True)
val_dl = DataLoader(val_data, batch_size = BATCH_SIZE, collate_fn = custom_collate, shuffle = False)

## Build Model

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

![](https://github.com/gupta24789/sentiment-analysis/blob/main/imgaes/rnn_emb.jpg?raw=1)

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

---

The RNN returns 2 tensors
  - output : [batch size, sentence length, hidden dim]
  - hidden : [num_layers, batch size, hidden dim]


output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state from very layers.

In [19]:
## Model
class RNNModel(pl.LightningModule):

  def __init__(self, num_embeddings, embedding_dim, hidden_dim, learning_rate, num_layers, bidirectional):
    super().__init__()
    self.learning_rate = learning_rate
    self.bidirectional = bidirectional

    ## define loss & metrics
    self.loss_fn = nn.BCELoss()
    self.train_accuracy = Accuracy(task = "binary", num_classes = 2, threshold= 0.5)
    self.val_accuracy = Accuracy(task = "binary", num_classes = 2, threshold= 0.5)

    ## Define Model
    self.embed_layer = nn.Embedding(num_embeddings= num_embeddings, embedding_dim=embedding_dim)
    self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first = True, num_layers = num_layers, bidirectional=bidirectional)
    self.relu = nn.ReLU()
    self.linear = nn.Linear(in_features= hidden_dim * 2 if bidirectional else hidden_dim, out_features= 1)
    self.sigmoid = nn.Sigmoid()


  def forward(self,feature, verbose = False):

    # feature : [batch size, sent len]

    embedded = self.embed_layer(feature)

    ## out embed : [batch size , sent len, emb dim]
    output, hidden = self.rnn(embedded)

    # output : [batchsize, sent len, hid dim]
    # hidden : [nu layers, batchsize, hid dim]

    if self.bidirectional:
       ## concatnate last hidden layer of forward & backward
      hidden_squeezed = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
    else:
      hidden_squeezed = hidden[-1,:,:].squeeze(0)



    out = self.relu(hidden_squeezed)
    out_linear = self.linear(out)
    out_sigmoid = self.sigmoid(out_linear)
    output = torch.squeeze(out_sigmoid, dim = 1)

    if verbose:
      print(f"features : {feature.shape}")
      print(f"embedded : {embedded.shape}")
      print(f"output : {output.shape}")
      print(f"hidden : {hidden.shape}")
      print(f"hidden_squeezed : {hidden_squeezed.shape}")
      print(f"linear : {out_linear.shape}")
      print(f"output : {output.shape}")

    return output

  def _shared_step(self, batch):
    feature, label = batch['features'],batch['labels']
    logits = self(feature)
    loss = self.loss_fn(logits, label)
    return logits, loss, label

  def training_step(self, batch, batch_idx):
    logits, loss, label = self._shared_step(batch)
    self.train_accuracy(logits,label)
    self.log_dict({"train_loss": loss, "train_accuracy": self.train_accuracy}, on_step = False, on_epoch = True, prog_bar=True)
    return loss

  def validation_step(self, batch, batch_idx):
    logits, loss, label = self._shared_step(batch)
    self.val_accuracy(logits,label)
    self.log_dict({"val_loss": loss,  "val_accuracy": self.val_accuracy}, on_step = False, on_epoch = True, prog_bar = True)
    return loss

  def on_train_epoch_end(self):
    self.train_accuracy.reset()

  def on_validation_epoch_end(self):
    if self.current_epoch!=0:
      print(f"Epoch : {self.current_epoch} Validation Accuracy : {self.val_accuracy.compute()}")
    self.val_accuracy.reset()

  def configure_optimizers(self):
    optimizer = optim.Adam(self.parameters(), lr =self.learning_rate)
    return optimizer

In [20]:
# ## test model
# model = RNNModel(num_embeddings= len(vocab), embedding_dim=100, hidden_dim= 32, learning_rate=0.001, num_layers=2, bidirectional=True)
# logits = model(feature,verbose = True)
# print(f"Logits : {logits}")
# print(f"Loss : {model.loss_fn(logits, label)}")

## Single Layer RNN

In [21]:
## logger
logger = pl.loggers.CSVLogger("logs", name="sentiment_analysis")

## checkpoints
checkpoint_callback  = pl.callbacks.ModelCheckpoint(
                                                filename='{epoch}-{val_loss:.2f}-{val_accuracy:.2f}',
                                                every_n_epochs = 2,
                                                save_top_k = -1,
                                                monitor='val_loss',
                                                )


model = RNNModel(num_embeddings= len(vocab), embedding_dim=100, hidden_dim= 32, learning_rate=0.0001,num_layers=1, bidirectional=False)

trainer = pl.Trainer(accelerator="cpu",
                     max_epochs = 5,
                     check_val_every_n_epoch=1,
                     callbacks=[checkpoint_callback],
                     logger=logger

                    )

## Train the Model
trainer.fit(model, train_dl, val_dl)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type           | Params
--------------------------------------------------
0 | loss_fn        | BCELoss        | 0     
1 | train_accuracy | BinaryAccuracy | 0     
2 | val_accuracy   | BinaryAccuracy | 0     
3 | embed_layer    | Embedding      | 909 K 
4 | rnn            | RNN            | 4.3 K 
5 | relu           | ReLU           | 0     
6 | linear         | Linear         | 33    
7 | sigmoid        | Sigmoid        | 0     
--------------------------------------------------
913 K     Trainable params
0         Non-trainable params
913 K     Total params
3.654     Total estimated model params size (

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 Validation Accuracy : 0.502751350402832


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 Validation Accuracy : 0.5057528614997864


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 3 Validation Accuracy : 0.5042521357536316


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch : 4 Validation Accuracy : 0.5192596316337585


## Multi Layer RNN

In [22]:
## logger
logger = pl.loggers.CSVLogger("logs", name="sentiment_analysis")

## checkpoints
checkpoint_callback  = pl.callbacks.ModelCheckpoint(
                                                filename='{epoch}-{val_loss:.2f}-{val_accuracy:.2f}',
                                                every_n_epochs = 2,
                                                save_top_k = -1,
                                                monitor='val_loss',
                                                )


model = RNNModel(num_embeddings= len(vocab), embedding_dim=100, hidden_dim= 32, learning_rate=0.0001,num_layers=2, bidirectional=False)

trainer = pl.Trainer(accelerator="cpu",
                     max_epochs = 5,
                     check_val_every_n_epoch=1,
                     callbacks=[checkpoint_callback],
                     logger=logger

                    )

## Train the Model
trainer.fit(model, train_dl, val_dl)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type           | Params
--------------------------------------------------
0 | loss_fn        | BCELoss        | 0     
1 | train_accuracy | BinaryAccuracy | 0     
2 | val_accuracy   | BinaryAccuracy | 0     
3 | embed_layer    | Embedding      | 909 K 
4 | rnn            | RNN            | 6.4 K 
5 | relu           | ReLU           | 0     
6 | linear         | Linear         | 33    
7 | sigmoid        | Sigmoid        | 0     
--------------------------------------------------
915 K     Trainable params
0         Non-trainable params
915 K     Total params
3.663     Total estimated model params size (

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 Validation Accuracy : 0.6718358993530273


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 Validation Accuracy : 0.8904452323913574


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 3 Validation Accuracy : 0.9354677200317383


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch : 4 Validation Accuracy : 0.9509754776954651


## Multi Layer Bidirectional RNN

In [23]:
## logger
logger = pl.loggers.CSVLogger("logs", name="sentiment_analysis")

## checkpoints
checkpoint_callback  = pl.callbacks.ModelCheckpoint(
                                                filename='{epoch}-{val_loss:.2f}-{val_accuracy:.2f}',
                                                every_n_epochs = 2,
                                                save_top_k = -1,
                                                monitor='val_loss',
                                                )


model = RNNModel(num_embeddings= len(vocab), embedding_dim=100, hidden_dim= 32, learning_rate=0.0001,num_layers=2, bidirectional=True)

trainer = pl.Trainer(accelerator="cpu",
                     max_epochs = 5,
                     check_val_every_n_epoch=1,
                     callbacks=[checkpoint_callback],
                     logger=logger

                    )

## Train the Model
trainer.fit(model, train_dl, val_dl)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.callbacks.model_summary:
  | Name           | Type           | Params
--------------------------------------------------
0 | loss_fn        | BCELoss        | 0     
1 | train_accuracy | BinaryAccuracy | 0     
2 | val_accuracy   | BinaryAccuracy | 0     
3 | embed_layer    | Embedding      | 909 K 
4 | rnn            | RNN            | 14.8 K
5 | relu           | ReLU           | 0     
6 | linear         | Linear         | 65    
7 | sigmoid        | Sigmoid        | 0     
--------------------------------------------------
924 K     Trainable params
0         Non-trainable params
924 K     Total params
3.696     Total estimated model params size (

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 1 Validation Accuracy : 0.682841420173645


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 2 Validation Accuracy : 0.8294147253036499


Validation: |          | 0/? [00:00<?, ?it/s]

Epoch : 3 Validation Accuracy : 0.9089545011520386


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch : 4 Validation Accuracy : 0.9534767270088196
