# Recurrent Neural Networks for NLP

We will deal with recurrent neural networks (RNNs), a kind of neural network that specializes in dealing with sequential or time-varying data. <br>
The data points have a temporal relationship to one another. In a recurrent neural network, connections between neurons form a directed graph on a temporal sequence, exhibiting temporal dynamic behavior. <br>
A traditional feed-forward network has no memory of the previous input; however, an RNN uses a memory unit to remember the previous input and therefore processes the current input based on the sequence of inputs so far. <br>

Where information from one step is being fed into the next, creating multiple copies of the same network, and
all this is encapsulated in the recurrent loop. A recurrent neural network accepts an input and gives an output, but this output is dependent not just on the input at the given instance, but on the entire history of inputs given to the network, which are mathematically remembered by the network.

In [1]:
!pip install torchtext



In [1]:
import torchtext

## Tokenization

When dealing with a natural language processing task, we take a text corpus and break it down into smaller units. <br>
A computer can only understand numbers, and so these words are assigned a unique integer value to represent a word. The process of breaking a sentence into tokens is called tokenization.

In [2]:
tokenizer = lambda words : words.split()

In [3]:
tokenizer("This is a test for tokenizer")

['This', 'is', 'a', 'test', 'for', 'tokenizer']

## Creating fields

Fields make it easy to process natural-language data. Fields let us define the datatype and help us create
tensors out of textual data by specifying the set of operations to be performed on the data.
The Field class lets us perform common text processing tasks and holds the vocabulary of the data at hand.

In [29]:
!pip install -U torchtext==0.9.0

Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp39-cp39-macosx_10_9_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 4.1 MB/s eta 0:00:01
[?25hCollecting torch==1.8.0
  Downloading torch-1.8.0-cp39-none-macosx_10_9_x86_64.whl (120.6 MB)
[K     |████████████████████████████████| 120.6 MB 2.2 kB/s  eta 0:00:01    |█████████▋                      | 36.1 MB 11.6 MB/s eta 0:00:08     |███████████████▍                | 58.1 MB 7.9 MB/s eta 0:00:08     |███████████████▋                | 59.0 MB 7.9 MB/s eta 0:00:08     |██████████████████████          | 82.6 MB 6.8 MB/s eta 0:00:06███████████████       | 94.2 MB 24.7 MB/s eta 0:00:02| 107.9 MB 29.9 MB/s eta 0:00:01     |██████████████████████████████▋ | 115.3 MB 25.4 MB/s eta 0:00:01
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1
    Uninstalling torch-1.12.1:
      Successfully uninstalled torch-1.12.1
  Attempting uninstall: torch

In [4]:
torchtext.__version__

'0.9.0'

In [5]:
from torchtext.legacy.data import Field

We define a Field object for reviews and the field for labels:

In [6]:
Review = Field(sequential=True, tokenize=tokenizer, lower=True)

In [7]:
Label = Field(sequential=False, use_vocab=False)

We can: <br>
- add a token at the beginning and end of an input string.<br>
- set the sequence to a fixed length.<br>
- set an unknown token.<br>
- set the batch dimension as the first dimension.<br>

In [8]:
SequenceField = Field(tokenize=tokenizer, init_token='<sos>', eos_token='<eos>', lower=True)
SequenceField = Field(tokenize=tokenizer, init_token='<sos>', eos_token='<eos>', lower=True, fix_length=50)
SequenceField = Field(tokenize=tokenizer, init_token='<sos>', eos_token='<eos>', unk_token='<unk>')
SequenceField = Field(tokenize=tokenizer, init_token='<sos>', eos_token='<eos>', unk_token='<unk>', batch_first=True)

## Developing a dataset

In [10]:
from torchtext.legacy.data import TabularDataset

Select the training columns:

In [11]:
train_datafields = [("id", None),
                    ("content", Review), ("Business", Label),
                    ("SciTech", Label), ("Sports", Label),
                    ("World", Label)]

Select the testing columns:

In [12]:
test_datafields = [("id", None),
                    ("content", Review)]

Read the training and validation .csv file:

In [19]:
train = TabularDataset(path='NewsClassification/train.csv', 
                        format='csv',
                        skip_header=True,
                        fields=train_datafields)

In [18]:
valid = TabularDataset(path='NewsClassification/valid.csv', 
                        format='csv',
                        skip_header=True,
                        fields=train_datafields)

In [16]:
test = TabularDataset(path="NewsClassification/test.csv",
                    format='csv',
                    skip_header=True,
                    fields=test_datafields)

Build the vocabulary:

In [20]:
Review.build_vocab(train, min_freq=2)

## Developing iterators

Iterators are used to load batches of data from the dataset. They provide methods to make loading data and moving data to the appropriate device easier. We could use these iterator objects to iterate over the data while running through the epochs.

In [21]:
from torchtext.legacy.data import BucketIterator
import torch

Define the batch size:

In [22]:
BATCH_SIZE = 128

Identify the device that's available:

In [23]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Use BucketIterator to create buckets of datasets: 

In [24]:
train_iter, valid_iter, test_iter = BucketIterator.splits(
                                    (train, valid, test),
                                    batch_size=BATCH_SIZE,
                                    device=device,
                                    sort_key=lambda x: len(x.comment_text), 
                                    sort_within_batch=False
)

## Exploring word embeddings

Word embeddings are learned representations of words. They are dense representations of words, where each word is assigned a vector, that is, a real-valued vector in a pre-defined vector space, rather than a numerical identifier. 

In word embedding, words with similar meanings have a similar representation, and we can perform vector arithmetic on these word vectors.

In [25]:
from torchtext import vocab

Move on to loading the embedding vectors:

In [27]:
vec = vocab.Vectors('glove.6B.100d.txt',
    cache='./vec/glove_embedding/',
    url='http://nlp.stanford.edu/data/glove.6B.zip')

./vec/glove_embedding/glove.6B.zip: 862MB [02:42, 5.29MB/s]                                
100%|█████████▉| 399999/400000 [00:24<00:00, 16255.47it/s]


We can build the vocabulary from the pretrained vector by applying it to the field object:

In [28]:
Review.build_vocab(train, min_freq=2, vectors=vec)

We have loaded the pretrained word embedding.

## Building an LSTM network

Long short-term memory (LSTM) networks are a type of recurrent neural network that has internal gates that helps in better information persistence. These gates are tiny neural networks that control when information needs to be saved and when it can be erased or forgotten.<br>
RNNs suffer from vanishing and exploding gradients, making it difficult to learn long-term dependencies. LSTMs are resistant to exploding and vanishing gradients, although it is still mathematically possible

In [29]:
import torch.nn as nn

Name the class LSTMClassifier:

In [37]:
class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, dropout):
        """"Constructor of the class"""
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(len(Review.vocab), embedding_dim) # add the embedding layer
        self.rnn = nn.LSTM(embedding_dim, hidden_dim) # add the LSTM layer
        self.fc = nn.Linear(hidden_dim, output_dim) # add a fully connected layer
        self.dropout = nn.Dropout(dropout) # define the dropout layer
    
    def forward(self, x):
        x = self.embedding(x)
        output, (hidden, cell) = self.rnn(x)
        hidden = self.dropout(hidden)
        return self.fc(hidden)

Define the hyperparameters as follows:

In [31]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0.5

Create a model object:

In [38]:
model = LSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)

## Multilayer LSTMs

In [39]:
class MultiLSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, dropout, num_layers):
        """"Constructor of the class"""
        super(MultiLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(len(Review.vocab), embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        x = self.embedding(x)
        output, (hidden, cell) = self.rnn(x)
        hidden = self.dropout(hidden)
        return self.fc(hidden[-1])

In [40]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT = 1
DROPOUT = 0.5
NUM_LAYERS = 2

Create the model object:

In [41]:
model = MultiLSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT, NUM_LAYERS)

## Bidirectional LSTMs

In a normal LSTM, the LSTM reads the input sequence from first to last; however, in a bidirectional LSTM, there is a second LSTM that reads the sequence from last to first—that is, a backward RNN.<br>
This type of LSTM improves the model performance when the prediction at the current timestamp is dependent on the inputs further on in the sequence. Consider the examples "I read comics" and "I read comics yesterday". In this case, the same token, that is, read, has different meanings based on the token that appears in the future. We will explore its implementation in this recipe.

In [42]:
class BiLSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, dropout, num_layers):
        """"Constructor of the class"""
        super(BiLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(len(Review.vocab), embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True)
        self.fc = nn.Linear(2*hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        x = self.embedding(x)
        output, (hidden, cell) = self.rnn(x)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        return self.fc(hidden.squeeze(0))

In [43]:
model = BiLSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT, NUM_LAYERS)

We concatenated the hidden states of the forward and backward LSTMs and passed them into the fully connected layer. Because of this, the input dimension of the fully connected layer was doubled to accommodate the forward and backward hidden state tensors.