# NLP. Lesson 10. RNN. ELMO

## RNN


RNN is a type of artificial neural networks designed to recognize patterns in **sequences of data**, such as _time series data, sentences, or video frames_. Unlike traditional feedforward neural networks, which process inputs in a single direction (from input to output), RNNs have connections that form directed cycles, allowing them to maintain a 'memory' of previous inputs. This makes RNNs powerful for tasks where context or sequence order is important, such as:
- Language Modeling: predicting the next word in a sentence.
- Sequence Prediction: tasks like time-series forecasting.
- Text Generation: generating text sequences based on learned patterns.

RNN types:
>Simple (Vanilla) RNN: basic form with limitations like vanishing gradient problem.

>[Long Short-Term Memory (LSTM)](https://medium.com/@ottaviocalzone/an-intuitive-explanation-of-lstm-a035eb6ab42c): designed to overcome the vanishing gradient problem by maintaining long-term dependencies. If you want to, you can explore the details of LSTM models [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).

>Gated Recurrent Unit (GRU): a simplified version of LSTM with similar performance

The advantage of RNNs is that they are able to deal with different amounts of input values.

### Architecture

RNNs contain weights, biases, layers, activation functions, and **feedback loops**.
At each time step, an RNN takes an input vector and its internal state (also known as hidden state) from the previous time step, producing an output vector and updating its internal state. This internal state serves as a memory, enabling RNNs to capture information about previous inputs in the sequence. This recurrent structure makes RNNs suitable for tasks involving sequential data, such as natural language processing (NLP), speech recognition, and time series prediction.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab10/RNNarch.png" alt="RNN architecture" width="1000"/>

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab10/RNNtypes.png" alt="RNN types architecture" width="1000"/>

### Working process in formulas
At each time step $t$, an RNN takes an input $x_t​$ and updates its hidden state $h_t​$. The hidden state is influenced by both the current input and the previous hidden state $h_{t−1​}$. The hidden state $h_t$ is computed using $h_t = \sigma(W_th_{t-1} + W_xx_t + b_h)$, where $W_h$ is the weight matrix for the hidden state, $W_x$ is the weight matrix for the input, $b_h$ is the bias vector, $\sigma$  is the activation function (commonly tanh or ReLU).

The output $y_t$ at each time step can be computed using the hidden state: $y_t = \phi(W_yh_t+b_y)$, where $W_y$ is the weight matrix for the output, $b_y$ is the bias vector, $\phi$ is the output activation function.

`Unrolling` refers to the process of expanding the RNN across time steps. For example, an RNN over three time steps can be visualized as:

$h_1​=\sigma(W_hh_0+W_xx_1+b_h)$

$h_2=\sigma(W_hh_1+W_xx_2+b_h)$

$h_3=\sigma(W_hh_2+W_xx_3+b_h)$. Each $h_t$ depends on the previous hidden state $h_{t-1}$

Consider the RNN example:

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab10/RNNexample.png" alt="RNN example" width="800"/>

The network contains 1 loop, w1=1.8, b1=0, w2= -0.5, w3=1.1, b2 = 0. Suppose we have a sequence with 3 elements (measurements on the day before yestarday, yesterday, and today), we want to predict the value tomorrow, thus the unrolled network consists of 3 iterations:

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab10/RNNunrolled.png" alt="RNN unrolled" width="800"/>

Instead of evaluating the output for the first 2 measurements, we pass them into the next iteration (previous measurements affect the future measurements).

Check [this video](https://www.youtube.com/watch?v=AsNTP8Kwu80) with this example.

### Training

RNNs are typically trained using backpropagation through time (BPTT), an extension of backpropagation that considers the unfolding of the network over time. However, traditional RNNs suffer from the vanishing gradient problem, where gradients diminish exponentially as they propagate backward through time, limiting their ability to capture long-range dependencies in sequences.

### Vanishing/Exploding gradient problem
The more we unroll an RNN, the harder it is to train. Suppose we have $w_2 = 2$ and 3 values as the primary input (the picture above). After applying $w_2$ to the 1st input $x_1$, the value will be $2x_1$. $2x_1$ is passed to the 2nd iteration (to the sum function), $4x_1$ will be passed to the 3rd iteration. The output value, affecting on the final output will be $8x_1$.

But what if we have 50 features/elements in the primary input? The value will become $x_1*2^{50}$ => exploding gradient problem. The Backpropagation uses gradient approach, and the gradient contains a huge number => large steps in gradient descent.

The alternative situation appears when &w_2 = 0.5& or even smaller number. The value will become $x_1*0.5^{50}$ and steps are too small to find the optimal solution => Vanishing gradient problem

### Text classification with RNN


In [1]:
# dataset downloading

!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/1429_1.csv.zip
!unzip 1429_1.csv.zip
!rm 1429_1.csv.zip

--2024-07-12 10:18:53--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/1429_1.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3697835 (3.5M) [application/zip]
Saving to: ‘1429_1.csv.zip’


2024-07-12 10:18:53 (72.5 MB/s) - ‘1429_1.csv.zip’ saved [3697835/3697835]

Archive:  1429_1.csv.zip
  inflating: 1429_1.csv              


In [2]:
import pandas as pd
df = pd.read_csv('1429_1.csv')
df.head()

  df = pd.read_csv('1429_1.csv')


Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,Inexpensive tablet for him to use and learn on...,Beginner tablet for our 9 year old son.,,,DaveZ
3,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,4.0,http://reviews.bestbuy.com/3545/5620406/review...,I've had my Fire HD 8 two weeks now and I love...,Good!!!,,,Shacks
4,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-12T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,I bought this for my grand daughter when she c...,Fantastic Tablet for kids,,,explore42


In [3]:
# set the 0-1-2 system of rating

rating2sentiment = {0.0: 0, 1.0: 0, 2.0: 0, 3.0: 1, 4.0: 2, 5.0: 2}

df = df[["reviews.text", "reviews.rating"]]
df.dropna(inplace=True)

df["sentiment"] = df["reviews.rating"].apply(lambda x: rating2sentiment[x])
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


Unnamed: 0,reviews.text,reviews.rating,sentiment
0,This product so far has not disappointed. My c...,5.0,2
1,great for beginner or experienced person. Boug...,5.0,2
2,Inexpensive tablet for him to use and learn on...,5.0,2
3,I've had my Fire HD 8 two weeks now and I love...,4.0,2
4,I bought this for my grand daughter when she c...,5.0,2


In [4]:
df.loc[df['sentiment'] <= 1].iloc[0]['reviews.text']

"Didn't have some of the features I was looking for. Returned it the next day. May be good for others"

In [5]:
# Text preprocessing and tokenization

import re
import numpy
from tqdm import tqdm
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")  # basic english tokenizer

def preprocess_text(s: str) -> str:
    """ Function for text preprocessing.
    s.strip(): for removing any leading or trailing whitespaces
    s.lower(): for lowercasing
    re.sub(r"[^a-zA-Z.,!?]+", " ", s): for replacing any symbols except letters, .,!? with a whitespace
    re.sub(r"\s{2,}", " ", s): for replacing two or more consecutive whitespace characters with a single space.
    s.strip(): for removing any new/appeared leading or trailing whitespaces

    Args:
        s (str): string that should be preprocessed

    Returns:
        str: ready for the future manipulations string
    """
    s = s.strip()
    s = s.lower()
    s = re.sub(r"[^a-zA-Z.,!?]+", " ", s)
    s = re.sub(r"\s{2,}", " ", s)
    s = s.strip()
    return s


def build_vocab(dataset: numpy.ndarray):
    """ Function for creating a generator of set of words form the dataset
    Generator - for memory saving
    Args:
        dataset (np.ndarray): text

    Yields:
        generator: tokenization for each sentence/ splitting the sentence on words
    """
    for text in tqdm(dataset, desc="Building vocabulary"):
        yield tokenizer(preprocess_text(str(text)))


vocab = build_vocab_from_iterator(
    build_vocab(df["reviews.text"].values),  # iterator
    max_tokens=25000,
    # special tokens: unknown and for padding sequences
    #
    specials=["<UNK>", "<PAD>"],
    special_first=True,
)
vocab.set_default_index(vocab["<UNK>"])

VOCAB_SIZE = len(vocab)
print("Vocabulary size: ", VOCAB_SIZE)

Building vocabulary: 100%|██████████| 34626/34626 [00:02<00:00, 11772.16it/s]


Vocabulary size:  13457


In [6]:
# Preparing the data for training. DataLoader creation

import torch
import numpy as np
from torch.utils.data import DataLoader

# defines the number of samples in each batch fed to the model
BATCH_SIZE = 16

# defines the maximum length (in tokens) for each text sequence
SEQUENCE_LENGTH = 15


def text_pipeline(text: str) -> list:
    """ Convert text to a sequence of token IDs using the vocabulary.

    Args:
        text (str): text to be preprocessed

    Returns:
        list: tokens (words) in text are converted to their corresponding integer IDs using the vocabulary (vocab)
        example: [5, 266, 49, 200, 22, 64, 379, 616, 6, 5, 159, 67, 59, 335, 178, 1663]
    """
    return vocab(tokenizer(preprocess_text(text)))


def collate_fn(batch: list[list[str, int]]) -> torch.tensor:
    """Function for the dataloader.
    Pads or truncates sequences to a fixed length and converts them into tensors along with labels.
    Args:
        batch (list[list[str, int]]]): the list of BATCH_SIZE size, each element is a list with
        a string (reviews.text) and an integer (corresponding sentiment)

    Returns:
        torch.tensor: tensors with
    """
    texts, labels = [], []

    for text, label in batch:
        # text - sentence (reviews.text)
        # label - 'sentiment' column from the dataframe
        text_tokens_ids = text_pipeline(text)
        if len(text_tokens_ids) > SEQUENCE_LENGTH:
            text_tokens_ids = text_tokens_ids[:SEQUENCE_LENGTH]
        elif len(text_tokens_ids) < SEQUENCE_LENGTH:
            text_tokens_ids.extend(
                # "<PAD>" is a special for fulfilling the sequence
                vocab(["<PAD>" for _ in range(SEQUENCE_LENGTH - len(text_tokens_ids))])
            )

        texts.append(text_tokens_ids)
        labels.append(label)

    texts = torch.tensor(texts, dtype=torch.int)
    labels = torch.tensor(labels, dtype=torch.float)
    return texts, labels


data = np.column_stack((df["reviews.text"].values, df["sentiment"].values))
print(data.shape)
dataloader = DataLoader(
    data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)

(34626, 2)


In [7]:
# Demonstration of how the collate_fn works

SEQUENCE_LENGTH = 5
example_batch = [['phrase 1', 1], ['hello world', 3], ['the rose is fragnant', 2]]
example_text, example_labels = collate_fn(example_batch)
print(example_text)
print(example_labels)

SEQUENCE_LENGTH = 100

tensor([[ 3004,     1,     1,     1,     1],
        [ 3286,   967,     1,     1,     1],
        [    3, 11932,    11,     0,     1]], dtype=torch.int32)
tensor([1., 3., 2.])


In [8]:
# Biuld the RNN

import torch.nn as nn


class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, n_layers=1):
        super().__init__()

        # embedding layer
        # transforms token IDs into dense vectors of fixed size that contains information
        # about the token's semantics and context
        self.embedding = nn.Embedding(input_dim, embedding_dim)

        # use the long-short-term memory model
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True)

        # output layer for assigning the probabilities for being one of 3 types of sentiment
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text, h0, c0):
        embedded = self.embedding(text)

        output, (hidden, cell) = self.lstm(embedded, (h0, c0))
        return self.fc(hidden[-1, :, :])

In [9]:
# Variables and model assignment

input_dim = VOCAB_SIZE
embedding_dim = 64
hidden_dim = 32
output_dim = 3
n_layers = 5
model = RNN(input_dim, embedding_dim, hidden_dim, output_dim, n_layers=n_layers)
model

RNN(
  (embedding): Embedding(13457, 64)
  (lstm): LSTM(64, 32, num_layers=5, batch_first=True)
  (fc): Linear(in_features=32, out_features=3, bias=True)
)

In [10]:
# Adding optimizers, stop sriterion, and GPU if available

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)

In [11]:
# Training function

from tqdm import tqdm


def train(model, dataloader, optimizer, criterion, device):
    # classic training function implementation:
    # model.train(), loop, optimizer.zero_grad(), prediction, loss evaluation, loss.backward(), optimizer.step()
    # h0 and c0 states are defined for the LSTM and created with random values initially.

    epoch_loss = 0

    model.train()

    for text, labels in tqdm(dataloader):
        text = text.to(device)
        labels = labels.to(device).long()

        optimizer.zero_grad()

        h0 = torch.randn(n_layers, text.shape[0], hidden_dim, device=device)
        c0 = torch.randn(n_layers, text.shape[0], hidden_dim, device=device)
        predictions = model(text, h0, c0)
        loss = criterion(predictions, labels)

        epoch_loss += loss.item()

        loss.backward()
        optimizer.step()

    # average loss over all batches
    return epoch_loss / len(dataloader)

In [12]:
epochs = 4

for epoch in range(epochs):
    train_loss = train(model, dataloader, optimizer, criterion, device)
    print(f"Epoch: {epoch}, Train Loss:  {train_loss} ")

100%|██████████| 2165/2165 [00:13<00:00, 157.17it/s]


Epoch: 0, Train Loss:  0.32587369788463894 


100%|██████████| 2165/2165 [00:09<00:00, 218.98it/s]


Epoch: 1, Train Loss:  0.2840579899316159 


100%|██████████| 2165/2165 [00:09<00:00, 216.67it/s]


Epoch: 2, Train Loss:  0.2825955834506428 


100%|██████████| 2165/2165 [00:09<00:00, 216.56it/s]

Epoch: 3, Train Loss:  0.2813812624344787 





In [13]:
text = "This product is so cool"
tokens = text_pipeline(text)
tokens.extend(vocab(["<PAD>" for _ in range(SEQUENCE_LENGTH - len(tokens))]))
tokens = torch.tensor([tokens]).to(device)
h0 = torch.randn(n_layers, tokens.shape[0], hidden_dim, device=device)
c0 = torch.randn(n_layers, tokens.shape[0], hidden_dim, device=device)
predictions = model(tokens, h0, c0)
predictions.argmax(axis=1).item()

2

In [14]:
predictions

tensor([[-1.7203, -0.9273,  2.3646]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

## ELMO (Embeddings from Language Models)

### Definition

ELMO is a type of word embedding technique developed by researchers at the Allen Institute for Artificial Intelligence. Unlike traditional word embeddings, which assign a fixed vector representation to each word in a vocabulary, ELMO generates contextualized word embeddings that capture the meaning of a word based on its surrounding context in a sentence.

[ELMo and biLM. Follow this link for a better understanding](https://sh-tsang.medium.com/review-elmo-deep-contextualized-word-representations-8eb1e58cd25c)

### Architecture

ELMO is based on a bidirectional language model, typically implemented using a deep neural network such as a bi-directional LSTM. This model is trained on a large corpus of text data to predict the next word in a sentence given its surrounding context. During training, the parameters of the model are adjusted to minimize the prediction error.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab10/ELMo.png" alt="ELMo architecture" width="1000"/>

### Contextual Embeddings

After training, the hidden states of the bi-directional LSTM are used to generate word embeddings for downstream NLP tasks. These embeddings are contextualized because they capture information about the entire sentence, allowing them to represent the meaning of a word differently depending on its context.

### Fine-tuning

ELMO embeddings can be fine-tuned for specific tasks by incorporating them into larger neural network architectures and updating their parameters during training on task-specific data. This allows ELMO to adapt to the particular nuances of different tasks and improve performance.

### Implementations and weights

- [Tensorflow1 Hub](https://www.kaggle.com/models/google/elmo/frameworks/tensorFlow1/variations/elmo/versions/1?tfhub-redirect=true) - weights
- [AllenNLP Implementation](https://github.com/allenai/allennlp/blob/main/allennlp/modules/elmo.py) - no longer maintained


## RNN Implementation

> The goal of the following tasks is to build an rnn-based model to classify descriptions from the dataset into categories. 

You can try 3 levels of complexity: 9 (l1), 70 (l2), and 219 (l3) categories.

In [15]:
SEED = 42

torch.manual_seed(SEED)
torch.backends.cuda.deterministic = True

In [16]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/DBPEDIA_train.zip
!unzip DBPEDIA_train.zip
!rm DBPEDIA_train.zip

--2024-07-12 10:19:55--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/DBPEDIA_train.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24145631 (23M) [application/zip]
Saving to: ‘DBPEDIA_train.zip’


2024-07-12 10:19:56 (259 MB/s) - ‘DBPEDIA_train.zip’ saved [24145631/24145631]

Archive:  DBPEDIA_train.zip
  inflating: DBPEDIA_train.csv       


In [17]:
df = pd.read_csv("DBPEDIA_train.csv")
df.head()

Unnamed: 0,text,l1,l2,l3
0,"William Alexander Massey (October 7, 1856 – Ma...",Agent,Politician,Senator
1,Lions is the sixth studio album by American ro...,Work,MusicalWork,Album
2,"Pirqa (Aymara and Quechua for wall, hispaniciz...",Place,NaturalPlace,Mountain
3,Cancer Prevention Research is a biweekly peer-...,Work,PeriodicalLiterature,AcademicJournal
4,The Princeton University Chapel is located on ...,Place,Building,HistoricBuilding


### Task 1. Data Preprocessing

#### Task 1.1
First of all you have to choose the level of complexity. Assign 'l1', 'l2', or 'l3' to the 'complexity' variable.

Then, you need to drop unneded columns and encode text values from a lefted column into integers. Create a dictionary with mapping. Optimize the process using `df[complexity].unique()` method

In [18]:
complexity = 'l1'

# drop unneeded columns
df.drop(['l2', 'l3'], inplace=True, axis=1)

# find all unique categories and form a dictionary with mapping (starting with 0)
text_categories = df[complexity].unique()
categories = {text_categories[i]:i for i in range(len(text_categories))}
categories

{'Agent': 0,
 'Work': 1,
 'Place': 2,
 'Species': 3,
 'UnitOfWork': 4,
 'Event': 5,
 'SportsSeason': 6,
 'Device': 7,
 'TopicalConcept': 8}

In [19]:
# map text representations of categories to integers using 'categories' variable
df[complexity] = df[complexity].map(categories)
df.head()

Unnamed: 0,text,l1
0,"William Alexander Massey (October 7, 1856 – Ma...",0
1,Lions is the sixth studio album by American ro...,1
2,"Pirqa (Aymara and Quechua for wall, hispaniciz...",2
3,Cancer Prevention Research is a biweekly peer-...,1
4,The Princeton University Chapel is located on ...,2


In [20]:
assert len(text_categories) == len(df[complexity].unique())
assert len(df.columns) == 2
assert type(df[complexity][0]) == numpy.int64

#### Task 1.2.
Build the vocabulary

In [21]:
# build the vocabulary, assign max_tokens to 60000

vocab = build_vocab_from_iterator(
    build_vocab(df["text"].values),
    max_tokens=60000,
    specials=["<UNK>", "<PAD>"],
    special_first=True,
)
vocab.set_default_index(vocab["<UNK>"])

VOCAB_SIZE = len(vocab)
VOCAB_SIZE

Building vocabulary: 100%|██████████| 90000/90000 [00:12<00:00, 7320.96it/s]


60000

In [22]:
assert VOCAB_SIZE == 60000
assert 0 <= vocab["<UNK>"] <= 1
assert 0 <= vocab["<PAD>"] <= 1
assert vocab['the'] == 2

#### Task 1.3.
Split the data

In [23]:
# use train_test_split from sklearn

from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(df)

In [24]:
assert len(train_data) == 67500

#### Task 1.4.
Build DataLoaders

Assign batch_size to 64 and use the same function for collate_fn parameter

In [25]:
batch_size = 64

train_data = np.column_stack((train_data["text"].values, train_data[complexity].values))
train_dataloader = DataLoader(
    train_data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn
)

val_data = np.column_stack((val_data["text"].values, val_data[complexity].values))
val_dataloader = DataLoader(
    val_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

In [26]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
assert len(train_dataloader.dataset) == 67500
assert len(val_dataloader.dataset) == 22500

### Task 2. RNN Network
Build a complete RNN model. Use nn.Embedding, bidirectional nn.LSTM with dropout, fully connected layer as the last one, and Sigmoid in the end.

In [27]:
class RNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim,
        hidden_dim,
        output_dim,
        n_layers,
        bidirectional,
        dropout,
    ):

        super(RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout,
            batch_first=True,
        )

        self.fc = nn.Linear(hidden_dim * n_layers, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text):
        embedded = self.embedding(text)

        packed_output, (hidden_state, cell_state) = self.lstm(embedded)

        hidden = torch.cat((hidden_state[-2, :, :], hidden_state[-1, :, :]), dim=1)

        dense_outputs = self.fc(hidden)

        outputs = self.sigmoid(dense_outputs)
        return outputs

### Task 3. Training and Evaluation

#### Task 3.1
Assign constants suitable for your model. Don't forget to make the model bidirectional and apply dropout. Also, add an optimizer, cross entropy loss and lead all the components to your device.

In [28]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 64
OUTPUT_DIM = len(text_categories)
NUM_LAYERS = 2
BIDIRECTION = True
DROPOUT = 0.2
model = RNN(
    VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, NUM_LAYERS, BIDIRECTION, DROPOUT
)

In [29]:
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
criterion = criterion.to(device)

#### Task 3.2 
Fill the train and evaluate functions. Follow the common rules. You also can apply tqdm for visualizing the process.

In [30]:
def train(model, dataloader, optimizer, criterion):

    epoch_loss = 0.0
    model.train()

    for texts, labels in tqdm(dataloader, desc="Train"):
        texts = texts.to(device)
        labels = labels.to(device).long()
        optimizer.zero_grad()
        predictions = model(texts)
        loss = criterion(predictions, labels)
        loss.backward()

        optimizer.step()
        epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

In [31]:
def evaluate(model, dataloader, criterion):

    epoch_loss = 0.0

    model.eval()
    with torch.no_grad():
        for texts, labels in tqdm(dataloader, desc="Valid"):
            texts = texts.to(device)
            labels = labels.to(device).long()
            predictions = model(texts)
            loss = criterion(predictions, labels.long())
            epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

In [32]:
EPOCHS = 4
for epoch in range(EPOCHS):
    train_loss = train(model, train_dataloader, optimizer, criterion)
    valid_loss = evaluate(model, val_dataloader, criterion)
    print(f"train loss: {train_loss}, valid loss: {valid_loss}")

Train: 100%|██████████| 1055/1055 [00:19<00:00, 53.79it/s]
Valid: 100%|██████████| 352/352 [00:04<00:00, 81.04it/s]


train loss: 1.5880454019347638, valid loss: 1.469332528385249


Train: 100%|██████████| 1055/1055 [00:17<00:00, 60.58it/s]
Valid: 100%|██████████| 352/352 [00:04<00:00, 73.18it/s]


train loss: 1.4557195562886966, valid loss: 1.4543645026331598


Train: 100%|██████████| 1055/1055 [00:17<00:00, 59.18it/s]
Valid: 100%|██████████| 352/352 [00:03<00:00, 91.91it/s]


train loss: 1.437542036359344, valid loss: 1.4472287493673237


Train: 100%|██████████| 1055/1055 [00:20<00:00, 52.50it/s]
Valid: 100%|██████████| 352/352 [00:03<00:00, 90.80it/s]

train loss: 1.4330522197117739, valid loss: 1.4405848946083675





### Task 4.  Prediction

Apply data preprocessing for test set, assign DataLoaders for it, build the predict function, generate predictions, and compute the accuracy.

In [33]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/DBPEDIA_test.zip
!unzip DBPEDIA_test.zip
!rm DBPEDIA_test.zip

--2024-07-12 10:21:45--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/DBPEDIA_test.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10651409 (10M) [application/zip]
Saving to: ‘DBPEDIA_test.zip’


2024-07-12 10:21:45 (159 MB/s) - ‘DBPEDIA_test.zip’ saved [10651409/10651409]

Archive:  DBPEDIA_test.zip
  inflating: DBPEDIA_test.csv        


In [34]:
df = pd.read_csv("DBPEDIA_test.csv")
df.head()

Unnamed: 0,text,l1,l2,l3
0,Dubai Media Incorporated (DMI) is the official...,Agent,Broadcaster,BroadcastNetwork
1,Filippo Maria Galletti (1636–1714) was an Ital...,Agent,Artist,Painter
2,Anime Ganbare Goemon (アニメがんばれゴエモン Anime Ganbar...,Work,Cartoon,Anime
3,Morong 43 is a group of 43 health workers in t...,Agent,Scientist,Medician
4,Aaron Snyder is a resident of Oakdale on the A...,Agent,FictionalCharacter,SoapCharacter


In [35]:
df.drop(['l2', 'l3'], inplace=True, axis=1)
df[complexity] = df[complexity].map(categories)
df.head()

Unnamed: 0,text,l1
0,Dubai Media Incorporated (DMI) is the official...,0
1,Filippo Maria Galletti (1636–1714) was an Ital...,0
2,Anime Ganbare Goemon (アニメがんばれゴエモン Anime Ganbar...,1
3,Morong 43 is a group of 43 health workers in t...,0
4,Aaron Snyder is a resident of Oakdale on the A...,0


In [None]:
assert len(df.columns) == 2

In [36]:
test_data = np.column_stack((df["text"].values, np.zeros(len(df["text"]))))
# test_data = np.column_stack((df["text"].values, df[complexity].values))

test_dataloader = DataLoader(
    test_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn
)

In [None]:
assert len(test_dataloader.dataset) == 40000
assert len(test_dataloader.dataset[0]) == 2

In [37]:
def predict(model, dataloader):
    model.eval()
    predictions = []
    with torch.no_grad():
        for texts, _ in tqdm(dataloader):
            texts = texts.to(device)
            preds = model(texts)
            predictions.extend(preds.argmax(axis=1).cpu().tolist())
    return predictions

In [38]:
predictions = predict(model, test_dataloader)

100%|██████████| 625/625 [00:08<00:00, 73.93it/s]


In [None]:
assert len(predictions) == 40000
assert type(predictions[0]) == int

In [39]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(predictions, df[complexity])
accuracy

0.859425

In [None]:
if complexity == 'l1':
    assert accuracy > 0.8
if complexity == 'l2':
    assert accuracy > 0.5
if complexity == 'l3':
    assert accuracy > 0.3

# Conclusion

In this lesson on Natural Language Processing (NLP), we explored several key themes and techniques essential for text analysis and processing. - We began with an overview of Recurrent Neural Network (RNN) architecture, highlighting its ability to process sequential data by maintaining a hidden state that captures information from previous time steps. This was exemplified through an RNN model application, demonstrating its utility in various text-based tasks.
- We delved into sentiment analysis, a popular NLP task where RNNs can be effectively used to determine the sentiment expressed in a text. This example illustrated the abilities of RNNs in understanding context and nuances in human language, enabling applications such as customer feedback analysis and social media monitoring.
- We touched ELMo (Embeddings from Language Models) theme, a significant advancement in generating contextualized word embeddings. ELMo improves upon traditional embeddings by capturing the meaning of words in different contexts, thus enhancing the performance of NLP models on tasks like named entity recognition and sentiment analysis.
- To reinforce the versatility of RNNs, you were given a Task involving text classification. Here, an RNN model was trained to classify text descriptions into predefined categories such as "Company," "Artist," and "Natural Place," showcasing how RNNs can be applied to various domains and text classification tasks.

Overall, the lesson highlighted the importance of RNNs and advanced embedding techniques like ELMo in modern NLP, providing a robust foundation for building sophisticated models capable of understanding and processing human language effectively.