<a href="https://colab.research.google.com/github/nosportugal/faast-data-science/blob/main/courses/deep_learning/unit7/solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 7: Recurrent Neural Networks

By now, you should have the files `labeledTrainData.tsv` and `testData.tsv` in a folder named `ldsa-dl-course-data` in you Google Drive. If you don't, please check the README file of Unit 2 for instructions.

We recommend that you to use [Weights & Biases](https://wandb.ai/site) (W&B) to track your experiments. Sign up on W&B with your Google account so that connection with the Google Colab environment is seamless. Follow the [documentation](https://docs.wandb.ai/guides/integrations/lightning) to integrate W&B with PyTorch Lightning.


## 1) Setup & Installs

In [None]:
! pip install lightning==2.0.1 wandb --quiet

In [None]:
from google.colab import drive

drive.mount("/content/drive")

In [None]:
import wandb

# This will open a window so you can login to W&B on Google Colab.
# If that doesn't work, set your W&B API key below
# If you do, remove your key before publishing to GitHub.

# %env WANDB_API_KEY=YOUR_WANDB_API_KEY
wandb.login()
run = wandb.init(project="imdb_sentiment")

## 2) Load the train **dataset**

Load the train dataset from the tsv files stored in your Google Drive. Split it into train and validation datasets.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/labeledTrainData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:20000]
df_val = df_shuffled.iloc[20000:25000]

## 3) Tokenization

The goal of this section is to transform the data, such that each word is segmented and mapped to an integer.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
# We use the Keras text Tokenizer, as it is quite simple to use for this end.
# Note that this is a pre-processing step, which we can decouple from the model.
# Keras is being used in a way similar to how sklearn was used in previous
# units, simply as a means of doing data processing. The model side still uses
# PyTorch only. Other alternatives for the processing are spaCy and torchtext.

tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=10000, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True
)

# The tokenizer looks at all the words that exist in the training dataset, and
# assigns an integer to each unique word (up to the 10000 most common)
tokenizer.fit_on_texts(df_train["review"])

In [None]:
# This method first transforms each sentence into an array of integers. We also
# add padding (extra zeros) so that the length of the sequences are consistent.
def tokenize_to_array(texts, max_seq_len):
    tokenized_texts = tokenizer.texts_to_sequences(texts)

    X = np.empty((len(texts), max_seq_len))
    X[...] = 0

    for i, tokenized_text in enumerate(tokenized_texts):
        X[i, : len(tokenized_text)] = tokenized_text

    return X

In [None]:
# Discover the length of the longest sentence in the training dataset.
train_texts = df["review"]
print(
    f"Max. sequence length on train dataset: {len(max(tokenizer.texts_to_sequences(train_texts), key=len))}"
)

In [None]:
max_seq_len = 2200  # Add an extra margin.

X_train = tokenize_to_array(df_train["review"], max_seq_len)
X_val = tokenize_to_array(df_val["review"], max_seq_len)

In [None]:
# Example of a tokenized sentence.
print(X_train.shape)
print(X_train)

## 4) Data loader

Create a data PyTorch `Dataset` and corresponding `DataLoader` for the train and validation datasets.

In [None]:
import lightning as L
import torch

In [None]:
from torch.utils.data import Dataset, DataLoader


class TextDataset(Dataset):
    def __init__(self, X, y):
        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.int64)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

In [None]:
train_ds = TextDataset(X_train, df_train["sentiment"].values)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
val_ds = TextDataset(X_val, df_val["sentiment"].values)

val_loader = DataLoader(
    dataset=val_ds,
    batch_size=32,
    shuffle=True,
)

In [None]:
for batch_idx, (features, class_labels) in enumerate(train_loader):
    break

In [None]:
features.shape

## 5) Model definition

Define a PyTorch model and the corresponding PyTorch Lightning module.

In [None]:
class RNNWithEmbeddings(torch.nn.Module):
    def __init__(self, num_words, embed_dim, rnn_hidden_size, rnn_layers, num_classes):
        super().__init__()

        self.embedding_layer = torch.nn.Embedding(num_words, embed_dim)
        self.lstm = torch.nn.LSTM(
            embed_dim,
            rnn_hidden_size,
            num_layers=rnn_layers,
            bidirectional=True,
            batch_first=True,
        )
        self.linear = torch.nn.Linear(2 * rnn_hidden_size, num_classes)

    def forward(self, x):
        mask = torch.isnan(x)
        lengths = torch.sum(~mask, dim=1).to(device="cpu")

        x = x.to(torch.int32)
        x[mask] = 0
        embeddings = self.embedding_layer(x)

        padded_seq = torch.nn.utils.rnn.pack_padded_sequence(
            embeddings,
            lengths,
            batch_first=True,
            enforce_sorted=False,
        )
        rnn_packed_embeddings, _ = self.lstm(padded_seq)
        rnn_unpacked_embeddings, _ = torch.nn.utils.rnn.pad_packed_sequence(
            rnn_packed_embeddings,
            batch_first=True,
            padding_value=float("nan"),
        )
        avg_embeddings = torch.nanmean(rnn_unpacked_embeddings, dim=1)
        logits = self.linear(avg_embeddings)
        return logits


pytorch_model = RNNWithEmbeddings(
    num_words=10_000,
    embed_dim=128,
    rnn_hidden_size=64,
    rnn_layers=1,
    num_classes=2,
)

In [None]:
import torch.nn.functional as F
import torchmetrics

In [None]:
class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate):
        super().__init__()
        self.save_hyperparameters()

        self.learning_rate = learning_rate
        self.model = model

        self.train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, x):
        return self.model(x)

    def _shared_step(self, batch):
        features, true_labels = batch
        logits = self(features)

        loss = F.cross_entropy(logits, true_labels)
        predicted_labels = torch.argmax(logits, dim=1)
        return loss, true_labels, predicted_labels

    def training_step(self, batch, batch_idx):
        loss, true_labels, predicted_labels = self._shared_step(batch)

        self.log("train_loss", loss)
        self.train_acc(predicted_labels, true_labels)
        self.log(
            "train_acc", self.train_acc, prog_bar=True, on_epoch=True, on_step=False
        )
        return loss

    def validation_step(self, batch, batch_idx):
        loss, true_labels, predicted_labels = self._shared_step(batch)

        self.log("val_loss", loss, prog_bar=True)
        self.val_acc(predicted_labels, true_labels)
        self.log("val_acc", self.val_acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        loss, true_labels, predicted_labels = self._shared_step(batch)
        self.test_acc(predicted_labels, true_labels)
        self.log("test_acc", self.test_acc)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer

## 6) Model training

Train your model using a Lightning trainer.

In [None]:
from lightning import Trainer
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import WandbLogger

In [None]:
lightning_model = LightningModel(model=pytorch_model, learning_rate=0.0006)

callbacks = [
    ModelCheckpoint(save_top_k=1, mode="max", monitor="val_acc", save_last=True)
]

wandb_logger = WandbLogger(
    project="imdb_sentiment",
    log_model="all",
    group="unit7",
)

trainer = Trainer(
    callbacks=callbacks,
    max_epochs=30,
    accelerator="auto",
    logger=wandb_logger,
    deterministic=True,
)

trainer.fit(
    model=lightning_model,
    train_dataloaders=train_loader,
    val_dataloaders=val_loader,
)

wandb.finish()

## 7) Inference

Load the test dataset from the tsv file stored in your Google Drive and the model from the checkpoints you created on W&B. Finally, perform inference with the model on the test dataset.

In [None]:
df_test = pd.read_csv(
    "/content/drive/My Drive/ldsa-dl-course-data/testData.tsv",
    header=0,
    delimiter="\t",
    quoting=3,
)

X_test = tokenize_to_array(df_test["review"], max_seq_len)

In [None]:
class InferenceTextDataset(Dataset):
    def __init__(self, X):
        self.features = torch.tensor(X, dtype=torch.float32)

    def __getitem__(self, index):
        return self.features[index]

    def __len__(self):
        return self.features.shape[0]

In [None]:
test_ds = InferenceTextDataset(X_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=32,
    shuffle=False,
)

In [None]:
# Define checkpoint reference.
checkpoint_reference = "[USERNAME]/imdb_sentiment/model-[MODEL_ID]:best"

# Download checkpoint locally (if not already cached).
artifact = run.use_artifact(checkpoint_reference, type="model")
artifact_dir = artifact.download()

# Load checkpoint.
model = LightningModel.load_from_checkpoint(str(artifact_dir) + "/model.ckpt")

In [None]:
batch_outputs = trainer.predict(model=model, dataloaders=test_loader)
logits = torch.cat(batch_outputs)
predicted_labels = torch.argmax(logits, dim=1)

In [None]:
wandb.finish()

## 8) Post-process for Kaggle submission

Assuming the predicted class labels are stored in `predicted_labels` (as a Torch tensor), create a csv file ready for submission on Kaggle.

In [None]:
output = pd.DataFrame(data={"id": df_test["id"], "sentiment": predicted_labels})

In [None]:
output.to_csv("output.csv", index=False, quoting=3)