# Task definition
Implement LSTM Sentiment Tagger for imdb reviews dataset.

1. (5pt) Fill missing code below
    * 1pt implement vectorization
    * 2pt implement \_\_init\_\_ and forward methods of models
    * 2pt implement collate function
2. (4pt) Implement training loop, choose proper loss function, use clear ml for max points.
    * 2pts is a baseline for well written, working code
    * 2pts if clear ml used properly
3. (3pt) Train the models (find proper hyperparams). Make sure you are not overfitting or underfitting. Visualize training of your best model (plot training, and test loss/accuracy in time). Your model should reach at least 87% accuracy. For max points it should exceed 89%. 
    * 1pt for accuracy above 89%
    * 1pt for accuracy above 87%
    * 1pt for visualizations

Remarks:
* Use embeddings of size 50
* Use 0.5 threshold when computing accuracy.
* Use supplied dataset for training and evaluation.
* You do not have to use validation set.
* You should monitor overfitting during training.
* For max points use clear ml to store and manage logs from your experiments. 
* We encourage to use pytorch lightning library (Addtional point for using it - however the sum must not exceed 12)

[Clear ML documentation](https://clear.ml/docs/latest/docs/)

[Clear ML notebook exercise from bootcamp](https://colab.research.google.com/drive/1wtLb4gg8beLS7smcyJlOZppn6_rQvSxL?usp=sharing)

In [20]:
!pip install clearml
!pip install pytorch-lightning

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
import os
from collections import defaultdict

import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import torchtext
from clearml import Task

import torch
from torch import nn
from torch import optim

from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

from nltk.corpus import stopwords

from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
from pytorch_lightning.callbacks import LearningRateMonitor
from torch.nn import functional as F
from torchmetrics.functional import accuracy
import torchmetrics

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [22]:

web_server = 'https://app.community.clear.ml'
api_server = 'https://api.community.clear.ml'
files_server = 'https://files.community.clear.ml'
access_key = '8E6XXHL5IUUFBSDPNP3O'  #@param {type:"string"}
secret_key = 'DSQtWLrn4WkgAgeBpbgYIzUzIem0mjaqZHWRbJYqKs2mILxIJr'  #@param {type:"string"}

Task.set_credentials(web_host=web_server,
                     api_host=api_server,
                     files_host=files_server,
                     key=access_key,
                     secret=secret_key)

In [23]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
!tar -xvzf imdb_dataset.gz
data = pd.read_csv("imdb_dataset.csv")

Downloading...
From: https://drive.google.com/uc?id=1hK-3iiRPlbePb99Fe-34LJNZ5yB-nduq
To: /content/imdb_dataset.gz
100% 77.0M/77.0M [00:01<00:00, 62.8MB/s]
imdb_dataset.csv


In [24]:
def preprocess(data):
    # remove urls
    data['tokenized'] = data['tokenized'].replace({'http\S+': '', 'www\S+': ''},
                                                  regex=True)

    # remove non english signs
    # data['tokenized'] = data['tokenized'].replace({'[^a-zA-Z]': ' '}, regex=True)

    #remove stop words
    english_stop_words = set(stopwords.words('english'))
    data['tokenized'] = data['tokenized'].apply(
        lambda rev: [word for word in rev.split() if word not in english_stop_words])

    # rejoin words
    data['tokenized'] = data['tokenized'].apply(lambda rev: ' '.join(rev))


# preprocess(data)
data['tokenized'].iloc[0]

"gary cooper , ( michael brandon ) played the role as an american millionaire who had seven bad marriages , but always divorced his wife 's with plenty of money to live on . michael is in paris on business and goes into a french department store to buy a pair of pajama tops and the sales people refuse to sell him just the tops , he has to buy the bottoms or there is no sale . nicole deloiselle , ( claudette colbert ) listens to this conversation and offers to buy the bottom of these pajama 's . michael becomes very interested in nicole and they have occasion to meet and go on dates . it is not too long before michael proposes marriage to nicole and she is very taken back with his request for marriage since she really does not know him very well . however , once she finds out she is going to become the eighth wife of michael she begins to change her mind and this story becomes quite entertaining and funny . do n't miss this film , it is great entertainment by great veteran actors . enjo

In [25]:
PADDING_VALUE = 0


class NaiveVectorizer:
    def __init__(self, tokenized_data, **kwargs):
        """Converts data from string to vector of ints that represent words. 
        Prepare lookup dict (self.wv) that maps token to int. Reserve index 0 for padding.
        """
        tokenized_data = [seq.split() for seq in tokenized_data]
        ### Your code goes here ###
        flatten_data = [w for seq in tokenized_data for w in seq]
        self.wv = dict(
            [(w, i) for i, w in enumerate(['<pad>'] + list(set(flatten_data)))])
        ##################################

    def vectorize(self, tokenized_seq):
        """Converts sequence of tokens into sequence of indices.
        If the token does not appear in the vocabulary(self.wv) it is ommited
        Returns torch tensor of shape (seq_len,) and type long."""
        ### Your code goes here ###
        return torch.LongTensor([self.wv[key] for key in tokenized_seq if key in self.wv])
        ##################################

    def get_vocab_size(self):
        return len(self.wv)


class ImdbDataset(Dataset):
    SPLIT_TYPES = ["train", "test", "unsup"]

    def __init__(self, data, preprocess_fn, split="train"):
        super(ImdbDataset, self).__init__()
        if split not in self.SPLIT_TYPES:
            raise AttributeError(f"No such split type: {split}")

        self.split = split
        self.label = [i for i, c in enumerate(data.columns) if c == "sentiment"][0]
        self.data_col = [i for i, c in enumerate(data.columns) if c == "tokenized"][0]
        self.data = data[data["split"] == self.split]
        self.preprocess_fn = preprocess_fn

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq = self.preprocess_fn(self.data.iloc[idx, self.data_col].split())
        label = self.data.iloc[idx, self.label]
        return (seq, label)


naive_vectorizer = NaiveVectorizer(data.loc[data["split"] == "train", "tokenized"])


def get_datasets():
    train_dataset = ImdbDataset(data, naive_vectorizer.vectorize)
    test_dataset = ImdbDataset(data, naive_vectorizer.vectorize, split="test")

    return train_dataset, test_dataset


def custom_collate_fn(pairs):
    """This function is supposed to be used by dataloader to prepare batches
    Input: list of tuples (sequence, label)
    Output: sequences_padded_to_the_same_lenths, original_lenghts_of_sequences, lables.
    torch.nn.utils.rnn.pad_sequence might be usefull here
    """
    ### Your code goes here ###
    sequence, labels = list(zip(*pairs))
    labels = torch.tensor(labels).float().to(device)
    lengths = torch.Tensor([len(seq) for seq in sequence])
    seqcs = pad_sequence(sequence, batch_first=True, padding_value=PADDING_VALUE)
    #################################
    return seqcs, lengths, labels

In [26]:
"""Implement LSTMSentimentTagger. 
The model should use a LSTM module.
Use torch.nn.utils.rnn.pack_padded_sequence to optimize processing of sequences.
When computing vocab_size of embedding layer remeber that padding_symbol counts to the vocab.
Use sigmoid activation function.
"""


class LSTMSentimentTagger(LightningModule):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, train_set, test_set):
        super(LSTMSentimentTagger, self).__init__()
        self.hidden_dim = hidden_dim

        ### Your code goes here ###
        self.embedding_dim = embedding_dim
        self.layers = 2
        self.vocab_size = vocab_size
        self.threshold = 0.5
        self.train_set = train_set
        self.test_set = test_set
        self.learning_rate = hyperparams['lr']

        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim,
                                      padding_idx=PADDING_VALUE)
        self.lstm = nn.LSTM(input_size=self.embedding_dim,
                            hidden_size=self.hidden_dim,
                            num_layers=self.layers,
                            dropout=0.5,
                            batch_first=True,
                            bidirectional=True)
        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(2 * self.hidden_dim, 1)
        self.activation = nn.Sigmoid()

        self.loss = nn.BCEWithLogitsLoss()
        #################################

    def forward(self, sentence, lengths):
        ### Your code goes here ###
        out = self.embedding(sentence)
        pack_padded = pack_padded_sequence(out, lengths.cpu(), batch_first=True,
                                           enforce_sorted=False)

        hidden = torch.zeros(self.layers * 2, sentence.shape[0], self.hidden_dim).to(
            device)
        cell = torch.zeros(self.layers * 2, sentence.shape[0], self.hidden_dim).to(device)

        _, (hidden, _) = self.lstm(pack_padded, (hidden, cell))
        out = torch.cat([hidden[-2, :, :], hidden[-1, :, :]], dim=1)

        out = self.dropout(out)
        scores = self.fc(F.relu(out))
        #################################
        return scores

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.learning_rate)
        scheduler = scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                                           mode='min',
                                                                           factor=0.2,
                                                                           patience=2,
                                                                           min_lr=1e-6,
                                                                           verbose=True)
        return {'optimizer': optimizer, 'lr_scheduler': scheduler, 'monitor': 'val_loss'}

    def training_step(self, batch, batch_idx):
        loss, acc = self._shared_eval_step(batch, batch_idx)
        metrics = {'loss': loss, 'train_acc': acc}
        self.log_dict(metrics)
        return metrics

    def validation_step(self, batch, batch_idx):
        loss, acc = self._shared_eval_step(batch, batch_idx)
        metrics = {'val_loss': loss, 'val_acc': acc}
        self.log_dict(metrics)
        return metrics

    def test_step(self, batch, batch_idx):
        loss, acc = self._shared_eval_step(batch, batch_idx)
        metrics = {'test_loss': loss, 'test_acc': acc}
        self.log_dict(metrics)
        return metrics

    def _shared_eval_step(self, batch, batch_idx):
        x, lengths, y = batch
        y_hat = self.forward(x, lengths).squeeze(1)
        loss = self.loss(y_hat, y)
        y_hat = (self.activation(y_hat) > self.threshold).long()
        acc = accuracy(y_hat, y.long())
        return loss, acc

    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size=hyperparams['batch_size'],
                          collate_fn=custom_collate_fn, shuffle=True)

    def test_dataloader(self):
        return DataLoader(self.test_set, batch_size=hyperparams['batch_size'],
                          collate_fn=custom_collate_fn)

    def val_dataloader(self):
        return DataLoader(self.test_set, batch_size=hyperparams['batch_size'],
                          collate_fn=custom_collate_fn)

# Trainig loop and visualizations


In [27]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [28]:
hyperparams = {
    'batch_size': 128,
    'lr': 1e-2,
    'epochs': 20,
    'embedding_dim': 50,
    'hidden_dim': 100,
    'vocab_size': naive_vectorizer.get_vocab_size()
}

In [29]:
train_set, test_set = get_datasets()

In [30]:
task = Task.init(project_name='GSN - Homework RNN', task_name='experiment_tracking')

model = LSTMSentimentTagger(hyperparams['embedding_dim'], hyperparams['hidden_dim'],
                            hyperparams['vocab_size'], train_set, test_set).to(device)

trainer = Trainer(
    max_epochs=hyperparams['epochs'],
    auto_lr_find=False,
    devices='auto',
    accelerator='auto',
    callbacks=[
        EarlyStopping(
            monitor='val_loss',
            patience=5
        ),
        ModelCheckpoint(
            monitor='val_acc',
            save_top_k=1,
            verbose=True,
            mode='max'
        ),
        LearningRateMonitor()
    ])

task.connect(hyperparams)

trainer.fit(model)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type              | Params
-------------------------------------------------
0 | embedding  | Embedding         | 4.4 M 
1 | lstm       | LSTM              | 363 K 
2 | dropout    | Dropout           | 0     
3 | fc         | Linear            | 201   
4 | activation | Sigmoid           | 0     
5 | loss       | BCEWithLogitsLoss | 0     
-------------------------------------------------
4.7 M     Trainable params
0         Non-trainable params
4.7 M     Total params
18.859    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch 0, global step 195: val_acc reached 0.77416 (best 0.77416), saving model to "/content/lightning_logs/version_2/checkpoints/epoch=0-step=195.ckpt" as top 1


2022-01-15 02:40:50,754 - clearml.frameworks - INFO - Found existing registered model id=136f65286d3b4acaa9241b748ab43949 [/content/lightning_logs/version_2/checkpoints/epoch=0-step=195.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

Epoch 1, global step 391: val_acc reached 0.86680 (best 0.86680), saving model to "/content/lightning_logs/version_2/checkpoints/epoch=1-step=391.ckpt" as top 1


2022-01-15 02:42:20,634 - clearml.frameworks - INFO - Found existing registered model id=e7393c999ad94e7a8be9a6b6311ce63c [/content/lightning_logs/version_2/checkpoints/epoch=1-step=391.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

Epoch 2, global step 587: val_acc was not in top 1


Validating: 0it [00:00, ?it/s]

Epoch 3, global step 783: val_acc was not in top 1


Validating: 0it [00:00, ?it/s]

Epoch 4, global step 979: val_acc reached 0.86864 (best 0.86864), saving model to "/content/lightning_logs/version_2/checkpoints/epoch=4-step=979.ckpt" as top 1


2022-01-15 02:46:32,246 - clearml.frameworks - INFO - Found existing registered model id=46dd9b5447b14c379482c23c6ee41d4c [/content/lightning_logs/version_2/checkpoints/epoch=4-step=979.ckpt] reusing it.


Validating: 0it [00:00, ?it/s]

Epoch 5, global step 1175: val_acc was not in top 1


Epoch     6: reducing learning rate of group 0 to 2.0000e-03.


Validating: 0it [00:00, ?it/s]

Epoch 6, global step 1371: val_acc reached 0.87356 (best 0.87356), saving model to "/content/lightning_logs/version_2/checkpoints/epoch=6-step=1371.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

Epoch 7, global step 1567: val_acc was not in top 1


In [31]:
# lr_finder = trainer.tuner.lr_find(model)
# lr_finder.suggestion()

In [44]:
plots_titles = ['loss', 'train_acc', 'val_loss', 'val_acc', 'lr-Adam']
scalars = task.get_reported_scalars()

In [45]:
fig = make_subplots(rows=3, cols=2, subplot_titles=plots_titles)

for id, title in enumerate(plots_titles):
    x = scalars[title]['version_2: '+title]['x']
    y = scalars[title]['version_2: '+title]['y']
    fig.add_trace(go.Scatter(x=x, y=y), row=int(id / 2) + 1, col=id % 2 + 1)
fig.show()

In [48]:
print(f'max test acc: {max(scalars["val_acc"]["version_2: val_acc"]["y"]):.3}')

max test acc: 0.874


In [34]:
task.close()