### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


## Load data

In [None]:
import pandas as pd

data = pd.read_csv('comments.tsv', sep='\t')
data.head()

In [None]:
print('Total comments: {}'.format(len(data)))
texts = data['comment_text'].values
target = data['should_ban'].values

In [None]:
from sklearn.model_selection import train_test_split

texts_train, texts_test, target_train, target_test = train_test_split(texts, target, test_size=0.5, random_state=42)
print('Train size: {}'.format(len(texts_train)))
print('Test size: {}'.format(len(texts_test)))

## Tokenization & Embeddings

In [None]:
from nltk.tokenize import TweetTokenizer
import numpy as np

tokenizer = TweetTokenizer()
tokenize_func = lambda x: ' '.join(tokenizer.tokenize(x.lower()))
tokenize_func('Hello, world!')

In [None]:
tokenize_func_vect = np.vectorize(tokenize_func)
print('Before tokenization:')
print(texts_train[:3])
print()
texts_train = tokenize_func_vect(texts_train)
texts_test = tokenize_func_vect(texts_test)
print('After tokenization:')
print(texts_train[:3])
print()

In [None]:
import gensim.downloader 
embeddings = gensim.downloader.load("fasttext-wiki-news-subwords-300")

In [None]:
def embedding_sum(text, print_missing_words=False):
    tokens = text.split()
    embeddings_sum = np.zeros(embeddings.vectors.shape[1])
    for token in tokens:
        try:
            embeddings_sum += embeddings.get_vector(token)
        except KeyError:
            if print_missing_words:
                print(f'Word "{token}" not found in vocabulary')
            pass
    return embeddings_sum

embedding_sum(tokenize_func('Hello, world!'), True).shape

def embedding_stack(text, sentence_len=32, print_missing_words=False):
    tokens = text.split()
    embeddings_stack = np.zeros((embeddings.vectors.shape[1], sentence_len))
    
    if len(tokens) < sentence_len:
        diff = sentence_len - len(tokens)
        tokens = tokens + ['PAD'] * diff
    
    for i, token in enumerate(tokens[:sentence_len]):
        try:
            embeddings_stack[:, i] = embeddings.get_vector(token)
        except KeyError:
            embeddings_stack[:, i] = embeddings.get_vector('UNK')
            if print_missing_words:
                print(f'Word "{token}" not found in vocabulary')
            pass
    return embeddings_stack

embedding_stack(tokenize_func('Hello, world!'), sentence_len=5, print_missing_words=True).shape

In [None]:
avg_sentence_len = int(np.mean([len(text.split()) for text in texts_train]))

print('Before embedding:')
print(texts_train[:3])
print()
texts_train = np.array([embedding_stack(text, sentence_len=avg_sentence_len) for text in texts_train])
texts_test = np.array([embedding_stack(text, sentence_len=avg_sentence_len) for text in texts_test])
print('After embedding:')
print(texts_train[:3].shape)
print()

## CNNs Architecture

We will test the following architectures:

1. Simple CNN with one convolutional layer and one fully-connected layer
2. CNN with two convolutional layers each with a max-pooling layer and one fully-connected layer
3. Parallel CNN with two convolutional layers each with a max-pooling layer, one concatenation and one fully-connected layer

*Sources:*
- https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
- https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def to_tensors(data):
    return torch.tensor(data, dtype=torch.float32, device=device)

def make_batches(data, batch_size=None):
    if batch_size is None:
        batch_size = len(data)
    for i in range(0, len(data), batch_size):
        yield to_tensors(data[i:i+batch_size])

In [None]:
def accuracy(model, X, y):
    """
    Calculates the accuracy for a batch of X and y, using model.
    Since the model outputs a single value with sigmoid output, we need to round it to get the predicted class.
    @param model: torch.nn.Module
    @param X: torch.Tensor or numpy.ndarray
    @param y: torch.Tensor or numpy.ndarray
    @return: float
    """
    if isinstance(X, np.ndarray):
        X = to_tensors(X)
    if isinstance(y, np.ndarray):
        y = to_tensors(y)
        
    with torch.no_grad():
        output = model(X)
        if len(output.shape) > 1:
            output = output.squeeze()
        y_pred = torch.round(output)
        return (y_pred == y).float().mean().item()

def train(model, optimizer, criterion, data, batch_size=32, epochs=10):
    """
    Train a model using the given optimizer and criterion,
    using the given data, for the given number of epochs in batches of batch_size.
    @param model: torch.nn.Module
    @param optimizer: torch.optim.Optimizer
    @param criterion: torch.nn.Module
    @param data: tuple - should be [X_train, X_test, y_train, y_test]
    @param batch_size: int
    @param epochs: int
    @return: dict - with keys "loss", "accuracy", "test_loss", "test_accuracy"
    """
    model.train()
    
    X_train, X_test, y_train, y_test = data
    assert X_train.shape[0] == y_train.shape[0], "X_train and y_train must have the same number of rows, keep in mind data should be [X_train, X_test, y_train, y_test]"
    assert X_test.shape[0] == y_test.shape[0], "X_test and y_test must have the same number of rows, keep in mind data should be [X_train, X_test, y_train, y_test]"
    assert X_train.shape[1] == X_test.shape[1], "X_train and X_test must have the same number of columns, keep in mind data should be [X_train, X_test, y_train, y_test]"

    metrics = {
        "loss": [],
        "accuracy": [],
        "test_loss": [],
        "test_accuracy": []
    }

    for epoch in range(epochs):
        epoch_loss = 0
        
        # Train
        for X_batch, y_batch in zip(make_batches(X_train, batch_size), make_batches(y_train, batch_size)):
            optimizer.zero_grad()
            output = model(X_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        epoch_loss /= len(data) / batch_size
        metrics["loss"].append(epoch_loss)
        metrics["accuracy"].append(accuracy(model, X_train, y_train))
        
        # Test
        epoch_loss = 0
        for X_batch, y_batch in zip(make_batches(X_test), make_batches(y_test)):
            output = model(X_batch)
            loss = criterion(output, y_batch)
            epoch_loss += loss.item()
        
        epoch_loss /= len(data) / batch_size
        metrics["test_loss"].append(epoch_loss)
        metrics["test_accuracy"].append(accuracy(model, X_test, y_test))

    return metrics

def plot_metrics(metrics):
    """
    Generates plots for a metrics dictionary with the results of training.
    @param metrics: dict - should have keys "loss", "accuracy", "test_loss", "test_accuracy"
    """
    plt.figure(figsize=(10, 5))

    plt.subplot(1, 2, 1)
    plt.title('Loss')
    l = metrics.get('loss', [])
    tl = metrics.get('test_loss', [])
    if len(l) > 0 or len(tl) > 0: #at least one of them has values
        plt.plot(l, label='Train loss')
        plt.plot(tl, label='Test loss')
        plt.legend()

    plt.subplot(1, 2, 2)
    plt.title('Accuracy')
    a = metrics.get('accuracy', [])
    ta = metrics.get('test_accuracy', [])
    if len(a) > 0 or len(ta) > 0: #at least one of them has values
        plt.plot(a, label='Train accuracy')
        plt.plot(ta, label='Test accuracy')
        plt.legend()

def plot_auc(model, X, y, model_name='Model'):
    """
    Calculates the AUC for a batch of X and y, using model.
    @param model: torch.nn.Module
    @param X: torch.Tensor or numpy.ndarray
    @param y: torch.Tensor or numpy.ndarray
    """
    if isinstance(X, np.ndarray):
        X = to_tensors(X)
    if isinstance(y, np.ndarray):
        y = to_tensors(y)
        
    with torch.no_grad():
        output = model(X)
        if len(output.shape) > 1:
            output = output.squeeze()
        y_pred = output.cpu().numpy()
        y_true = y.cpu().numpy()
        fpr, tpr, _ = roc_curve(y_true, y_pred)
        roc_auc = roc_auc_score(y_true, y_pred)

        plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], 'k--')
        plt.xlim([-0.05, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.legend(loc="lower right")

Simple CNN with one convolutional layer and one fully-connected layer

In [None]:
class SimpleCNN(nn.Module):
    def __init__(self, input_size, sentence_len, hidden_size, filter_size=3):
        super().__init__()
        self.conv1 = nn.Conv1d(input_size, hidden_size, filter_size, padding='same')
        self.fc1 = nn.Linear(hidden_size*sentence_len, 1)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.sigmoid(x)
        return x.squeeze()

simple_cnn = SimpleCNN(texts_train.shape[1], texts_train.shape[2], 100, 9).to(device)
optimizer = torch.optim.Adam(simple_cnn.parameters(), lr=0.001)
criterion = nn.BCELoss()

metrics = train(simple_cnn, optimizer, criterion, [texts_train, texts_test, target_train, target_test], batch_size=32, epochs=10)


In [None]:
plot_metrics(metrics)

CNN with two convolutional layers each with a max-pooling layer and one fully-connected layer

In [None]:
class CNN(nn.Module):
    def __init__(self, input_size, sentence_len, hidden_size, filter_size=3):
        super().__init__()
        self.batch_norm1 = nn.BatchNorm1d(input_size)
        self.conv1 = nn.Conv1d(input_size, hidden_size, filter_size, padding='same')
        self.pool1 = nn.MaxPool1d(2)
        sentence_len = sentence_len // 2 # because of maxpool
        filter_size = max(3, filter_size // 2) # decreasing filter size
        filter_size = filter_size + 1 if filter_size % 2 == 0 else filter_size # making filter size odd (better for padding='same')
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.conv2 = nn.Conv1d(hidden_size, hidden_size, filter_size, padding='same')
        self.pool2 = nn.MaxPool1d(2)
        sentence_len = sentence_len // 2 # because of maxpool
        self.fc1 = nn.Linear(hidden_size*sentence_len, 1)

    def forward(self, x):
        x = F.dropout1d(x)
        x = self.batch_norm1(x)
        x = self.pool1(F.relu(self.conv1(x)))
        
        x = F.dropout1d(x)
        x = self.batch_norm2(x)
        x = self.pool2(F.relu(self.conv2(x)))
        
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.sigmoid(x)
        return x.squeeze()

cnn = CNN(texts_train.shape[1], texts_train.shape[2], 100, 9).to(device)
optimizer = torch.optim.Adam(cnn.parameters(), lr=0.001)
criterion = nn.BCELoss()

metrics = train(cnn, optimizer, criterion, [texts_train, texts_test, target_train, target_test], batch_size=32, epochs=50)

In [None]:
plot_metrics(metrics)

Parallel CNN with two convolutional layers each with a max-pooling layer, one concatenation and one fully-connected layer

In [None]:
class parallellCNN(nn.Module):
    def __init__(self, input_size, sentence_len, hidden_size, filter_size=3):
        super().__init__()
        # First layers of both branches
        self.batch_norm_1 = nn.BatchNorm1d(input_size)
        self.batch_norm_2 = nn.BatchNorm1d(input_size)
        self.conv1_1 = nn.Conv1d(input_size, hidden_size, filter_size, padding='same')
        self.pool1_1 = nn.MaxPool1d(2)
        self.conv1_2 = nn.Conv1d(input_size, hidden_size, filter_size, padding='same')
        self.pool1_2 = nn.MaxPool1d(2)
        sentence_len = sentence_len // 2 # because of maxpool
        filter_size = max(3, filter_size // 2) # decreasing filter size
        filter_size = filter_size + 1 if filter_size % 2 == 0 else filter_size # making filter size odd (better for padding='same')

        # Second layers of both branches
        self.conv2_1 = nn.Conv1d(hidden_size, hidden_size, filter_size, padding='same')
        self.pool2_1 = nn.MaxPool1d(2)
        self.conv2_2 = nn.Conv1d(hidden_size, hidden_size, filter_size, padding='same')
        self.pool2_2 = nn.MaxPool1d(2)
        sentence_len = sentence_len // 2 # because of maxpool

        # Fully connected layer
        self.fc1 = nn.Linear(hidden_size*sentence_len*2, 1)

    def forward(self, x):
        x = F.dropout1d(x)
        x1 = self.batch_norm_1(x)
        x2 = self.batch_norm_2(x)
        x1 = F.dropout(x1)
        x2 = F.dropout(x2)
        x1 = self.pool1_1(F.relu(self.conv1_1(x1)))
        x2 = self.pool1_2(F.relu(self.conv1_2(x2)))
        x1 = F.dropout(x1)
        x2 = F.dropout(x2)
        x1 = self.pool2_1(F.relu(self.conv2_1(x1)))
        x2 = self.pool2_2(F.relu(self.conv2_2(x2)))
        x1 = torch.flatten(x1, 1)
        x2 = torch.flatten(x2, 1)
        x = torch.cat((x1, x2), dim=1)
        x = self.fc1(x)
        x = F.sigmoid(x)
        return x.squeeze()

par_cnn = parallellCNN(texts_train.shape[1], texts_train.shape[2], 100, 9).to(device)
optimizer = torch.optim.Adam(par_cnn.parameters(), lr=0.001)
criterion = nn.BCELoss()

metrics = train(par_cnn, optimizer, criterion, [texts_train, texts_test, target_train, target_test], batch_size=32, epochs=50)

In [None]:
plot_metrics(metrics)

### ROC curves

In [None]:
plt.figure(figsize=(10, 5))
plot_auc(simple_cnn, texts_test, target_test, 'Simple CNN')
plot_auc(cnn, texts_test, target_test, 'CNN')
plot_auc(par_cnn, texts_test, target_test, 'Parallel CNN')

## RNN Architecture

We will test the following architechtures:

1. Simple RNN with one fully-connected layer
2. RNN with two fully-connected layers
3. Parallel RNN one left-to-right and one right-to-left (Bidirectional RNN) with a final concatenation and one fully-connected layer
4. Parallel LSTM one left-to-right and one right-to-left (Bidirectional LSTM) with a final concatenation and one fully-connected layer
5. Parallel GRU one left-to-right and one right-to-left (Bidirectional GRu) with a final concatenation and one fully-connected layer

*Sources:*
- https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
- https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

In [1]:
# <CODE-HERE>

## A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/common/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!