### Salary prediction, episode II: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either __pytorch__ or __tensorflow__ or any other framework (e.g. pure __keras__). Feel free to adapt the seminar code for your needs. For tensorflow version, consider `seminar_tf2.ipynb` as a starting point.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

## Part II: Experimenting

In [2]:
data = pd.read_csv('data/Train_rev1.csv', index_col = None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

In [3]:
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

data.sample(3)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,Log1pSalary
53711,68674868,Web Developer,Web Developer A Junior or MW Web / Digital Dev...,Milton Keynes Buckinghamshire South East,Milton Keynes,,permanent,Beyond The Book,"PR, Advertising & Marketing Jobs",20-30k,25000,totaljobs.com,10.126671
93808,69192359,Assistant Manager,An award winning swimming school is looking to...,"COLCHESTER, ESSEX",Colchester,full_time,permanent,,Customer Services Jobs,15000 - 16000 per annum,15500,fish4.co.uk,9.64866
239657,72629644,Home Improvement Agency Caseworker,Based within Caseworker Services of Environmen...,UK Warwickshire,Warwickshire,,contract,European Solution Ltd,Property Jobs,10.17 per hour,19526,careers4a.com,9.879554


 ### Train-test split
    
To be completely rigorous, let's first separate data into train and validation parts before proceeding to tokenization.

In [4]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

data_train = data_train.copy()
data_val = data_val.copy()

print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))

Train size =  195814
Validation size =  48954


### Preprocessing text data

Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation, etc).

__Your task__ is to lowercase and tokenize all texts under `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for performance reasons.

It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay.

In [5]:
print("Raw text:")
print(data["FullDescription"][2::100000])

Raw text:
2         Mathematical Modeller / Simulation Analyst / O...
100002    A successful and high achieving specialist sch...
200002    Web Designer  HTML, CSS, JavaScript, Photoshop...
Name: FullDescription, dtype: object


In [6]:
import nltk

tokenizer = nltk.tokenize.WordPunctTokenizer()
preprocess = lambda text: ' '.join(tokenizer.tokenize(text.lower()))

In [7]:
data_train['Title'] = data_train['Title'].astype(str).apply(preprocess)
data_val['Title'] = data_val['Title'].astype(str).apply(preprocess)

data_train['FullDescription'] = data_train['FullDescription'].astype(str).apply(preprocess)
data_val['FullDescription'] = data_val['FullDescription'].astype(str).apply(preprocess)

In [8]:
print("Tokenized:")
print(data_train["FullDescription"][2::100000])

Tokenized:
2         the opportunity my client is currently seeking...
100002    a principal railways systems engineer is requi...
Name: FullDescription, dtype: object


Not all words are equally useful. Some of them are typos or rare words that are only present a few times. 

Let's count how many times is each word present in the data so that we can build a "white list" of known words.

In [9]:
from collections import Counter
token_counts = Counter()

for text in data_train['Title'].values:
    token_counts.update(text.split())

for text in data_train['FullDescription'].values:
    token_counts.update(text.split())

In [10]:
min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = sorted(t for t, c in token_counts.items() if c >= min_count)

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens

In [11]:
print("Vocabulary size:", len(tokens))
assert type(tokens) == list
assert 'me' in tokens
assert UNK in tokens
print("Correct!")

Vocabulary size: 30715
Correct!


__Task 1.2__ Build an inverse token index: a dictionary from token(string) to it's index in `tokens` (int)

In [12]:
token_to_id = {t: i for i, t in enumerate(tokens)}

In [13]:
assert isinstance(token_to_id, dict)
assert len(token_to_id) == len(tokens)
for tok in tokens:
    assert tokens[token_to_id[tok]] == tok

print("Correct!")

Correct!


And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices.

In [14]:
UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

Now let's  encode the categirical data we have.

As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc.

In [15]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data_train['Company']).most_common(1000))
recognized_companies = set(top_companies)
data_train["Company"] = data_train["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data_train[categorical_columns].apply(dict, axis=1))

DictVectorizer(dtype=<class 'numpy.float32'>, sparse=False)

### The deep learning part

Once we've learned to tokenize the data, let's design a machine learning experiment.


In [16]:
import torch

def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors

def make_batch(data, max_len=None, word_dropout=0, device=torch.device('cpu')):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if TARGET_COLUMN in data.columns:
        batch[TARGET_COLUMN] = data[TARGET_COLUMN].values
    
    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

In [17]:
make_batch(data_train[:3], max_len=10)

{'Title': tensor([[24819, 26849, 30286,     1,     1,     1,     1],
         [26256,   178, 17212, 17987, 13981, 20789,  3625],
         [ 9523, 27317, 15930,    30,  7802, 26179,    58]]),
 'FullDescription': tensor([[24819, 26849, 30286, 29617,   859, 24819, 26849, 30286, 14791, 29617],
         [26256,   178, 17212, 17987, 13981, 20789,  3625, 22887,   792,    74],
         [27623, 19706, 18493,  5736, 14791,  7327, 24684,   859, 27317, 15930]]),
 'Categorical': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 'Log1pSalary': tensor([ 9.7115, 10.4631, 10.7144])}

In [18]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=torch.device('cpu'), **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], device=device, **kwargs)
            yield batch
        
        if not cycle: break

# Experiments

## What's the benchmark?

Before we proceed to experimenting, let's define what's the benchmark. Let's consider the architecture and the results obtained in `homework_part2.ipynb` as our benchmark. That is, after 5 epochs
- Mean square error = 4.32724
- Mean absolute error = 2.04903

We'll see whether it will be possible to improve these results or not.

### A) CNN architecture

It is close to what we've done in the `homework_part2.ipynb`, but we will try some more stuff as suggested:

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


In [74]:
import torch
import torch.nn as nn
import torch.functional as F

In [75]:
# We'll use a special helper module to squeeze dimensions
class Squeezener(nn.Module):
    def forward(self, x):
        return x.squeeze()

In [None]:
class SalaryPredictor(nn.Module):
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):
        super().__init__()

        self.embedder = nn.Embedding(n_tokens, embedding_dim=hid_size)

        self.title_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.description_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.categorical_encoder = nn.Sequential(
            nn.Linear(n_cat_features, hid_size * 2),
            nn.ReLU(),
            nn.Linear(hid_size * 2, hid_size * 2),
            nn.ReLU()
        )

        self.final_predictor = nn.Sequential(
            nn.Linear(hid_size * 4, hid_size),
            nn.ReLU(),
            nn.Linear(hid_size, 1)
        )

    def forward(self, batch):

        title_embeddings = self.embedder(batch['Title']).permute(0,2,1)
#         print(title_embeddings.shape)
        title_features = self.title_encoder(title_embeddings)
#         print(title_features.shape)

        description_embeddings = self.embedder(batch['FullDescription']).permute(0,2,1)
#         print(description_embeddings.shape)
        description_features = self.description_encoder(description_embeddings)
#         print(description_features.shape)

        categorical_features = self.categorical_encoder(batch['Categorical'])
        
        features = torch.cat([title_features, description_features, categorical_features], 1)
        return self.final_predictor(features).squeeze()


In [None]:
import tqdm

BATCH_SIZE = 512
EPOCHS = 5
DEVICE = torch.device('cuda')

In [None]:
def print_metrics(model, data, batch_size=BATCH_SIZE, name="", device=torch.device('cpu'), **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, device = device, shuffle=False, **kw):
            batch_pred = model(batch)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    return mse, mae


In [None]:
model = SalaryPredictor().to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.tqdm_notebook(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val, device=DEVICE)


### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer

As suggested above, I will use `Wikipedia 2014 + Gigaword 5`. It was trained on a corpus of 6 billion tokens and contains a vocabulary of 400000 tokens.

In [None]:
model = SalaryPredictor()
batch = make_batch(data_train[:100])
criterion = nn.MSELoss()

dummy_pred = model(batch)
dummy_loss = criterion(dummy_pred, batch[TARGET_COLUMN])
assert dummy_pred.shape == torch.Size([100])
assert len(np.unique(dummy_pred.cpu().detach().numpy())) > 20, "model returns suspiciously few unique outputs. Check your initialization"
assert dummy_loss.ndim == 0 and 0. <= dummy_loss <= 250., "make sure you minimize MSE"

In [19]:
import torch
import torch.nn as nn
import torch.functional as F

To make things comparable, let's make changes in steps:

1. Vector embedding dimension will be 50 (64 in a not-pretrained setting)
2. Let's see the results without fine-tunning the embeddings
3. Results with finetuning
4. Use embeddings with 300 size

In [None]:
# Code below is taken from
# https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76
words = []
idx = 0
word2idx = {}
vectors = []

with open('glovedata/glove.6B/glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)

In [None]:
glovedct =  {w : vectors[word2idx[w]] for w in words}

We must build a matrix of weights that will be loaded into the PyTorch embedding layer. Its shape will be equal to:

```(dataset’s vocabulary length, word vectors dimension).```

For each word in dataset’s vocabulary, we check if it is on GloVe’s vocabulary. If it do it, we load its pre-trained word vector. Otherwise, we initialize a random vector.

In [None]:
matrix_len = len(tokens)
weights_matrix = np.zeros((matrix_len, 50))
words_found = 0

for i, word in enumerate(tokens):
    try:
        weights_matrix[i] = glovedct[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size = (50,))

In [None]:
print(f'{100*words_found/len(tokens):.2f}% of tokens were found in Glove dataset')

Architecture will be the same as in the baseline, the only difference is the use of pretrained word vectors with a different dimension.

In [None]:
# We'll use a special helper module to squeeze dimensions
class Squeezener(nn.Module):
    def forward(self, x):
        return x.squeeze()

In [None]:
class SalaryPredictor(nn.Module):
    def __init__(self,
                 weights_matrix,
                 n_cat_features=len(categorical_vectorizer.vocabulary_),
                 hid_size=50,
                 freeze=True):
        
        super().__init__()
        
        self.embedder = nn.Embedding.from_pretrained(
            torch.FloatTensor(weights_matrix),
            freeze=freeze)

        self.title_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.description_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.categorical_encoder = nn.Sequential(
            nn.Linear(n_cat_features, hid_size * 2),
            nn.ReLU(),
            nn.Linear(hid_size * 2, hid_size * 2),
            nn.ReLU()
        )

        self.final_predictor = nn.Sequential(
            nn.Linear(hid_size * 4, hid_size),
            nn.ReLU(),
            nn.Linear(hid_size, 1)
        )

    def forward(self, batch):

        title_embeddings = self.embedder(batch['Title']).permute(0,2,1)
#         print(title_embeddings.shape)
        title_features = self.title_encoder(title_embeddings)
#         print(title_features.shape)

        description_embeddings = self.embedder(batch['FullDescription']).permute(0,2,1)
#         print(description_embeddings.shape)
        description_features = self.description_encoder(description_embeddings)
#         print(description_features.shape)

        categorical_features = self.categorical_encoder(batch['Categorical'])
        
        features = torch.cat([title_features, description_features, categorical_features], 1)
        return self.final_predictor(features).squeeze()


In [None]:
model = SalaryPredictor(weights_matrix)
batch = make_batch(data_train[:100])
criterion = nn.MSELoss()

dummy_pred = model(batch)
dummy_loss = criterion(dummy_pred, batch[TARGET_COLUMN])
assert dummy_pred.shape == torch.Size([100])
assert len(np.unique(dummy_pred.cpu().detach().numpy())) > 20, "model returns suspiciously few unique outputs. Check your initialization"
assert dummy_loss.ndim == 0 and 0. <= dummy_loss <= 250., "make sure you minimize MSE"

In [None]:
import tqdm

BATCH_SIZE = 512
EPOCHS = 5
DEVICE = torch.device('cuda')

In [None]:
def print_metrics(model, data, batch_size=BATCH_SIZE, name="", device=torch.device('cpu'), **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, device = device, shuffle=False, **kw):
            batch_pred = model(batch)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    return mse, mae


In [None]:
model = SalaryPredictor(weights_matrix, freeze=True).to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.tqdm_notebook(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val, device=DEVICE)


Interesting! Notice the following thing: both `mean squared error` and `mean absolute error` are less than the pne in the benchamrk (`4.32724` and `2.04903`, correspondingly). But the interesting aspect is that by default `nn.Embedding()` layer in the benchmark is a trainable layer while `nn.Embedding.from_pretrained()` is not. It means that using pretrained vectors gives better results (at least when training for 5 epochs) even when we do not allow retraining.

Let's see what happens when we allow retraining of the embeddings (setting `freeze=False`).

In [None]:
model.embedder.weight.requires_grad

In [None]:
model = SalaryPredictor(weights_matrix, freeze=False).to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.tqdm_notebook(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val, device=DEVICE)


In [None]:
model.embedder.weight.requires_grad

We were able to get even better results, when allowing for retraining of the embeddings. However, such a result should be taken with caution because one cannot judge having tried just a few times + training only for 5 epochs. However, we can make a conclusion that using pretrained vectors improves the results.

I am curious of what would happen if instead of 50 dimensions for the embeddings, we will use those with 300 dimensions? Repeat the previous steps and let's see:

In [None]:
# Code below is taken from
# https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76
words = []
idx = 0
word2idx = {}
vectors = []

with open('glovedata/glove.6B/glove.6B.300d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
        
glovedct =  {w : vectors[word2idx[w]] for w in words}
matrix_len = len(tokens)
weights_matrix = np.zeros((matrix_len, 300))
words_found = 0

for i, word in enumerate(tokens):
    try:
        weights_matrix[i] = glovedct[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size = (300,))

In [None]:
print(f'{100*words_found/len(tokens):.2f}% of token were found in Glove dataset')

In [None]:
model = SalaryPredictor(weights_matrix, hid_size=300, freeze=False).to(DEVICE)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    print(f"epoch: {epoch}")
    model.train()
    for i, batch in tqdm.tqdm_notebook(enumerate(
            iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=DEVICE)),
            total=len(data_train) // BATCH_SIZE
        ):
        pred = model(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print_metrics(model, data_val, device=DEVICE)


Quite unexpectedly the results are worse. It might be due to the fact that to train a larger network, one needs more epochs to train and adjusting the hyperparameters.

## E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!

ok, everyone talks about PyTorch Lightning, so it is a good opportunity to do the first dive.

In [20]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=torch.device('cpu'), **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], device=device, **kwargs)
            yield batch
        
        if not cycle: break

In [21]:
BATCH_SIZE = 512
EPOCHS = 5
DEVICE = torch.device('cuda')

In [42]:
import torch
from torch.nn import functional as F
from torch import nn
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.metrics import Metric

In [23]:
# Code below is taken from
# https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76
words = []
idx = 0
word2idx = {}
vectors = []

with open('glovedata/glove.6B/glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
        
glovedct =  {w : vectors[word2idx[w]] for w in words}
matrix_len = len(tokens)
weights_matrix = np.zeros((matrix_len, 50))
words_found = 0

for i, word in enumerate(tokens):
    try:
        weights_matrix[i] = glovedct[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size = (50,))

In [36]:
class MyMSE(Metric):
    def __init__(self, compute_on_step=False, dist_sync_on_step=False):
        super().__init__(compute_on_step=compute_on_step, dist_sync_on_step=dist_sync_on_step)

        self.add_state("squared_error", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")

    def update(self, preds: torch.Tensor, target: torch.Tensor):
        self.squared_error += torch.sum(torch.square(preds - target))
        self.total += target.numel()

    def compute(self):
        return self.squared_error.float()/ self.total

class MyMAE(Metric):
    def __init__(self, compute_on_step=False, dist_sync_on_step=False):
        super().__init__(compute_on_step=compute_on_step, dist_sync_on_step=dist_sync_on_step)

        self.add_state("abs_error", default=torch.tensor(0.0), dist_reduce_fx="sum")
        self.add_state("total", default=torch.tensor(0), dist_reduce_fx="sum")

    def update(self, preds: torch.Tensor, target: torch.Tensor):
        self.abs_error += torch.sum(torch.abs(preds - target))
        self.total += target.numel()

    def compute(self):
        return self.abs_error.float() / self.total

In [37]:
# We'll use a special helper module to squeeze dimensions
class Squeezener(nn.Module):
    def forward(self, x):
        return x.squeeze()

In [65]:
class SalaryPredictor(LightningModule):

    def __init__(self,
                 weights_matrix,
                 n_cat_features=len(categorical_vectorizer.vocabulary_),
                 hid_size=50,
                 freeze=True):
        
        super().__init__()

        self.val_mse = MyMSE()
        self.val_mae = MyMAE()
        
        self.embedder = nn.Embedding.from_pretrained(
            torch.FloatTensor(weights_matrix),
            freeze=freeze)

        self.title_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.description_encoder = nn.Sequential(
            nn.Conv1d(hid_size, hid_size, kernel_size=2),
            nn.Dropout(p=0.25),
            nn.ReLU(),
            nn.AdaptiveMaxPool1d(output_size=1),
            Squeezener(),
            nn.Linear(hid_size, hid_size),
            nn.ReLU()
        )

        self.categorical_encoder = nn.Sequential(
            nn.Linear(n_cat_features, hid_size * 2),
            nn.ReLU(),
            nn.Linear(hid_size * 2, hid_size * 2),
            nn.ReLU()
        )

        self.final_predictor = nn.Sequential(
            nn.Linear(hid_size * 4, hid_size),
            nn.ReLU(),
            nn.Linear(hid_size, 1)
        )

    def forward(self, batch):

        title_embeddings = self.embedder(batch['Title']).permute(0,2,1)
        title_features = self.title_encoder(title_embeddings)
        description_embeddings = self.embedder(batch['FullDescription']).permute(0,2,1)
        description_features = self.description_encoder(description_embeddings)

        categorical_features = self.categorical_encoder(batch['Categorical'])
        
        features = torch.cat([title_features, description_features, categorical_features], 1)
        return self.final_predictor(features).squeeze()

    def training_step(self, batch, batch_idx):

        pred = self(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        self.log('loss', loss, prog_bar=False)
        return {'loss': loss}

    def validation_step(self, batch, batch_idx):
        pred = self(batch)
        loss = criterion(pred, batch[TARGET_COLUMN])
        self.log('val_loss', loss, prog_bar=False)
        
        self.val_mse(pred, batch[TARGET_COLUMN])
        self.val_mae(pred, batch[TARGET_COLUMN])
        
        self.log('valid_mse', self.val_mse, on_step=False, on_epoch=True,  prog_bar=True)
        self.log('valid_mae', self.val_mae, on_step=False, on_epoch=True,  prog_bar=True)
        return {
            'val_loss': loss,
            'valid_mse': self.val_mse,
            'valid_mae': self.val_mae}

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-4)
        return optimizer

    def train_dataloader(self):
        iterate_minibatches = IterateMiniBatches(data_train)
        return iterate_minibatches
    
    def val_dataloader(self):
        iterate_minibatches = IterateMiniBatches(data_val)
        return iterate_minibatches


In [66]:
class IterateMiniBatches:
    
    def __init__(self, data, batch_size=512, shuffle=True):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
    
    def __iter__(self):
        
        data = self.data
        batch_size = self.batch_size
        shuffle = self.shuffle
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)
        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]])
            yield batch

In [67]:
model = SalaryPredictor(weights_matrix)
criterion = nn.MSELoss(reduction='sum')

In [70]:
trainer = Trainer(gpus=1, max_epochs=20, callbacks=[EarlyStopping(monitor='valid_mae')])
trainer.fit(model)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name                | Type       | Params
---------------------------------------------------
0 | val_mse             | MyMSE      | 0     
1 | val_mae             | MyMAE      | 0     
2 | embedder            | Embedding  | 1.5 M 
3 | title_encoder       | Sequential | 7.6 K 
4 | description_encoder | Sequential | 7.6 K 
5 | categorical_encoder | Sequential | 368 K 
6 | final_predictor     | Sequential | 10.1 K
---------------------------------------------------
393 K     Trainable params
1.5 M     Non-trainable params
1.9 M     Total params


HBox(children=(HTML(value='Validation sanity check'), FloatProgress(value=1.0, bar_style='info', layout=Layout…

HBox(children=(HTML(value='Training'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…

HBox(children=(HTML(value='Validating'), FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), m…




1

In [73]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


### A) CNN architecture

It is close to what we've done in the `homework_part2.ipynb`, but we will try some more stuff as suggested:

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### Architecture

Our basic model consists of three branches:
* Title encoder
* Description encoder
* Categorical features encoder

We will then feed all 3 branches into one common network that predicts salary.

![scheme](https://github.com/yandexdataschool/nlp_course/raw/master/resources/w2_conv_arch.png)

### Part II: Experiments

In [None]:
# < A whole lot of your code > - models, charts, analysis

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback(keras)](https://keras.io/callbacks/#earlystopping) or in [pytorch(lightning)](https://pytorch-lightning.readthedocs.io/en/latest/early_stopping.html).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!