<a href="https://colab.research.google.com/github/karfly/learning-deep-learning/blob/master/07_emb/seminar_salary_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Salary prediction from job description

Today we're gonna apply the newly learned DL tools for sequence processing to the task of predicting job salary.

Special thanks to [Oleg Vasilev](https://github.com/Omrigan/) for the core of the assignment.

In [0]:
import numpy as np
import pandas as pd
from tqdm import trange

from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline

## About the challenge



Our task is to predict one number, __SalaryNormalized__, in the sense of minimizing __Mean Absolute Error__.

<img src="https://raw.githubusercontent.com/karfly/learning-deep-learning/master/07_emb/static/salary-prediction.png" width=400px>



To do so, our model will access a number of features:
* Natural language text: __`Title`__ and  __`FullDescription`__
* Categorical features: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__.


You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description).

## Download data

If for any reason data downloading failed (Dropbox quota can exceed):
- you can manually download it from [here](https://www.dropbox.com/s/uni730jpy90eh6f/Train_rev1.csv.tar?dl=0)
- get it from the original [Kaggle competition page](https://www.kaggle.com/c/job-salary-prediction/data) (download `Train_rev1.*`)

In [0]:
!curl -L https://www.dropbox.com/s/uni730jpy90eh6f/Train_rev1.csv.tar?dl=! -o Train_rev1.csv.tar
!tar -xvf ./Train_rev1.csv.tar

## Data preprocessing

Neural networks don't know anything about *words*, but they do know about *tensors*. In this section we'll preprocess our data, build an **inverse token index** (token -> some unique index) and present it in the form which can be successfully fed to the neural network.

We'll do the following steps:
1. Load data
2. Tokenize text features
3. Wipe out rarely occured tokens
4. Add special tokens to the vocabulary (`PAD`, `UNK`)
5. Build an inverse token index
6. Implement function for converting tokens to tensors
7. Preprocess categorical features
8. Split data in train/test 

### 1. Load data

As a target we'll use not pure $\text{salary}$, but $\log(1 + \text{salary})$. It's a common trick in ML, because it makes a skewed target variable more normal (more [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion/103975)).

In [0]:
data = pd.read_csv("./Train_rev1.csv", index_col=None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

text_columns = ['Title', 'FullDescription']
categorical_columns = ['Category', 'Company', 'LocationNormalized', 'ContractType', 'ContractTime']

target_column = 'Log1pSalary'
data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast nan to string

data.sample(3)

### 2. Tokenize text features

To even begin training our neural network, we'll need to preprocess the text features. Since it is not an NLP course, we're gonna use simple built-in NLTK tokenization.

In [0]:
# !pip install nltk

In [0]:
print("Before tokenization")
print(data['Title'][:10])

Here we tokenize text columns of our data:

In [0]:
import nltk
tokenizer = nltk.tokenize.WordPunctTokenizer()

for col in text_columns:
    data[col] = data[col].apply(lambda l: ' '.join(tokenizer.tokenize(str(l).lower())))

Now we can assume that our text is a space-separated list of tokens:

In [0]:
print("After tokenization")
print(data['Title'][:10])

### 3. Wipe out rarely occured tokens

Not all words are equally useful. Some of them are typos or rare words that are only present a few times. Let's see how many times each word is present in the data so that we can build a "white list" of known words.

In [0]:
from collections import Counter
token_counts = Counter()

# count how many times does each token occur in text columns
for col in text_columns:
    data[col].apply(lambda x: token_counts.update(x.split()))

In [0]:
print("Total unique tokens :", len(token_counts))
print("Most common:", token_counts.most_common(n=5))
print("Least common:", token_counts.most_common()[-3:])

assert token_counts.most_common(1)[0][1] in  range(2600000, 2700000)
assert len(token_counts) in range(200000, 210000)
print("Correct!")

Let's see how many words are there for each count:

In [0]:
plt.hist(list(token_counts.values()), range=[0, 10 ** 4], bins=50, log=True)
plt.xlabel("Word counts")

**Fun fact**: such disdributions are very common in real world and there is a term for it – [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) (here's a nice [Vsauce video](https://www.youtube.com/watch?v=fCn8zs912OE) about it)

Get a list of all tokens that occur at least 10 times.

In [0]:
min_count = 10

# take only tokens that occured more than min_count times
tokens = ## your code here

### 4. Add special tokens to the vocabulary (`PAD`, `UNK`)

Add a special tokens for unknown and empty words. The purpose of adding these tokens will be more clear later.

In [0]:
PAD, UNK = 'PAD', 'UNK'
tokens = [PAD, UNK] + tokens

In [0]:
print("Total number of tokens:", len(tokens))

assert type(tokens) == list
assert len(tokens) in range(32000, 35000)
assert 'me' in tokens
assert UNK in tokens
print("Correct!")

### 5. Build an inverse token index

Build an inverse token index: a dictionary from token (string) to it's index (int) in `tokens` list

In [0]:
token_to_id = ## your code here

In [0]:
assert isinstance(token_to_id, dict)
assert len(token_to_id) == len(tokens)
for token in tokens:
    assert tokens[token_to_id[token]] == token

print("Correct!")

### 6. Implement function for converting tokens to tensors

Let's use the vocabulary you've built to map text lines into torch-digestible matrices. The sentences can be of different lengths, so we pad them with `PAD` token to the length of the longest sentence. We do it to be able to process batches of sentences, because tensors' rows **must** be of the same size.

In [0]:
PAD_IX, UNK_IX = token_to_id[PAD], token_to_id[UNK]

def as_matrix(sequences, max_len=None):
    """Convert a list of tokens into a matrix with padding"""
    
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i, seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

In [0]:
print("Lines:")
print('\n'.join(data['Title'][:3].values))

print()

print("Matrix:")
print(as_matrix(data['Title'][:3]))

### 7. Preprocess categorical features

Now let's  encode the categirical data we have. As usual, we'll use one-hot encoding for simplicity.

In [0]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data['Company'] = data['Company'].apply(lambda comp: comp if comp in recognized_companies else 'Other')

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

### 8. Split data in train/test 

Once we've learned to tokenize the data, let's design a machine learning experiment. As before, we won't focus too much on validation, opting for a simple train-test split.

**To be completely rigorous,** we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes.

In [0]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.1, random_state=42)

print("Train size =", len(data_train))
print("Validation size =", len(data_val))

## Building models

In [0]:
import torch
from torch import nn
import torch.nn.functional as F

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

Let's first write a helper function for sampling batches:

In [0]:
def sample_batch(data, batch_size, device='cpu', max_len=None, replace=True):
    """Creates a pytorch-friendly dict from the batch data"""
    
    if batch_size is not None:
        data = data.sample(batch_size, replace=replace)
    
    batch = {}
    for col in text_columns:
        batch[col] = torch.tensor(as_matrix(data[col].values, max_len), dtype=torch.long).to(device)
    
    batch['Categorical'] = torch.tensor(
        categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1)), dtype=torch.float32
    ).to(device)
    
    if target_column in data.columns:
        batch[target_column] = torch.tensor(data[target_column].values, dtype=torch.float32).to(device)
    
    return batch

In [0]:
sample_batch(data_train, 2, max_len=5)

Out model consists of three branches:
* Title encoder
* Description encoder
* Categorical features encoder

We will then feed all 3 branches into one common network that predicts salary.

![scheme](https://raw.githubusercontent.com/karfly/learning-deep-learning/master/07_emb/static/salary-prediction-architecture.png)

By default, both text vectorizers will use 1d convolutions, followed by global pooling over time.

### GlobalMaxPooling

In [0]:
class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super().__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

### TitleEncoder

In [0]:
class TitleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> global_max -> relu -> dense
        """
        super().__init__()
        
        # here `padding_idx` just inits padding embedding with zeros
        self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_IX)  
        self.conv = nn.Conv1d(64, out_size, kernel_size=3, padding=1)
        self.pool = GlobalMaxPooling()
        self.dense = nn.Linear(out_size, out_size)

    def forward(self, text_ix):
        """
        :param text_ix: int64 tensor of shape [batch_size, max_len]
        :returns: float32 tensor of shape [batch_size, out_size]
        """
        x = self.emb(text_ix)

        # we transpose from [batch, time, units] to [batch, units, time] to fit Conv1d dim order
        x = torch.transpose(x, 1, 2)
        
        # apply the layers as defined above (don't forget about nonlinearity)
        
        ## your code herer
        
        return x

In [0]:
title_encoder = TitleEncoder(out_size=64).to(device)

batch = sample_batch(data_train, 4, device=device)
prediction = title_encoder(batch['Title'])

assert tuple(prediction.shape) == (batch['Title'].shape[0], 64)

del title_encoder
print("Seems fine")

### DescriptionEncoder

It has pretty the same architecture as `TitleEncoder`, but intuitively it seems that `description` of the job is a more complex text, than `title`. So let's make it deeper (more conv and dense layers) and wider (bigger embedding space and bigger intermediate channels).

In [0]:
class DescriptionEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64):
        """ 
        A simple sequential encoder for titles.
        x -> emb -> conv -> relu -> conv -> global_max -> relu -> dense -> relu -> dense
        """
        super().__init__()
        
        ## your code here

    def forward(self, text_ix):
        """
        :param text_ix: int64 tensor of shape [batch_size, max_len]
        :returns: float32 tensor of shape [batch_size, out_size]
        """
        
        ## your code here
        
        return x

In [0]:
description_encoder = DescriptionEncoder(out_size=64).to(device)

batch = sample_batch(data_train, 4, device=device)
prediction = description_encoder(batch['FullDescription'])

assert tuple(prediction.shape) == (batch['FullDescription'].shape[0], 64)

del description_encoder
print("Seems fine too")

### CategoricalEncoder

Let's build simple dense network for encoding categorical features

In [0]:
class CategoricalEncoder(nn.Module):
    def __init__(self, n_categorical_features=len(categorical_vectorizer.vocabulary_), out_size=64):
        """ 
        A simple dense encoder for categorical features.
        x -> dense -> relu -> dense
        """
        super().__init__()  
        
        self.dense_1 = nn.Linear(n_categorical_features, 64)
        self.dense_2 = nn.Linear(64, out_size)

    def forward(self, x):
        """
        :param x: float32 tensor of shape [batch_size, n_categorical_features]
        :returns: float32 tensor of shape [batch_size, out_size]
        """
        x = self.dense_1(x)
        x = F.relu(x)
        x = self.dense_2(x)
        
        return x

In [0]:
categorical_encoder = CategoricalEncoder(out_size=64).to(device)

batch = sample_batch(data_train, 4, device=device)
prediction = categorical_encoder(batch['Categorical'])

assert tuple(prediction.shape) == (batch['Categorical'].shape[0], 64)

del categorical_encoder
print("And this is fine... wow")

### One network ~~to rule them all~~

Takes `title_ix`, `description_ix`, `categorical_features` as input, encodes them with corresponding encoders, concatenates encoded vectors and feed to simple dense network which predicts log1salary (1 scalar).

In [0]:
class SalaryPredictionNetwork(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & description encoders you defined above as long as some layers for head and categorical branch.
    """
    
    def __init__(self, n_tokens=len(tokens), n_categorical_features=len(categorical_vectorizer.vocabulary_)):
        super().__init__()
        
        self.title_encoder = TitleEncoder(out_size=64)
        self.description_encoder = DescriptionEncoder(out_size=64)
        self.categorical_encoder = CategoricalEncoder(n_categorical_features=n_categorical_features, out_size=128)
        
        # define "output" layers that combines encoded features and predicts the answer
        self.dense_1 = nn.Linear(64 + 64 + 128, 256)
        self.dense_2 = nn.Linear(256, 64)
        self.dense_3 = nn.Linear(64, 1)
        
        
    def forward(self, title_ix, description_ix, categorical_features):
        """
        :param title_ix: int32 tensor of shape [batch, title_len], job titles encoded by as_matrix
        :param description_ix:  int32 tensor of shape [batch, desc_len] , job descriptions encoded by as_matrix
        :param categorical_features: float32 tensor of shape [batch, n_cat_features]
        :returns: float32 1d tensor [batch], predicted log1p-salary
        """
        
        # process each data source with it's respective encoder
        title_x = self.title_encoder(title_ix)
        description_x = self.description_encoder(description_ix)
        categorical_x = self.categorical_encoder(categorical_features)
        
        # concatenate all vectors together...
        joint_x = torch.cat([title_x, description_x, categorical_x], dim=1)
        
        # apply layers for processing the concatenated encoded vectors
        
        ## your code here
        
        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.
        return joint_x[:, 0]

In [0]:
model = SalaryPredictionNetwork().to(device)

batch = sample_batch(data_train, 32, device=device)
prediction = model(batch['Title'], batch['FullDescription'], batch['Categorical'])
assert len(prediction.shape) == 1 and prediction.shape[0] == batch['Title'].shape[0]

# check grads
loss = prediction.norm()
loss.backward()

for name, p in model.named_parameters():
    grad = p.grad
    assert grad is not None and not (grad == 0).all(), \
    f"Some model parameters received zero grads ({name}). Double-check that your model uses all it's layers."

print("The whole model seems fine. You're awesome!")

### Losses

For loss we'll use `MSE`. As a metric we'll use `MAE`.

In [0]:
def compute_loss(reference, prediction):
    """
    Computes objective for minimization.
    By deafult we minimize MSE, but you are encouraged to try mix up MSE, MAE, huber loss, etc.
    """
    loss = ## your MSE code here
    return loss

def compute_mae(reference, prediction):
    """Computes MAE on actual salary, assuming your model outputs log1p(salary)"""
    return torch.abs(torch.exp(reference) - torch.exp(prediction)).mean()

## Train loop

In [0]:
def iterate_minibatches(data, batch_size, device='cpu', max_len=None, max_batches=None, shuffle=True, verbose=False):
    indices = np.arange(len(data))
    
    if shuffle:
        indices = np.random.permutation(indices)
    if max_batches is not None:
        indices = indices[:batch_size * max_batches]
    
    irange = trange if verbose else range
    for start in irange(0, len(indices), batch_size):
        yield sample_batch(data.iloc[indices[start : start + batch_size]], batch_size, device=device, max_len=max_len)

Training parameters:

In [0]:
num_epochs = 100
max_len = 100
batch_size = 32
batches_per_epoch = 100
loss_visualization_window = 500

Setup model:

In [0]:
model = SalaryPredictionNetwork().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

Finally, let's train our model:

In [0]:
train_losses = []
val_losses = []

for epoch_i in range(num_epochs):    
    # train
    train_loss = train_mae = train_batches = 0    
    model.train()
    
    for batch in iterate_minibatches(data_train, batch_size, device=device, max_batches=batches_per_epoch):
        prediction = ## make a forward pass for the model
        reference = batch[target_column]

        loss = compute_loss(reference, prediction)
        
        ## your optimization step here (remember the mantra? zero-backward-step, zero-backward-step, ...)
        
        train_loss += loss.item()
        train_losses.append(loss.item())
        
        train_mae += compute_mae(reference, prediction).item()
        train_batches += 1
    
    # val
    val_loss = val_mae = val_batches = 0
    model.eval()
    
    for batch in iterate_minibatches(data_val, batch_size, shuffle=False, device=device):
        prediction = ## make a forward pass for the model
        reference = batch[target_column]
        
        loss = compute_loss(reference, prediction)

        val_loss += loss.item()
        val_losses.append(loss.item())
        
        val_mae += compute_mae(reference, prediction).item()
        val_batches += 1
    
    # visualization
    clear_output(True)
    plt.plot(
        np.arange(len(train_losses))[-loss_visualization_window:],
        train_losses[-loss_visualization_window:]
    )
    plt.show()
    
    print(f"Epoch {epoch_i}/{num_epochs}")
    print("train: loss={:.5f}, MAE={:.1f}".format(train_loss / train_batches, train_mae / train_batches))
    print("val: loss={:.5f}, MAE={:.1f}".format(val_loss / val_batches, val_mae / val_batches))

After training let's make a final validation:

In [0]:
model.eval()

val_loss = val_mae = val_batches = 0
for batch in iterate_minibatches(data_val, batch_size, shuffle=False, device=device):
    prediction = ## make a forward pass for the model
    reference = batch[target_column]

    loss = compute_loss(reference, prediction)

    val_loss += loss.item()
    val_mae += compute_mae(reference, prediction).item()
    val_batches += 1

print("Final validation:")
print("val: loss={:.5f}, MAE={:.1f}".format(val_loss / val_batches, val_mae / val_batches))

You should get **MAE ~= 7000**.

## Bonus: explaining model's predictions

It's usually a good idea to understand what your model does before you let it make actual decisions. It's simple for linear models: just see which words learned positive or negative weights. However, its much harder for neural networks that learn complex nonlinear dependencies.

There are, however, some ways to look inside the black box:
* Seeing how model responds to input perturbations
* Finding inputs that maximize/minimize activation of some chosen neurons (read more [on distill.pub](https://distill.pub/2018/building-blocks/))
* Building local linear approximations to your neural network: [article](https://arxiv.org/abs/1602.04938), [eli5 library](https://github.com/TeamHG-Memex/eli5/tree/master/eli5/formatters)

Today we gonna try the first method just because it's the simplest one.

__Your task__ is to measure how does model prediction change if you replace certain tokens with UNKs. The core idea is that if dropping a word from text causes model to predict lower log-salary, than this word probably has positive contribution to salary (and vice versa).

In [0]:
def explain(model, sample, col_name='Title'):
    """ Computes the effect each word had on model predictions """
    sample = dict(sample)
    
    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))

    for drop_i in range(len(sample_col_tokens)):
        data_drop_one_token.loc[drop_i, col_name] = ' '.join(
            UNK if i == drop_i else tok for i, tok in enumerate(sample_col_tokens)
        ) 
    batch = sample_batch(data_drop_one_token, None, device=device)

    *predictions_drop_one_token, baseline_pred = model(
        batch['Title'], batch['FullDescription'], batch['Categorical']
    ).detach().to('cpu').numpy()
    diffs = baseline_pred - predictions_drop_one_token
    
    return list(zip(sample_col_tokens, diffs))

See some sample's weight:

In [0]:
sample = data.loc[np.random.randint(len(data))]
print("Input:", sample)

tokens_and_weights = explain(model, sample, 'Title')
print(tokens_and_weights)

In [0]:
from IPython.display import HTML, display_html

def draw_html(
    tokens_and_weights, cmap=plt.get_cmap("bwr"), display=True,
    token_template="""<span style="background-color: {color_hex}">{token}</span>""",
    font_style="font-size:14px;"
):
    
    def get_color_hex(weight):
        rgba = cmap(1.0 / (1 + np.exp(weight)), bytes=True)
        return '#{:02X}{:02X}{:02X}'.format(*rgba[:3])
    
    tokens_html = [
        token_template.format(token=token, color_hex=get_color_hex(weight))
        for token, weight in tokens_and_weights
    ]
    
    
    raw_html = """<p style="{}">{}</p>""".format(font_style, ' '.join(tokens_html))
    if display:
        display_html(HTML(raw_html))
        
    return raw_html

Here `blue` – most important, `red` – least important:

In [0]:
i = np.random.randint(len(data))
sample = data.loc[i]
print("Index:", i)

# predict salary on sample
sample_dict = dict(sample)
sample_frame = pd.DataFrame([sample_dict])
batch = sample_batch(sample_frame, None, device=device)

prediction = model(batch['Title'], batch['FullDescription'], batch['Categorical'])
predicted_salary = (torch.exp(prediction) - 1).item()
print("Salary:", predicted_salary)

tokens_and_weights = explain(model, sample, 'Title')
draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');

tokens_and_weights = explain(model, sample, 'FullDescription')
draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);

## Beast mode on!

Here're some ideas, how to improve you validation MAE

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout
* Batch Norm. This time it's `nn.BatchNorm1d`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels
* More layers, more neurons, you know...


#### B) Play with pooling

There's more than one way to do max pooling:
* Max over time is our `GlobalMaxPooling`
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a small neural network


The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$

#### C) Fun with embeddings

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained word2vec from [here](http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/) or [here](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/).
* Start with pre-trained embeddings, then fine-tune them with gradient descent
* Use the same embedding matrix in title and desc vectorizer

#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
  * Please bear in mind that while convolution uses [batch, units, time] dim order, 
    recurrent units are built for [batch, time, unit]. You may need to `torch.transpose`.

* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping.
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.state_dict`
  * Plotting learning curves is usually a good idea