# Multilayer Perceptron (MLP)

## Course outline:

1. Recall of linear classifier

2. MLP with scikit-learn

3. MLP with pytorch

4. Test several MLP architectures

5. Limits of MLP

In [None]:
%matplotlib inline

Sources:

Deep learning

- [cs231n.stanford.edu](http://cs231n.stanford.edu/)


Pytorch

- [WWW tutorials](https://pytorch.org/tutorials/)
- [github tutorials](https://github.com/pytorch/tutorials)
- [github examples](https://github.com/pytorch/examples)

MNIST and pytorch:

- [MNIST nextjournal.com/gkoehler/pytorch-mnist](https://nextjournal.com/gkoehler/pytorch-mnist)
- [MNIST github/pytorch/examples](https://github.com/pytorch/examples/tree/master/mnist)
- [MNIST kaggle](https://www.kaggle.com/sdelecourt/cnn-with-pytorch-for-mnist)



In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

Set working directory

In [None]:
from pathlib import Path
WD = os.path.join(Path.home(), "data", "pystatml", "dl_mnist_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir is:", os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

Hyperparameters

In [None]:
n_epochs = 5
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10
random_seed = 1
no_cuda = True

use_cuda = not no_cuda and torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

## Dataset: MNIST Handwritten Digit Recognition

In [None]:
def load_mnist(batch_size_train, batch_size_test):
    
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,)) # Mean and Std of the MNIST dataset
                       ])),
        batch_size=batch_size_train, shuffle=True)
    
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('data', train=False, transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,)) # Mean and Std of the MNIST dataset
        ])),
        batch_size=batch_size_test, shuffle=True)
    return train_loader, test_loader

train_loader, test_loader = load_mnist(batch_size_train, batch_size_test)
data_shape = train_loader.dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out = len(train_loader.dataset.targets.unique())

In [None]:
print("Train dataset:", train_loader.dataset.data.shape, train_loader.dataset.targets.shape)
print("Test dataset:", test_loader.dataset.data.shape, test_loader.dataset.targets.shape)

Now let's take a look at some mini-batches examples.


In [None]:
batch_idx, (example_data, example_targets) = next(enumerate(train_loader))
print("Train batch:", example_data.shape, example_targets.shape)
batch_idx, (example_data, example_targets) = next(enumerate(test_loader))
print("Test batch:", example_data.shape, example_targets.shape)

So one test data batch is a tensor of shape: . This means we have 1000 examples of 28x28 pixels in grayscale
(i.e. no rgb channels, hence the one). We can plot some of them using matplotlib.



In [None]:
def show_data_label_prediction(data, y_true, y_pred=None, shape=(2, 3)):
    y_pred = [None] * len(y_true) if y_pred is None else y_pred
    fig = plt.figure()
    for i in range(np.prod(shape)):
        plt.subplot(*shape, i+1)
        plt.tight_layout()
        plt.imshow(data[i][0], cmap='gray', interpolation='none')
        plt.title("True: {} Pred: {}".format(y_true[i], y_pred[i]))
        plt.xticks([])
        plt.yticks([])

show_data_label_prediction(data=example_data, y_true=example_targets, y_pred=None, shape=(2, 3))

## Recall of linear classifier

### Binary logistic regression

<img src="figures/logistic.png" width="300">

1 neuron as output layer
$$
f(x) = \sigma(x^{T} w)
$$


### Softmax Classifier (Multinomial Logistic Regression)

<img src="figures/logistic_multinominal.png" width="300">

- Input $x$: a vector of dimension $(0)$ (layer 0).
- Ouput $f(x)$ a vector of $(1)$ (layer 1) possible labels 

The model as $(1)$ neurons as output layer

$$
f(x) = \text{softmax}(x^{T} W + b)
$$

Where $W$ is a $(0) \times (1)$ of coefficients and $b$ is a  $(1)$-dimentional vector of bias.

MNIST classfification using multinomial logistic

<img src="figures/logistic_multinominal_MNIST.png" width="800">

[source: Logistic regression MNIST](https://notebooks.azure.com/cntk/projects/edxdle/html/Lab2_LogisticRegression.ipynb)

Here we fit a multinomial logistic regression with L2 penalty on a subset of
the MNIST digits classification task.

[source: scikit-learn.org](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html

In [None]:
X_train = train_loader.dataset.data.numpy()
#print(X_train.shape)
X_train = X_train.reshape((X_train.shape[0], -1))
y_train = train_loader.dataset.targets.numpy()

X_test = test_loader.dataset.data.numpy()
X_test = X_test.reshape((X_test.shape[0], -1))
y_test = test_loader.dataset.targets.numpy()

print(X_train.shape, y_train.shape)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

#from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Turn up tolerance for faster convergence
clf = LogisticRegression(C=50., multi_class='multinomial', solver='sag', tol=0.1)
clf.fit(X_train, y_train)
#sparsity = np.mean(clf.coef_ == 0) * 100
score = clf.score(X_test, y_test)
# print('Best C % .4f' % clf.C_)
#print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with penalty: %.4f" % score)

In [None]:
coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()
for i in range(10):
    l1_plot = plt.subplot(2, 5, i + 1)
    l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',
                   cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
    l1_plot.set_xticks(())
    l1_plot.set_yticks(())
    l1_plot.set_xlabel('Class %i' % i)
plt.suptitle('Classification vector for...')

plt.show()

## Model Two Layer MLP

### MLP with Scikit-learn

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(100, ), max_iter=n_epochs, alpha=1e-4,
                    solver='sgd', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=learning_rate, batch_size=batch_size_train)

mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))

print("Coef shape=", len(mlp.coefs_))

fig, axes = plt.subplots(4, 4)
# use global min / max to ensure all weights are shown on the same scale
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()

### MLP with pytorch

In [None]:
class TwoLayerMLP(nn.Module):

    def __init__(self, d_in, d_hidden, d_out):
        super(TwoLayerMLP, self).__init__()
        self.d_in = d_in
        
        self.linear1 = nn.Linear(d_in, d_hidden)
        self.linear2 = nn.Linear(d_hidden, d_out)

    def forward(self, X):
        X = X.view(-1, self.d_in)
        X = self.linear1(X)
        return F.log_softmax(self.linear2(X), dim=1)

model = TwoLayerMLP(D_in, 50, D_out)

Explore the model and compute the number of parameters

In [None]:
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))

#### Train the Model

- First we want to make sure our network is in training mode.

- Then we iterate over all training data once per epoch. Loading the individual batches is handled by the DataLoader.

- First we need to manually set the gradients to zero using `optimizer.zero_grad()` since PyTorch by default accumulates gradients.

- Forward pass: We  produce the output of our network and compute a negative log-likelihodd loss between the output and the ground truth label.

- Backward pass: The `backward()` call we now collect a new set of gradients which we propagate back into each of the network's parameters using `optimizer.step()`.

- We'll also keep track of the progress with some printouts. In order to create a nice training curve later on we also create two lists for saving training and testing losses. On the x-axis we want to display the number of training examples the network has seen during training.

- Save model state: Neural network modules as well as optimizers have the ability to save and load their internal state using `.state_dict()`. With this we can continue training from previously saved state dicts if needed - we'd just need to call `.load_state_dict(state_dict)`.

In [None]:
def train(model, train_loader, optimizer, epoch, device, log_interval=10, batch_max=np.inf, save_model=True):
    train_losses, train_counter = list(), list()
    # epoch = 1; log_interval=10; train_losses=[]; train_counter=[]

    model.train()

    # Iterate over minibatch
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx > batch_max:
            break
        # batch_idx, (data, target) = next(enumerate(train_loader))
        # print(data.shape)
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
    
        # Forward
        output = model(data)
        loss = F.nll_loss(output, target)
    
        # Bakward
        loss.backward()

        # Update params
        optimizer.step()
        
        # Track losses
        train_losses.append(loss.item())
        train_counter.append(data.shape[0]) # (batch_idx * data.shape[0]) + ((epoch-1)*len(train_loader.dataset)))

        # Save model
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            
            if save_model:
                torch.save(model.state_dict(), 'models/mod-%s.pth' % model.__class__.__name__)
                torch.save(optimizer.state_dict(), 
                           'models/mod-%s_opt-%s.pth' % (model.__class__.__name__,
                                                         optimizer.__class__.__name__))

    return model, train_losses, train_counter

#### Evaluate/test the Model

- First we want to make sure our network is in evaluation mode `model.eval()`.

- Then we iterate over all test data once per epoch. Loading the individual batches is handled by the DataLoader.

- Using the context manager `torch.no_grad()` we can avoid storing the computations done producing the output of our network in the computation graph.

Test loop. Here we sum up the test loss and keep track of correctly classified digits to compute the accuracy of
the network.


In [None]:
def test(model, test_loader, device, batch_max=np.inf):

    model.eval()

    test_loss = 0
    correct = 0
    output, pred, target = list(), list(), list()

    # Iterate over mini-batches
    with torch.no_grad():
        for batch_idx, (data, target_) in enumerate(test_loader):
            if batch_idx > batch_max:
                break
            # batch_idx, (data, target) = next(enumerate(test_loader))
            # print(target_.shape)
            data, target_ = data.to(device), target_.to(device) # target.shape == 1000
            output_ = model(data) # output.shape == (1000, 10)
            
            # Compute loss
            test_loss += F.nll_loss(output_, target_, reduction='sum').item() # sum up batch loss
            pred_ = output_.argmax(dim=1) # get the index of the max log-probability
            
            # An correct classification
            correct += pred_.eq(target_.view_as(pred_)).sum().item() # view_as(other): View this tensor as the same size as other

            # Track output, class-prediction and true target
            output.append(output_)
            pred.append(pred_)
            target.append(target_)

    output = torch.cat(output)
    pred = torch.cat(pred)
    target = torch.cat(target)
    assert pred.eq(target.view_as(pred)).sum().item() == correct

    test_loss /= len(target)
    print('Average loss: {:.4f}, Accuracy: {}/{} ({:.1f}%)'.format(
        test_loss, correct, len(target),
        100. * correct / len(target)))
    return pred, output, target, test_loss

#### Initialize the network and the optimizer.

In [None]:
#  If we were using a GPU for training, we should have also sent the network parameters to the GPU
model = TwoLayerMLP(D_in, 50, D_out)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

Time to run the training! We'll manually add a test() call before we loop over n_epochs to evaluate our model with
randomly initialized parameters.



In [None]:
pred, output, target, test_loss = test(model, test_loader, device)
print("Test accuracy = {}%".format((target == pred).sum() * 100. / len(target)))

Train one epoch

In [None]:
model, train_losses, train_counter = train(model, train_loader, optimizer, 1, device, log_interval=100)
pred, output, target, test_loss = test(model, test_loader, device)

Evaluating the Model's Performance



In [None]:
print("Test accuracy = {}%".format((target == pred).sum() * 100. / len(target)))
test_counter, test_losses = [len(train_loader.dataset)], [test_loss]

fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, '-b',
         np.cumsum(test_counter), test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

let's again look at a few examples as we did earlier and compare the model's output.



In [None]:
with torch.no_grad():
  output = model(example_data)
y_pred = output.argmax(dim=1)

show_data_label_prediction(data=example_data, y_true=example_targets, y_pred=y_pred, shape=(3, 4))

Look at some missclassified images



In [None]:
errors = example_targets != y_pred
print("Nb errors = {}, (rate = {:.2f}%)".format(errors.sum(), 100 * errors.sum().item() / len(errors)))
err_idx = np.where(errors)
show_data_label_prediction(data=example_data[err_idx], y_true=example_targets[err_idx], y_pred=y_pred[err_idx], shape=(3, 4))

#### Reload model

In [None]:
model = TwoLayerMLP(D_in, 50, D_out)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

model.load_state_dict(torch.load('models/mod-%s.pth' % model.__class__.__name__))
optimizer.load_state_dict(torch.load('models/mod-%s_opt-%s.pth' % (model.__class__.__name__, optimizer.__class__.__name__)))

#### Continue training from checkpoints

In [None]:
for epoch in range(2, n_epochs + 1):
    # Train
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device,
                                                 log_interval=50)
    train_losses += train_losses_
    train_counter += train_counter_
    
    # Test
    pred, output, target, test_loss = test(model, test_loader, device)
    test_counter.append(len(train_loader.dataset))
    test_losses.append(test_loss)
    print("Test accuracy = {:.1f}%".format((target == pred).sum().item() * 100. / len(target)))

fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(np.cumsum(test_counter), test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## Test several MLP architectures

- Define a `MultiLayerMLP(784, 512, 256, 128)` class that take the size of the layer as paraameters of the constructor.
- Add some non-linearity with relu acivation function

In [None]:
class MLP(nn.Module):

    def __init__(self, d_layer):
        super(MLP, self).__init__()
        self.d_layer = d_layer
        layer_list = [nn.Linear(d_layer[l], d_layer[l+1]) for l in range(len(d_layer) - 1)]
        self.linears = nn.ModuleList(layer_list)

    def forward(self, X):
        X = X.view(-1, self.d_layer[0])
        # relu(Wl x) for all hidden layer
        for layer in self.linears[:-1]:
            X = F.relu(layer(X))
        # softmax(Wl x) for output layer
        return F.log_softmax(self.linears[-1](X), dim=1)

In [None]:
#model = MLP([D_in, 50, D_out])
#model = MLP([D_in, 512, 256, 128, D_out])
model = MLP([D_in, 512, 256, 128, 64, D_out]) # 96.6% (5 epochs)
#model = MLP([D_in, 512, 256, 256, 128, 128, 64, 64, D_out]) # 98.0% 661514 parameters
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))
    
train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):
    # Train
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device,
                                                 log_interval=100)
    train_losses += train_losses_
    train_counter += train_counter_
    
    print("Test : ", end = '')
    pred, output, target, test_loss = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(test_loss)
    #print("Test accuracy = {:.1f}%".format((target == pred).sum().item() * 100. / len(target)))

fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## Reduce the size of training dataset

Reduce the size of the training dataset by considering only a subset of batches.
Reduce the size of the batch size to `16`, an consider `8` mini-batches for training, ie 128 traning samples.

In [None]:
train_loader, test_loader = load_mnist(16, batch_size_test)

In [None]:
n_epochs = 50
n_batch = 8

model = MLP([D_in, 512, 256, 128, 64, D_out]) # 11.5% (50 epochs)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))
    
train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):
    print()
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device,
                                                 log_interval=8,
                                                 batch_max=n_batch, save_model=False)
    train_losses += train_losses_
    train_counter += train_counter_
    
    print("Test : ", end = '')
    pred_test, output_test, target_test, loss_test = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(loss_test)
    
    # Train accuracy
    print("Train: ", end = '')
    pred_train, output_train, target_train, loss_train = test(model, train_loader, device, batch_max=n_batch)

    #print("Train accuracy = {:.1f}%".format((target_train == pred_train).sum().item() * 100. / len(target_train)))
    #print("Test accuracy = {:.1f}%".format((target_test == pred_test).sum().item() * 100. / len(target_test)))
    
fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## Run MLP on CIFAR-10 dataset

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

Here are the classes in the dataset, as well as 10 random images from each:
- airplane 										
- automobile 										
- bird 										
- cat 										
- deer 										
- dog 										
- frog 										
- horse 										
- ship 										
- truck

In [None]:
from pathlib import Path
WD = os.path.join(Path.home(), "data", "pystatml", "dl_cifar10_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir is:", os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

Load CIFAR-10 dataset

In [None]:
# ---------------------------------------------------------------------------- #
# An implementation of https://arxiv.org/pdf/1512.03385.pdf                    #
# See section 4.2 for the model architecture on CIFAR-10                       #
# Some part of the code was referenced from below                              #
# https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py   #
# ---------------------------------------------------------------------------- #

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
num_epochs = 5
learning_rate = 0.001

# Image preprocessing modules
transform = transforms.Compose([
    transforms.Pad(4),
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32),
    transforms.ToTensor()])

# CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
                                             train=True, 
                                             transform=transform,
                                             download=True)

test_dataset = torchvision.datasets.CIFAR10(root='data/',
                                            train=False, 
                                            transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=100, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=100, 
                                          shuffle=False)

Output layer size: How many classes to predict?

In [None]:
data_shape = train_loader.dataset.data.shape[1:]
D_in = np.prod(data_shape)
print(train_loader.dataset.data.shape, D_in, data_shape)

Output layer size: How many classes to predict?

In [None]:
D_out = len(set(train_loader.dataset.targets))
print(D_out)

In [None]:
n_epochs = 10

model = MLP([D_in, 512, 256, 128, 64, D_out]) # 13.6% (10 epochs)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))

train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):
    print()
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device,
                                                 log_interval=100)
    train_losses += train_losses_
    train_counter += train_counter_
    
    print("Test : ", end = '')
    pred_test, output_test, target_test, loss_test = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(loss_test)
    
    # Train accuracy
    print("Train: ", end = '')
    pred_train, output_train, target_train, loss_train = test(model, train_loader, device, batch_max=n_batch)

    #print("Train accuracy = {:.1f}%".format((target_train == pred_train).sum().item() * 100. / len(target_train)))
    #print("Test accuracy = {:.1f}%".format((target_test == pred_test).sum().item() * 100. / len(target_test)))
    
fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## Does dropout regularization improve the situation ?

In [None]:
class MLPDropOut(nn.Module):

    def __init__(self, D_in, D_out):
        super(MLPDropOut, self).__init__()
        self.fc1 = nn.Linear(D_in, 512)
        self.fc1_drop = nn.Dropout(0.2)

        self.fc2 = nn.Linear(512, 256)
        self.fc2_drop = nn.Dropout(0.2)

        self.fc3 = nn.Linear(256, 128)
        self.fc3_drop = nn.Dropout(0.2)

        self.fc4 = nn.Linear(128, 64)
        self.fc4_drop = nn.Dropout(0.2)
        
        self.fc5 = nn.Linear(64, D_out)
        

    def forward(self, x):
        x = x.view(-1, D_in)

        x = F.relu(self.fc1(x))
        x = self.fc1_drop(x)
        
        x = F.relu(self.fc2(x))
        x = self.fc2_drop(x)

        x = F.relu(self.fc3(x))
        x = self.fc3_drop(x)
        
        x = F.relu(self.fc4(x))
        x = self.fc4_drop(x)

        return F.log_softmax(self.fc5(x), dim=1)

In [None]:
model = MLPDropOut(D_in, D_out)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))
    
train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):
    print()
    # Train
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device, log_interval,
                                                 batch_max=10)
    train_losses += train_losses_
    train_counter += train_counter_
    
    # Test
    pred, output, target, test_loss = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(test_loss)
    
    # Train accuracy
    pred_train, output_train, target_train, loss_train = test(model, train_loader, device)
    #print("Train accuracy = {:.1f}%".format((target_train == pred_train).sum().item() * 100. / len(target_train)))
    #print("Test accuracy = {:.1f}%".format((target == pred).sum().item() * 100. / len(target)))

fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')