## trixi PyTorch Experiment

This notebook shows how to use the trixi `PytorchExperiment` with the Pytorch MNIST example to classify mnist digits.

Before running call:  
`python -m visdom.server -p 8080`  
This starts a visdom server which is used to visualize the training's progress in real-time.

Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.optim as optim
from torchvision import datasets, transforms

from trixi.util import Config
from trixi.experiment import PytorchExperiment

Let's prepare the experiment dir

In [None]:
!mkdir experiment_dir/

In [None]:
!ls experiment_dir/

In [None]:
!rm -rf experiment_dir/20*

In [None]:
!du -sh experiment_dir/

Now let's create a Config. 
A config is basically a dict (which can be accessed with the "." operator).
All objects in the dict will be initialized when the experiment starts.
Additonally all config keywords/elements can be parsed over the command line (e.g. --batch_size=128)

In [None]:
c = Config()

c.batch_size = 64
c.batch_size_test = 1000
c.n_epochs = 10
c.learning_rate = 0.01
c.momentum = 0.9
if torch.cuda.is_available():
    c.use_cuda = True
else:
    c.use_cuda = False
c.rnd_seed = 1
c.log_interval = 200


Now we define the Model we use for classification 

In [None]:
# build a simple cnn model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

Let's now setup a PytorchExperiment, therefore we create a class which inherits the PytorchExperiment class.
We than overwrite the setup, train and validate method. 
When we finally call the experiement.run() method it will call the setup, and then for the defined number of n_epochs it will call the train and validate method in an alternating fashion.

The PytorchExperiment has serveral benefits:
* It automatically creates a experimentLogger (elog) which will create a defined folder structure and can be used to store all the results
* It automatically creates a visdomLogger (vlog) which can be used to show the results live on a visdom server
* It automatically creates a combinedLogger (clog) which has the same interface as the experimentLogger and visdomLogger and logs to both in defined interval (e.g. you can see each result in visdom, while only saving every 10th on your hard disk)
* After each epoch and at the end, if an error occours, it automatically stores a checkpoint of your experiment
* You can simply resume an ended experiment by provinding its experiment folder when creating a new one
* You can use the add_result method to compare your experiments in the trixi browser, backtrace all your result updates, and see all the results on your visdom server (at the same time)
* Save your config and your code (if given globs=globals()) for full reproducibility)
* Many more ;-)  

In [None]:
class MNIST_experiment(PytorchExperiment):
    def setup(self):
        
        self.elog.print("Config:")
        self.elog.print(self.config)
        
        ### Get Dataset
        transf = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])
        self.dataset_train = datasets.MNIST(root="experiment_dir/data/", download=True,
                                                        transform=transf, train=True)
        self.dataset_test = datasets.MNIST(root="experiment_dir/data/", download=True,
                                                       transform=transf, train=False)

        data_loader_kwargs = {'num_workers': 1, 'pin_memory': True} if self.config.use_cuda else {}
        
        self.train_data_loader = torch.utils.data.DataLoader(self.dataset_train, batch_size=self.config.batch_size,
                                                        shuffle=True, **data_loader_kwargs)
        self.test_data_loader = torch.utils.data.DataLoader(self.dataset_test, batch_size=self.config.batch_size,
                                                       shuffle=True, **data_loader_kwargs)

        
        self.device = torch.device("cuda" if self.config.use_cuda else "cpu")
        
        self.model = Net()
        self.model.to(self.device)

        self.optimizer = optim.SGD(self.model.parameters(), lr=self.config.learning_rate,
                                               momentum=self.config.momentum)
        
        self.save_checkpoint(name="checkpoint_start")
        self.vlog.plot_model_structure(self.model,
                                       [self.config.batch_size, 1, 28, 28], 
                                       name='Model Structure')
        
        self.batch_counter = 0        
        self.elog.print('Experiment set up.')
        
    
    def train(self, epoch):
        
        self.model.train()
        
        for batch_idx, (data, target) in enumerate(self.train_data_loader):
            
            self.batch_counter += 1
            
            if self.config.use_cuda:
                data, target = data.cuda(), target.cuda()
                
            self.optimizer.zero_grad()
            
            output = self.model(data)
            self.loss = F.nll_loss(output, target)
            self.loss.backward()
            
            self.optimizer.step()
            
            if batch_idx % self.config.log_interval == 0:
                # plot train loss (mathematically mot 100% correct, just so that lisa can sleep at night (if no one is breathing next to her ;-P) )
                self.add_result(value=self.loss.item(), name='Train_Loss',
                                     counter=epoch + batch_idx / len(self.train_data_loader), label='Loss')
                # log train batch loss and progress
                self.clog.show_text(
                    'Train Epoch: {} [{}/{} samples ({:.0f}%)]\t Batch Loss: {:.6f}'
                    .format(epoch, batch_idx * len(data),
                            len(self.train_data_loader.dataset),
                            100. * batch_idx / len(self.train_data_loader),
                            self.loss.item()), name="log")
                
                self.clog.show_image_grid(data, name="mnist_training", n_iter=epoch + batch_idx / len(self.train_data_loader), iter_format="{:0.02f}")
                
                self.save_checkpoint(name="checkpoint", n_iter=batch_idx)
                
    def validate(self, epoch):
        self.model.eval()
        
        validation_loss = 0
        correct = 0
        
        for data, target in self.test_data_loader:
            if self.config.use_cuda:
                data, target = data.cuda(), target.cuda()
            output = self.model(data)
            validation_loss += F.nll_loss(output, target, size_average=False).item()
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).cpu().sum().item()
        validation_loss /= len(self.test_data_loader.dataset)
        # plot the test loss
        self.add_result(value=validation_loss, name='Validation_Loss',
                             counter=epoch + 1, label='Loss')
        # plot the test accuracy
        acc = 100. * correct / len(self.test_data_loader.dataset)
        self.add_result(value=acc, name='ValidationAccurracy',
                             counter=epoch + 1, label='Accurracy' )
        
        # log validation loss and accuracy
        self.elog.print(
            '\nValidation set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'
            .format(validation_loss, correct, len(self.test_data_loader.dataset),
                    100. * correct / len(self.test_data_loader.dataset)))

In [None]:
exp = MNIST_experiment(config=c, name='experiment', n_epochs=c.n_epochs, 
                       seed=42, base_dir='./experiment_dir', loggers={"visdom":"visdom"})

In [None]:
exp.run()

In [None]:
import os
last_experiment = 'experiment_dir/' + sorted([d for d in os.listdir('experiment_dir/') if '20' in str(d)], reverse=True)[0]

In [None]:
!ls experiment_dir/

Let's now resume the last Experiment

In [None]:
last_experiment

In [None]:
from trixi.experiment import PytorchExperiment
exp_resume = MNIST_experiment(config=c, name='resume_experiment', 
                              n_epochs=5, seed=42, base_dir='./experiment_dir', 
                              resume=last_experiment, resume_save_types=('model',
                                                                         'simple',
                                                                         'th_vars',
                                                                         'results'), loggers={"visdom":"visdom"})

In [None]:
exp_resume.run()

In [None]:
!ls experiment_dir

You can also change a parameter in your experiment and simply run the same experiment again (this can of course also be done automatically).

In [None]:
c.learning_rate = 0.0001
exp2 = MNIST_experiment(config=c, name='experiment2', n_epochs=c.n_epochs, 
                       seed=42, base_dir='./experiment_dir', loggers={"visdom":"visdom"})
exp2.run()

Now lets compare all our experiments. Therefore we simply start the trixi browser:

In [None]:
!python -m trixi.browser $PWD/experiment_dir