In [2]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

# Intermediate Q&A 1
***

Agenda:
* Tips and tricks for the challenge
    * Layers, optimizers, losses, hyperparameters
    * My solution
* Q&A

# Useful layers
***
`Dense` layers are fully-connected
* $784$ input pixels for $n$ `Dense` neurons give $784n$ weights!
* Each output depends on each input

`Conv2D` layers setup small *kernels* that are applied across the input
* Only $k_w*k_h$ weights per neuron!
<img src="https://anhreynolds.com/img/cnn.png" style="height:300px;">

# Useful layers 2
***
`Pooling(w,h)` layers reduce the dimensionality of the input
* `AvgPool2D`: Take the average value of each patch
* `MaxPool2D`: Take the maximum value each patch
* No weights and learning, but helps sorting out unecessary input

In [10]:
input_layer = keras.layers.Input((28,28,))
l = keras.layers.Reshape((28,28,1))(input_layer)

l = keras.layers.Conv2D(filters=32, kernel_size=(3,3), 
                        activation="relu", padding="same")(l)
l = keras.layers.AvgPool2D((2,2))(l)

l = keras.layers.Conv2D(filters=32, kernel_size=(3,3), 
                        activation="relu", padding="same")(l)
l = keras.layers.AvgPool2D((2,2))(l)

model = keras.models.Model(input_layer, l)
model.summary(50)

Model: "functional_7"
__________________________________________________
Layer (type)          Output Shape        Param # 
input_7 (InputLayer)  [(None, 28, 28)]    0       
__________________________________________________
reshape_6 (Reshape)   (None, 28, 28, 1)   0       
__________________________________________________
conv2d_6 (Conv2D)     (None, 28, 28, 32)  320     
__________________________________________________
average_pooling2d_6 ( (None, 14, 14, 32)  0       
__________________________________________________
conv2d_7 (Conv2D)     (None, 14, 14, 32)  9248    
__________________________________________________
average_pooling2d_7 ( (None, 7, 7, 32)    0       
Total params: 9,568
Trainable params: 9,568
Non-trainable params: 0
__________________________________________________


# Useful optimizers
***
All optimizers in `keras` are based on gradient descent
* Idea: A negative gradient indicates decreasing function values
* The `learning_rate` determines how far to move into the steepest direction

`SGD` is the vanilla gradient descent method
* Gets stuck in local optima fast

Others, like `Adam`, `Adamax`, `Adadelta` dynamically set learning rate
* E.g based on gradient norms, dimensionality
* In practice, they almost always outperform `SGD`

# Which to choose?
***
The effectivity of an optimizer depends on the network and the training data

Use case:
* Proof of concept, i.e. "Wonder if this would learn anything"
    * Start with standard methods, e.g. `SGD`
* Scientific research
    * Try any *reasonable* methods
* Trying to beat Niklas
    * Try out everything

# Useful losses
***
Our problem involves classification. In almost all cases the one below works very well
* `categorical_crossentropy` is a measure of closeness between discrete distributions
    * I.e. set the parameters such that our model approximates the input distribution

For classification it might also be interesting to play around with
* `MSE`: Usually worse, but has its perks for some optimizers
 

# Hyperparameters
***
Getting the most out of your model usually involves tuning hyperparameters:
* `learning_rate`
    * Large steps might skip over more local minima, but may also not converge
    * Small steps might converge to local minima and never leave them again
        * Decreasing the learning during training can help a lot!
* `batch_size`
    * Theoretically, `batch_size=1` has the fastest convergence properties
        * But per epoch, not per second!
    * Start with a `batch_size` that allows for rapid protoype training
        * Depends on your hardware
    * For promising candidates: Lower it until you see diminishing returns

# Try it out
***
All of the design choices available are interconnected:
* Network architecture
    * Type of layers
    * Activation functions
    * Weights per layer
* Optimizer
    * Learning rate, momentum
    * batch size
* Losses

There are no perfect solutions
* It is a lot of trial and error
* Very good solutions for scenario A might be far from good for scenario B