- If the data for activation has not zero-mean and unit variance, then small perturbations in the weight matrix of that layer of the network could cause large perturbations in the output of that layer. This makes learning more difficult.

- Regularization is used to reduce the gap between train and test error.

- If loss changes quickly in one direction and slowly in another, this result in very slow progress along shallow (poco profunda) dimension, jitter along steepest direction.

- Saddle point: in one direction we go up and in the other direction we go down. These points are much more common in high dimensions. Local minima is very rare in high dimensions (every step in every direction must increase the value of the function).

- SGD+Momentum: we mantain the velovity over time and we add our gradient estimates to the velovity. Then, we step in the direction of the velocity rather than stepping in the direction of the gradient. Even if the gradient is 0 in a saddle point, we have velocity and can continue moving downwards. This overcomes the noise in a gradient estimate.

In [None]:
vx = 0
vx = rho * vx + dx  # rho gives "friction"; typically 0.9.
x += learning_rate * vx
# This is a smooth moving average of your recent gradients.

- Nesterov momentum: step in the direction of the velocity, evaluate the gradient at that point, go back and mix velocity and gradient to make the step.  

In [None]:
old_v = v
v = rho * v - learning_rate * dx
# Incorporates a correcting term between the previous and current velocity and that minimizes the overshooting.
x += -rho * old_v + (1 + rho) * v

- AdaGrad: keep a running sum of all the squared gradients during training.

In [None]:
grad_squared += dx * dx  
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

- RMSProp: momentum over squared gradient.

In [None]:
grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

- Adam: RMSProp with momentum.

In [None]:
first_moment = 0
second_moment = 0
for t in range(num_iterations):
    first_moment = beta_1 * first_moment + (1 - beta_1) * dx  # Momentum.
    second_moment = beta_2 * second_moment + (1 - beta_2) * dx * dx  # RMSProp.
    # Bias correction for the fact that first and second moment estimates start at zero.
    first_unbias = first_moment / (1 - beta_1 ** t)
    second_unbias = second_moment / (1 - beta_2 ** t)
    x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)  # RMSProp.

- Good starting point: Adam with beta_1 = 0.9, beta_2 = 0.99 and learning_rate = 1e-3 or 5e-4.

- Learning rate decay over time: step, exponential, 1/t. It is useful when the loss is getting stuck in a plateau. 

- Dropout: In each forward pass, randomly set the output of some neurons to zero (0.5 by default) at every layer. It forces the network to have a redundant representation and prevents co-adaptation of features.

- At test time: scale the activations by the dropout probability.

- Dropout is equivalent to Batch normalization for regularizing (add noise at train time and remove the average of it [marginalize] at test time) and maybe better because of the hyperparameter dropout_rate.

- Data Augmentation: random mix of translation, rotation, stretching, distorsion, etc. for regularizing.

- Transfer learning: train on ImageNet, then freeze all the weights up to the final FC layer, reinitialize the matrix that has the weights going into the final FC layer randomly and finally, train only those parameters and let them convert to your small data. The original parameters of the network that converged on ImageNet work pretty well generally and they should be changed just a very small amount.

- "Model Zoo" on DL-frameworks: pretrained models for different tasks so you don't need to train your own.