# Tricks to improve performance
While vanilla neural networks get good performance on a lot of tasks, there is a vast amount of research current going on to try to improve their performance to get those extra percentages of accuracy. Many techniques are praised as amazing if they can consistently offer even 1% of extra accuracy. That extra accracy could be what is separating them from human-level or above performance so make sure you add these once you have a basic model working.

## Vanishing and Exploding gradients
Two key problems that arise when training neural networks are vanishing and exploding gradients. These problems are when we have deep networks and are due to the fact that the first layers' gradients are calculated using the chain rule which requires us to multiply with the gradients of the final layer neurons. Consider the case when we are using a sigmoid activation function with the network shown below. As you can see each time we calculate the gradients for nodes closer to the input, we multiply by a $\sigma'(z_i)$ term. Given that the max value that this derivative of the sigmoid can take is 0.25 as shown in the graph below, our gradients will reduce by atleast a factor of 4 each time we add a layer. The gradient vanishes which makes learning very slow in the earlier layers. This is especially a problem if the neuron (z-value) is satuarated meaning it takes on a very high or very low value. This is because the derivative for very high and low sigmoid inputs is practically 0. <br> 
exploding gradient = infinity * infinity from the weights --> gradients going to infinity <br> 
vanishing = due to derivative of the gradients --> gradients go to 0 <br> 
z-score: 
- The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.


![](nngrad.png)

$J = \frac{1}{2}(h-y)^2$

$h = \sigma(z_3)$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$z_3 = w_3a_2$

$a_2 = \sigma(z_2)$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$z_2 = w_2a_1$

$a_1 = \sigma(z_1)$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$z_1 = w_1x$

$\frac{\partial J}{\partial w_3} = (h-y)\ \sigma'(z_3)\ a_2$

$\frac{\partial J}{\partial w_2} = (h-y)\ \sigma'(z_3)\ w_3\ \sigma'(z_2)\ a_1$

$\frac{\partial J}{\partial w_1} = (h-y)\ \sigma'(z_3)\ w_3\ \sigma'(z_2)\ w_2\ \sigma'(z_1)\ x$

![](sigderiv.png)

In certain cases, the w terms can get very high and after being multiplied all the way through the network can lead to an extremely high gradient causing us to jump about all over the cost surface, never converging. This is called exploding gradients and occurs less frequently than vanishing gradient.

## Weights Initialization
Due to the problems explained above, we have to be careful in how we initialize our weights. If all of our weights have very high positive or negative values initially, then our neurons will satuarate and we will get very small gradients leading to slow learning or no learning at all. This is also another reason to normalize your inputs as you can imagine very high and low inputs will quickly lead to neuron satuaration.

We can smartly initialize our weights using Xavier Initialization. This means that each weight in our network is randomly sampled from a Normal distribution with mean 0 and variance $\frac{1}{N_{avg}}$ where $N_{avg}$ is the average of the number of input neurons and output neurons to the layer the weights are being initialized for.

This usually isn't the first thing to add when improving performance as the other techniques described lead to larger gains but this can be good to give a small boost.
- minimize gradients: set variance of the weights to 1/neurons 
#### Implementation
We can implement this in pytorch by creating a new class inheriting from the defalt layer type we are trying to initialize but overriding the default reset_parameters function to initialize the weights differently.
<br> sampling the weight with mean & variance: inverse of the gradients 


In [0]:
class Linear(nn.Linear): #overwrite reset parameter of nn.Linear 
    def reset_parameters(self):
        var = 2 / (self.in_features + self.out_features)
        self.weight.data.normal_(0, np.sqrt(var)) #takes in stdv as argument so we square-root variance to get stdv
        if self.bias is not None:
            self.bias.data.zero_()

class Conv2d(nn.Conv2d):
    def reset_parameters(self):
        var = 2 / ((self.in_channels + self.out_channels) * np.prod(self.kernel_size))
        self.weight.data.normal_(0, np.sqrt(var))
        if self.bias is not None:
            self.bias.data.zero_()

## Batch Normalization
Usually, we normalise our input features to help our network train faster. Batch-norm takes advantage of this trick by normalizing the output z-values at each layer with respect to the current batch. We then multiply by learnable parameter $\gamma$ and add learnable parameter $\beta$ to allow us to learn the best mean and variance for the z-values.
A more in-depth explanation can be found here: https://youtu.be/tNIpEZLv_eg
- take much larger step than necessary --> normalize data to have gradients are at the same scale 
- feature scale: sderivative on the same scale and optimized learning rate 
- scale z scores --> batch normalizations 
- take mean of z score, find sd and scale z scores by factor --> normlized z scores 
- gamma stretch, and beta shifts --> normalization @ every layer to train depp network 
- batch normalization = normalize the input layer by adjusting and scaling the activations

$z_{norm} = \frac{z-\mu}{\sigma^2}$

$\stackrel{\sim}{z} = \gamma z_{norm} + \beta$

This is one of the best tricks to give a big boost to performance especially in deeper architectures.

#### Implementation
We implement by defining batch norm operations which we apply after applying each layer except the output layer. The batch norm operation takes in the number of filters as the argument if it is a 2d batch-norm otherwise the number of neurons if it is a dense layer.

In [1]:
class Convnet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(1, 64, kernel_size=4, stride=2, padding=1)
        self.dense1 = torch.nn.Linear(64*14*14, 256)
        self.dense2 = torch.nn.Linear(256, 1)
#add parameters to model: added extra "layer" to network, apply fct to 64 layers 
#normalize each of the 64 neurons 
        self.bn1 = torch.nn.BatchNorm2d(64)
        self.bn2 = torch.nn.BatchNorm1d(256)
#added extra layer 
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x))).view(-1, 64*14*14)
        x = F.relu(self.bn2(self.dense1(x)))
        x = F.sigmoid(self.dense2(x))
        return x

NameError: ignored

## Activation functions
We can sometimes significantly improve performance by experimenting with different activation functions. Relu has been in the spotlight in recent years, often achieving the best performance but there are always new ones being researched. Take, for example, the Selu activation, which, according to the paper I read, shattered some benchmarks and is easy to implement. It works as the function leads the layers of the network to have self-normalizing properties.<br>
selu = scaled linear unit 
- looks like a smooth relu 
- difference bw rel and selu 

$selu(x) = \lambda \begin{cases}x & x>0 \\ \alpha e^x - \alpha & x<=0 \end{cases}$

#### Implementation
We simply define the new activation function and apply it on our forward pass<br> 
scale and alpha: numbers found empirically 


In [0]:
class Net(torch.nn.Module):
    def __init__(self):
        super().__init__() #call parent class initializer
        self.h1 = torch.nn.Linear(30, 10) #input layer to size 10 hidden layer
        self.out = torch.nn.Linear(10, 1) #hidden layer to single output

    #define the forward propagation/prediction equation of our model
    def forward(self, x):
        x = self.h1(x) #linear combination
        x = self.selu(x) #activation
        x = self.out(x) #linear combination
        x = F.sigmoid(x) #activation
        return x
    
    def selu(self, x):
        scale = 1.0507009873554804934193349852946
        alpha = 1.6732632423543772848170429916717
        x[x>0] = scale*x
        x[x<=0] = scale*(alpha*x.exp() - alpha)
        return x

## Regularization
Regularization techniques are implemented in order to prevent over-fitting. This is when the model over-fits to the training data so it does not generalize well to new examples thus having significantly decreased performance on the test set. There are two main methods of regularization which are often implemented are drop-out and L1/L2.

In drop-out, we set the activations each neuron in a given layer to 0 with probability q. This random proccess of zeroing the output of certain neurons leads the network to learn a robust representation of the data. The way I think of it is that the network will learn to find alternative paths to achieving the same result and doesn't depend too much on any single neuron.

In L1/L2 we add the magnitude(L1) or squared magnitude(L2) of our parameters to the cost function. If you imagine the case where we are regressing higher orders of our features, we are asking the network to keep the parameter values as small as possible while maxizing accuracy. This will mean that, for example, the $x^3$ coeffecient for an $x^2$ function will, given the necessary training time, reduce to approximately 0 as it is contributing much to accuracy. We are essentially asking the network to find the simplest function to fit the data rather than any function.
- over fitting of the function eg higher level polynomial 
- model over fit data --> puts a lot of weight into high power polynomial 
- too much capacity to the model --> just fits data
- need to penalize values being too high 
- high accuracy for test set and low accuracy for training set 
<br>
cost: J = 1/2M * sum(h-y)^2 + ||W||^2  <br> 



#### Implementation
Implementation of both these techniques is very simple. 

For L2, you simply add the extra keyword argument weight_decay which specifies the coeffecient that the L2 is multiplied by before being added to the cost.

For drop-out, you simply add a dropout layer to the network which randomly zeros some of the inputs and apply on the forward pass. The dropout layer takes as argument the probability of zeroing any individual neuron. We must be careful to set out model in evaluation mode before testing as we only want to randomly zero values during training, not testing.

In [0]:
# L2 regularization
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.00001)

# Drop-out
class Net(torch.nn.Module):
    def __init__(self):
        super().__init__() #call parent class initializer
        self.h1 = torch.nn.Linear(30, 10) #input layer to size 10 hidden layer
        self.dropout = torch.nn.Dropout(p=0.5) #probability of any input being zeroed
        self.out = torch.nn.Linear(10, 1) #hidden layer to single output

    #define the forward propagation/prediction equation of our model
    def forward(self, x):
        x = self.h1(x) #linear combination
        x = F.relu(x) #activation
        x = self.dropout(x)
        x = self.out(x) #linear combination
        x = F.sigmoid(x) #activation
        return x

## Normalization
Normalization increases how fast your model learns as it ensures that gradients w.r.t each parameter are on similar orders of magnitude. It is not really a trick but more of a standard practice you should be doing every time. I have decided to include it in here as it is a very simple technique that offers huge gains but can be forgotten at time.

#### Implementation
Implementation is as simple as including one extra line.

In [0]:
X = torch.Tensor(np.array(df[df.columns[0:-1]])) #pick our features from our dataset
Y = torch.Tensor(np.array(df[df.columns[-1]])) #select our label

X = (X-X.mean(0)) / X.std(0) #normalize our features along the 0th axis 

## Loss function
Last but not least, ensure that you are working with the appropriate loss function. This one is not a technique but more of a sanity check. Even though many different loss functions may work for a single problem, there is often one which is more suitable given the problem and causes the model to converge quicker.