# Tuning Hyper-Parameters
There are a great number of hyper-parameters that have to be tuned for Deep Neural Networks

- alpha (learning_rate)
- beta (momentum)
- beta1, beta2, epsilon (Adam optimization)
- number of iterations
- number of layers
- number of hidden units
- learning rate decay
- mini-batch size

The importance of these hyperparameters varies based on the specific network application. Generally, they can be grouped into different tiers of importance:

- Group A (most important): alpha
- Group B (secondary importance): beta (momentum), number of hidden units (in each hidden layer), mini-batch size
- Group C (once A and B have been addressed): number of layers, learning rate decay 
- Group D (rarely tuned): beta1, beta2, epsilon (Adam optimization)

## Random Initialization
When testing different hyperparameters, avoid using a strict mathematical grid distribution. Instead, try random values within the same range. This approach is beneficial because it is challenging to predict which parameters will be most important. If the parameters are evenly distributed, and one parameter has minimal impact, you might end up training multiple similar networks. Randomly choosing numbers ensures each model is different due to the varied values for crucial parameters.

## Coarse to Fine
After initial exploration with a wide range of hyperparameters, focus on the areas with more optimal values. "Zoom in" on that section and perform another round of hyperparameter testing around the identified region. This method allows you to find a more suitable configuration for the network.

## Choosing the Right Hyperparameter Scale for Sampling
When sampling possible hyperparameters, it is essential to focus on the appropriate range for each parameter. Uniformly sampling values between the expected range works well for determining the number of layers or nodes within each layer. However, this approach is not suitable for many other hyperparameters.

For instance, if searching for an alpha value between 0.0001 and 1, uniformly sampling values within this range would result in 90% of the values being between 0.1 and 1.0, with very few between 0.0001 and 0.001. This method does not effectively test various values and primarily focuses on the 0.1 to 1.0 range. When searching for alpha, you want to sample exponentially to evenly distribute values between 0.0001-0.001, 0.001-0.01, 0.01-0.1, and 0.1-1.0.

In Python, this can be achieved using the following method:

In [26]:
import numpy as np

def create_random_exp_dist(min_value=0.0001, max_value=1.0, num_samples=10, print_values=True):
    """ 
    Create an array of randomly sampled values from an exponential distribution.
    
    Args:
        min_value (float): The minimum value in the distribution.
        max_value (float): The maximum value in the distribution.
        num_samples (int): The number of samples to generate.
        print_values (bool): Whether to print the sampled values.
        
    Returns:
        np.ndarray: An array of size (num_samples) containing the sampled values.
    """
    # Determine the minimum and maximum exponents
    min_exponent = np.log10(min_value)
    max_exponent = np.log10(max_value)

    # Generate random exponents in the specified range
    random_exponents = min_exponent + (max_exponent - min_exponent) * np.random.rand(num_samples)

    # Calculate the exponential distribution values
    exp_dist_values = 10 ** random_exponents

    # Print values if specified
    if print_values:
        print(f"For max value of {max_value}, our exponent is {max_exponent}")
        print(f"For min value of {min_value}, our exponent is {min_exponent}")
        print(f"Our first ten exponential distribution values are: {exp_dist_values[:10]}")

    return exp_dist_values



create_random_exp_dist()
print("-" * 40)
create_random_exp_dist(min_value=0.000043, max_value= 12.5, num_samples=10000)
print("-" * 40)

For max value of 1.0, our exponent is 0.0
For min value of 0.0001, our exponent is -4.0
Our first ten exponential distribution values are: [1.20002394e-01 2.52741958e-02 2.27140645e-04 1.40011736e-04
 9.65885082e-03 3.79918801e-03 2.28542382e-04 1.18148364e-04
 2.67373721e-02 9.02807722e-02]
----------------------------------------
For max value of 12.5, our exponent is 1.0969100130080565
For min value of 4.3e-05, our exponent is -4.366531544420414
Our first ten exponential distribution values are: [0.01053281 0.00078034 0.04804891 0.00726155 0.00387039 0.00044357
 0.08124986 0.09002199 0.00011992 0.0001027 ]
----------------------------------------


## Hyper parameters for exponentially weighted averages
Lets way you want beta (the averaging weight) to be between 0.9 and 0.999. This roughly equates to averaging between the last 10 values (with 0.9) and the last 1000 values (with 0.999). This is another case where we do not want to scale the value by a linear range. Instead we want to scale using the exponential method implemented in the function above.

In [27]:
create_random_exp_dist(min_value = 0.9, max_value = 0.999)

For max value of 0.999, our exponent is -0.0004345117740176917
For min value of 0.9, our exponent is -0.045757490560675115
Our first ten exponential distribution values are: [0.9300954  0.902647   0.99821746 0.90510451 0.97527567 0.91832005
 0.98641899 0.94431407 0.96164416 0.96337401]


array([0.9300954 , 0.902647  , 0.99821746, 0.90510451, 0.97527567,
       0.91832005, 0.98641899, 0.94431407, 0.96164416, 0.96337401])

## Organizing the Hyperparameter search process
**IMPORTANT** hyper parameters can become stale (less effective over time), therefore it is a good idea to test your hyper-parameters every few months to ensure the model is performing optimally on its current parameters still.

- One approach is to babysit one model, monitor its training and dynamically adjust the hyper parameters 
  - this is the Panda approach (one baby at a time)
  - can be better when you are limited by computational power
- Another approach to train multiple models at the same time in parallel 
  - this is the Caviar approach (spawn thousands of eggs at a time)
  - is great when you have huge amounts of processing power

## Batch Normalization
Batch normalization (BN) is a technique introduced by Sergey Ioffe and Christian Szegedy in 2015 to improve the training of deep neural networks. The main idea behind batch normalization is to normalize the activations of each layer to have zero mean and unit variance. This helps address the issue of internal covariate shift, which occurs when the distribution of inputs to a given layer changes during training. By reducing internal covariate shift, batch normalization allows the network to be trained faster and with larger learning rates, improving the overall performance.

Makes hyperparameter choice search easier and makes the network more robust and makes it easier to train deep networks

This is how normalization works for gradient descent:
- mean = 1/m * np.sum(X)
- X = X - mean
- sigma = 1/m * np.sum(X^2) (element wise squared)
- X = X / sigma

Now, what if we normalized the input values to other hidden layers? (e.g. normalizing a[2] for the third layer of the network)

We CAN!!!!! And it's called batch normalization.

There are two options, you can either normalize A[2] or Z[2] (after the activation function). Normalizing Z is more common and should be the default choice.
- M = 1/m * np.sum(Z[l](i))
- G2 = 1/m * np.sum(Z[l] - M)

Instead of using this normalization function:
- Znorm[l](i) = (Z[l](i) - M) / np.sqrt(G2 + epsilon)

We use this one:
- Z~(i) = gamma * Znorm[l](i) + beta
- gamma and beta become hyperparameters which can be tuned

If beta is equal to M and G is equal to np.sqrt(G2 + epsilon) this normalization reduces to:
- Z~[l](i) = Z[l](i)

Batch Norm is typically applied to mini-batches on each mini-batch independently. This leads to specific circumstances:
- b[l] (as in the equasion of W[l]*A[l-1] + b[l]) can be removed as this constant is cancelled out by the normalization process.
- Then, when you compute Znorm[l] you use this formula
- Zn[l] = gamma[l] * Z[l] + beta[l]

In short, you replace the function Z[l] with Z~[l] and start computing back-prob for dBeta and dGama

### But how does it work?
It makes weights later in the network more robust to changes from layers earlier in the network.

By normalizing the inputs to the batch normalized layer, you ensure that the mean and variance of you input data remains the same, even through the data is shifting around. This limits the amount that changes in earlier layers can have on the normalized layer.

Here are the key conceptual frameworks behind batch normalization:

**Normalize activations**: For each layer, BN normalizes the activations before they are fed into the activation function. This is done by computing the mean and variance of the activations within a mini-batch and then normalizing the activations using these statistics.

**Scale and shift**: After normalization, BN introduces two learnable parameters, gamma (𝛾) and beta (𝛽), for each activation. These parameters allow the model to learn the optimal scale and shift for the normalized activations. The scaling and shifting step helps preserve the representational power of the network since it allows the model to learn different activation distributions if required.

**During inference**: At test time, the batch normalization layer uses the running average of the mean and variance computed during training instead of the mini-batch statistics. This ensures a more stable and accurate estimation of the normalization parameters for the input data.

The benefits of batch normalization include:

- **Faster training**: By normalizing the activations, BN helps gradients flow better through the network, allowing for faster convergence and enabling the use of larger learning rates.
- **Regularization effect**: BN adds a small amount of noise to the network during training, which can have a slight regularization effect, similar to dropout.
- **Reduced sensitivity to weight initialization**: With BN, the network becomes less sensitive to the initial choice of weights, making it easier to train deep networks from scratch.

Overall, batch normalization has become a standard technique in deep learning models due to its ability to improve training speed and stability, allowing for deeper and more complex architectures.

The later layers are forced less to adapt to changes in earlier layers which allows them to become more decoupled from earlier layers to free them up to "figure out more and different things"

Each layer is then able to learn more independently from each other which speeds up the learning process =)  

## Batch Norm as regularization
- Each batch is scaled by the mean/variance computed just on a single mini-batch
- This adds some noise to the values z[l] within that minibatch. So similar to dropout, it adds some noise to each hidden layer's activations
- This has a slight regularization effect (like dropout) which prevents the overfitting of data.

DONT USE BATCH NORM AS A REGULARIZATION METHOD, it just has an unintended side-effect of slight regularization

## Moving from Training with Batch Norm to testing
During the training phase, BatchNorm calculates the mean and variance of the input data within each mini-batch for each layer. It then normalizes the input data by subtracting the mini-batch mean and dividing by the square root of the mini-batch variance plus a small constant (epsilon) for numerical stability. After normalization, the input data is rescaled and shifted using learnable parameters called gamma (scale) and beta (shift). These parameters are learned during training, along with the other model parameters.

During the test phase, the goal is to estimate the true population statistics (mean and variance) rather than relying on a single mini-batch's statistics, as the test data might not have the same distribution as the training data. To achieve this, the mean and variance calculated during the training phase are typically averaged over several mini-batches to obtain a more accurate estimate of the population statistics. These running averages of mean and variance are then used for normalization during the test phase. The learnable parameters gamma and beta are applied in the same way as in the training phase.

## Random Notes about "e"
In mathematics, "e" is a fundamental constant known as Euler's number or the base of the natural logarithm. It is an irrational number, meaning that it cannot be represented as a simple fraction, and its decimal representation never terminates or repeats. The approximate value of e is 2.718281828459045.

Euler's number is important in various mathematical contexts, including calculus, complex analysis, number theory, and even probability. One of the most significant appearances of e is in the exponential function, f(x) = e^x, which has the unique property of being its own derivative and integral.

e also appears in the natural logarithm function, denoted as ln(x), which is the inverse of the exponential function with base e. The natural logarithm has many useful properties and is widely used in calculus, physics, engineering, and other fields.

Additionally, e is involved in the famous Euler's formula, which connects the exponential function with trigonometric functions:

e^(ix) = cos(x) + i*sin(x),

where i is the imaginary unit, and x is a real number. This formula helps bridge the gap between real and complex numbers and plays a vital role in complex analysis.

## Random notes about derivitives
A derivative is a fundamental concept in calculus that represents the rate of change of a function with respect to its independent variable. In other words, it measures how a function's output value changes as its input value changes. Derivatives are used to analyze various aspects of functions, such as slopes of tangent lines, local extrema (maximum and minimum values), inflection points, and rates of change in various applied fields like physics and economics.

The derivative of a function f(x) with respect to x is often denoted as f'(x) or df/dx. To find the derivative of a function, you can use various rules and techniques, including the power rule, product rule, quotient rule, and chain rule, among others.

For example, let's consider the function f(x) = x^2. The derivative of this function with respect to x is:

f'(x) = 2x

This means that the rate of change of the function x^2 is 2x at any point x. The derivative also represents the slope of the tangent line to the curve defined by the function at a given point. In this case, the slope of the tangent line to the curve y = x^2 at any point x is 2x.

Derivatives have numerous applications, such as finding the velocity and acceleration of objects in physics, determining the profit or cost optimization in economics, and analyzing the behavior of functions in mathematics.

## Random notes about Integrals
An integral is a fundamental concept in calculus that represents the area under a curve, accumulation of a quantity, or the signed area between a curve and the horizontal axis over a given interval. Integrals can be used to solve a wide range of problems, including finding the area of irregular shapes, determining the length of curves, and calculating the total accumulated change in physical quantities such as velocity or displacement in physics.

There are two main types of integrals: definite integrals and indefinite integrals.

Definite Integral: A definite integral represents the signed area under a curve within a specified interval [a, b] on the x-axis. It is denoted as:
∫[a, b] f(x) dx

where f(x) is the function being integrated, and 'a' and 'b' are the limits of integration. The result of a definite integral is a numerical value.

Indefinite Integral: An indefinite integral, also known as an antiderivative, represents a family of functions whose derivative is the given function. It is denoted as:
∫f(x) dx = F(x) + C

where f(x) is the function being integrated, F(x) is the antiderivative of f(x), and C is the constant of integration. The result of an indefinite integral is a function or a family of functions.

Integration can be thought of as the reverse process of differentiation. While differentiation breaks down a function into its instantaneous rates of change, integration accumulates these rates of change to reconstruct the original function (up to a constant term in the case of indefinite integrals).