# init

> The `init` module in the `minima` (mi) library provides a suite of tensor initialization functions to create and initialize tensors in various ways. Each function in this module represents a different strategy for initializing the values of a tensor, such as uniform or normal random values, constant values, or specialized initializations like Xavier or Kaiming methods.

> These initialization methods serve as the starting point for the optimization process in neural networks, setting the stage for gradient descent and other optimization methods to fine-tune the model's parameters during training. Carefully chosen initial values can significantly influence the training dynamics and the final performance of a model.
The `init` module is a critical part of the deep learning pipeline, providing the essential first step in the process of training a neural network. It ensures a smooth and effective transition from model definition to the iterative process of learning from data.

In [None]:
#| default_exp init

In [None]:
#| export
import math
import minima as mi

1. **`rand`**: This function generates a tensor filled with random numbers drawn from a uniform distribution between `low` and `high` (defaulting to 0 and 1). It does this by creating an array of random values on the specified device (defaulting to CPU), then scales and shifts these values to the correct range. The result is wrapped in a `mi.Tensor` object, which supports automatic differentiation if `requires_grad` is True.

In [None]:
#| export
def rand(
    *shape, # The shape of the output tensor. Variable length argument list. 
    low=0.0, # Lower bound of the uniform distribution. Default is 0.0.
    high=1.0, # Upper bound of the uniform distribution. Default is 1.0.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype='float32', # The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a tensor with random numbers uniformly distributed between `low` and `high`.

    Parameters
    ----------
    *shape : int
    low : float, optional
    high : float, optional
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A tensor of shape `shape`, filled with random numbers from the uniform distribution between `low` and `high`.

    """
    device = mi.cpu() if device is None else device
    array = device.rand(*shape) * (high - low) + low
    return mi.Tensor(array, device=device, dtype=dtype, requires_grad=requires_grad)


In [None]:
rand(10,5)

minima.Tensor([[7.83526659e-01 3.13296258e-01 5.32577574e-01 2.27785841e-01
  6.73412919e-01]
 [6.40811443e-01 5.99687219e-01 5.97387493e-01 7.08121240e-01
  9.55171525e-01]
 [1.16244555e-01 1.26897478e-02 3.50459933e-01 6.54483795e-01
  3.78269851e-01]
 [6.05518281e-01 8.89193788e-02 6.42256975e-01 6.28048480e-02
  5.76855242e-01]
 [8.49515557e-01 1.44303188e-01 9.66744244e-01 4.36203182e-01
  2.72257447e-01]
 [6.02679014e-01 1.71971247e-01 6.67142749e-01 5.52026671e-04
  5.23647010e-01]
 [1.83930516e-01 2.19278708e-01 2.65353024e-01 3.98990422e-01
  5.83426416e-01]
 [5.73141694e-01 4.54402059e-01 9.81765151e-01 4.20937479e-01
  8.24222863e-01]
 [1.53790116e-01 5.16592562e-01 4.47600335e-01 3.45737524e-02
  6.62555993e-01]
 [9.63034093e-01 9.91501689e-01 3.56508255e-01 1.41618416e-01
  2.77995765e-01]])

In [None]:
t = rand(10,5)

In [None]:
t.dtype, t.device, t.requires_grad

(dtype('float32'), minima.cpu(), False)

2. **`randn`**: Similar to `rand`, but generates numbers from a normal distribution with the specified mean and standard deviation (defaulting to 0 and 1). This is done by creating an array of normally-distributed random values, then scaling and shifting them to match the requested parameters.

In [None]:
#| export
def randn(
    *shape, # The shape of the output tensor. Variable length argument list.
    mean=0.0,# Mean of the normal distribution. Default is 0.0.
    std=1.0, # Standard deviation of the normal distribution. Default is 1.0.
    device=None,# The device where the tensor will be allocated. Default is CPU.
    dtype="float32",# The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a tensor with random numbers normally distributed with specified mean and standard deviation.

    Parameters
    ----------
    *shape : int
    mean : float, optional
    std : float, optional
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A tensor of shape `shape`, filled with random numbers from the normal distribution with the specified mean and standard deviation.
    """
    device = mi.cpu() if device is None else device
    array = device.randn(*shape) * std + mean
    return mi.Tensor(array, device=device, dtype=dtype, requires_grad=requires_grad)

In [None]:
t = randn(5,5, requires_grad=True)

In [None]:
t

minima.Tensor([[-0.32201555  0.65765953  0.10364752  1.8708901   0.5211276 ]
 [-2.6434643   0.34590507  0.13994128 -0.56156456  0.5111668 ]
 [-0.629234    1.8087889  -1.7081019   0.17440249 -0.56732374]
 [-0.32246572  0.978294    0.44196278  0.5731406   1.6570238 ]
 [ 1.2459402  -0.6183812   0.00332103  0.5251044  -0.9210202 ]])

In [None]:
t.shape, t.dtype, t.device, t.requires_grad

((5, 5), dtype('float32'), minima.cpu(), True)

3. **`constant`**: This function creates a tensor filled with a constant value `c` (defaulting to 1). It does this by creating an array of ones on the specified device and then scaling these ones by the constant value.

In [None]:
#| export
def constant(
    *shape, # The shape of the output tensor. Variable length argument list.
    c=1.0, # The constant value to fill the tensor with. Default is 1.0.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype="float32", # The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a tensor filled with a constant value.

    Parameters
    ----------
    *shape : int
    c : float, optional
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A tensor of shape `shape`, filled with the constant value `c`.
    """
    device = mi.cpu() if device is None else device
    array = device.ones(*shape, dtype=dtype) * c # note: can change dtype
    return mi.Tensor(array, device=device, dtype=dtype, requires_grad=requires_grad)

4. **`ones` and `zeros`**: These functions are simply shortcuts for creating tensors filled with ones or zeros, respectively. They're implemented by calling the `constant` function with `c` set to 1 or 0.

In [None]:
#| export
def ones(
    *shape, # The shape of the output tensor. Variable length argument list.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype="float32", # The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a tensor filled with ones.

    Parameters
    ----------
    *shape : int
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A tensor of shape `shape`, filled with ones.
    """
    return constant(*shape, c=1.0, device=device, dtype=dtype, requires_grad=requires_grad)

In [None]:
#| export
def zeros(
    *shape, # The shape of the output tensor. Variable length argument list.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype="float32", # The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a tensor filled with zeros.

    Parameters
    ----------
    *shape : int
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A tensor of shape `shape`, filled with zeros.
    """
    return constant(*shape, c=0.0, device=device, dtype=dtype, requires_grad=requires_grad)

5. **`randb`**: This function creates a binary tensor, with each element independently being True with probability `p` (defaulting to 0.5). This is done by generating uniformly-distributed random numbers and checking whether they're less than or equal to `p`.

In [None]:
#| export
def randb(
    *shape, # The shape of the output tensor. Variable length argument list.
    p=0.5, # The probability of generating a `True` (1) in the binary tensor. Default is 0.5.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype="bool", # The data type of the tensor. Default is 'bool'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a binary tensor with random values of `True` or `False`.

    Parameters
    ----------
    *shape : int
    p : float, optional
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A binary tensor of shape `shape`, filled with random boolean values, where the probability of `True` is `p`.
    """
    device = mi.cpu() if device is None else device
    array = device.rand(*shape) <= p
    return mi.Tensor(array, device=device, dtype=dtype, requires_grad=requires_grad)

6. **`one_hot`**: This function creates a one-hot encoding tensor. Given a size `n` and an index `i`, it creates a tensor of size `n` with a 1 at the `i`-th position and 0s elsewhere.

In [None]:
#| export
def one_hot(
    n, # The size of the one-hot vector.
    i, # The index to be set to `1` in the one-hot vector.
    device=None, # The device where the tensor will be allocated. Default is CPU.
    dtype="float32", # The data type of the tensor. Default is 'float32'.
    requires_grad=False # If True, the tensor is created with gradient tracking. Default is False.
):
    """
    Generates a one-hot encoding tensor.

    Parameters
    ----------
    n : int
    i : int
    device : Device, optional
    dtype : str, optional
    requires_grad : bool, optional
    
    Returns
    -------
    mi.Tensor
        A one-hot tensor of size `n`, with the `i`th element set to `1` and all others set to `0`.
    """
    device = mi.cpu() if device is None else device
    return mi.Tensor(device.one_hot(n,i.numpy(), dtype=dtype), device=device, requires_grad=requires_grad)

### Glorot/Xavier Initialization

Xavier initialization, also known as Glorot initialization, is a technique for initializing the weights in artificial neural networks to improve the stability and speed of neural network training. In the paper Understanding the difficulty of training deep feedforward neural networks, researchers identified a value for the variance of the weights that works well to mitigate the problems we've discussed.

Here's a high-level idea of how it works:

Neural networks are trained using a method called backpropagation, which involves iteratively adjusting the weights of the network based on the difference between the network's current output and its desired output.

One challenge with this process is that the scale of the initial weights can have a large impact on the network's learning dynamics. If the weights are too large or too small, the network might learn very slowly, or not at all. This is particularly an issue in deep networks where there are many layers of weights to learn.

Xavier initialization seeks to address this issue by scaling the initial weights in proportion to the number of inputs and outputs of the neuron. Specifically, in Xavier initialization, the weights are drawn from a distribution with zero a mean of 0 and a variance defined as: 

$$
\text{var}(w)=\frac{2}{n_{in}+n_{out}}
$$

where $n_{in}$ is the number of inputs to the neuron and $n_{out}$ is the number of outputs. In order to induce the weights to acquire a standard deviation of $\sqrt{\frac{2}{n_{in}+n_{out}}}$, consequently causing a variance of $\frac{2}{n_{in}+n_{out}}$, the weights are initially produced randomly from a normal distribution with a mean of 0 and a standard deviation of 1.

Subsequently, every weight is multiplied by $\sqrt{\frac{2}{n_{in}+n_{out}}}$, effectively shifting the standard deviation of the distribution to $\sqrt{\frac{2}{n_{in}+n_{out}}}$.

![Xavier initialization from a normal distribution](../assets/10.xav-init-normal.svg)

In [None]:
#| export
def xavier_normal(
    fan_in, # The number of input units in the weight tensor.
    fan_out, # The number of output units in the weight tensor.
    gain=1.0, # Scaling factor for the standard deviation of the normal distribution. Default is 1.0.
    **kwargs # Additional arguments.
):
    """
    Initializes a tensor using Xavier (Glorot) Normal initialization.

    This initializer is designed to keep the scale of the gradients roughly the same
    in all layers. It samples weights from a normal distribution centered around 0 with 
    standard deviation `gain * sqrt(2 / (fan_in + fan_out))`

    Parameters
    ----------
    fan_in : int
        The number of input units in the weight tensor.
    fan_out : int
        The number of output units in the weight tensor.
    gain : float, optional
        Scaling factor for the standard deviation of the normal distribution. Default is 1.0.
    **kwargs
        Additional arguments.
    
    Returns
    -------
    mi.Tensor
        A tensor initialized using Xavier Normal initialization.
    """
    std = gain * math.sqrt(2 / (fan_in + fan_out))
    return randn(fan_in, fan_out) * std

It's worth noting that there is also a Xavier initialization variant suitable for uniform distributions as opposed to normal distributions. The resultant weight matrix will comprise values sampled from a uniform distribution within the scope of $(-a, a)$, with $a$ equalling $\sqrt{\frac{6}{n_{in}+n_{out}}}$.

![Xavier initialization from a uniform distribution](../assets/11.xav-uniform.svg)

In [None]:
#| export
def xavier_uniform(
    fan_in, # The number of input units in the weight tensor.
    fan_out, # The number of output units in the weight tensor.
    gain=1.0, # Scaling factor for the range of the uniform distribution. Default is 1.0.
    **kwargs # Additional arguments.
):
    """
    Initializes a tensor using Xavier (Glorot) Uniform initialization.

    This initializer is designed to keep the scale of the gradients roughly the same
    in all layers. It samples weights from a uniform distribution within the range 
    `[-gain * sqrt(6 / (fan_in + fan_out)), gain * sqrt(6 / (fan_in + fan_out))]`

    Parameters
    ----------
    fan_in : int
        The number of input units in the weight tensor.
    fan_out : int
        The number of output units in the weight tensor.
    gain : float, optional
        Scaling factor for the range of the uniform distribution. Default is 1.0.
    **kwargs
        Additional arguments.
    
    Returns
    -------
    mi.Tensor
        A tensor initialized using Xavier Uniform initialization.
    """
    a = gain * math.sqrt(6 / (fan_in + fan_out))
    return rand(fan_in, fan_out, low=-a, high=a)

Both normal and uniform distributions have demonstrated effectiveness in practical applications, and it is up to the network designer to select the preferred method. Xavier initialization is frequently utilized in practical scenarios to promote more stable training and circumvent issues that stem from unstable gradients, such as the vanishing and exploding gradient predicaments.

In [None]:
# Initialize weights with Xavier/Glorot initialization
W = xavier_uniform(fan_in=10, fan_out=5)

In [None]:
W

minima.Tensor([[ 0.18204309 -0.12172355  0.42933008 -0.30406302  0.22789986]
 [-0.4650699   0.12824431  0.28893474 -0.4557486  -0.43527567]
 [ 0.38837773  0.36388302 -0.27686062  0.5698096   0.5395923 ]
 [-0.31807607  0.20749329  0.4179927  -0.28587523  0.03159472]
 [-0.22673258  0.28144655  0.30295125 -0.26983428  0.26549587]
 [-0.5303138   0.10772958 -0.05070698 -0.30012584  0.19123213]
 [-0.4650802   0.6139801  -0.63092977  0.47466028 -0.5542998 ]
 [-0.11492721  0.06349465  0.32470813 -0.21819438  0.0323478 ]
 [-0.10181683 -0.5484834  -0.5687075  -0.5260988  -0.40132982]
 [ 0.08179916 -0.2423988   0.3873124  -0.23807983 -0.5552056 ]])

In [None]:
W = xavier_normal(fan_in=10, fan_out=5)

In [None]:
W

minima.Tensor([[-0.12149321 -0.06503619 -0.18251379  0.39912507 -0.00423999]
 [ 0.13038522 -0.68032825  0.21075922 -0.0962192  -0.5333168 ]
 [ 0.14011304 -0.17632282  0.01991802  0.4106401   0.369387  ]
 [ 0.21930689 -0.02168333  0.21634428  0.19984812 -0.45925918]
 [-0.14728941  0.05819568 -0.03692521  0.27890548 -0.20965143]
 [ 0.46666375 -0.75786346 -0.23499827  0.38242584  0.3118511 ]
 [-0.16439071  0.38929185  0.61245376 -0.01932754  0.15984343]
 [-0.20579684 -0.7682969   0.47243273 -0.50518984  0.34700346]
 [ 0.19490859 -0.06111953 -0.9346084  -0.46014336  0.23300567]
 [-0.42322642  0.176557   -0.4687168  -0.06649911  0.54504395]])

The original Xavier initialization was designed for use with the sigmoid activation function, which is symmetric around zero. If you're using a different activation function, like ReLU, you might need a different initialization scheme, like He initialization, which is a modification of Xavier initialization designed for ReLU and other non-symmetric activation functions.

### He Initialization

Kaiming Initialization, also known as He Initialization, is a method used in initializing the weights of Neural Networks. This initialization method is designed specifically for neural networks with Rectified Linear Unit (ReLU) activation functions. It was proposed by Kaiming He et al. in their 2015 paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification".

**Principles of Kaiming Initialization:**

The basic idea of Kaiming Initialization is to keep the variance of the input and output of each layer of the neural network as consistent as possible during the forward and backward propagation. This is to solve the problem of gradient dispersion or explosion caused by the deepening of the neural network layer, which can help the model learn effectively.

Kaiming initialization initializes a weight matrix $w$ with random values sampled from a normal distribution with mean of $0$ and variance

$$\text{var}(w)=\frac{2}{n_{i}}$$

Here, `n_i` is the number of inputs to the neuron, `w` is the weight vector.

Just as with Xavier initialization, to force the weights distribution to take on this variance, the weights ar first randomly generated from a normal distribution with centered around 0 with a standard deviation of 1. Then, each weight is multiplied by 

$$\sqrt{\frac{2}{n_{i}}}$$

![Kaiming initialization from a normal distributiont](../assets/12.kaiming-normal.svg)

where `n` is the number of inputs coming into a neuron (also known as the "fan-in").

In [None]:
#| export
def kaiming_normal(
    fan_in,  # Number of input units in the weight tensor.
    fan_out, # Number of output units in the weight tensor.
    nonlinearity="relu", # The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'.
    **kwargs # Additional keyword arguments
):
    """
    Fills the input Tensor with values according to the method described in
    "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification" - He, K. et al. (2015), using a normal distribution.
    The resulting tensor will have values sampled from normal distribution with mean=0 and std=sqrt(2 / fan_in).

    Parameters
    ----------
    fan_in : int
        Number of input units in the weight tensor.
    fan_out : int
        Number of output units in the weight tensor.
    nonlinearity : str, optional
        The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'.
    **kwargs : optional
        Additional keyword arguments.
    
    Returns
    -------
    mi.Tensor
        A tensor of shape (fan_in, fan_out), filled with random numbers from the normal distribution according to the Kaiming initialization.
    """
    assert nonlinearity == "relu", "Only relu supported currently"
    std = np.sqrt(2) / np.sqrt(fan_in)
    return randn(fan_in, fan_out) * std


There is also a version of Kaiming initialization to use for uniform distributions rather than normal distributions. The resulting weight matrix will have values sampled from a uniform distribution within the range $(-a, a)$, where 

$$a = \sqrt{\frac{6}{n_{i}}}$$

![Kaiming initialization from a uniform distributiont](../assets/13.kaiming-uniform.svg)

In [None]:
#| export
def kaiming_uniform(
    fan_in,  # Number of input units in the weight tensor.
    fan_out, # Number of output units in the weight tensor.
    nonlinearity="relu", # The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'.
    **kwargs # Additional keyword arguments
):
    """
    Fills the input Tensor with values according to the method described in
    "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification" - He, K. et al. (2015), using a uniform distribution.
    The resulting tensor will have values sampled from uniform distribution in the range [-std, std] where std = sqrt(2 / fan_in).

    Parameters
    ----------
    fan_in : int
        Number of input units in the weight tensor.
    fan_out : int
        Number of output units in the weight tensor.
    nonlinearity : str, optional
        The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'.
    **kwargs : optional
        Additional keyword arguments.
    
    Returns
    -------
    mi.Tensor
        A tensor of shape (fan_in, fan_out), filled with random numbers from the uniform distribution according to the Kaiming initialization.
    """
    assert nonlinearity == "relu", "Only relu supported currently"
    gain = math.sqrt(2)
    std = gain * math.sqrt(3/fan_in)
    return rand(fan_in, fan_out, low=-std, high=std)

**Advantages of Kaiming Initialization:**

1. It helps to keep the variance of the gradients roughly the same across all layers. This ensures that all layers in the network learn at about the same speed, avoiding the saturation of activation functions, and it can also help speed up the convergence of the network.
2. It performs better with ReLU and its variants because it accounts for the fact that the variance of the output of a neuron with a ReLU activation function is half the variance of its input.

## Export

In [None]:
import nbdev; nbdev.nbdev_export()