<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

1. **[`rand`](https://m0saan.github.io/minima/init.html#rand)**: This function generates a tensor filled with random numbers drawn from a uniform distribution between `low` and `high` (defaulting to 0 and 1). It does this by creating an array of random values on the specified device (defaulting to CPU), then scales and shifts these values to the correct range. The result is wrapped in a `mi.Tensor` object, which supports automatic differentiation if `requires_grad` is True.

In [1]:
#| echo: false
#| output: asis
show_doc(rand)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L12){target="_blank" style="float:right; font-size:smaller"}

### rand

>      rand (*shape, low=0.0, high=1.0, device=None, dtype='float32',
>            requires_grad=False)

Generates a tensor with random numbers uniformly distributed between `low` and `high`.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| low | float | 0.0 |  |
| high | float | 1.0 |  |
| device | NoneType | None |  |
| dtype | str | float32 |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape `shape`, filled with random numbers from the uniform distribution between `low` and `high`.** |

In [None]:
rand(10,5)

minima.Tensor([[0.8361942  0.7891384  0.10970007 0.5923745  0.22967269]
 [0.61836874 0.12242746 0.89674276 0.26497158 0.25622988]
 [0.8375568  0.04936132 0.21718413 0.15642066 0.10232401]
 [0.1440296  0.13674147 0.40588015 0.33155832 0.28403464]
 [0.58986247 0.20638846 0.24636365 0.75810486 0.94382447]
 [0.74609196 0.00459267 0.48561355 0.20537768 0.17416522]
 [0.24115583 0.06162176 0.3904394  0.9618843  0.8685511 ]
 [0.42657614 0.42485094 0.19993785 0.9789261  0.9477727 ]
 [0.02524497 0.48020166 0.7375612  0.7842982  0.92582405]
 [0.12680373 0.41048595 0.00874551 0.16642605 0.39158627]])

In [None]:
t = rand(10,5)

In [None]:
t.dtype, t.device, t.requires_grad

(dtype('float32'), minima.cpu(), False)

2. **[`randn`](https://m0saan.github.io/minima/init.html#randn)**: Similar to [`rand`](https://m0saan.github.io/minima/init.html#rand), but generates numbers from a normal distribution with the specified mean and standard deviation (defaulting to 0 and 1). This is done by creating an array of normally-distributed random values, then scaling and shifting them to match the requested parameters.

In [2]:
#| echo: false
#| output: asis
show_doc(randn)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L44){target="_blank" style="float:right; font-size:smaller"}

### randn

>      randn (*shape, mean=0.0, std=1.0, device=None, dtype='float32',
>             requires_grad=False)

Generates a tensor with random numbers normally distributed with specified mean and standard deviation.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| mean | float | 0.0 |  |
| std | float | 1.0 |  |
| device | NoneType | None |  |
| dtype | str | float32 |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape `shape`, filled with random numbers from the normal distribution with the specified mean and standard deviation.** |

In [None]:
t = randn(5,5, requires_grad=True)

In [None]:
t

minima.Tensor([[ 1.3397974   0.64010125 -0.9311074   0.6728155  -0.1192577 ]
 [ 0.7008655  -0.7104067  -0.89565736 -0.8261754   0.72841895]
 [-0.8426411  -0.8788722  -0.661193   -1.4981922   0.15918176]
 [ 0.9665735  -1.2228402   0.7100398   0.4944528   0.34494334]
 [-0.22832021  0.5712975   1.866018   -0.6395092   0.90164375]])

In [None]:
t.shape, t.dtype, t.device, t.requires_grad

((5, 5), dtype('float32'), minima.cpu(), True)

3. **[`constant`](https://m0saan.github.io/minima/init.html#constant)**: This function creates a tensor filled with a constant value `c` (defaulting to 1). It does this by creating an array of ones on the specified device and then scaling these ones by the constant value.

In [3]:
#| echo: false
#| output: asis
show_doc(constant)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L74){target="_blank" style="float:right; font-size:smaller"}

### constant

>      constant (*shape, c=1.0, device=None, dtype='float32',
>                requires_grad=False)

Generates a tensor filled with a constant value.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| c | float | 1.0 |  |
| device | NoneType | None |  |
| dtype | str | float32 |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape `shape`, filled with the constant value `c`.** |

4. **[`ones`](https://m0saan.github.io/minima/init.html#ones) and [`zeros`](https://m0saan.github.io/minima/init.html#zeros)**: These functions are simply shortcuts for creating tensors filled with ones or zeros, respectively. They're implemented by calling the [`constant`](https://m0saan.github.io/minima/init.html#constant) function with `c` set to 1 or 0.

In [4]:
#| echo: false
#| output: asis
show_doc(ones)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L102){target="_blank" style="float:right; font-size:smaller"}

### ones

>      ones (*shape, device=None, dtype='float32', requires_grad=False)

Generates a tensor filled with ones.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| device | NoneType | None |  |
| dtype | str | float32 |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape `shape`, filled with ones.** |

In [5]:
#| echo: false
#| output: asis
show_doc(zeros)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L126){target="_blank" style="float:right; font-size:smaller"}

### zeros

>      zeros (*shape, device=None, dtype='float32', requires_grad=False)

Generates a tensor filled with zeros.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| device | NoneType | None |  |
| dtype | str | float32 |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape `shape`, filled with zeros.** |

5. **[`randb`](https://m0saan.github.io/minima/init.html#randb)**: This function creates a binary tensor, with each element independently being True with probability `p` (defaulting to 0.5). This is done by generating uniformly-distributed random numbers and checking whether they're less than or equal to `p`.

In [6]:
#| echo: false
#| output: asis
show_doc(randb)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L150){target="_blank" style="float:right; font-size:smaller"}

### randb

>      randb (*shape, p=0.5, device=None, dtype='bool', requires_grad=False)

Generates a binary tensor with random values of `True` or `False`.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| shape |  |  |  |
| p | float | 0.5 |  |
| device | NoneType | None |  |
| dtype | str | bool |  |
| requires_grad | bool | False |  |
| **Returns** | **mi.Tensor** |  | **A binary tensor of shape `shape`, filled with random boolean values, where the probability of `True` is `p`.** |

6. **[`one_hot`](https://m0saan.github.io/minima/init.html#one_hot)**: This function creates a one-hot encoding tensor. Given a size `n` and an index `i`, it creates a tensor of size `n` with a 1 at the `i`-th position and 0s elsewhere.

In [7]:
#| echo: false
#| output: asis
show_doc(one_hot)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L178){target="_blank" style="float:right; font-size:smaller"}

### one_hot

>      one_hot (n, i, device=None, dtype='float32', requires_grad=False)

Generates a one-hot encoding tensor.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| n | int |  | The size of the one-hot vector. |
| i | int |  | The index to be set to `1` in the one-hot vector. |
| device | NoneType | None | The device where the tensor will be allocated. Default is CPU. |
| dtype | str | float32 | The data type of the tensor. Default is 'float32'. |
| requires_grad | bool | False | If True, the tensor is created with gradient tracking. Default is False. |
| **Returns** | **mi.Tensor** |  | **A one-hot tensor of size `n`, with the `i`th element set to `1` and all others set to `0`.** |

### Glorot/Xavier Initialization

Xavier initialization, also known as Glorot initialization, is a technique for initializing the weights in artificial neural networks to improve the stability and speed of neural network training. In the paper Understanding the difficulty of training deep feedforward neural networks, researchers identified a value for the variance of the weights that works well to mitigate the problems we've discussed.

Here's a high-level idea of how it works:

Neural networks are trained using a method called backpropagation, which involves iteratively adjusting the weights of the network based on the difference between the network's current output and its desired output.

One challenge with this process is that the scale of the initial weights can have a large impact on the network's learning dynamics. If the weights are too large or too small, the network might learn very slowly, or not at all. This is particularly an issue in deep networks where there are many layers of weights to learn.

Xavier initialization seeks to address this issue by scaling the initial weights in proportion to the number of inputs and outputs of the neuron. Specifically, in Xavier initialization, the weights are drawn from a distribution with zero a mean of 0 and a variance defined as: 

$$
\text{var}(w)=\frac{2}{n_{in}+n_{out}}
$$

where $n_{in}$ is the number of inputs to the neuron and $n_{out}$ is the number of outputs. In order to induce the weights to acquire a standard deviation of $\sqrt{\frac{2}{n_{in}+n_{out}}}$, consequently causing a variance of $\frac{2}{n_{in}+n_{out}}$, the weights are initially produced randomly from a normal distribution with a mean of 0 and a standard deviation of 1.

Subsequently, every weight is multiplied by $\sqrt{\frac{2}{n_{in}+n_{out}}}$, effectively shifting the standard deviation of the distribution to $\sqrt{\frac{2}{n_{in}+n_{out}}}$.

![Xavier initialization from a normal distribution](../assets/10.xav-init-normal.svg)

In [8]:
#| echo: false
#| output: asis
show_doc(xavier_normal)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L205){target="_blank" style="float:right; font-size:smaller"}

### xavier_normal

>      xavier_normal (fan_in, fan_out, gain=1.0, **kwargs)

Initializes a tensor using Xavier (Glorot) Normal initialization.

This initializer is designed to keep the scale of the gradients roughly the same
in all layers. It samples weights from a normal distribution centered around 0 with 
standard deviation `gain * sqrt(2 / (fan_in + fan_out))`

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| fan_in | int |  | The number of input units in the weight tensor. |
| fan_out | int |  | The number of output units in the weight tensor. |
| gain | float | 1.0 | Scaling factor for the standard deviation of the normal distribution. Default is 1.0. |
| kwargs |  |  |  |
| **Returns** | **mi.Tensor** |  | **A tensor initialized using Xavier Normal initialization.** |

It's worth noting that there is also a Xavier initialization variant suitable for uniform distributions as opposed to normal distributions. The resultant weight matrix will comprise values sampled from a uniform distribution within the scope of $(-a, a)$, with $a$ equalling $\sqrt{\frac{6}{n_{in}+n_{out}}}$.

![Xavier initialization from a uniform distribution](../assets/11.xav-uniform.svg)

In [9]:
#| echo: false
#| output: asis
show_doc(xavier_uniform)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L238){target="_blank" style="float:right; font-size:smaller"}

### xavier_uniform

>      xavier_uniform (fan_in, fan_out, gain=1.0, **kwargs)

Initializes a tensor using Xavier (Glorot) Uniform initialization.

This initializer is designed to keep the scale of the gradients roughly the same
in all layers. It samples weights from a uniform distribution within the range 
`[-gain * sqrt(6 / (fan_in + fan_out)), gain * sqrt(6 / (fan_in + fan_out))]`

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| fan_in | int |  | The number of input units in the weight tensor. |
| fan_out | int |  | The number of output units in the weight tensor. |
| gain | float | 1.0 | Scaling factor for the range of the uniform distribution. Default is 1.0. |
| kwargs |  |  |  |
| **Returns** | **mi.Tensor** |  | **A tensor initialized using Xavier Uniform initialization.** |

Both normal and uniform distributions have demonstrated effectiveness in practical applications, and it is up to the network designer to select the preferred method. Xavier initialization is frequently utilized in practical scenarios to promote more stable training and circumvent issues that stem from unstable gradients, such as the vanishing and exploding gradient predicaments.

In [None]:
# Initialize weights with Xavier/Glorot initialization
W = xavier_uniform(fan_in=10, fan_out=5)

In [None]:
W

minima.Tensor([[-0.1622775   0.4536291   0.61989725 -0.02014413 -0.05646893]
 [ 0.5858742   0.46239555  0.20997366  0.46705866  0.23186018]
 [-0.24604936  0.06220854  0.6072554  -0.05731776  0.5148139 ]
 [-0.0666193  -0.3852586   0.03337919  0.5635869   0.5360037 ]
 [-0.36881718  0.4481966  -0.5299952  -0.16656235 -0.63166136]
 [ 0.06658348 -0.02263997  0.3014384  -0.15522511  0.3325003 ]
 [-0.12717669 -0.02845087 -0.36774656  0.41525584 -0.46239212]
 [-0.5931512  -0.541313   -0.5024823   0.2106733   0.14501753]
 [-0.38586646 -0.3023581   0.2594077   0.27719358 -0.1911264 ]
 [-0.3176814   0.59976697  0.60364455  0.07043133  0.21091662]])

In [None]:
W = xavier_normal(fan_in=10, fan_out=5)

In [None]:
W

minima.Tensor([[ 0.03059636  0.40731347  0.04486648 -0.20211084 -0.2908123 ]
 [-0.07282545 -0.03428365  0.31833392  0.10940555  0.05456669]
 [-0.35267887  0.58239627  0.20920038 -0.05054335  0.06172116]
 [ 0.1331309   0.284902    0.15670004  0.22623208 -0.6965369 ]
 [ 0.43259475  0.42572162 -0.40264252 -0.43965283  0.46393195]
 [ 0.710218   -0.02606277 -0.06617628 -0.9257728   0.3177419 ]
 [-0.03474366 -0.42733535  0.5783244   0.29713896 -0.16121665]
 [ 0.7878572  -0.01783044  0.23402494  0.20502235 -0.6642037 ]
 [-0.08082991 -0.18710302  0.13123396  0.42042506  0.17879266]
 [ 0.15647691  0.3683187  -0.15457386 -0.51149946 -0.7011396 ]])

The original Xavier initialization was designed for use with the sigmoid activation function, which is symmetric around zero. If you're using a different activation function, like ReLU, you might need a different initialization scheme, like He initialization, which is a modification of Xavier initialization designed for ReLU and other non-symmetric activation functions.

### He Initialization

Kaiming Initialization, also known as He Initialization, is a method used in initializing the weights of Neural Networks. This initialization method is designed specifically for neural networks with Rectified Linear Unit (ReLU) activation functions. It was proposed by Kaiming He et al. in their 2015 paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification".

**Principles of Kaiming Initialization:**

The basic idea of Kaiming Initialization is to keep the variance of the input and output of each layer of the neural network as consistent as possible during the forward and backward propagation. This is to solve the problem of gradient dispersion or explosion caused by the deepening of the neural network layer, which can help the model learn effectively.

Kaiming initialization initializes a weight matrix $w$ with random values sampled from a normal distribution with mean of $0$ and variance

$$\text{var}(w)=\frac{2}{n_{i}}$$

Here, `n_i` is the number of inputs to the neuron, `w` is the weight vector.

Just as with Xavier initialization, to force the weights distribution to take on this variance, the weights ar first randomly generated from a normal distribution with centered around 0 with a standard deviation of 1. Then, each weight is multiplied by 

$$\sqrt{\frac{2}{n_{i}}}$$

![Kaiming initialization from a normal distributiont](../assets/12.kaiming-normal.svg)

where `n` is the number of inputs coming into a neuron (also known as the "fan-in").

In [10]:
#| echo: false
#| output: asis
show_doc(kaiming_normal)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L271){target="_blank" style="float:right; font-size:smaller"}

### kaiming_normal

>      kaiming_normal (fan_in, fan_out, nonlinearity='relu', **kwargs)

Fills the input Tensor with values according to the method described in
"Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification" - He, K. et al. (2015), using a normal distribution.
The resulting tensor will have values sampled from normal distribution with mean=0 and std=sqrt(2 / fan_in).

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| fan_in | int |  | Number of input units in the weight tensor. |
| fan_out | int |  | Number of output units in the weight tensor. |
| nonlinearity | str | relu | The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'. |
| kwargs |  |  |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape (fan_in, fan_out), filled with random numbers from the normal distribution according to the Kaiming initialization.** |

There is also a version of Kaiming initialization to use for uniform distributions rather than normal distributions. The resulting weight matrix will have values sampled from a uniform distribution within the range $(-a, a)$, where 

$$a = \sqrt{\frac{6}{n_{i}}}$$

![Kaiming initialization from a uniform distributiont](../assets/13.kaiming-uniform.svg)

In [11]:
#| echo: false
#| output: asis
show_doc(kaiming_uniform)

---

[source](https://github.com/m0saan/minima/blob/main/minima/init.py#L303){target="_blank" style="float:right; font-size:smaller"}

### kaiming_uniform

>      kaiming_uniform (fan_in, fan_out, nonlinearity='relu', **kwargs)

Fills the input Tensor with values according to the method described in
"Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification" - He, K. et al. (2015), using a uniform distribution.
The resulting tensor will have values sampled from uniform distribution in the range [-std, std] where std = sqrt(2 / fan_in).

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| fan_in | int |  | Number of input units in the weight tensor. |
| fan_out | int |  | Number of output units in the weight tensor. |
| nonlinearity | str | relu | The non-linear function (`nn.functional` name), recommended to use only with 'relu' or 'leaky_relu'. Default is 'relu'. |
| kwargs |  |  |  |
| **Returns** | **mi.Tensor** |  | **A tensor of shape (fan_in, fan_out), filled with random numbers from the uniform distribution according to the Kaiming initialization.** |

**Advantages of Kaiming Initialization:**

1. It helps to keep the variance of the gradients roughly the same across all layers. This ensures that all layers in the network learn at about the same speed, avoiding the saturation of activation functions, and it can also help speed up the convergence of the network.
2. It performs better with ReLU and its variants because it accounts for the fact that the variance of the output of a neuron with a ReLU activation function is half the variance of its input.

## Export

In [None]:
import nbdev; nbdev.nbdev_export()