- [blog_link](https://eisenjulian.github.io/deep-learning-in-100-lines/)
- [Pytorch nn module tutorial](https://pytorch.org/tutorials/beginner/nn_tutorial.html)

In [3]:
import numpy as np

In [10]:
a = np.random.randint(low=0, high=10,size=(4,3))
c = np.random.randint(low=0, high=10,size=(3,4))

In [11]:
np.dot(a, c).shape

(4, 4)

In [12]:
np.matmul(a, c).shape # n is 5, k is 3, m is 4

(4, 4)

In [18]:
a @ c

array([[ 75,  17,  29,  96],
       [ 30,   8,  12,  33],
       [135,  52,  66, 174],
       [ 93,  51,  55, 117]])

## Param Class

We can start with a class that encapsulates 
- tensor
- its gradients. 

The tensor can be anything like numpy array or torch array

In [28]:
class Parameter():
    def __init__(self, tensor):
        self.tensor = tensor
        self.gradient = np.zeros_like(self.tensor)

## Layer Class

Now we can create the layer class, the key idea is that during a forward pass (`Forward()`) we return both 

- layer output 
- `function`

The `function` will have the following signature:

```py
def function(input):
    """
    input (tensor): gradient of the loss with respect to the outputs
    return (tensor): the gradient with respect to the inputs, 
                    updating the weight gradients in the process
    """
    output = do_operation(input)
    return output

```

This is because while evaluating the model layer by layer there's no way to calculate the gradients if we don't know the final loss yet, instead the best thing you can do is return a function that CAN calculate the gradient later. And that function will only be called after we completed the forward evaluation, when you know the loss and you have all the necessary info to compute the gradients in that layer.


The training process will then have three steps, calculate the forward step, then the backward steps accumulate the gradients, and finally updating the weights. It’s important to do this at the end	since weights can be reused in multiple layers and we don’t want to mutate the weights before time.

In [44]:
class Layer:
    def __init__(self):
        self.parameters = []

    def forward(self, X):
        return X, lambda D: D

    def build_param(self, tensor):
        """
        tensor (numpy matrix)
        return:
            param (Parameter)
        """
        param = Parameter(tensor)
        self.parameters.append(param)
        return param
    
    def update(self, optimizer):
        for param in self.parameters: optimizer.update(param)

It's standard to delegate the job of updating the parameters to an optimizer, which receives an instance of a parameter after every batch. The simplest and most known optimization method out there is the mini-batch stochastic gradient descent

### Code Analysis
Also look at line 6, where it's returning a lambda. Here it's returning a lambda function that can be applied later with the appropriate argument. 

How a returned lambda function can work? 

Lets see an example

```py
def fun(n):
    """
    return a lambda function with the increment size
    """
    return lambda x: x+n

x = 3
inc3 = fun(3)
inc3(x) ## this will increase the x's value by 3
ans: 6

inc5 = fun(5)
inc5(x) ## this will increase the x's value by 3
ans: 8
```

## SGD Optimizer Class

In [40]:
class SGDOptimizer():
    def __init__(self, lr=0.1):
        self.lr = lr

    def update(self, param):
        """
        the tensors are updated, but not the gradients. they are filled with 0. Why ????
        look at the backward() definition in Class Linear. In the backward(), the gradients are updated.
        """
        param.tensor -= self.lr * param.gradient
        param.gradient.fill(0) 

Next build our `Linear Layer` extending the `Class Layer`

## Linear Class

For reference let's look at the `Layer class`, which the `Linear class` is extending
```py
class Layer:
    def __init__(self):
        self.parameters = []

    def forward(self, X):
        return X, lambda D: D

    def build_param(self, tensor):
        """
        tensor (numpy matrix)
        return:
            param (Parameter)
        """
        param = Parameter(tensor)
        self.parameters.append(param)
        return param
    
    def update(self, optimizer):
        for param in self.parameters: optimizer.update(param)
```

In [42]:
class Linear(Layer):
    def __init__(self, inputs, outputs):
        """
        inputs (int): input dimension
        outputs (int): output dimension
        """
        super().__init__()
        tensor = np.random.randn(inputs, outputs) * np.sqrt(1 / inputs)
        self.weights = self.build_param(tensor)
        self.bias = self.build_param(np.zeros(outputs))

    def forward(self, X):
        def backward(D):
            self.weights.gradient += X.T @ D
            self.bias.gradient += D.sum(axis=0)
            return D @ self.weights.tensor.T
        
        return X @ self.weights.tensor +  self.bias.tensor, backward

### Code Analysis

**line 8**

We are initializing the weights here with `Xavier initialisation` (by multiplying with `1/sqrt(n)`), where n is the input dimension

**Line 9, 10**

the build param is returning a Parameter object which has a tensor and the gradient. The tensor is filled with the input tensor. But the gradient is filled with `0` intially.

**Line 14-16**

the `backward()` definition is something similar to 

```py
def outfun(x):
    def infun(n):
        print('process infun() running ...')
        return x*n
    return 5*x, infun

x = 10
z, y = outfun(x)

y(2)
Ans:
   process infun() running ...
   20
```

Here the logic is outfun will return the `5*x` and also the **partial funciton defition** for `infun()`. Why **partial**?: Because when the definition of `infun(n)` is generated, it doesn't know the `n`. But it receives the `x` from `outfun(x)`. So `infun()` definition is partially generated when `outfun()` returns `infun`. Later, when we call `y(2)` then `y == infun` and `n == 2` and then `infun(2)` is executed and it gives `20`. It means that the `infun()` definition is not static. Rather it's dynamic. Because it's one part is bound to the input of the `outfun()`. Thus for different `outfun(10)`, `outfun(20)` call, it's `infun(2)` 


In [45]:
def outfun(x):
    def infun(n):
        print('process infun() running ...')
        return x*n
    return 5*x, infun

In [46]:
x = 10
z, y = outfun(x)

y(2)