# Teacher's Assignment No. 14 - Q1

***Author:*** *Ofir Paz* $\qquad$ ***Version:*** *12.05.2024* $\qquad$ ***Course:*** *22961 - Deep Learning*

Welcome to question 1 of the fourth assignment of the course *Deep Learning*. \
In this question, we will implement the *SplitLinear* network layer, and make various gradient calculations related to it.

## Imports

First, we will import the required packages for this assignment.
- [pytorch](https://pytorch.org/) - One of the most fundemental and famous tensor handling library.

In [1]:
import torch  # pytorch.
import torch.nn as nn  # neural network module.
import torch.nn.functional as F  # functional module.

## SplitLinear Implementation

We will start with the implementation of the *SplitLinear* layer, using pytorch.

In [2]:
class SplitLinear(nn.Module):
    '''SplitLinear layer.
    
    The SplitLinear layer is a linear layer that splits the input tensor in half, 
    applies a linear transformation to each half, and concatenates the results.
    '''
    def __init__(self, layer_size: int) -> None:
        '''
        Constructor for the SplitLinear layer.

        Args:
            layer_size (int) - Number of features. assumes even.
        '''
        super(SplitLinear, self).__init__()
        self.linear = nn.Linear(layer_size // 2, layer_size // 2)

        # Use Xavier initialization for the weights.
        # Reasoning for use in the video.
        nn.init.xavier_uniform_(self.linear.weight)
        nn.init.zeros_(self.linear.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        '''
        Forward pass of the layer.

        Args:
            x (torch.Tensor) - Input tensor.
                Assumes shape (batch_size, #features), where #features is even.

        Returns:
            torch.Tensor - Output tensor.
        '''

        # Split the input tensor in half.
        x1, x2 = torch.chunk(x, 2, dim=1)

        # Apply linear transformation to each half.
        x1, x2 = self.linear(x1), self.linear(x2)

        # Concatenate the results and apply ReLU.
        x = F.relu(torch.cat([x1, x2], dim=1))

        return x

In [3]:
# Example if Single pass through the `SplitLinear` layer.
split_linear = SplitLinear(6)

# Random input tensor.
X = torch.randn(2, 6)
print(f"Input:\n{X = }")
print(f"{X.shape = }\n")

# Forward pass (not using `.forward` for printing each stage).
with torch.no_grad():
    X1, X2 = torch.chunk(X, 2, dim=1)
    print(f"Split:\n{X1 = }\n{X2 = }")
    print(f"{X1.shape = }\n{X2.shape = }\n")

    Z1, Z2 = split_linear.linear(X1), split_linear.linear(X2)
    print(f"Linear:\n{Z1 = }\n{Z2 = }")
    print(f"{Z1.shape = }\n{Z2.shape = }\n")

    Y = F.relu(torch.cat([X1, X2], dim=1))
    print(f"Output:\n{Y = }")
    print(f"{Y.shape = }")

Input:
X = tensor([[-0.3942,  0.4170, -1.4738, -0.9345, -0.6397, -1.0555],
        [ 0.7832,  1.7541,  1.3419,  2.2032, -0.8232,  1.1084]])
X.shape = torch.Size([2, 6])

Split:
X1 = tensor([[-0.3942,  0.4170, -1.4738],
        [ 0.7832,  1.7541,  1.3419]])
X2 = tensor([[-0.9345, -0.6397, -1.0555],
        [ 2.2032, -0.8232,  1.1084]])
X1.shape = torch.Size([2, 3])
X2.shape = torch.Size([2, 3])

Linear:
Z1 = tensor([[ 0.6701,  1.4141,  0.8321],
        [-1.1550, -3.5769,  0.1444]])
Z2 = tensor([[ 0.2946,  2.4598, -0.2546],
        [ 1.4591, -2.4159,  1.1164]])
Z1.shape = torch.Size([2, 3])
Z2.shape = torch.Size([2, 3])

Output:
Y = tensor([[0.0000, 0.4170, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7832, 1.7541, 1.3419, 2.2032, 0.0000, 1.1084]])
Y.shape = torch.Size([2, 6])


## Block diagram

To easily understand the Split Linear layer, we can see the next block diagram that describes it.

<img src="block_diagram_q1.png"></img>

$\def\M2{\frac{m}{2}}$
## Analysis of SplitLinear vs. Standard Linear Layer
### Parameters in SplitLinear Layer
- Input size: $m$ (even)
- Output size: $m$
- Weight matrix: $(\M2, \M2)$
- Bias vector: $(\M2)$ (duplicated)
- Total Parameters: $(\M2)^2 + \M2$

### Parameters in Standard Linear Layer
- Weight matrix: $(m, m)$
- Bias vector: $(m)$
- Total Parameters: $m^2 + m$

### Ratio of Parameters
$$
\frac{\#SplitLinear}{\#Linear}
    = \frac{(\M2)^2 + \M2}{m^2 + m} 
    = \frac{\frac{m}{4} + \frac{1}{2}}{m + 1} 
    = \frac{1}{4} \cdot \frac{m + \frac{1}{8}}{m + 1}
    \xrightarrow[m \rightarrow \infty]{} \frac{1}{4} 
$$

## Gradient Calculating

$ \def\d{\delta} \def\M2{\frac{M}{2}} \providecommand{\:}[2]{[#1 \space : \space #2]}$
To caluculate the number of parameters in the `SplitLinear` layer, we can use the *chain rule*.

We will start with $ \frac{\d C}{\d W} $. Remember that $W$ and $b$ are of dimentions $ (\M2, \M2) $ and $ \M2 $ 
respectively.

Assuming we have $\frac{\d C}{\d Y}$ already calculated, we get

$$
\frac{\d C}{\d w_{p, q}} = \frac{\d C}{\d Y_p} \cdot \frac{\d Y_p}{\d Z_p} \cdot \frac{\d Z_p}{\d w_{p, q}} 
                         + \frac{\d C}{\d Y_{p + \M2}} 
                           \cdot \frac{\d Y_{p + \M2}}{\d Z_{p + \M2}} 
                           \cdot \frac{\d Z_{p + \M2}}{\d w_{p, q}}
$$

We can represent $Z$ as such

$$
Z = \begin{bmatrix} W & 0 \\ 0 & W \\ \end{bmatrix} \begin{bmatrix} X_1 \\ X_2 \end{bmatrix}
  + \begin{bmatrix} b \\ b \end{bmatrix}
$$

Where $\begin{bmatrix} {X_1}_{(\M2)} & {X_2}_{(\M2)} \end{bmatrix} = X_{(M)}^T $. With this we can calculate

$$
\frac{\d Z_p}{\d w_{p, q}} = {X_1}_q = X_q \qquad \text{and} \qquad 
\frac{\d Z_{p + \M2}}{\d w_{p, q}} = {X_2}_q = X_{q + \M2}
$$

The relation between $Y$ and $Z$ is that $ Y = \text{ReLU}(Z) $, so

$$
\frac{\d Y_m}{\d Z_m} = 
    \begin{cases}
        1, & \text{if } Z_m \geq 0 \\
        0, & \text{if } Z_m < 0
    \end{cases}
$$

At the end we get

$$
\boxed
{
\frac{\d C}{\d w_{p, q}} = \frac{\d C}{\d Y_p} \cdot 1\{Z_p \geq 0\} \cdot X_q
                         + \frac{\d C}{\d Y_{p + \M2}} \cdot 1\{Z_{p + \M2} \geq 0\} \cdot X_{q + \M2}
}
$$

We can also represent this in matrix form as such

$$
\frac{\d C}{\d W} = \frac{\d C}{\d Y}_{\:{0}{\M2}} \otimes 1\{Z_{\:{0}{\M2}} \geq 0\} \cdot X_{\:{0}{\M2}}^T
                  + \frac{\d C}{\d Y}_{\:{\M2}{M}} \otimes 1\{Z_{\:{\M2}{M}} \geq 0\} \cdot X_{\:{\M2}{M}}^T
$$

Where $\otimes$ represents element to row product.

Continuing with $ \frac{\d C}{\d b} $, we can use the chain rule again to obtain

$$
\frac{\d C}{\d b_m} = \frac{\d C}{\d Y_m} \cdot \frac{\d Y_m}{\d Z_m} \cdot \frac{\d Z_m}{\d b_m} 
                     +  \frac{\d C}{\d Y_{m + \M2}} 
                        \cdot \frac{\d Y_{m + \M2}}{\d Z_{m + \M2}} 
                        \cdot \frac{\d Z_{m + \M2}}{\d b_m}
$$

and by the representation of $Z$ we get

$$
\frac{\d Z_m}{\d b_m} = 1 \qquad \text{and} \qquad \frac{\d Z_{m + \M2}}{\d b_m} = 1
$$

$ \frac{\d Y_m}{\d Z_m} $ was already calculated so we finally get

$$
\boxed
{
\frac{\d C}{\d b_m} = \frac{\d C}{\d Y_m} \cdot 1\{Z_m \geq 0\}
                    + \frac{\d C}{\d Y_{m + \M2}} \cdot 1\{Z_{m + \M2} \geq 0\}
}
$$

We can represent this too in matrix form as such

$$
\frac{\d C}{\d b} = \frac{\d C}{\d Y}_{\:{0}{\M2}} \otimes 1\{Z_{\:{0}{\M2}} \geq 0\}
                  + \frac{\d C}{\d Y}_{\:{\M2}{M}} \otimes 1\{Z_{\:{\M2}{M}} \geq 0\}
$$

Where now $\otimes$ represents element wise product.

$\def\M4{\frac{M}{4}}$
If we were to change this network layer such that the input would split into four equal parts, the gradients will be sum of 4 elements:

$$
\frac{\d C}{\d w_{p, q}} = \sum_{i = 0}^3
    \frac{\d C}{\d Y_{p + i \cdot \M4}} \cdot 1\{Z_{p + i \cdot \M4} \geq 0\} \cdot X_{q + i \cdot \M4}
\newline

\frac{\d C}{\d b_m} = \sum_{i = 0}^3 \frac{\d C}{\d Y_{m + i \cdot \M4}} \cdot 1\{Z_{m + i \cdot M4} \geq 0\}
$$

Where now $W$ and $b$ are of dimentions $ (\M4, \M4) $ and $ \M4 $ respectively.