# Deep Learning - Lab Exercise 5


**WARNING:** you must have finished the previous exercise before this one as you will re-use parts of the code.

In the first lab exercise, we built a simple linear classifier.
Although it can give reasonable results on the MNIST datasetÂ (~92.5% of accuracy), deeper neural networks can achieve more the 99% accuracy.
However, it can quickly become really impracical to explicitly code forward and backward passes.
Hence, it is useful to rely on an auto-diff library where we specify the forward pass once, and the backward pass is automatically deduced from the computational graph structure.

In this lab exercise, we will build a small and simple auto-diff lib that mimics the autograd mechanism from Pytorch (of course, we will simplify a lot!)


In [80]:
# import libs that we will use
import os
import numpy as np
import matplotlib.pyplot as plt
import math

# To load the data we will use the script of Gaetan Marceau Caron
# You can download it from the course webiste and move it to the same directory that contains this ipynb file
import dataset_loader

%matplotlib inline

In [81]:
if("mnist.pkl.gz" not in os.listdir(".")):
    # this link doesn't work any more,
    # seach on google for the file "mnist.pkl.gz"
    # and download it
    !wget https://github.com/mnielsen/neural-networks-and-deep-learning/raw/master/data/mnist.pkl.gz


# if you have it somewhere else, you can comment the lines above
# and overwrite the path below
mnist_path = "./mnist.pkl.gz"

In [82]:
# load the 3 splits
train_data, dev_data, test_data = dataset_loader.load_mnist(mnist_path)

## Computation Graph

Instead of directly manipulating numpy arrays, we will manipulate abstraction that contains:
- a value (i.e. a numpy array)
- a bool indicating if we wish to compute the gradient with respect to the value
- the gradient with respect to the value
- the operation to call during backpropagation

There will be two kind of nodes:
- ComputationGraphNode: a generic computation node
- Parameter: a computation node that is used to store parameters of the network. Parameters are always leaf nodes, i.e. they cannot be build from other computation nodes.

Our implementation of the backward pass will be really simple and incorrect in the general case (i.e. won't work with computation graph with loops).
We will just apply the derivative function for a given tensor and then call the ones of its antecedents, recursively.
This simple algorithm is good enough for this exercise.

Note that a real implementation of backprop will store temporary values during forward that can be used during backward to improve computation speed. We do not do that here.


In [83]:
class ComputationGraphNode(object):
    
    def __init__(self, data, require_grad=False):
        # we initialise the value of the node and the grad
        if(not isinstance(data, np.ndarray)):
            data = np.array(data)
        self.value = data
        self.grad = None
        
        self.require_grad = require_grad
        self.func = None
        self.input_nodes = None
        self.func_parameters = []
    
    def set_input_nodes(self, *nodes):
        self.input_nodes = list(nodes)

    def set_func_parameters(self, *func_parameters):
        self.func_parameters = list(func_parameters)
    
    def set_func(self, func):
        self.func = func

    def zero_grad(self):
        if self.grad is not None:
            self.grad.fill(0)

    def set_gradient(self, gradient):
        """
        Accumulate gradient for this tensor
        """
        if gradient.shape != self.value.shape:
            print(gradient.shape, self.value.shape)
            raise RuntimeError("Invalid gradient dimension")
        if self.grad is None:
            self.grad = gradient
        else:
            self.grad += gradient
    
    def backward(self, g=None):
        if g is None:
            g = self.value.copy()
            g.fill(1.)
        self.set_gradient(g)
        if self.func is not None:
            grad_list = self.func.backward(*(self.input_nodes + self.func_parameters + [g]))
            for input_node, ngrad in zip(self.input_nodes, grad_list):
                input_node.backward(ngrad)
    
    def __add__(self, y):
        if not isinstance(y, ComputationGraphNode):
            y = ComputationGraphNode(y)
        return Addition()(self, y)

    def __getitem__(self, slice):
        return Selection()(self, slice)

    def __str__(self):
        return self.value.__str__()

    def __repr__(self):
        return self.value.__str__()

class Parameter(ComputationGraphNode):
    def __init__(self, data, name="default"):
        super().__init__(data, require_grad=True)
        self.name  = name

    def backward(self, g=None):
        if g is not None:
            self.set_gradient(g)

The class `Operation` is a class that three methods you should reimplement only the forward and the backward methods.
* The `forward` method compute the function w.r.t inputs and return a new node that must contains information for backward pass.
* The `backward` functions compute the gradient of the function w.r.t gradient of the output and other informations (forward pass input, parameter of the function...).**It should return a tuple**

For better understanding below two operation are implemented, the selection and the addition (notice that it should not works yet since we do not defined what is a node)

In [116]:
class Operation(object):
    @staticmethod
    def forward(*args):
        raise NotImplementedError("It is an abstract method")
    
    def __call__(self, *args):
        output_node = self.forward(*args)
        output_node.set_func(self)
        return output_node
        
    @staticmethod
    def backward(*args):
        pass
class Addition(Operation):
    @staticmethod
    def forward(x, y):
        output_array = x.value + y.value
        output_node = ComputationGraphNode(output_array)
        output_node.set_input_nodes(x, y)
        return output_node

    @staticmethod
    def backward(x, y, gradient):
        return (gradient, gradient)

class Selection(Operation):
    @staticmethod
    def forward(x, slice):
        np_x = x.value

        output_array = np_x.__getitem__(slice)
        
        output_node = ComputationGraphNode(output_array)
        output_node.set_input_nodes(x)
        output_node.set_func_parameters(slice)

        return output_node
        
    @staticmethod
    def backward(x, slice, gradient):
        np_x = x.value

        cgrad = np_x.copy()
        cgrad.fill(0)
        cgrad.__setitem__(slice, gradient)
        
        return cgrad,

**Question 1** Complete the following class 

In [85]:
class ReLU(Operation):
    @staticmethod
    def forward(x):
        # we copy the value of the input node
        np_x = x.value.copy()

        # set negative elements to zero
        np_x[np_x < 0] = 0 # notice we consider strictly < 0

        # we create the output node needing only the node x
        output_node = ComputationGraphNode(np_x)
        output_node.set_input_nodes(x)

        return output_node

    @staticmethod
    def backward(x, gradient):
        """
        Computes the gradient of the loss with respect to the input of ReLU.

        Args:
            x (ComputationGraphNode): The input node from the forward pass.
            gradient (np.ndarray): The gradient flowing back from the next layer (dL/d(output)).

        Returns:
            tuple: A tuple containing the gradient with respect to the input x (dL/dx).
        """
        # Get the value of the input node
        np_x = x.value

        # Create a mask where input was positive (derivative is 1)
        # Elsewhere, the derivative is 0.
        relu_derivative = (np_x > 0).astype(gradient.dtype) # Use same dtype as gradient

        # Apply chain rule: dL/dx = dL/d(output) * d(output)/dx
        grad_input = gradient * relu_derivative

        # Return as a tuple (as required by the framework's backward structure)
        return (grad_input,)

We recall that :  $$tanh(x)= \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$$ 

However we can have stability issues if $||z||$ is large, e.g. $e^{10000}$ will lead to computation error or infinity. Indeed in python using numpy:


>np.exp(10000)


will leads to :

>/tmp/ipykernel_7784/2473798304.py:1: RuntimeWarning: overflow encountered in exp
>np.exp(10000)
>
>inf

We can use the same tricks that the one used in the softmax computation observing the simple following fact: 
$$
\begin{aligned}
 tanh(x) &= \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} \\
 &= \left(\frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\right)\frac{e^{-a}}{e^{-a}} \\
 &= \frac{e^{z}e^{-a} - e^{-z}e^{-a}}{e^{z}e^{-a} + e^{-z}e^{-a}} \\
&= \frac{e^{z-a} - e^{-z-a}}{e^{z-a} + e^{-z-a}}
\end{aligned}
$$
Thus we want that $z-a$ or $-z-a$ be small, or in our case lower than $0$.  Thus taking $a$ as the absolute value of $z$ ($|z|$) will leads to have 
$z-a\leq 0$ and $-z-a\leq 0$.


For the backward notice that $tanh'(x) = 1-\sigma(x)^2$


In [86]:
class TanH(Operation):
    @staticmethod
    def TanHCompute(z):
        """Computes tanh(z) element-wise using a numerically stable approach."""
        if not isinstance(z, np.ndarray):
             z = np.array(z) # Ensure input is a numpy array

        # Numerical stability trick: subtract max(|z|) from exponents
        abs_z = np.abs(z)
        # Note: The explanation uses 'a = |z|', which implies element-wise absolute value.
        # The subtraction happens inside the exp calls.

        exp_pos = np.exp(z - abs_z)   # Corresponds to exp(z-a)
        exp_neg = np.exp(-z - abs_z)  # Corresponds to exp(-z-a)

        # Avoid division by zero if exp_pos + exp_neg is very small
        # Although with the stabilization, this is less likely.
        denominator = exp_pos + exp_neg
        # Handle potential division by zero or very small denominators
        # A small epsilon prevents division by zero and NaN results.
        # Use np.finfo for a suitable small number based on the dtype.
        epsilon = np.finfo(z.dtype).eps
        tanh_val = (exp_pos - exp_neg) / (denominator + epsilon)

        return tanh_val

    @staticmethod
    def forward(x):
        """Computes the forward pass for TanH."""
        output_array = TanH.TanHCompute(x.value)

        output_node = ComputationGraphNode(output_array)
        output_node.set_input_nodes(x)

        return output_node

    @staticmethod
    def backward(x, gradient):
        """
        Computes the gradient of the loss with respect to the input of TanH.

        Args:
            x (ComputationGraphNode): The input node from the forward pass.
            gradient (np.ndarray): The gradient flowing back from the next layer (dL/d(output)).

        Returns:
            tuple: A tuple containing the gradient with respect to the input x (dL/dx).
        """
        # Recompute tanh(x) - necessary as the forward output isn't directly available here.
        # Use the stable computation method.
        tanh_x_value = TanH.TanHCompute(x.value)

        # Compute the local derivative: d(tanh(x))/dx = 1 - tanh(x)^2
        local_derivative = 1.0 - tanh_x_value**2

        # Apply chain rule: dL/dx = dL/d(output) * d(output)/dx
        grad_input = gradient * local_derivative

        # Return as a tuple
        return (grad_input,)

**Question 2:** Next, we implement the affine transform operation.
You can reuse the code from the third lab exercise, with one major difference: you have to compute the gradient with respect to x too!

In [87]:
class affine_transform(Operation):
    @staticmethod
    def forward(W, b, x):
        """
        Computes the forward pass of an affine transform: y = x @ W + b.

        Args:
            W (Parameter): Weight matrix node, shape (D_in, D_out).
            b (Parameter): Bias vector node, shape (1, D_out) or (D_out,).
            x (ComputationGraphNode): Input node, shape (N, D_in).

        Returns:
            ComputationGraphNode: Output node y, shape (N, D_out).
        """
        # Get numpy values from input nodes
        np_W = W.value
        np_b = b.value
        np_x = x.value

        # Validate shapes for matrix multiplication
        if np_x.shape[-1] != np_W.shape[0]:
             raise ValueError(f"Shape mismatch for x @ W: x shape {np_x.shape}, W shape {np_W.shape}")

        # Perform the affine transformation
        # Using @ operator for matrix multiplication (preferred in Python 3.5+)
        try:
            output_array = np_x @ np_W + np_b
        except ValueError as e:
             # Catch potential broadcasting issues with bias
             raise ValueError(f"Error during affine transform x({np_x.shape}) @ W({np_W.shape}) + b({np_b.shape}): {e}")


        # Create the output node
        output_node = ComputationGraphNode(output_array)
        # IMPORTANT: Store input nodes in the order W, b, x for backward pass
        output_node.set_input_nodes(W, b, x)
        # The Operation.__call__ method will set output_node.set_func(self)

        return output_node

    @staticmethod
    def backward(W, b, x, gradient):
        """
        Computes the backward pass for the affine transform.

        Args:
            W (Parameter): Weight matrix node from forward pass.
            b (Parameter): Bias vector node from forward pass.
            x (ComputationGraphNode): Input node from forward pass.
            gradient (np.ndarray): Gradient flowing back from the next layer (dL/dy), shape (N, D_out).

        Returns:
            tuple: Gradients (dL/dW, dL/db, dL/dx).
        """
        # Get numpy values needed for gradient calculations
        np_W = W.value # Shape (D_in, D_out)
        np_x = x.value # Shape (N, D_in)
        # np_b = b.value # Bias value itself not needed for derivative calculations

        # --- Calculate dL/dW ---
        # dL/dW = x.T @ (dL/dy)
        try:
            grad_W = np_x.T @ gradient # Shape: (D_in, N) @ (N, D_out) -> (D_in, D_out)
        except ValueError as e:
            raise ValueError(f"Shape mismatch calculating grad_W: x.T({np_x.T.shape}) @ grad({gradient.shape}): {e}")


        # --- Calculate dL/db ---
        # dL/db = sum(dL/dy over batch dimension)
        grad_b = np.sum(gradient, axis=0) # Shape: (D_out,)
        # Ensure grad_b shape matches b.value's shape for accumulation
        # If b.value was (1, D_out), reshape grad_b
        if b.value.ndim == 2 and b.value.shape[0] == 1:
             if grad_b.ndim == 1: # Check if sum reduced dimension
                 grad_b = grad_b.reshape(1, -1) # Reshape to (1, D_out)
        # If b.value was (D_out,), grad_b is likely already (D_out,)


        # --- Calculate dL/dx ---
        # dL/dx = (dL/dy) @ W.T
        try:
            grad_x = gradient @ np_W.T # Shape: (N, D_out) @ (D_out, D_in) -> (N, D_in)
        except ValueError as e:
            raise ValueError(f"Shape mismatch calculating grad_x: grad({gradient.shape}) @ W.T({np_W.T.shape}): {e}")


        # Return gradients in the order corresponding to the inputs (W, b, x)
        return grad_W, grad_b, grad_x

**Question 3:** Define the NLL operation

We recall that 
$$nll(x, y)= -log\left(\frac{e^{x_{y}}}{ \sum\limits_{i=1}^n e^{x_{ j}} }\right) = -x_{y} + log(\sum\limits_{i=1}^n e^{x_{ j} })$$

$$
    \begin{align*}
        \frac{\partial nll(x, y)}{\partial x_i} &= - \mathbb{1}_{y = i} + \frac{\partial log(\sum\limits_{i=1}^n e^{x_{ j} })}{\partial\sum\limits_{i=1}^n e^{x_{ j} }}\frac{\sum\limits_{i=1}^n e^{x_{ j} }}{\partial x_i} \\
        &= - \mathbb{1}_{y = i} + \frac{e^{x_i}}{\sum\limits_{i=1}^n e^{x_{ j} }} 
    \end{align*}
$$

In [88]:
import numpy as np

# Assume ComputationGraphNode, Operation are defined as in the prompt

class nll(Operation):
    @staticmethod
    def forward(x, y):
        np_x = x.value # Shape (N, C)
        np_y = y.value.flatten().astype(int) # Ensure (N,) and integer type for indexing

        N = np_x.shape[0]
        if N == 0: # Handle empty batch
            return ComputationGraphNode(np.array([]))
        if np_y.shape[0] != N:
            raise ValueError(f"Batch size mismatch between x ({N}) and y ({np_y.shape[0]})")

        # --- Numerically Stable Log-Sum-Exp ---
        # Find max logit for each sample for stability
        max_logits = np.max(np_x, axis=1, keepdims=True) # Shape (N, 1)
        # Subtract max, exponentiate, sum, log, add max back
        stable_x = np_x - max_logits # Shape (N, C)
        log_sum_exp = np.log(np.sum(np.exp(stable_x), axis=1, keepdims=True)) + max_logits # Shape (N, 1)

        # --- Select the logit corresponding to the true class y ---
        # Use advanced indexing to get x[n, y[n]] for each n in N
        logits_of_true_class = np_x[np.arange(N), np_y].reshape(N, 1) # Shape (N, 1)

        # --- Compute NLL ---
        # nll = -x_y + log(sum(exp(x_j)))
        loss_array = -logits_of_true_class + log_sum_exp # Shape (N, 1)

        # Create output node, typically loss per sample (N,)
        output_node = ComputationGraphNode(loss_array.flatten()) # Shape (N,)
        output_node.set_input_nodes(x, y)
        # Store necessary values for backward? The formula only needs x and y.
        # No extra func_parameters needed here based on the formula.

        return output_node

    @staticmethod
    def backward(x, y, gradient):
        np_x = x.value # Shape (N, C)
        np_y = y.value.flatten().astype(int) # Shape (N,)

        N, C = np_x.shape
        if N == 0: # Handle empty batch
             return (np.zeros_like(np_x), None)

        # --- Numerically Stable Softmax ---
        max_logits = np.max(np_x, axis=1, keepdims=True) # Shape (N, 1)
        stable_x = np_x - max_logits # Shape (N, C)
        exp_x = np.exp(stable_x) # Shape (N, C)
        sum_exp_x = np.sum(exp_x, axis=1, keepdims=True) # Shape (N, 1)
        softmax_x = exp_x / sum_exp_x # Shape (N, C)

        # --- Gradient Calculation (d(nll)/dx = softmax(x) - indicator(y=i)) ---
        # Create the indicator matrix (one-hot encoding of y)
        indicator = np.zeros_like(np_x) # Shape (N, C)
        indicator[np.arange(N), np_y] = 1.0

        # Calculate the local gradient d(nll)/dx
        grad_x_local = softmax_x - indicator # Shape (N, C)

        # --- Apply Chain Rule: dL/dx = dL/d(nll_output) * d(nll)/dx ---
        # The incoming gradient is dL/d(nll_output).
        # If nll.forward returned (N,), gradient should be (N,) or broadcastable (scalar)
        if gradient.ndim == 0: # Scalar gradient (e.g., 1.0 or 1.0/N)
             grad_x = gradient * grad_x_local
        elif gradient.ndim == 1 and gradient.shape[0] == N:
             # gradient shape (N,), grad_x_local shape (N, C)
             # Multiply each sample's gradient with its local gradient vector
             grad_x = gradient[:, np.newaxis] * grad_x_local # (N, 1) * (N, C) -> (N, C)
        else:
             raise ValueError(f"Gradient shape {gradient.shape} incompatible with NLL output ({N},)")


        # Gradient w.r.t y is not defined/needed
        grad_y = None

        return (grad_x, grad_y)

# Module

Neural networks or parts of neural networks will be stored in Modules.
They implement method to retrieve all parameters of the network and subnetwork.

In [89]:
class Module:
    def __init__(self):
        pass
        
    def parameters(self):
        ret = []
        for name in dir(self):
            o = self.__getattribute__(name)

            if type(o) is Parameter:
                ret.append(o)
            if isinstance(o, Module) or isinstance(o, ModuleList):
                ret.extend(o.parameters())
        return ret

# if you want to store a list of Parameters or Module,
# you must store them in a ModuleList instead of a python list,
# in order to collect the parameters correctly
class ModuleList(list):
    def parameters(self):
        ret = []
        for m in self:
            if type(m) is Parameter:
                ret.append(m)
            elif isinstance(m, Module) or isinstance(m, ModuleList):
                ret.extend(m.parameters())
        return ret

# Initialization and optimization

**Question 1:** Implement the different initialisation methods

In [90]:
def zero_init(b):
    """Initializes a NumPy array (typically bias) with zeros in-place."""
    # Ensure it's a float array if you want float zeros
    if not np.issubdtype(b.dtype, np.floating):
        b[:] = 0.0
    else:
        b.fill(0) # Efficient way to fill with zeros

def glorot_init(W):
    """
    In-place initialization of a weight matrix W using Glorot (Xavier) uniform method.
    Suitable for layers followed by symmetric activations like tanh.
    """
    if W.ndim != 2:
        raise ValueError("Glorot initialization expects a 2D weight matrix.")

    fan_in, fan_out = W.shape[0], W.shape[1]
    if fan_in + fan_out == 0: # Avoid division by zero for empty layers
        W[:, :] = 0.0 # Or handle as appropriate
        return

    # Calculate the limit for the uniform distribution
    limit = np.sqrt(6.0 / (fan_in + fan_out))

    # Generate random values from Uniform(-limit, limit) and assign in-place
    W[:, :] = np.random.uniform(low=-limit, high=limit, size=W.shape)

def kaiming_init(W):
    """
    In-place initialization of a weight matrix W using Kaiming (He) normal method.
    Suitable for layers followed by ReLU activations. Assumes fan_in mode.
    """
    if W.ndim != 2:
        raise ValueError("Kaiming initialization expects a 2D weight matrix.")

    fan_in = W.shape[0]
    if fan_in == 0: # Avoid division by zero for empty layers
        W[:, :] = 0.0 # Or handle as appropriate
        return

    # Calculate the standard deviation for the normal distribution
    # Assumes nonlinearity is ReLU (gain=sqrt(2))
    stddev = np.sqrt(2.0 / fan_in)

    # Generate random values from Normal(0, stddev) and assign in-place
    W[:, :] = np.random.normal(loc=0.0, scale=stddev, size=W.shape)


We will implement the Stochastic gradient descent through an object, in the init function this object will store the different parameters (in a list format). The step function will update the parameters (see slides), notice that the gradient is stored in the nodes (grad attribute). Finally it will be necessary after each update to reset all the gradient to zero (in the method zero_grad) because we do not want to accumumlate gradient of all previous step.

**Question 2:** Implement the SGD 

In [91]:
class SGD:
    """
    Implements stochastic gradient descent (optionally with momentum).
    """
    def __init__(self, params, lr=0.1):
        """
        Initializes the SGD optimizer.

        Args:
            params (list): A list of Parameter objects to optimize.
            lr (float): Learning rate.
        """
        if not isinstance(params, list):
            raise TypeError("params must be a list of Parameter objects.")
        if lr <= 0.0:
            raise ValueError("Invalid learning rate: {}".format(lr))

        self.params = params
        self.lr = lr

    def step(self):
        """
        Performs a single optimization step (parameter update).
        """
        for p in self.params:
            # Check if the parameter requires gradient and if gradient exists
            if p.require_grad and p.grad is not None:
                # Ensure grad and value are numpy arrays for the operation
                if not isinstance(p.value, np.ndarray) or not isinstance(p.grad, np.ndarray):
                     print(f"Warning: Parameter {getattr(p, 'name', 'unnamed')} value or grad is not a numpy array. Skipping update.")
                     continue
                try:
                    # Perform the SGD update: param = param - learning_rate * gradient
                    # Use -= for potential in-place update if possible with numpy arrays
                    p.value -= self.lr * p.grad
                except (TypeError, ValueError) as e:
                     # Catch potential issues like shape mismatch or dtype problems
                     print(f"Warning: Error updating parameter {getattr(p, 'name', 'unnamed')}. "
                           f"Value shape: {p.value.shape}, dtype: {p.value.dtype}. "
                           f"Grad shape: {p.grad.shape}, dtype: {p.grad.dtype}. Error: {e}")
            # else:
                # Parameter might not require grad, or grad might be None (e.g., if not part of graph)
                # No update needed in these cases.
                # Optionally add a check/warning if require_grad=True but grad is None after backward.


    def zero_grad(self):
        """
        Resets the gradients of all parameters managed by the optimizer to zero.
        """
        for p in self.params:
            if p.grad is not None:
                # Check if grad is a numpy array before calling fill
                if hasattr(p.grad, 'fill'):
                    try:
                        # Use fill(0) for efficient in-place zeroing of numpy arrays
                        p.grad.fill(0)
                    except Exception as e:
                        print(f"Warning: Could not zero gradient for param {getattr(p, 'name', 'unnamed')} using fill(). Error: {e}. Trying assignment.")
                        # Fallback: Assign a new zero array
                        try:
                            p.grad = np.zeros_like(p.grad)
                        except Exception as e_assign:
                             print(f"Error: Could not assign zeros to gradient for param {getattr(p, 'name', 'unnamed')}. Error: {e_assign}")

                else:
                    # If grad is not a numpy array (shouldn't happen with Parameter), try assigning 0.0
                    try:
                         p.grad = 0.0 # Or appropriate zero value based on expected type
                    except Exception as e_assign_scalar:
                        print(f"Warning: Gradient for param {getattr(p, 'name', 'unnamed')} is not a numpy array and could not be zeroed. Type: {type(p.grad)}. Error: {e_assign_scalar}")
            # If p.grad is None, nothing to zero.

# Networks and training loop

We first create a simple linear classifier, similar to the first lab exercise.

In [118]:
class LinearNetwork(Module):
    """
    A simple linear layer module (affine transformation).
    Applies the transformation y = x @ W + b.
    """
    def __init__(self, dim_input, dim_output):
        """
        Initializes the linear layer.

        Args:
            dim_input (int): Dimensionality of the input features.
            dim_output (int): Dimensionality of the output features.
        """
        super().__init__() # Initialize base Module class
        self.dim_input = dim_input
        self.dim_output = dim_output

        # Build the parameters W and b
        # Create numpy arrays first (can be empty, init_parameters will fill them)
        # Using float for typical network operations
        W_data = np.empty((dim_input, dim_output), dtype=float)
        b_data = np.empty((1, dim_output), dtype=float) # Bias shape (1, D_out) for broadcasting

        # Wrap numpy arrays in Parameter nodes
        self.W = Parameter(W_data, name=f'Linear_{dim_input}x{dim_output}_W')
        self.b = Parameter(b_data, name=f'Linear_{dim_input}x{dim_output}_b')

        # Note: Actual value initialization happens in init_parameters

    def init_parameters(self):
        """
        Initializes the weight (W) and bias (b) parameters of the network.
        Uses Glorot initialization for weights and zero initialization for biases.
        """
        # Use Glorot initialization for the weight matrix W
        # It modifies the self.W.value array in-place
        glorot_init(self.W.value)

        # Use zero initialization for the bias vector b
        # It modifies the self.b.value array in-place
        zero_init(self.b.value)

    def forward(self, x):
        """
        Performs the forward pass: computes x @ W + b.

        Args:
            x (ComputationGraphNode): Input node, expected shape (N, dim_input).

        Returns:
            ComputationGraphNode: Output node, shape (N, dim_output).
        """
        # Ensure x is a ComputationGraphNode
        if not isinstance(x, ComputationGraphNode):
            # Depending on strictness, either raise error or wrap it
            # Let's wrap it for flexibility, assuming x is a numpy array here
            x = ComputationGraphNode(x)
            # Note: If x doesn't require grad, this might be slightly inefficient
            # but necessary for the graph structure.

        # Apply the affine_transform operation
        # Pass the Parameter nodes (self.W, self.b) and the input node (x)
        # The operation returns the resulting ComputationGraphNode
        output_node = affine_transform()(self.W, self.b, x)
        return output_node

    def __repr__(self):
        """Provides a string representation of the module."""
        return f"LinearNetwork(dim_input={self.dim_input}, dim_output={self.dim_output})"


In [119]:
np.random.seed(42)

In [120]:
# those lines should be executed correctly
lin1 = LinearNetwork(784, 10)
lin2 = LinearNetwork(10, 5)

lin1.init_parameters()
lin2.init_parameters()

input_image = train_data[0][0]
if input_image.ndim == 1:
    input_image = input_image.reshape(1, -1)

x = ComputationGraphNode(input_image, require_grad=True)

a = lin1.forward(Addition()(x, x))
b = TanH()(a)
c = lin2.forward(b)
c.backward()

print("Gradient of input x (first 10 elements):")
if x.grad is not None:
    print(x.grad.flatten()[:10])
else:
    print("x.grad is None (check require_grad and backward pass)")

Gradient of input x (first 10 elements):
[-0.02910252  0.26027206  0.0721708  -0.49305239 -0.07406743 -0.34937228
  0.10215195 -0.08538571  0.29589588 -0.07565649]


We will train several neural networks.
Therefore, we encapsulate the training loop in a function.

**warning**: you have to call optimizer.zero_grad() before each backward pass to reinitialize the gradient of the parameters!

In [123]:
def training_loop(network, optimizer, train_data, dev_data, n_epochs=10):
    X_train, y_train = train_data  # Training data: (inputs, labels)
    X_dev, y_dev = dev_data        # Dev data: (inputs, labels)
    
    for epoch in range(n_epochs):
        optimizer.zero_grad()
        train_predictions = network.forward(X_train)
        y_train_node = ComputationGraphNode(y_train)
        
        # Use the __call__ interface of nll to attach the operation to the node.
        loss_node = nll()(train_predictions, y_train_node)
        mean_loss = np.mean(loss_node.value)
        
        # Use a gradient with the same shape as loss_node.value.
        N = len(loss_node.value)
        loss_node.backward(np.ones_like(loss_node.value) / N)
        
        optimizer.step()
        
        dev_predictions = network.forward(X_dev)
        predicted_labels = np.argmax(dev_predictions.value, axis=1)
        accuracy = np.mean(predicted_labels == y_dev)
        
        print(f"mean loss -> {mean_loss} validation accuracy -> {accuracy:.4f}")


In [125]:
dim_input = 28*28
dim_output = 10

network = LinearNetwork(dim_input, dim_output)
network.init_parameters()
optimizer = SGD(network.parameters(), 0.01)

training_loop(network, optimizer, train_data, dev_data, n_epochs=5)

mean loss -> 2.4296296662456505 validation accuracy -> 0.0721
mean loss -> 2.4156190269698405 validation accuracy -> 0.0755
mean loss -> 2.401992674947218 validation accuracy -> 0.0811
mean loss -> 2.388717143607291 validation accuracy -> 0.0852
mean loss -> 2.3757632291864654 validation accuracy -> 0.0891


After you finished the linear network, you can move to a deep network!

In [113]:
class DeepNetwork(Module):
    def __init__(self, dim_input, dim_output, hidden_dim, n_layers, tanh=False):
        super().__init__()
        """
        dim_input  : number of input features
        dim_output : number of output units/classes
        hidden_dim : dimension of each hidden layer
        n_layers   : number of hidden layers
        tanh       : if True, use tanh activation; else use ReLU
        """
        # We keep a list of layers in a ModuleList so the parameters can be collected:
        self.layers = ModuleList()
        self.use_tanh = tanh

        # 1) First hidden layer: input -> hidden_dim
        self.layers.append(LinearNetwork(dim_input, hidden_dim))

        # 2) Next hidden layers: hidden_dim -> hidden_dim
        for _ in range(n_layers - 1):
            self.layers.append(LinearNetwork(hidden_dim, hidden_dim))

        # 3) Final layer: hidden_dim -> dim_output
        self.layers.append(LinearNetwork(hidden_dim, dim_output))

        # Initialize all parameters right after construction
        self.init_parameters()

    def init_parameters(self):
        """
        Initialize parameters of each sub-layer
        """
        for layer in self.layers:
            layer.init_parameters()

    def forward(self, x):
        """
        Forward pass: 
          - For all hidden layers, apply linear => activation
          - For the final layer, apply linear only (no activation)
        """
        # Pass through the first N hidden layers with activation
        for layer in self.layers[:-1]:
            x = layer.forward(x)
            if self.use_tanh:
                x = TanH()(x)
            else:
                x = ReLU()(x)

        # Final layer: produce raw outputs (e.g. logits)
        x = self.layers[-1].forward(x)
        return x

In [126]:
dim_input = 28*28
dim_output = 10

network = DeepNetwork(dim_input, dim_output, 100, 2)
network.init_parameters()
optimizer = SGD(network.parameters(), 0.01)

training_loop(network, optimizer, train_data, dev_data, n_epochs=5)

mean loss -> 2.3261267912328973 validation accuracy -> 0.1751
mean loss -> 2.3189201776771773 validation accuracy -> 0.1804
mean loss -> 2.311857649368916 validation accuracy -> 0.1869
mean loss -> 2.3049334367056944 validation accuracy -> 0.1914
mean loss -> 2.2981409542554316 validation accuracy -> 0.1954


## Better Optimizer
Implement the SGD with momentum, notice that you will need to store the cumulated gradient.


In [127]:
class SGDWithMomentum:
    def __init__(self, params, lr=0.1, momentum=0.5):
        """
        params: a list of Parameter objects
        lr: learning rate (float)
        momentum: momentum factor (float)
        """
        self.params = params
        self.lr = lr
        self.momentum = momentum
        # We store a velocity array for each parameter,
        # initialized to zeros of the same shape.
        self.velocities = []
        for p in params:
            self.velocities.append(np.zeros_like(p.value))

    def step(self):
        """
        Apply one update step to each parameter.
        velocity = momentum * velocity - lr * grad
        param = param + velocity
        """
        for i, p in enumerate(self.params):
            if p.grad is not None:
                # Update velocity
                self.velocities[i] = self.momentum * self.velocities[i] - self.lr * p.grad
                # Update parameter
                p.value += self.velocities[i]

    def zero_grad(self):
        """
        Reset the accumulated gradients on each parameter to zero.
        """
        for p in self.params:
            if p.grad is not None:
                p.grad.fill(0.0)

## Bonus: Batch SGD
Propose a methods to take into account batch of input