# PS1: Your first library-free neural network!  

Advanced Learning 2025


For SUBMISSION:   

Please upload the complete and executed `ipynb` to your git repository. Verify that all of your output can be viewed directly from github, and provide a link to that git file below.

~~~
STUDENT ID: 314992595
~~~

~~~
STUDENT GIT LINK: https://github.com/netanelazran11
~~~
In Addition, don't forget to add your ID to the files:    
  
`PS1_Part2_HelloNN_2025_ID_[000000000].html`   


In [181]:
import numpy as np # You are allowed to use  only numpy.
import time


**Welcome**.   

In this part of the problem set you are set to build a complete and flexible neural network.  
This neural network will be library free (in the sense that we won't use PyTorch/Tensorflow/etc.).   

Let's do a quick review of the basic neural-network components:  


*   *Layer* - can be fully connected (dense/hidden), convolution, etc.
  * Forward propagation- the layer outputs the next layer's input
  * Backward propagation- the layer also outputs the gradient descent update
*   *Activation* Layer (e.g. ReLU) - there are no parameters, only gradients with respect to the input. We want to compute both the gradient w.r.t the parameters of the layer and to create the gradient with respect to the layer's inputs
   * *Forward propagation*- the layer outputs the next layer's input
   * *Backward propagation*- the layer also outputs the gradient descent update
*   *Loss Function* : how our model  quantifies the difference between the predicted outputs the actual (target) values  
*   *Network Wrapper*-  wraps our components together as a trainable model.






Useful resource:  
* Gradient descent for neural networks [cheat sheet](https://moodle4.cs.huji.ac.il/hu23/mod/resource/view.php?id=402297).
* Neural network architecture [cheat sheet](https://moodle4.cs.huji.ac.il/hu23/mod/url/view.php?id=402298).

### 0. Loading data

You are going to test and evaluate your home-made network on the `mnist` dataset.   
The MNIST dataset is a large dataset of handwritten digits that is commonly used for training various image and vision models.

In [182]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
# load MNIST from server
# Using a standard library (keras.datasets) to load the mnist data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [183]:
x_train.shape

(60000, 28, 28)

#### Data transformations





In [184]:
# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train /= 255
# One-hot encoding of the output.
# Currently a number in range [0,9]; Change into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = to_categorical(y_train)
# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test /= 255
y_test = to_categorical(y_test)

### 1. Network's Components

Please fill-in the missing code in the code boxes below (only where  `#### SOLUTION REQUIRED ####` is specified).   

In [187]:

# This class is a general layer primitive, defining that each instance must
# have an (input,output) parameters, and 2 functions: forward+backward propogation
class Layer_Primitive:
    def __init__(self):
        self.input = None
        self.output = None

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

#### Fully Connected Layer

A fully-connected layer (a.k.a. affine, dense,linear layer) connects every input neuron to every output neuron.   
It has 2 parameters: (input, output).   
You need to define (code) the following:
* its initialization weights with random weights.
* the forward propogation calculation (as shown in class).
* the backward propogation gradients calculation (given output, as shown in class).

Parameters must be intitialized with some values. There are many ways to initialize the weights, and you are encouraged to do a quick research about the common methods. Any commonly used method will be accepted.  

1.1 (20 pts)

In [162]:

# inherit from base class Layer
class Affine_Layer(Layer_Primitive):
    # input_size = number of input neurons
    # output_size = number of output neurons
    def __init__(self, input_size, output_size):
        super().__init__()
        # weights dimension = output_size X input_size
        self.weights = np.random.randn(output_size,input_size)  
        self.bias = np.random.randn(output_size) 


    # returns output for a given input
    def forward_propagation(self, input_data):
        #check dimention :
        #print(f"self.weights.shape = {self.weights.shape} , input_data.shape = {input_data.shape} , self.bias.shape = {self.bias.shape}")

        self.input = input_data.reshape(-1)
        self.output = self.weights @ self.input.T + self.bias
        ##print(f"self.output.shape = {self.output.shape}")
        return self.output



    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_grad, learning_rate):
       #check dimention :
        #print("#"*20)
        #print(f"self.weights.shape = {self.weights.shape} , output_grad.size = {output_grad.size} , self.bias.shape = {self.bias.shape}")
        # (input_size,output_size) @ output_size
        input_error = self.weights.T @ output_grad 
        # (output_size , 1)@(1,input_size) 
        weights_error = output_grad.reshape(-1,1) @ self.input.reshape(1,-1)  

        # update parameters
        self.weights -= learning_rate * weights_error 
        self.bias -= learning_rate * output_grad 

        return input_error

#### Activation layers

Activation functions are often a non-linear functions that aid in how well the network model adapts to and learns  the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.  



In [163]:
# inherit from base class Layer
class ActivationLayer(Layer_Primitive):
    def __init__(self, activation, activation_grad):
        self.activation = activation
        self.activation_grad = activation_grad

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_grad=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_grad, learning_rate):
        return self.activation_grad(self.input) * output_grad



You need to define (code) the following via different functions:
* the forward propogation calculation (as shown in class).
* the backward propogation gradients calculation (given output, as shown in class).

1.2 (20 pts)

In [188]:

# activation functions and their derivatives:

def tanh(x):
    return np.tanh(x)

def tanh_grad(x):
    return  1- tanh(x)**2

def relu(x):
    return np.maximum(0,x)

def relu_grad(x):
    return (x>0).astype(float) 

def sigmoid(x):
    # FILL IN THE MISSING CODE
    return 1/(1+np.exp(-x))
def sigmoid_grad(x):
    # FILL IN THE MISSING CODE
    return sigmoid(x)*(1-sigmoid(x))

#### Loss function

1.3 (10 pts)

In [189]:


# loss function and its derivative

def mse(y_true, y_pred):
    return np.mean((y_pred-y_true)**2)

def mse_grad(y_true, y_pred):
    return (2*(y_pred-y_true))/y_true.size



#### Putting everything together

1.4 (10 pts)

In [166]:
#### SOLUTION REQUIRED (in `predict`) ####

class MyNetwork:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_grad = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use_loss(self, loss, loss_grad):
        self.loss = loss
        self.loss_grad = loss_grad


    # train the network
    def fit(self, x_train, y_train, epochs, learning_rate):
        # sample dimension first
        samples = len(x_train)

        # training loop
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                err += self.loss(y_train[j], output)

                # backward propagation
                grad = self.loss_grad(y_train[j], output)
                for layer in reversed(self.layers):
                    grad = layer.backward_propagation(grad, learning_rate)

            # calculate average error on all samples
            err /= samples
            print('Training epoch %d/%d   error=%f' % (i+1, epochs, err))


    # predict output for given input
    def predict(self, x_test,y_test=np.array([])):
        if y_test.size:
           assert len(x_test)==len(y_test) # if Y is given
        # sample dimension first
        samples = len(x_test)
        result = []
        loss = 0
        correct = 0
        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = x_test[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)
            # ONLY IF LABELS ARE GIVEN (Y):
            if y_test.size:
                # Evaluate the output against Y,
                # calculate loss against Y, add to `loss`:
                loss += self.loss(y_test[i], output)
                target = y_test[i]
                # Evaluate the label of the output against real, and if identical,
                # add +1 to `correct`:
                if np.argmax(target) == np.argmax(y_test[i]) :
                   correct += 1
        if y_test.size:
            mean_loss = loss/samples

            print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.
                  format(mean_loss, correct, samples,100. * correct / samples))

        return result


## 2. Testing Your Neural Network

### Defining our main neural network architecture

Define your network's architecture:  
(Please rationalize your choice of activation funciton.)
* first affine layer that takes your input and outputs 128 nodes
* `tanh/relu/sigmoid` activation layer following the first affine layer
* second affine layer that takes the first layer's input and outputs 64 nodes
* `tanh/relu/sigmoid` activation layer following the second affine layer
* third affine layer that takes your second layer's input and outputs nodes in the size of the Y labels.
* `tanh/relu/sigmoid` activation layer following the last affine layer


2.1 (5 pts)

In [167]:
#### SOLUTION REQUIRED (in `predict`) ####

# Network Architecture
net = MyNetwork()
net.add(Affine_Layer(784,128))
net.add(ActivationLayer(relu,relu_grad))
net.add(Affine_Layer(128,64))
net.add(ActivationLayer(sigmoid,sigmoid_grad))
net.add(Affine_Layer(64,y_train[0].shape[0]))
net.add(ActivationLayer(tanh,tanh_grad))

### Training!

In [168]:

# While developing, it is recommended to train your model on a subset of the data... / or low epochs.
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use_loss(mse, mse_grad)
epoch_num = 20
lr = 0.01
t1 = time.time()
net.fit(x_train, y_train, epochs=epoch_num, learning_rate=lr)
print(f"Total process time: {round(time.time() - t1,3)}")


Training epoch 1/20   error=0.333425
Training epoch 2/20   error=0.095786
Training epoch 3/20   error=0.092522
Training epoch 4/20   error=0.090431
Training epoch 5/20   error=0.089137
Training epoch 6/20   error=0.088247
Training epoch 7/20   error=0.087761
Training epoch 8/20   error=0.087451
Training epoch 9/20   error=0.087155
Training epoch 10/20   error=0.086953
Training epoch 11/20   error=0.086760
Training epoch 12/20   error=0.086614
Training epoch 13/20   error=0.086469
Training epoch 14/20   error=0.086392
Training epoch 15/20   error=0.086245
Training epoch 16/20   error=0.086176
Training epoch 17/20   error=0.086068
Training epoch 18/20   error=0.085906
Training epoch 19/20   error=0.082719
Training epoch 20/20   error=0.077400
Total process time: 525.88


### Evaluation

Exciting! Now is the time to test your model.   

    May the gradients be always in your favor.

In [169]:
output = net.predict(x_test ,y_test )


Test set: Avg. loss: 0.0760, Accuracy: 10000/10000 (100%)



## 3. Benchmarking against PyTorch

How well your model performs against a similar-architecture PyTorch model?   
It is time to find out:

In [170]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset

#### Prepare the data as tensors using PyTorch DataLoader:

In [171]:
t_train =  TensorDataset(torch.Tensor(x_train),torch.Tensor(y_train))
t_test =  TensorDataset(torch.Tensor(x_test),torch.Tensor(y_test))
train_loader = torch.utils.data.DataLoader(dataset=t_train, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=t_test, batch_size=64, shuffle=False)

Define a `PyTorchNet` class with an identical architecture you used in your home-made network.

3.1 (10 pts)

In [176]:
#### SOLUTION REQUIRED  ####

class PyTorchNet(nn.Module):
    def __init__(self):
        super(PyTorchNet, self).__init__()
        input_size = 28*28
        num_classes = 10
        self.fc1 = nn.Linear(input_size, 128)
        self.activ1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.activ2 = nn.Sigmoid()
        self.fc3 = nn.Linear(64, num_classes)
        self.activ3 = nn.Tanh()


    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.activ1(self.fc1(x))
        x = self.activ2(self.fc2(x))
        x = self.activ3(self.fc3(x))

        return x

In [177]:

# Train the model
num_epochs = 20
pt_learning_rate = 0.01
pt_network = PyTorchNet()
optimizer = torch.optim.Adam(pt_network.parameters(), lr=pt_learning_rate)
criterion = nn.MSELoss()

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Forward pass
        outputs = pt_network(images)
        loss = criterion(outputs, labels)
        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # A handy printout:
        if (i + 1) % 500 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')


Epoch [1/20], Step [500/938], Loss: 0.0064
Epoch [2/20], Step [500/938], Loss: 0.0111
Epoch [3/20], Step [500/938], Loss: 0.0055
Epoch [4/20], Step [500/938], Loss: 0.0045
Epoch [5/20], Step [500/938], Loss: 0.0070
Epoch [6/20], Step [500/938], Loss: 0.0019
Epoch [7/20], Step [500/938], Loss: 0.0067
Epoch [8/20], Step [500/938], Loss: 0.0042
Epoch [9/20], Step [500/938], Loss: 0.0094
Epoch [10/20], Step [500/938], Loss: 0.0040
Epoch [11/20], Step [500/938], Loss: 0.0061
Epoch [12/20], Step [500/938], Loss: 0.0030
Epoch [13/20], Step [500/938], Loss: 0.0047
Epoch [14/20], Step [500/938], Loss: 0.0030
Epoch [15/20], Step [500/938], Loss: 0.0025
Epoch [16/20], Step [500/938], Loss: 0.0015
Epoch [17/20], Step [500/938], Loss: 0.0056
Epoch [18/20], Step [500/938], Loss: 0.0017
Epoch [19/20], Step [500/938], Loss: 0.0013
Epoch [20/20], Step [500/938], Loss: 0.0008


Evaluation:

In [179]:
pt_network.eval()
test_losses = []
test_loss = 0
correct = 0
with torch.no_grad():
    for data, target in test_loader:
        output = pt_network(data)
        test_loss += criterion(output, target,)
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.max(1,keepdim=True)[1]).sum()

test_loss /= len(test_loader.dataset)
test_losses.append(test_loss)
print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
  test_loss, correct, len(test_loader.dataset),
  100. * correct / len(test_loader.dataset)))


Test set: Avg. loss: 0.0001, Accuracy: 9641/10000 (96%)



3.2 (10 pts)

Time for some questions:
1. Which one of the models performed better? Why?
2. Which one of the models performed faster? Why?  
3. What would you change in your network's architecture?   
4. What would you change in your model's solution algorithm?


# Answer 1: Which model performed better? Why?

The PyTorch model performed better.
It reached around 97% accuracy, while my model stayed around 10%, which is basically random guessing.
The main reason is that PyTorch handles weight initialization, gradient computation, and numerical stability in a much more robust way. In my own implementation, some mistakes in backpropagation and/or parameter updates probably caused the model to not learn properly.

⸻

# Answer 2: Which model performed faster? Why?

The PyTorch model was faster.
PyTorch uses optimized tensor operations (and can also use the GPU), while my implementation runs with pure Python loops, which is much slower. So even the same architecture runs noticeably faster in PyTorch.

⸻

# Answer 3: What would you change in your network’s architecture?

I would change the activation functions and the last layer.
Instead of using sigmoid and tanh, I would switch to ReLU in the hidden layers because it reduces the risk of vanishing gradients.
Also, instead of applying tanh on the output, I would output raw logits and use a softmax inside the loss function (like CrossEntropyLoss).
So the architecture would become something like: 
784 → 128 (ReLU)
128 → 64 (ReLU)
64 → 10 (logits)

# Answer 4: What would you change in the training algorithm?

I would change the optimizer and the learning rate.
Using Adam instead of simple SGD usually gives better and more stable results.
I would also make sure the data is normalized properly (which PyTorch already does).
So overall: Adam optimizer, smaller learning rate (e.g : 0.001), and CrossEntropyLoss for the output layer.

## 4. The Network Wars!

Here is your chance to play with your model's architecture in order to break your own benchmark set eariler.  
You can add/remove layers, play with their sizes, types, etc.   
You can add a new loss if you wish, or anything else that will fairly give your model an advantage over base.  

4.1 (15 pts)

In [194]:
import numpy as np

# =============================
# 0) Base API
# =============================
class Layer_Primitive:
    """Minimal API that every layer must implement."""
    def __init__(self):
        self.input = None
        self.output = None

    def forward_propagation(self, input):
        raise NotImplementedError

    def backward_propagation(self, output_error, learning_rate):
        """
        Parameters
        ----------
        output_error : ndarray
            dL/dY of this layer (same shape as self.output)
        learning_rate : float
        Returns
        -------
        d_input : ndarray
            dL/dX to pass to the previous layer
        """
        raise NotImplementedError

# =============================
# 1) Fully Connected (Affine) layer
# =============================
class FullyConnected(Layer_Primitive):
    def __init__(self, in_features, out_features, bias=True, weight_init="xavier"):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.bias_flag = bias

        # Weight initialization
        if weight_init == "xavier":
            limit = np.sqrt(6.0 / (in_features + out_features))
            self.W = np.random.uniform(-limit, limit, size=(in_features, out_features))
        elif weight_init == "he":
            std = np.sqrt(2.0 / in_features)
            self.W = np.random.randn(in_features, out_features) * std
        else:
            self.W = np.random.randn(in_features, out_features) * 0.01

        self.b = np.zeros((1, out_features)) if bias else None

    def forward_propagation(self, X):
        # X shape: (N, in_features)
        self.input = X
        Y = X @ self.W  # (N, out_features)
        if self.bias_flag:
            Y = Y + self.b  # broadcast (N, out_features)
        self.output = Y
        return Y

    def backward_propagation(self, dY, learning_rate):
        # dY shape: (N, out_features)
        X = self.input
        N = X.shape[0]

        # Gradients
        dW = X.T @ dY  # (in_features, out_features)
        db = dY.sum(axis=0, keepdims=True) if self.bias_flag else None
        dX = dY @ self.W.T  # (N, in_features)

        # Parameter update (SGD)
        self.W -= learning_rate * dW / max(N, 1)
        if self.bias_flag:
            self.b -= learning_rate * db / max(N, 1)

        return dX

# Variant with L2 weight decay (adds λW to gradient)
class FullyConnectedWD(FullyConnected):
    def __init__(self, in_features, out_features, bias=True, weight_init="he", weight_decay=1e-4):
        super().__init__(in_features, out_features, bias=bias, weight_init=weight_init)
        self.weight_decay = weight_decay

    def backward_propagation(self, dY, learning_rate):
        X = self.input
        N = X.shape[0]
        dW = X.T @ dY + self.weight_decay * self.W
        db = dY.sum(axis=0, keepdims=True) if self.bias_flag else None
        dX = dY @ self.W.T
        self.W -= learning_rate * dW / max(N, 1)
        if self.bias_flag:
            self.b -= learning_rate * db / max(N, 1)
        return dX

# =============================
# 2) Activation layers
# =============================
class ReLU(Layer_Primitive):
    def forward_propagation(self, X):
        self.input = X
        self.output = np.maximum(0, X)
        return self.output

    def backward_propagation(self, dY, learning_rate):
        dX = dY * (self.input > 0)
        return dX

class LeakyReLU(Layer_Primitive):
    def __init__(self, negative_slope=0.1):
        super().__init__()
        self.negative_slope = negative_slope

    def forward_propagation(self, X):
        self.input = X
        self.output = np.where(X > 0, X, self.negative_slope * X)
        return self.output

    def backward_propagation(self, dY, learning_rate):
        dx = np.ones_like(self.input)
        dx[self.input < 0] = self.negative_slope
        return dY * dx

class Sigmoid(Layer_Primitive):
    def forward_propagation(self, X):
        self.output = 1.0 / (1.0 + np.exp(-X))
        self.input = X
        return self.output

    def backward_propagation(self, dY, learning_rate):
        s = 1.0 / (1.0 + np.exp(-self.input))
        return dY * s * (1 - s)

# =============================
# 3) Losses
# =============================
class MSELoss:
    def forward(self, y_pred, y_true):
        # y_* shape: (N, D)
        self.y_pred = y_pred
        self.y_true = y_true
        return ((y_pred - y_true) ** 2).mean()

    def backward(self):
        N = self.y_true.shape[0]
        return 2 * (self.y_pred - self.y_true) / max(N, 1)

# Optional: Cross-entropy with softmax for classification
class SoftmaxCrossEntropyLoss:
    def forward(self, logits, y_true_onehot):
        # logits: (N, C), y_true_onehot: (N, C)
        # stable softmax
        z = logits - logits.max(axis=1, keepdims=True)
        exp_z = np.exp(z)
        self.probs = exp_z / exp_z.sum(axis=1, keepdims=True)
        self.y_true = y_true_onehot
        # mean CE
        eps = 1e-12
        loss = -np.sum(y_true_onehot * np.log(self.probs + eps)) / logits.shape[0]
        return loss

    def backward(self):
        # dL/dlogits = (probs - y_true) / N
        N = self.y_true.shape[0]
        return (self.probs - self.y_true) / max(N, 1)

# =============================
# 4) Network Wrapper
# =============================
class Network:
    def __init__(self, *layers):
        self.layers = list(layers)

    def forward(self, X):
        for layer in self.layers:
            X = layer.forward_propagation(X)
        return X

    def backward(self, dL_dY, lr):
        grad = dL_dY
        for layer in reversed(self.layers):
            grad = layer.backward_propagation(grad, lr)

    def fit(self, X, Y, loss, epochs=50, lr=1e-2, verbose=True):
        for ep in range(epochs):
            # forward
            Y_hat = self.forward(X)
            L = loss.forward(Y_hat, Y)
            # backward
            dL_dY = loss.backward()
            self.backward(dL_dY, lr)
            if verbose and ((ep + 1) % max(1, epochs // 10) == 0 or ep == 0):
                print(f"epoch {ep+1}/{epochs}  loss={L:.6f}")

# =============================
# 5) Examples (disabled by default to avoid prints in notebooks)
# =============================
RUN_DEMO = False
if RUN_DEMO:
    # Example A: tiny regression (original)
    rng = np.random.default_rng(0)
    X = rng.normal(size=(128, 2))
    Y = (X[:, [0]] - 2 * X[:, [1]])  # shape (N,1)

    net = Network(
        FullyConnected(2, 16, weight_init="he"),
        ReLU(),
        FullyConnected(16, 1)
    )

    loss = MSELoss()
    net.fit(X, Y, loss, epochs=200, lr=0.1, verbose=False)
    y_hat = net.forward(X[:5])
    # print("first preds:", y_hat.ravel()[:5])  # muted

    # Example B: classification-ready stack with weight decay + LeakyReLU
    clf = Network(
        FullyConnectedWD(784, 256, weight_decay=1e-4),
        LeakyReLU(0.1),
        FullyConnectedWD(256, 128, weight_decay=1e-4),
        LeakyReLU(0.1),
        FullyConnectedWD(128, 64, weight_decay=1e-4),
        LeakyReLU(0.1),
        FullyConnectedWD(64, 10, weight_decay=1e-4),
    )
    ce = SoftmaxCrossEntropyLoss()
    # Example forward only; training would use Network.fit


In [196]:
# --- Architecture: WD + LeakyReLU + CE ---
net2 = MyNetwork()
net2.add(FullyConnectedWD(784, 256, weight_decay=1e-4))  # He init + L2
net2.add(LeakyReLU(0.1))
net2.add(FullyConnectedWD(256, 128, weight_decay=1e-4))
net2.add(LeakyReLU(0.1))
net2.add(FullyConnectedWD(128, 64,  weight_decay=1e-4))
net2.add(LeakyReLU(0.1))
net2.add(FullyConnectedWD(64, 10,   weight_decay=1e-4))

# Loss: one-hot labels expected
def ce_loss(y_true, logits):
    return SoftmaxCrossEntropyLoss().forward(logits, y_true)

_ce = SoftmaxCrossEntropyLoss()
def ce_loss_grad(y_true, logits):
    z = logits - logits.max(axis=1, keepdims=True)
    exp_z = np.exp(z)
    probs = exp_z / exp_z.sum(axis=1, keepdims=True)
    _ce.y_true = y_true
    _ce.probs  = probs
    return _ce.backward()

net2.use_loss(ce_loss, ce_loss_grad)

# Train
net2.fit(x_train, y_train, epochs=25, learning_rate=0.05)

# Evaluate (ללא הדפסת המטריצות):
_ = net2.predict(x_test, y_test)   # <— זו השורה החשובה
# חלופה שקולה: net2.predict(x_test, y_test);

Training epoch 1/25   error=0.220857
Training epoch 2/25   error=0.107384
Training epoch 3/25   error=0.083937
Training epoch 4/25   error=0.071884
Training epoch 5/25   error=0.064751
Training epoch 6/25   error=0.059408
Training epoch 7/25   error=0.056374
Training epoch 8/25   error=0.053473
Training epoch 9/25   error=0.051150
Training epoch 10/25   error=0.049849
Training epoch 11/25   error=0.048010
Training epoch 12/25   error=0.047415
Training epoch 13/25   error=0.046860
Training epoch 14/25   error=0.046423
Training epoch 15/25   error=0.045279
Training epoch 16/25   error=0.045056
Training epoch 17/25   error=0.043876
Training epoch 18/25   error=0.045155
Training epoch 19/25   error=0.043802
Training epoch 20/25   error=0.044106
Training epoch 21/25   error=0.044141
Training epoch 22/25   error=0.042316
Training epoch 23/25   error=0.042484
Training epoch 24/25   error=0.042663
Training epoch 25/25   error=0.043473

Test set: Avg. loss: 0.0994, Accuracy: 10000/10000 (100%)


Explanation — Improved Model

This model is an improved version of the baseline neural network.

Instead of using a simple small network with MSE loss, I built a deeper neural network with stronger architectural choices that are known to train better on image classification tasks such as MNIST.

The main improvements are:
	1.	Weight Decay (L2 regularization)
Using FullyConnectedWD adds weight decay to the weight updates.
This prevents the weights from becoming too large and helps reduce over-fitting, which usually improves generalization to the test set.
	2.	He initialization
The weights are initialized using He initialization.
This method is known to work better for networks that use ReLU/LeakyReLU activations, because it preserves variance and keeps gradients stable during training.
	3.	LeakyReLU instead of plain ReLU
LeakyReLU allows small negative slopes.
This avoids the “dead ReLU” problem where neurons get stuck at zero and stop learning.
	4.	Deeper architecture (more layers, more neurons)
The network has more layers (784→256→128→64→10) which increases representation capacity and allows the model to learn more complex patterns from the images.
	5.	Cross-Entropy Loss for classification
Instead of MSE, I use Softmax Cross Entropy which is the correct loss function for multi-class classification.
This loss gives stronger gradients and better optimization behaviour when predicting class scores.

⸻

Expected result

Because of L2 regularization, better activation function (LeakyReLU), deeper architecture, He initialization and the correct loss for classification (CE), this model is expected to train faster, reduce over-fitting, and achieve higher accuracy compared to the previous simple network.