## Exercise 1: Manual Calculation of MLP Steps

Consider a simple MLP with 2 input features, 1 hidden layer containing 2 neurons, and 1 output neuron. Use the hyperbolic tangent (tanh) function as the activation for both the hidden layer and the output layer. The loss function is mean squared error (MSE): $ L = \frac{1}{N} (y - \hat{y})^2 $, where $ \hat{y} $ is the network's output.

For this exercise, use the following specific values:

- Input and output vectors:

    $ \mathbf{x} = [0.5, -0.2] $

    $ y = 1.0 $

- Hidden layer weights:

    $ \mathbf{W}^{(1)} = \begin{bmatrix} 0.3 & -0.1 \\ 0.2 & 0.4 \end{bmatrix} $

- Hidden layer biases:

    $ \mathbf{b}^{(1)} = [0.1, -0.2] $

- Output layer weights:

    $ \mathbf{W}^{(2)} = [0.5, -0.3] $

- Output layer bias:

    $ b^{(2)} = 0.2 $

- Learning rate: $ \eta = 0.3 $

- Activation function: $ \tanh $



Values were defined as follows :

```py

x  = np.array([0.5, -0.2])                 # input
y  = 1.0                                   # target


W1 = np.array([[0.3, -0.1],                # hidden layer weights (2x2)
               [0.2,  0.4]])


b1 = np.array([0.1, -0.2])                 # hidden biases

W2 = np.array([0.5, -0.3])                 # output layer weights
b2 = 0.2                                   # output bias

eta = 0.1                                  # learning rate, used for the update step
tanh = np.tanh
tanhp = lambda z: 1.0 - np.tanh(z)**2      # derivative of tanh

```



Perform the following steps explicitly, showing all mathematical derivations and calculations with the provided values:

1. **Forward Pass**:

    - Compute the hidden layer pre-activations: $ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} $.
    - Apply tanh to get hidden activations: $ \mathbf{a}^{(1)} = \tanh(\mathbf{z}^{(1)}) $.
    - Compute the output pre-activation: $ z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)} $.
    - Compute the final output: $ \hat{y} = \tanh(z^{(2)}) $.

```py

# ----- 1) Forward pass -----
z1 = W1 @ x + b1            # hidden-layer pre-activations
a1 = tanh(z1)               # hidden activations (2,)
z2 = W2 @ a1 + b2           # pre-activation output 
y_hat = tanh(z2)            # final output


```


2. **Loss Calculation**:

    - Compute the MSE loss:

        $ L = \frac{1}{N} (y - \hat{y})^2 $.

```py

# ----- 2) Loss (MSE, N=1) -----
L = (y - y_hat)**2

```

3. **Backward Pass (Backpropagation)**: Compute the gradients of the loss with respect to all weights and biases. Start with $ \frac{\partial L}{\partial \hat{y}} $, then compute:

    - $ \frac{\partial L}{\partial z^{(2)}} $ (using the tanh derivative: $ \frac{d}{dz} \tanh(z) = 1 - \tanh^2(z) $).
    - Gradients for output layer: $ \frac{\partial L}{\partial \mathbf{W}^{(2)}} $, $ \frac{\partial L}{\partial b^{(2)}} $.
    - Propagate to hidden layer: $ \frac{\partial L}{\partial \mathbf{a}^{(1)}} $, $ \frac{\partial L}{\partial \mathbf{z}^{(1)}} $.
    - Gradients for hidden layer: $ \frac{\partial L}{\partial \mathbf{W}^{(1)}} $, $ \frac{\partial L}{\partial \mathbf{b}^{(1)}} $.
    
    Show all intermediate steps and calculations.

```py

# ----- 3) Backward pass -----
# dL/dy_hat
dL_dyhat = 2.0 * (y_hat - y)              # since N=1
# dL/dz2
dL_dz2 = dL_dyhat * tanhp(z2)

# Output layer grads
# dL/dW2 = dL/dz2 * a1
dL_dW2 = dL_dz2 * a1
# dL/db2 = dL/dz2
dL_db2 = dL_dz2

# Backprop to hidden
# dL/da1 = dL/dz2 * W2
dL_da1 = dL_dz2 * W2
# dL/dz1 = dL/da1 ⊙ tanh'(z1)
dL_dz1 = dL_da1 * tanhp(z1)

# Hidden layer grads
# dL/dW1 = (dL/dz1)[:, None] @ x[None, :]
dL_dW1 = np.outer(dL_dz1, x)
# dL/db1 = dL/dz1
dL_db1 = dL_dz1

```


4. **Parameter Update**: Using the learning rate $ \eta = 0.1 $, update all weights and biases via gradient descent:

    - $ \mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(2)}} $
    - $ b^{(2)} \leftarrow b^{(2)} - \eta \frac{\partial L}{\partial b^{(2)}} $
    - $ \mathbf{W}^{(1)} \leftarrow \mathbf{W}^{(1)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(1)}} $
    - $ \mathbf{b}^{(1)} \leftarrow \mathbf{b}^{(1)} - \eta \frac{\partial L}{\partial \mathbf{b}^{(1)}} $

    Provide the numerical values for all updated parameters.


```py
# ----- 4) Parameter update (gradient descent, eta = 0.1) -----

W2_new = W2 - eta * dL_dW2
b2_new = b2 - eta * dL_db2
W1_new = W1 - eta * dL_dW1
b1_new = b1 - eta * dL_db1

```

**Submission Requirements**: Show all mathematical steps explicitly, including intermediate calculations (e.g., matrix multiplications, tanh applications, gradient derivations). Use exact numerical values throughout and avoid rounding excessively to maintain precision (at least 4 decimal places).


In [1]:
import numpy as np

np.set_printoptions(precision=8, suppress=False)

# ----- Given values -----
x  = np.array([0.5, -0.2])                # input
y  = 1.0                                   # target

W1 = np.array([[0.3, -0.1],                # hidden layer weights (2x2)
               [0.2,  0.4]])
b1 = np.array([0.1, -0.2])                 # hidden biases (2,)

W2 = np.array([0.5, -0.3])                 # output layer weights (2,)
b2 = 0.2                                   # output bias (scalar)

eta = 0.1                                   # learning rate for the update step
tanh = np.tanh
tanhp = lambda z: 1.0 - np.tanh(z)**2       # derivative of tanh

# ----- 1) Forward pass -----
z1 = W1 @ x + b1            # pre-activations hidden (2,)
a1 = tanh(z1)               # activations hidden (2,)
z2 = W2 @ a1 + b2           # pre-activation output (scalar)
y_hat = tanh(z2)            # final output (scalar)

print("z^(1) =", z1)
print("a^(1) =", a1)
print("z^(2) =", z2)
print("y_hat =", y_hat)


# ----- 2) Loss (MSE, N=1) -----
L = (y - y_hat)**2
print("Loss L =", L)


# ----- 3) Backward pass -----
# dL/dy_hat
dL_dyhat = 2.0 * (y_hat - y)              # since N=1
# dL/dz2
dL_dz2 = dL_dyhat * tanhp(z2)

# Output layer grads
# dL/dW2 = dL/dz2 * a1
dL_dW2 = dL_dz2 * a1
# dL/db2 = dL/dz2
dL_db2 = dL_dz2

# Backprop to hidden
# dL/da1 = dL/dz2 * W2
dL_da1 = dL_dz2 * W2
# dL/dz1 = dL/da1 ⊙ tanh'(z1)
dL_dz1 = dL_da1 * tanhp(z1)

# Hidden layer grads
# dL/dW1 = (dL/dz1)[:, None] @ x[None, :]
dL_dW1 = np.outer(dL_dz1, x)
# dL/db1 = dL/dz1
dL_db1 = dL_dz1

print("\ndL/dy_hat =", dL_dyhat)
print("dL/dz^(2) =", dL_dz2)
print("dL/dW^(2) =", dL_dW2)
print("dL/db^(2) =", dL_db2)
print("dL/da^(1) =", dL_da1)
print("dL/dz^(1) =", dL_dz1)
print("dL/dW^(1) =\n", dL_dW1)
print("dL/db^(1) =", dL_db1)


# ----- 4) Parameter update (gradient descent, eta = 0.1) -----
W2_new = W2 - eta * dL_dW2
b2_new = b2 - eta * dL_db2
W1_new = W1 - eta * dL_dW1
b1_new = b1 - eta * dL_db1

print("\nUpdated parameters:")
print("W^(2)_new =", W2_new)
print("b^(2)_new =", b2_new)
print("W^(1)_new =\n", W1_new)
print("b^(1)_new =", b1_new)


z^(1) = [ 0.27 -0.18]
a^(1) = [ 0.26362484 -0.17808087]
z^(2) = 0.38523667817130075
y_hat = 0.36724656264510797
Loss L = 0.4003769124844312

dL/dy_hat = -1.265506874709784
dL/dz^(2) = -1.0948279147135995
dL/dW^(2) = [-0.28862383  0.19496791]
dL/db^(2) = -1.0948279147135995
dL/da^(1) = [-0.54741396  0.32844837]
dL/dz^(1) = [-0.50936975  0.31803236]
dL/dW^(1) =
 [[-0.25468488  0.10187395]
 [ 0.15901618 -0.06360647]]
dL/db^(1) = [-0.50936975  0.31803236]

Updated parameters:
W^(2)_new = [ 0.52886238 -0.31949679]
b^(2)_new = 0.30948279147136
W^(1)_new =
 [[ 0.32546849 -0.1101874 ]
 [ 0.18409838  0.40636065]]
b^(1)_new = [ 0.15093698 -0.23180324]


***

## Exercise 2: Binary Classification with Synthetic Data and Scratch MLP

Using the `make_classification` function from scikit-learn ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html)), generate a synthetic dataset with the following specifications:

- Number of samples: 1000
- Number of classes: 2
- Number of clusters per class: Use the `n_clusters_per_class` parameter creatively to achieve 1 cluster for one class and 2 for the other (hint: you may need to generate subsets separately and combine them, as the function applies the same number of clusters to all classes by default).
- Other parameters: Set `n_features=2` for easy visualization, `n_informative=2`, `n_redundant=0`, `random_state=42` for reproducibility, and adjust `class_sep` or `flip_y` as needed for a challenging but separable dataset.

Implement an MLP from scratch (without using libraries like TensorFlow or PyTorch for the model itself; you may use NumPy for array operations) to classify this data. You have full freedom to choose the architecture, including:

- Number of hidden layers (at least 1)
- Number of neurons per layer
- Activation functions (e.g., sigmoid, ReLU, tanh)
- Loss function (e.g., binary cross-entropy)
- Optimizer (e.g., gradient descent, with a chosen learning rate)

Steps to follow:

1. Generate and split the data into training (80%) and testing (20%) sets.
2. Implement the forward pass, loss computation, backward pass, and parameter updates in code.
3. Train the model for a reasonable number of epochs (e.g., 100-500), tracking training loss.
4. Evaluate on the test set: Report accuracy, and optionally plot decision boundaries or confusion matrix.
5. Submit your code and results, including any visualizations.



In [21]:
import numpy as np
from sklearn.datasets import make_classification

# Reproducibility
rng = np.random.RandomState(42)

def biased_weights_for(target_class, K=2, high=0.8):
    """
    Return a weight vector of length K that heavily favors target_class.
    Helps ensure enough samples of the desired class per call.
    """
    low = (1.0 - high) / (K - 1)
    w = np.full(K, low, dtype=float)
    w[target_class] = high
    return w

def sample_class_subset(
    n_needed: int,
    target_class: int,
    n_clusters_per_class: int,
    seed: int,
    *,
    n_features: int = 2,
    n_informative: int = 2,
    n_redundant: int = 0,
    class_sep: float = 1.5,
    flip_y: float = 0.0,
    max_tries: int = 20,
    K: int = 2
):
    """
    Generate samples with make_classification and keep only rows of 'target_class'.
    We over-generate with biased 'weights' so we can downsample exactly n_needed.
    """
    tries = 0
    local_seed = seed
    # Over-generate to boost the chance of hitting n_needed for target_class
    n_generate = max(4 * n_needed, 2000)

    while tries < max_tries:
        X_tmp, y_tmp = make_classification(
            n_samples=n_generate,
            n_features=n_features,
            n_informative=n_informative,
            n_redundant=n_redundant,
            n_repeated=0,
            n_classes=K,
            n_clusters_per_class=n_clusters_per_class,
            class_sep=class_sep,
            flip_y=flip_y,
            weights=biased_weights_for(target_class, K=K, high=0.8),
            random_state=local_seed,
        )

        idx = np.flatnonzero(y_tmp == target_class)
        if idx.size >= n_needed:
            chosen = rng.choice(idx, size=n_needed, replace=False)
            return X_tmp[chosen], np.full(n_needed, target_class, dtype=int)

        # Try again with a different seed
        tries += 1
        local_seed += 1

    raise RuntimeError(
        f"Could not obtain {n_needed} samples for class={target_class} "
        f"with n_clusters_per_class={n_clusters_per_class} after {max_tries} tries."
    )

# ---------- Build the asymmetric dataset ----------
N = 1000
K = 2
n_per_class = [N // K] * K
n_per_class[0] += N - sum(n_per_class)  # handle remainder if any (keeps total = N)

# Assign distinct cluster counts per class
clusters_per_class = {
    0: 1,  # class 0 -> 1 clusters
    1: 2,  # class 1 -> 2 clusters
}

# Different seeds per class for variety
base_seeds = {0: 42, 1: 1337}

Xs = []
ys = []
for c in range(K):
    Xi, yi = sample_class_subset(
        n_needed=n_per_class[c],
        target_class=c,
        n_clusters_per_class=clusters_per_class[c],
        seed=base_seeds[c],
        n_features=2,
        n_informative=2,
        n_redundant=0,
        class_sep=1.6,   # tweak for difficulty vs. separability
        flip_y=0.0,
        K=K,
    )
    Xs.append(Xi)
    ys.append(yi)

# Combine and shuffle
X = np.vstack(Xs)
y = np.concatenate(ys)
perm = rng.permutation(len(y))
X = X[perm]
y = y[perm]

print("X shape:", X.shape, "y shape:", y.shape)
print("Class counts:", np.bincount(y))
# Expect: (1500, 4) and roughly balanced counts (exactly 500 each by construction)


X shape: (1000, 2) y shape: (1000,)
Class counts: [500 500]


In [22]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score


class HiddenLayer():
    
    def __init__(self, w_size, b_size):
        self.W = np.random.uniform(-1, 1, size=w_size)
        self.b = np.random.uniform(-1, 1, size=b_size)
        self.z = 0                # pre-activation value
        self.a = 0                # activation value 
        self.prev_a = 0

    def set_z(self, z):
        self.z = z
        
    def set_a(self, a):
        self.a = a
    
    def set_W(self, W):
        self.W = W    
        
    def set_b(self, b):
        self.b = b
        
    def set_prev_a(self, value):
        self.prev_a = value
        
        
    def update_W(self, eta, dW):
        self.W -= eta * dW
        
    def update_b(self, eta, db):
        self.b -= eta * db

np.set_printoptions(precision=8, suppress=False)

# ----- Given values -----

# Mapping for -1 and 1
y = 2*y - 1

HiddenLayers = []

NLayers = 2

for _ in range(NLayers):
    HiddenLayers.append(HiddenLayer(w_size=(2,2), b_size=(2,)))


W2 = np.array([0.5, -0.3])                 # output layer weights (2,)
b2 = 0.2                                   # output bias

eta = 0.1                                   # learning rate for the update step
tanh = np.tanh
tanhp = lambda z: 1.0 - np.tanh(z)**2       # derivative of tanh

epochs = 100

for epoch in range(epochs):
    
    wrong = False
    
    for index in range(len(X)):
        
        x_i = X[index]
        y_i = y[index]

        prev_a = x_i
        for layer in HiddenLayers:
            layer.set_prev_a(prev_a)
            layer.set_z(layer.W @ prev_a + layer.b)            # pre-activations hidden (2,)
            a = (tanh(layer.z))                                # activations hidden (2,)
            layer.set_a(a)
            prev_a = a
            
            
        z2 = W2 @ prev_a + b2           # pre-activation output
        y_hat = tanh(z2)            # final output


        L = ((y_i - y_hat)**2)

        if L > 1e-12:
            
            wrong = True

            delta2 = (2.0 / N) * (y_hat - y_i) * (1.0 - y_hat**2)   # scalar

            # Gradients for output layer
            dW2 = delta2 * prev_a   
            db2 = delta2

            next_delta = delta2                      
            next_W = W2                              

            # Updatiung Hidden Layers
            first_back = True
            for layer in reversed(HiddenLayers):

                if first_back:
                    # Output Layer has a Bias (not a matrix)
                    g = next_W * next_delta                   
                    first_back = False
                else:
                    g = next_W.T @ next_delta                  

                delta = g * (1.0 - layer.a**2)   

                dW = np.outer(delta, layer.prev_a)     
                db = delta                     

                # Update hidden layer
                layer.update_W(eta, dW)
                layer.update_b(eta, db)

                # Prepare for next layer
                next_delta = delta
                next_W = layer.W    

            # Output Layer Update
            W2 -= eta * dW2
            b2 -= eta * db2
                    
    if not wrong:            
        break



def forward(x, HiddenLayers, W2, b2):
    x_ = x
    for layer in HiddenLayers:
        z = layer.W @ x_ + layer.b
        a = np.tanh(z)
        x_ = a
    z2 = W2 @ x_ + b2
    y_hat = np.tanh(z2)
    return y_hat



y_pred = np.array([forward(x_i, HiddenLayers, W2, b2) for x_i in X])

y_pred_labels = np.where(y_pred >= 0.0, 1, -1)

print("Final W2:", W2, "Final b2:", b2)
for i, layer in enumerate(HiddenLayers, 1):
    print(f"Layer {i} weights:\n{layer.W}\nLayer {i} bias:\n{layer.b}")

print("Accuracy:", accuracy_score(y, y_pred_labels))

Final W2: [ 0.63446301 -0.83237545] Final b2: 0.45908843162023505
Layer 1 weights:
[[-0.3443021   0.88894019]
 [-0.09855254 -0.86185418]]
Layer 1 bias:
[-0.6156105   0.16952364]
Layer 2 weights:
[[ 0.72297534 -0.12167491]
 [-0.45737613  1.00072866]]
Layer 2 bias:
[0.72930515 0.43186679]
Accuracy: 0.737



***

## Exercise 3: Multi-Class Classification with Synthetic Data and Reusable MLP

Similar to Exercise 2, but with increased complexity.

Use `make_classification` to generate a synthetic dataset with:

- Number of samples: 1500
- Number of classes: 3
- Number of features: 4
- Number of clusters per class: Achieve 2 clusters for one class, 3 for another, and 4 for the last (again, you may need to generate subsets separately and combine them, as the function doesn't directly support varying clusters per class).
- Other parameters: `n_features=4`, `n_informative=4`, `n_redundant=0`, `random_state=42`.

Implement an MLP from scratch to classify this data. You may choose the architecture freely, but for an extra point (bringing this exercise to 4 points), reuse the exact same MLP implementation code from Exercise 2, modifying only hyperparameters (e.g., output layer size for 3 classes, loss function to categorical cross-entropy if needed) without changing the core structure.

Steps:

1. Generate and split the data (80/20 train/test).
2. Train the model, tracking loss.
3. Evaluate on test set: Report accuracy, and optionally visualize (e.g., scatter plot of data with predicted labels).
4. Submit code and results.



In [30]:
import numpy as np
from sklearn.datasets import make_classification

# Reproducibility
rng = np.random.RandomState(42)

def biased_weights_for(target_class, K=3, high=0.8):
    """
    Return a weight vector of length K that heavily favors target_class.
    Helps ensure enough samples of the desired class per call.
    """
    low = (1.0 - high) / (K - 1)
    w = np.full(K, low, dtype=float)
    w[target_class] = high
    return w

def sample_class_subset(
    n_needed: int,
    target_class: int,
    n_clusters_per_class: int,
    seed: int,
    *,
    n_features: int = 4,
    n_informative: int = 4,
    n_redundant: int = 0,
    class_sep: float = 1.5,
    flip_y: float = 0.0,
    max_tries: int = 20,
    K: int = 3
):
    """
    Generate samples with make_classification and keep only rows of 'target_class'.
    We over-generate with biased 'weights' so we can downsample exactly n_needed.
    """
    tries = 0
    local_seed = seed
    # Over-generate to boost the chance of hitting n_needed for target_class
    n_generate = max(4 * n_needed, 2000)

    while tries < max_tries:
        X_tmp, y_tmp = make_classification(
            n_samples=n_generate,
            n_features=n_features,
            n_informative=n_informative,
            n_redundant=n_redundant,
            n_repeated=0,
            n_classes=K,
            n_clusters_per_class=n_clusters_per_class,
            class_sep=class_sep,
            flip_y=flip_y,
            weights=biased_weights_for(target_class, K=K, high=0.8),
            random_state=local_seed,
        )

        idx = np.flatnonzero(y_tmp == target_class)
        if idx.size >= n_needed:
            chosen = rng.choice(idx, size=n_needed, replace=False)
            return X_tmp[chosen], np.full(n_needed, target_class, dtype=int)

        # Try again with a different seed
        tries += 1
        local_seed += 1

    raise RuntimeError(
        f"Could not obtain {n_needed} samples for class={target_class} "
        f"with n_clusters_per_class={n_clusters_per_class} after {max_tries} tries."
    )

# ---------- Build the asymmetric dataset ----------
N = 1500
K = 3
n_per_class = [N // K] * K
n_per_class[0] += N - sum(n_per_class)  # handle remainder if any (keeps total = N)

# Assign distinct cluster counts per class
clusters_per_class = {
    0: 2,  # class 0 -> 2 clusters
    1: 3,  # class 1 -> 3 clusters
    2: 4,  # class 2 -> 4 clusters
}

# Different seeds per class for variety
base_seeds = {0: 42, 1: 1337, 2: 2027}

Xs = []
ys = []
for c in range(K):
    Xi, yi = sample_class_subset(
        n_needed=n_per_class[c],
        target_class=c,
        n_clusters_per_class=clusters_per_class[c],
        seed=base_seeds[c],
        n_features=4,
        n_informative=4,
        n_redundant=0,
        class_sep=1.6,   # tweak for difficulty vs. separability
        flip_y=0.0,
        K=K,
    )
    Xs.append(Xi)
    ys.append(yi)

# Combine and shuffle
X = np.vstack(Xs)
y = np.concatenate(ys)
perm = rng.permutation(len(y))
X = X[perm]
y = y[perm]

print("X shape:", X.shape, "y shape:", y.shape)
print("Class counts:", np.bincount(y))
# Expect: (1500, 4) and roughly balanced counts (exactly 500 each by construction)


X shape: (1500, 4) y shape: (1500,)
Class counts: [500 500 500]


In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score


def softmax(z):
    z = z - np.max(z)
    e = np.exp(z)
    return e / np.sum(e)

def one_hot(y_i, K):
    v = np.zeros(K, dtype=float)
    v[y_i] = 1.0
    return v




class HiddenLayer():
    
    def __init__(self, w_size, b_size):
        self.W = np.random.uniform(-1, 1, size=w_size)
        self.b = np.random.uniform(-1, 1, size=b_size)
        self.z = 0                # pre-activation value
        self.a = 0                # activation value 
        self.prev_a = 0

    def set_z(self, z):
        self.z = z
        
    def set_a(self, a):
        self.a = a
    
    def set_W(self, W):
        self.W = W    
        
    def set_b(self, b):
        self.b = b
        
    def set_prev_a(self, value):
        self.prev_a = value
        
        
    def update_W(self, eta, dW):
        self.W -= eta * dW
        
    def update_b(self, eta, db):
        self.b -= eta * db






def forward_probs(x, HiddenLayers, W2, b2):
    prev_a = x
    for layer in HiddenLayers:
        layer.set_prev_a(prev_a)
        layer.set_z(layer.W @ prev_a + layer.b)            # pre-activations hidden (2,)
        a = (tanh(layer.z))                                # activations hidden (2,)
        layer.set_a(a)
        prev_a = a
    z2 = W2 @ prev_a + b2          # logits, shape (K,)
    p = softmax(z2)           # probs, shape (K,)
    return p, prev_a               # also return last hidden activation




K = len(np.unique(y))                 # number of classes
input_dim = X.shape[1]                # = 4
H = 3                                 # hidden width (it is up to personal prefeerence)
NLayers = 2

HiddenLayers = []
# First hidden layer: (H, input_dim)
HiddenLayers.append(HiddenLayer(w_size=(H, input_dim), b_size=(H,)))

# Remaining hidden layers: (H, H)
for _ in range(NLayers - 1):
    HiddenLayers.append(HiddenLayer(w_size=(H, H), b_size=(H,)))

# Output layer: (K, H) and (K,)
W2 = np.random.uniform(-1, 1, size=(K, H))
b2 = np.zeros(K)                



eta = 0.1                                   # learning rate for the update step
tanh = np.tanh
tanhp = lambda z: 1.0 - np.tanh(z)**2       # derivative of tanh

epochs = 100

for epoch in range(epochs):
    
    wrong = False
    
    for index in range(len(X)):
        
        x_i = X[index]
        y_i = y[index]

        p, a_last = forward_probs(x_i, HiddenLayers, W2, b2)  # p: (K,), a_last: (H,)
        L = -np.log(p[y_i])   

        if L > 1e-12:
            
            wrong = True

            # output layer calculations
            y_one = one_hot(y_i, K)               # (K,)
            delta_out = p - y_one                 # (K,)
            dW2 = np.outer(delta_out, a_last)     # (K, H)
            db2 = delta_out                       # (K,)

            # save for hidden backprop (use current W2, not yet updated)
            next_delta = delta_out                # (K,)
            next_W = W2                           # (K, H)

            # updating hidden layers from reversed order
            for layer in reversed(HiddenLayers):
                g = next_W.T @ next_delta         # (H_prev,) where H_prev = layer.a.size
                delta = g * (1.0 - layer.a**2)    # tanh'(z) = 1 - a^2

                dW = np.outer(delta, layer.prev_a)
                db = delta

                layer.update_W(eta, dW)
                layer.update_b(eta, db)

                next_delta = delta                # (H_prev,)
                next_W = layer.W                  # (H_prev, input_dim_prev)

            # Updatye Output Layer
            W2 -= eta * dW2
            b2 -= eta * db2
                    
    if not wrong:            
        break



def forward(x, HiddenLayers, W2, b2):
    x_ = x
    for layer in HiddenLayers:
        z = layer.W @ x_ + layer.b
        a = np.tanh(z)
        x_ = a
    z2 = W2 @ x_ + b2
    y_hat = np.tanh(z2)
    return y_hat



print("Final W2:", W2, "Final b2:", b2)
for i, layer in enumerate(HiddenLayers, 1):
    print(f"Layer {i} weights:\n{layer.W}\nLayer {i} bias:\n{layer.b}")

probs = np.array([forward_probs(x_i, HiddenLayers, W2, b2)[0] for x_i in X])  # (N, K)
y_pred = np.argmax(probs, axis=1)
print("Accuracy:", accuracy_score(y, y_pred))


Final W2: [[-1.30150104  0.74545721  0.40928816]
 [ 1.46128138  0.86141196 -0.66732008]
 [ 1.12006045 -1.26491417  1.06276974]] Final b2: [-1.23020757 -0.18742653  1.4176341 ]
Layer 1 weights:
[[ 14.87125433   4.96105463  -6.61067946   9.02410079]
 [ -1.54703905   0.30016101  19.98290756   3.37650402]
 [  6.28185814 -12.02192067   8.7366937   -3.54556444]]
Layer 1 bias:
[-0.44119326 -6.82410816 -7.93830413]
Layer 2 weights:
[[-0.75370642  2.52967955  1.27164864]
 [-4.13644548 -2.32339937 -1.9046856 ]
 [-1.02515081 -1.44959544  2.82048605]]
Layer 2 bias:
[-0.94398869  2.25227241  1.17658551]
Accuracy: 0.7086666666666667



***

## Exercise 4: Multi-Class Classification with Deeper MLP

Repeat Exercise 3 exactly, but now ensure your MLP has **at least 2 hidden layers**. You may adjust the number of neurons per layer as needed for better performance. Reuse code from Exercise 3 where possible, but the focus is on demonstrating the deeper architecture. Submit updated code, training results, and test evaluation.




In [35]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score


def softmax(z):
    z = z - np.max(z)
    e = np.exp(z)
    return e / np.sum(e)

def one_hot(y_i, K):
    v = np.zeros(K, dtype=float)
    v[y_i] = 1.0
    return v




class HiddenLayer():
    
    def __init__(self, w_size, b_size):
        self.W = np.random.uniform(-1, 1, size=w_size)
        self.b = np.random.uniform(-1, 1, size=b_size)
        self.z = 0                # pre-activation value
        self.a = 0                # activation value 
        self.prev_a = 0

    def set_z(self, z):
        self.z = z
        
    def set_a(self, a):
        self.a = a
    
    def set_W(self, W):
        self.W = W    
        
    def set_b(self, b):
        self.b = b
        
    def set_prev_a(self, value):
        self.prev_a = value
        
        
    def update_W(self, eta, dW):
        self.W -= eta * dW
        
    def update_b(self, eta, db):
        self.b -= eta * db






def forward_probs(x, HiddenLayers, W2, b2):
    prev_a = x
    for layer in HiddenLayers:
        layer.set_prev_a(prev_a)
        layer.set_z(layer.W @ prev_a + layer.b)            # pre-activations hidden (2,)
        a = (tanh(layer.z))                                # activations hidden (2,)
        layer.set_a(a)
        prev_a = a
    z2 = W2 @ prev_a + b2          # logits, shape (K,)
    p = softmax(z2)           # probs, shape (K,)
    return p, prev_a               # also return last hidden activation




K = len(np.unique(y))                 # number of classes
input_dim = X.shape[1]                # = 4
H = 3                                 # hidden width (it is up to personal prefeerence)
NLayers = 28

HiddenLayers = []
# First hidden layer: (H, input_dim)
HiddenLayers.append(HiddenLayer(w_size=(H, input_dim), b_size=(H,)))

# Remaining hidden layers: (H, H)
for _ in range(NLayers - 1):
    HiddenLayers.append(HiddenLayer(w_size=(H, H), b_size=(H,)))

# Output layer: (K, H) and (K,)
W2 = np.random.uniform(-1, 1, size=(K, H))
b2 = np.zeros(K)                



eta = 0.1                                   # learning rate for the update step
tanh = np.tanh
tanhp = lambda z: 1.0 - np.tanh(z)**2       # derivative of tanh

epochs = 100

for epoch in range(epochs):
    
    wrong = False
    
    for index in range(len(X)):
        
        x_i = X[index]
        y_i = y[index]

        p, a_last = forward_probs(x_i, HiddenLayers, W2, b2)  # p: (K,), a_last: (H,)
        L = -np.log(p[y_i])   

        if L > 1e-12:
            
            wrong = True

            # output layer calculations
            y_one = one_hot(y_i, K)               # (K,)
            delta_out = p - y_one                 # (K,)
            dW2 = np.outer(delta_out, a_last)     # (K, H)
            db2 = delta_out                       # (K,)

            # save for hidden backprop (use current W2, not yet updated)
            next_delta = delta_out                # (K,)
            next_W = W2                           # (K, H)

            # updating hidden layers from reversed order
            for layer in reversed(HiddenLayers):
                g = next_W.T @ next_delta         # (H_prev,) where H_prev = layer.a.size
                delta = g * (1.0 - layer.a**2)    # tanh'(z) = 1 - a^2

                dW = np.outer(delta, layer.prev_a)
                db = delta

                layer.update_W(eta, dW)
                layer.update_b(eta, db)

                next_delta = delta                # (H_prev,)
                next_W = layer.W                  # (H_prev, input_dim_prev)

            # Updatye Output Layer
            W2 -= eta * dW2
            b2 -= eta * db2
                    
    if not wrong:            
        break



def forward(x, HiddenLayers, W2, b2):
    x_ = x
    for layer in HiddenLayers:
        z = layer.W @ x_ + layer.b
        a = np.tanh(z)
        x_ = a
    z2 = W2 @ x_ + b2
    y_hat = np.tanh(z2)
    return y_hat



print("Final W2:", W2, "Final b2:", b2)
for i, layer in enumerate(HiddenLayers, 1):
    print(f"Layer {i} weights:\n{layer.W}\nLayer {i} bias:\n{layer.b}")

probs = np.array([forward_probs(x_i, HiddenLayers, W2, b2)[0] for x_i in X])  # (N, K)
y_pred = np.argmax(probs, axis=1)
print("Accuracy:", accuracy_score(y, y_pred))


Final W2: [[-0.84814546  0.53060569 -0.02411378]
 [-0.84814546  0.53060569 -0.02411379]
 [-0.84814546  0.53060569 -0.02411378]] Final b2: [-3.47440006e-05 -2.60865489e-01  2.60900233e-01]
Layer 1 weights:
[[-0.84338989  0.61003512  0.19936643  0.24624894]
 [-0.3643678  -0.15606804  0.51052957 -0.82817205]
 [ 0.78685002 -0.21631564  0.33439671  0.32546589]]
Layer 1 bias:
[0.71062903 0.57992    0.99769994]
Layer 2 weights:
[[ 0.92091437  0.1590592  -0.92093574]
 [ 0.26167634 -0.37245274 -0.51969802]
 [ 0.52611436  0.27042889  0.56091939]]
Layer 2 bias:
[-0.94674719 -0.50250292 -0.88996456]
Layer 3 weights:
[[-0.48557048 -0.89524249  0.23999672]
 [ 0.11486781  0.22516431  0.4767062 ]
 [ 0.72083311  0.25566272 -0.04362071]]
Layer 3 bias:
[-0.87372412  0.31243484 -0.1821496 ]
Layer 4 weights:
[[-0.86679364 -0.74903253 -0.59972829]
 [-0.41471671 -0.93718506  0.99308781]
 [-0.90901745 -0.38878322 -0.66119217]]
Layer 4 bias:
[ 0.84725279 -0.47609917 -0.78251266]
Layer 5 weights:
[[ 0.9030442  