# LAB 4: NEURAL NETWORK

In this lab, we will explore Neural Networks, which can be trained using the Gradient Descent algorithm.

Specifically, we will cover the following topics:

- A brief introduction to the PyTorch library for implementing Neural Networks.

- Implementation of a neural network using PyTorch.

- Discussion on overfitting.


**A neuron is like a function; it takes a few inputs and calculates an output. Its adjusts its parameters (training), using the gradient descent method to minimise a loss function**.


The circle below illustrates an artificial neuron.

<figure>
  <img style="float: left;" src="../../fig/NN1.png" width="800"/>
</figure>



At the far left we see two input values plus a bias value. The input values are 1 and 0 (the green numbers), while the bias holds a value of -2 (the brown number).

The two inputs are then multiplied by their  weights, which are 7 and 3 (the blue numbers).

Finally we add it up with the bias and end up with a number, in this case: 5 (the red number). This is the input for our artificial neuron.

The neuron then performs some kind of computation on this number using activation function ( in our case the Sigmoid function), and then spits out an output. This happens to be 1, as Sigmoid of 5 equals to 1, if we round the number up (more about activation function down).

You can find more explanation about the structure [here](https://www.leewayhertz.com/what-are-neural-networks/#What-are-neural-networks)



# Single layer network : Perceptron

The perceptron algorithm was one of the first algorithms used to implement a simple neural network.

Perceptrons are supervised learning algorithms and is a type of an ANN

Let's remind th components of a Perceptron:
    
- Inputs: The perceptron takes several inputs $(x_1,x_2,…,x_n)$


- Weights: Each input is associated with a weight $(w_1,w_2,…,w_n).$

- Bias: A bias term $(b or w_0)$ is added to shift the decision boundary.

- Activation Function: The perceptron uses a step function (a simple thresholding function) to determine whether the weighted sum of inputs plus the bias is above or below a certain threshold.

- Binary Output: The output of a perceptron is binary (-1 or 1 that can be easily switched to 0 or 1), making it suitable for linearly separable classification problems.

### Limitations of Perceptron:

- Linear separability: The perceptron can only solve problems that are linearly separable (i.e., it can only classify data that can be separated by a straight line or hyperplane). It cannot solve more complex problems like XOR.

- Single-layer model: The original perceptron is a single-layer model and does not have hidden layers, limiting its expressiveness.

.

<figure>
  <img style="float: left;" src="../../fig/perceptron.png" width="800"/>
</figure>

# Activation Functions

Activation functions play an integral role in neural networks by introducing non-linearity.
This non-linearity allows neural networks to develop complex representations and functions based on the inputs that would not be possible with a simple linear layers.

You can find all types of activation functions [here](https://www.shiksha.com/online-courses/articles/activation-functions-with-real-life-analogy-and-python-code/)

Unlike the perceptron, which uses a simple step function for activation, neurons in modern neural networks can use a variety of activation functions, such as:

<figure>
  <img style="float: left;" src="../../fig/activation_function.png" width="800"/>
</figure>


# Multi-layer Networks

Multi-layer Networks: Neurons are part of more sophisticated architectures called multi-layer perceptrons (MLPs) or deep neural networks, where neurons are organized into layers (input layer, hidden layers, and output layer). Each layer performs computations, and the output of one layer is fed as input to the next.

<figure>
  <img style="float: left;" src="../../fig/MLP.png" width="800"/>
</figure>

Each hidden/output node is now related to all the nodes of the previous layer: we say that the network is fully connected.


###  Inputs to the 2nd Hidden Layer
The inputs to a neuron in the 2nd hidden layer come from the outputs of the neurons in the 1st hidden layer.

Here we have:
- $ h_1^1, h_1^2, h_1^3,h_1^4 $ represent the activations (outputs) of the neurons in the **1st hidden layer**. These are the inputs to the neurons in the 2nd hidden layer.
.

- Let's assume that $ w_{ij}^{(3)} $ represent the weights connecting neuron $ i $ in the 1st hidden layer to neuron $ j $ in the 2nd hidden layer.
.

- $ b_j^{2} $ is the bias for neuron $ j $ in the 2nd hidden layer.



### Weighted Sum for a Neuron in the 2nd Hidden Layer
The value of the weighted sum for neuron $ j $ in the 2nd hidden layer is:

$$
a_j^{2} = \sum_{i=1}^{4} w_{ij}^{2} h_i^{1} + b_j^{2}
$$


After computing the weighted sum $a_j^{2}$, we apply an activation function $ f(a_j^{2}) $ to compute the final output (activation) of the neuron. 


So the output (activation) of neuron $ j $ in the 2nd hidden layer is:


$$
h_j^{2} = f\left(\sum_{i=1}^{4} w_{ij}^{2} h_i^{1} + b_j^{2} \right)
$$



If we are using an activation function like ReLU, sigmoid, or tanh, the output of neuron $ j $ in the 2nd hidden layer would be:

- **Sigmoid**:
  $$
  h_j^{2} = \frac{1}{1 + e^{-a_j^{2}}}
  $$
- **ReLU**:
  $$
 h_j^{2} = \max(0, a_j^{2})
  $$
- **Tanh**:
  $$
 h_j^{2} = \tanh(a_j^{2}) = \frac{e^{a_j^{2}} - e^{-a_j^{2}}}{e^{a_j^{2}} + e^{-a_j^{2}}}
  $$




#  Training the Neural Network

To train a NN, we need to follow those process: 

- Forward pass: Compute the outputs.

- Compute loss: Compare outputs with actual labels.

- Backward pass: Compute gradients of loss with respect to weights and bias.

- Optimize: Update weights using the optimizer (e.g., Adam, SGD).

- Repeat for many epochs, using batches of data.



### Forward pass 

The action of obtaining the output y from the input x is called a forward pass.

In the forward pass, the input values are multiplied by the weights, biases are added, and the result is passed through an activation function. This process is repeated for every layer, eventually yielding the output of the network.

We can implement the forward pass of our multi-layer perceptron in a few lines. We use random weights and bias for the example:

In [1]:
import numpy as np

sizes = [3, 4, 3, 1]  # number of neurons in each layer
num_layers = len(sizes)  # one input layer, 2 hidden layers and one output layer

# the first layer does not have bias as it's the input layer
biases = [np.random.randn(sizes[i], 1) for i in range(1, num_layers)]

# we use the transpose of the weight matrices cause we need the same size as the bias vector
weights = [np.random.randn(sizes[i], sizes[i-1]) for i in range(1, num_layers)]


In [2]:
weights

[array([[ 1.07763987,  1.3355706 , -1.43261954],
        [-0.09855782,  1.2955058 ,  0.72416075],
        [ 0.10502936, -1.67961677,  1.40173685],
        [-0.04502352,  1.08057925, -1.03064786]]),
 array([[ 0.04128528,  0.06089792, -0.6319817 ,  1.06822963],
        [ 0.15103529, -0.29569492, -1.13376815,  1.00328582],
        [ 0.1006686 , -1.52056104,  1.06255129,  1.87065629]]),
 array([[0.59044125, 2.16752136, 0.10428971]])]

Once we have initialized the network size and its parameters, we are ready to implement the forward pass. We can then feed the network with any arbitrary input.

In [3]:
# Define an activation function
def g(z):
    """Sigmoid function."""
    return 1.0/(1.0 + np.exp(-z)) 

# define output/input relation for a MLP
def network_forward(h): 
    """Return the output of the network given input h."""
    for b, w in zip(biases, weights):
        z = np.dot(w, h) + b # mat-vect product + bias
        h = g(z) # apply activation function
    return h

# Example input
input = np.array([[1.],[2.],[3.]]) # a column vector (1,2,3)^T
output = network_forward(input)
print(output)

[[0.60611854]]


### Compute the Loss/Cost function:

The loss function measures how far off the network’s predictions are from the true labels.
Some common loss functions are:

- Mean Squared Error (MSE) for regression tasks.

- Cross-Entropy Loss for classification tasks (either binary or multi-class)

In [4]:
def binary_cross_entropy_loss(y_true, y_pred):
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)


### Backward Pass (Backpropagation):

The backward pass involves calculating the gradients of the loss with respect to the weights and biases in each layer.

The gradients are then used to update the weights and biases to minimize the loss (i.e., make the model’s predictions more accurate).

This process is called backpropagation cause it uses the chain rule from calculus to compute the gradients layer by layer, starting from the output layer and moving backward through the network

.

<figure>
  <img style="float: left;" src="../../fig/bp.png" width="800"/>
</figure>

# Implement NN with pytorch

PyTorch is one of the most widely used libraries for developing deep learning models. Tensors are one of the fundamental components of PyTorch. You can think of tensors as similar to NumPy arrays.

Using tensors, PyTorch can create computational graphs and calculate gradients.

We will cover the basic concepts of PyTorch and provide further explanations as we progress through the implementation.

you can find the [PyTorch tutorials here](https://pytorch.org/tutorials/beginner/basics/intro.html)

In [7]:
#!pip install torch

Defaulting to user installation because normal site-packages is not writeable
Collecting torch
  Downloading torch-2.5.0-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m906.4/906.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting sympy==1.13.1
  Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m82.5 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m
[?25hCollecting filelock
  Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Collecting nvidia-cuda-nvrtc-cu12

In [8]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

In [9]:
## Transform array to tensor with pytorch

tensor_1 = torch.tensor([[1., 2.], 
                         [3., 3.]])

tensor_2 = torch.tensor([[1., 0], 
                         [-1., 0]])

print("tensor_1 shape: ", tensor_1.shape)
print("tensor_2 shape: ", tensor_2.shape)

tensor_1 shape:  torch.Size([2, 2])
tensor_2 shape:  torch.Size([2, 2])


In [None]:
## Element Multiplication

elem_mul = tensor_1 * tensor_2
print("Element wise multiplication")
print("Result shape: ", elem_mul.size())
elem_mul

In [None]:
## Matrix Multiplication

matrix_mul = torch.matmul(tensor_1,tensor_2)
print("Matrix multiplication")
print("Result shape: ", matrix_mul.size())
torch.matmul(tensor_1,tensor_2)

In [None]:
# get tensor as numpy array
tensor_1.data.numpy()

In [None]:
# Define the layers manually
input_size = 3
hidden_size1 = 4
hidden_size2 = 3
output_size = 2

# Create the layers
fc1 = nn.Linear(input_size, hidden_size1)  # Input to hidden layer
fc2 = nn.Linear(hidden_size1, hidden_size2)  # Hidden layer 1 to hidden layer 2
fc3 = nn.Linear(hidden_size2, output_size)  # Hidden layer 2 to output layer

# Define activation functions
relu = nn.ReLU()
sigmoid = nn.Sigmoid()

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = optim.SGD([fc1.weight, fc1.bias, fc2.weight, fc2.bias, fc3.weight, fc3.bias], lr=0.01)

# Example input (2 samples, 3 features each)
inputs = torch.tensor([[0.1, 0.2, 0.3],
                       [0.4, 0.5, 0.6]], dtype=torch.float32)

#  true output labels
labels = torch.tensor([[0, 1],
                       [1, 0]], dtype=torch.float32)

# Number of epochs
epochs = 200
loss_values = []  # List to store loss at each epoch

for epoch in range(epochs):
    # Forward pass
    z1 = fc1(inputs)  # Linear combination in hidden layer 1
    a1 = relu(z1)     # Apply ReLU activation
    z2 = fc2(a1)  # Linear combination in hidden layer 2
    a2 = relu(z2)     # Apply ReLU activation again
    z3 = fc3(a2)      # Linear combination in output layer
    outputs = sigmoid(z3)  # Apply Sigmoid activation in the output layer\
    
    # Convert probabilities to binary predictions (0 or 1)
    binary_outputs = torch.round(outputs)

    # Compute loss
    loss = criterion(outputs, labels)

    # Backward pass
    loss.backward()

    # Update weights
    optimizer.step()
    
    # Store the loss value for this epoch
    loss_values.append(loss.item())
    
    # Print loss every 100 epochs for monitoring
    if (epoch + 1) % 50 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

In [None]:
# Print results
print("Outputs (Probabilities):\n", outputs)
print()
print("Binary Outputs (0 or 1):\n", binary_outputs)

In [None]:
# Plot the loss over epochs
plt.plot(loss_values)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss over Epochs')
plt.grid(True)
plt.show()

### Task:

- Modify the neural network to include more hidden layers (e.g., add another hidden layer with different size).

- Experiment with different activation functions (e.g., Tanh, LeakyReLU, etc.) in the hidden layers and observe the effect on output probabilities.

- Experiment with different optimizers such as Adam, RMSprop, and Adagrad in place of SGD, and observe their effect on model performance.

### Questions:

- How does changing the number of hidden layers affect the model's ability to classify the data?

- What happens if you use a different activation function in the hidden layers instead of ReLU (e.g., Tanh or LeakyReLU)? Show the effect on the binary outputs.

- How does changing the optimizer affect the convergence of the model?

- Which optimizer performed best for this dataset, and why do you think that is?


# Let's try with a real dataset

### Breast Cancer  Dataset

The **Breast Cancer  Dataset** is a widely-used dataset for binary classification tasks in machine learning, provided by the `sklearn` library. It contains data from breast cancer, with the goal of classifying tumors as **malignant** (cancerous) or **benign** (non-cancerous).

- **Number of samples:** 569

- **Number of features:** 30 numeric features

- **Target classes:** 
  - 0 = malignant (212 samples)
  - 1 = benign (357 samples)
  
- **Features:** Each feature represents a characteristic of cell nuclei (e.g., radius, texture, smoothness), calculated from digitized images of breast masses. The features include mean, standard error, and worst (largest) values of each characteristic.

You can find more about this dataset [here](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)

In [None]:
# 1. Load and preprocess the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

In [None]:
# Create a DataFrame with the feature data
df = pd.DataFrame(data=data.data, columns=data.feature_names)

# Add the target labels to the DataFrame
df['target'] = data.target

df

In [None]:
# Split the dataset into train, validation, and test sets (60% train, 20% val, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


In [None]:
# Standardize the features (mean=0, variance=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)


In [None]:
# Convert data to PyTorch tensors
train_dataset = TensorDataset(X_train, y_train)  

X_train = torch.tensor(X_train, dtype=torch.float32)
X_val = torch.tensor(X_val, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)

y_train = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)  # Shape (n_samples, 1)
y_val = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)      
y_test = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)    

# unsqueeze(1) is added because it is often required for compatibility with neural network layers 
#that expect a 2D input for the target tensor

### Batches

When we train neural network we usually train our model using batches of training example.


Batching method is performed for different reasons. First of all in we have a huge dataset we cannot fit all the data in our memery. Moreover, using batches requires less time to fit the model as in the same time you can perform more update in the weights of the model. 

So batch size is an additional hyperparameter of our training algorithm.

PyTorch provides two main data classes: DataLoader and Dataset to handle your data.

Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the batches.

you can read more at: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

Let's continue😅

In [None]:
# Create DataLoader for mini-batch training
batch_size = 32
# Create TensorDatasets
train_dataset = TensorDataset(X_train, y_train)  
val_dataset = TensorDataset(X_val, y_val)        
test_dataset = TensorDataset(X_test, y_test)     

# Create DataLoader
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [None]:
#  Define the neural network structure
input_size = X_train.shape[1]  # Number of features in the dataset
hidden_size1 = 16
hidden_size2 = 8
output_size = 1  # Binary classification (0 or 1)

# Create the model
model = nn.Sequential(
    nn.Linear(input_size, hidden_size1),
    nn.ReLU(),
    nn.Linear(hidden_size1, hidden_size2),
    nn.ReLU(),
    nn.Linear(hidden_size2, output_size),
    nn.Sigmoid()
)

# Loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [None]:
#  Train the model
epochs = 1000
loss_values, train_accuracies, val_accuracies,val_losses = [], [], [],[]

for epoch in range(epochs):
    model.train()
    for batch_X, batch_y in train_loader:
        
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)  
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    loss_values.append(loss.item())
    
    # Validation loss calculation
    model.eval()
    with torch.no_grad():
        val_loss = 0
        for val_X, val_y in val_loader:
            val_outputs = model(val_X)
            val_loss += criterion(val_outputs, val_y).item()
        val_loss /= len(val_loader)  # Average loss for validation
        val_losses.append(val_loss)
        
        # Accuracy calculations
        train_accuracy = (model(X_train).round().eq(y_train).sum() / y_train.size(0)).item()
        val_accuracy = (model(X_val).round().eq(y_val).sum() / y_val.size(0)).item()
        train_accuracies.append(train_accuracy)
        val_accuracies.append(val_accuracy)

    if (epoch + 1) % 50 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {loss.item():.4f}, Val Loss: {val_loss:.4f}, Train Acc: {train_accuracy:.2f}, Val Acc: {val_accuracy:.2f}')

In [None]:
# Plot loss and accuracy
plt.figure(figsize=(14, 5))

# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(loss_values, label='Train Loss')
plt.plot(val_losses, label='Val Loss', linestyle='--')  # Validation loss in dashed line
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid()
plt.legend()

# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(train_accuracies, label='Train Accuracy')
plt.plot(val_accuracies, label='Val Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.grid()
plt.legend()

plt.tight_layout()
plt.show()


In [None]:
#  Evaluate the model on test data
with torch.no_grad():
    test_accuracy = (model(X_test).round().eq(y_test).sum() / y_test.size(0)).item()

print(f'Test Accuracy: {test_accuracy * 100:.2f}%')

# TASK 1

### Model Architecture

- Modify the existing model architecture (e.g., add more layers, change the number of neurons).

- Experiment with different activation functions (e.g., Tanh, Leaky ReLU).

RETRAIN THE MODEL AND COMMENT THE DIFFERENCE IF IT EXISTS.

### Hyperparameter Tuning:

- Change the learning rate, batch size, and the number of epochs.

- Compare the effects of different optimizers (e.g., Adam, RMSprop) on model performance.

Regularization Techniques:

- Implement L2 regularization (weight decay) in the optimizer.

- Introduce dropout layers to prevent overfitting and discuss their impact on training and validation accuracy.

To add L2 regularization or dropout, you can modify the optimizer or the model structure ( dropout is added after the activation function in the model architectiure, Weight decay is added inside the regularization function (SGD or ADAM))

SAVE THE MODEL ARCHITECTURE YOU DID BEFORE, RETRAIN THE MODEL AND COMMENT THE DIFFERENCE IF IT EXISTS.

### Early Stopping:
Early stopping is an optimization technique used to reduce overfitting without compromising on model accuracy. The main idea behind early stopping is to stop training before a model starts to over-fit.

- Implement early stopping based on validation loss to avoid overfitting.

- Discuss the importance of early stopping in training deep learning models.

SAVE ALL YOUR CHANGES AND ADD EARLY STOPPING, RETRAIN THE MODEL AND COMMENT THE DIFFERENCE IF IT EXISTS.

# TASK 2: Build and Evaluate a Regression Model



**Objective**: Build a regression model using the class dataset to predict the weight. Evaluate the model’s performance using various metrics.

### Steps to Follow

1. **Data Loading**: 

2. **Data Preprocessing**:
   - Encode the categorical data
   
   - Apply oversampling on minority class for data balance
   
   - Split the dataset into training, validation, and test sets (e.g., 70% training, 15% validation, 15% test).
   
   - Normalize the features using standard scaling or min-max scaling.
   
   - Convert the datasets into PyTorch tensors.

3. **Model Development**:

   - Define a neural network architecture for regression. 
   
   - Use appropriate activation functions and output layers.
   
 
4. **Loss Function and Optimizer**:

   - Use Mean Squared Error (MSE) loss for regression.
   
   - Choose an optimizer (e.g., Adam or SGD).
   
   - Example:
     ```python
     criterion = nn.MSELoss()
     optimizer = optim.Adam(model.parameters(), lr=0.001)
     ```

5. **Training the Model**:

    - No need to use Batches because you don't have a big dataset

   - Train the model over a specified number of epochs.
   
   - Store loss values for visualization.
   
   - Print the training and validation loss at the end of each epoch.

6. **Model Evaluation**:

   - Evaluate the model on the test set and calculate the following metrics:
   
     - Mean Absolute Error (MAE)
     - Mean Squared Error (MSE)
     
   - Example of calculating metrics:
     ```python
     from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

     with torch.no_grad():
         model.eval()
         y_pred = model(torch.tensor(X_test, dtype=torch.float32)).numpy()
     
     mae = mean_absolute_error(y_test, y_pred)
     mse = mean_squared_error(y_test, y_pred)

     ```

7. **Visualization**:

   - Plot the training and validation loss over epochs to visualize training progress.
   
   - Create scatter plots of predicted vs. actual values to assess model performance visually.