## Please Read

Suggestions for this assignment


1.   Start early (some cells take slightly longer to run)
2.   Answer written questions thoroughly using concepts learned.  The analysis of the graphs are important.
3.   The diabetes dataset is EXTREMELY NOISY.  Please be aware, and answer questions based on the general trajectory.  Plotting loss curves may help.



## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state = 5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Diabetes dataset (2 points)

In [None]:
# These are the names of column in the dataset. It includes all features of the data and the label.
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
import os
if not os.path.exists("diabetes.csv"):
    !wget https://raw.githubusercontent.com/JHA-LAB/ece364_2025/master/data/diabetes.csv
diabetes_data = pd.read_csv("diabetes.csv", header=1, names=col_names)

# Display the first five instances in the dataset
diabetes_data.head(5)

### Extract target and descriptive features (1 point)

In [None]:
# Store all the features from the data in X
X = # TODO
# Store all the labels in y
y = # TODO

In [None]:
# Convert data to numpy array
X = # TODO
y = # TODO

### Create training and validation datasets (0.5 point)

Split the data into training and validation sets using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation.

In [None]:
X_train,X_val,y_train,y_val = # TODO

### Preprocess the dataset (0.5 point)

Preprocess the data by normalizing each feature to have zero mean and unit standard deviation. This can be done using the `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.


In [None]:
# Define the scaler for scaling the data
scaler = # TODO

# Normalize the training data
X_train = # TODO

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_val = # TODO


## Training a Multi-Layer Perceptron (18 points)


We will use `sklearn's` neural network library to train a multi-layer perceptron for classification. The model is trained to optimize the cross-entropy loss using stochastic gradient descent. Review ch.8 and see [here](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) for more details.

NOTE: Training each network takes several seconds to minutes.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [None]:
"""
For info on the arguments and attributes, see here:
(https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
"""

def get_mlp(hidden_layer_sizes=(100,),
            activation='relu',
            learning_rate_init=0.1,
            early_stopping=False,
            validation_fraction=0.15):

  # use stochastic gradient descent
    parameters={'solver':'sgd',
              'alpha': 0,
              'momentum': 0,
              'max_iter':20000,
              'n_iter_no_change':100,
              'tol': 1e-5,
              'random_state': random_state
              }
    parameters['hidden_layer_sizes']=hidden_layer_sizes
    parameters['activation']=activation
    parameters['learning_rate_init']=learning_rate_init
    parameters['early_stopping']=early_stopping
    parameters['validation_fraction']=validation_fraction

    return MLPClassifier(**parameters)

### Exercise 1: Warm up (2 points)

1. Use `get_mlp` defined above to create a multi-layer perceptron with 1 hidden layer consisting of 100 units and train the classifier on the training dataset. Keep all other parameters at their default values.


In [None]:
# TODO

2. Visualize the evolution of the training loss over epochs. Hint: use `loss_curve_` attribute of the classifier and matplotlib (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html).





In [None]:
# TODO

3. Report the classifier's accuracies over the training and validation datasets. Hint: use `accuracy_score`

In [None]:
# TODO

#### Explain any performance difference observed between the training and validation datasets.

Insert answer

#### We will next explore several strategies to improve the model's validation performance.

### Exercise 2: Width vs Depth (9 points)

#### Exercise 2a (3 points)

Next, we will experiment with the width of the hidden layer, defined by the number of units in the hidden layer.

Do this by using `get_mlp` to create a multi-layer perceptron with 1 hidden layer. Vary the number of hidden units among 1, 10, 50, 100, by setting `hidden_layer_sizes`. Keep all other parameters at their default values.

Fit each classifier on the training dataset and report its training and validation accuracies.

In [None]:
# TODO

#### Provide a possible explanation for any effect observed upon increasing the number of hidden units on classifier performance.

Insert answer

#### Exercise 2b (3 points)

Next, we will experiment with the depth of the MLP, by varying the number of hidden layers.

Do this by using `get_mlp` to create a Multi-layer perceptron with 15 units per hidden layer. Vary the number of hidden layers from 1 through 4, by setting `hidden_layer_sizes`. Keep all other parameters at their default values.

Fit each classifier on the training dataset and report its training and validation accuracies.


In [None]:
# TODO

#### Provide a possible explanation for any change in performance upon increasing the model depth.

Insert answer

#### Exercise 2c (3 points)

Next, we'll explore the role of the hidden activation function when training a deeper network.

Do this by using `get_mlp` to create a multi-layer perceptron with 5 hidden layers, each with 15 hidden units. Vary the activation functions among identity, logistic, tanh, and relu. Keep all other parameters at their default values.

Fit each classifier on the training dataset and report its training accuracy.

Also, plot the training loss curves for each classifier on a single plot.


In [None]:
# TODO

#### Explain any effect observed on the traininig loss trajectories and accuracies when varying the hidden activation function.

Insert answer

### Exercise 3: Early stopping (4 points)

As we've seen from the above exercises, neural networks are prone to overfitting. To mitigate this, we can use a regularization method called early stopping.

In this part, we will compare the performance of the model with the early stopping method and the one without the early stopping method. For fair comparison, we use the validation dataset built before (20% of the data) as test dataset, and we make it unavailable to both models until finally evaluating models on it. During training both models, we assume there is only the built training dataset (80% of the data) available.

In early stopping, one monitors the performance of the model on a validation dataset (which is separated from the training dataset) throughout training. Then, the model with the lowest loss on the validation dataset, typically found in the earlier iterations of training, is selected, rather than the model with the lowest training loss.




Do this by calling `get_mlp` and setting `early_stopping=True`, `validation_fraction=0.3`. Keep all other parameters at their default values. This will create a classifier that automatically splits the original training set into nonoverlapping training and validation splits, where the validation split is 30% of the original training set.    

- Compare this classifier against the same model trained without early stopping.

- Fit each classifier on the training dataset and report its training and test accuracies.

- Also, plot the training loss and validation accuracy curves separately for the classifier trained with early stopping. Hint: use the validation_scores_ (analogous to loss_curve_) to plot the validation accuracy.

In [None]:
# TODO

#### Explain the plot and any change in the train and test performance compared to the classifier trained without early stopping.

Insert answer

### Exercise 4: L2 Regularization in Neural Networks (3 points)

As discussed, some of the most commonly used methods to avoid overﬁtting in neural networks are early stopping and dropout. The approach to avoiding overﬁtting by modifying the learning algorithm in order to generate models that are stable with respect to changes in the input is generally known as **regularization**

**L2 regularization**, also known as weight decay, is a technique used to prevent overfitting in neural networks. It works by adding a penalty term to the loss function that discourages large weights. The `alpha` parameter in scikit-learn's MLPClassifier controls the strength of this L2 regularization.

For this exercise, we'll use `scikit-learn.MLPClassifier` to explore the effects of L2 regularization.

In [None]:
def train_model_with_regularization(hidden_layer_sizes, alpha, max_iter=1000):
    model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                          alpha=alpha,  # alpha parameter controls the amount of L2 regularization
                          max_iter=max_iter,
                          random_state=random_state)
    model.fit(X_train, y_train)

    train_acc = model.score(X_train, y_train)
    val_acc = model.score(X_val, y_val)
    test_acc = model.score(X_val, y_val)  # Using the original validation set as our test set

    return train_acc, val_acc, test_acc, model.loss_curve_

#### Train models with different alpha values $\alpha \in \{0.001, 0.01, 0.1, 0.2, 0.5\}$ and plot training loss curves for different alpha values.  Hidden_layer_sizes can be assigned as (50, 5).

In [None]:
# TODO

#### Explain the effect of different alpha values on the model's performance and generalization ability. How does increasing `alpha` (which increases L2 regularization) affect the accuracies?

Insert answer