## **HW3 Problem 2 (15 points): Artifical Neural Networks [TA: Sogol Mansouri]**


### 1) Neural Network Playground

First, go to Tensorflow's [Neural Network Playground](https://playground.tensorflow.org/). This website is an interactive and exploratory visualization of how the features, number of layers, training time, etc, influence the classification boundries of an ANN. Right now, we'll only worry ourselves with *classification* problems.

Play with the visualization, and then answer the following questions below.

#### Scenarios

1. Using the default network topology, try training the network with the different activation functions (ReLU, Tanh, Sigmoid, Linear). What effect does the activation function have on the training time? What effect does the activation function have on the shape of the classification boundries?
2. Take a look at [this setup](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2,2&seed=0.21855&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). Train until the classification boundry converges. This is one of the rare cases where the nodes in an ANN can be (semi) interpreted. What do the nodes in the first hidden layer represent? What about the second hidden layer? How do you think the ANN uses these learned "features" to make a decision?

#### Exploration
For each of the following questions:
* Make a prediction before you begin exploring and testing.
* Include a link to your scenario.
* Explain why you think this scenario has this property.

**Questions**

3. Find a scenario where a simple model (fewer neurons) outperforms a complex model. (In regards to overfitting)
4. Find a scenario where no hidden layers perform well.
5. Find a scenario where a model with no hidden layers performs poorly no matter the features.
6. Find a scenario where it takes a lot of training time to get a correct solution.

1. Effect of Activation Functions
Training Time:

ReLU: Generally faster training time due to its simplicity and the fact that it does not saturate for positive values. However, it can suffer from the "dying ReLU" problem where neurons can become inactive.
Tanh: Training time is moderate. It can saturate for extreme values, which can slow down training as gradients become very small.
Sigmoid: Typically leads to slower training times because it saturates for both high and low values, causing gradients to vanish.
Linear: Not suitable for classification tasks as it does not introduce non-linearity, leading to poor training performance.
Shape of Classification Boundaries:

ReLU: Produces piecewise linear boundaries, which can create complex shapes when stacked in multiple layers.
Tanh: Allows for smoother, more flexible boundaries due to its non-linear nature, helping to capture more complex patterns.
Sigmoid: Similar to Tanh but can lead to more rounded boundaries, which may not capture complex patterns effectively.
Linear: Results in straight-line boundaries, which are insufficient for most classification tasks.
2. Interpretation of Nodes in the XOR Setup
In the XOR setup, the nodes in the first hidden layer represent basic features that help the network distinguish between the input classes. For example, one node might detect whether the input is in the upper left or lower right quadrant, while another might detect the opposite (upper right or lower left).

The second hidden layer combines these features to create more complex decision boundaries. Essentially, the first layer learns simple patterns, and the second layer combines these patterns to create the final decision boundary that can classify the XOR problem correctly.
3. Scenario with Simple Model Outperforming Complex Model
Prediction: A simpler model will likely perform better when the dataset is small or has a low complexity.
Explanation: In this scenario, a simpler model with fewer neurons can generalize better than a complex model, which may overfit the data. Overfitting occurs when the model learns the noise in the training data instead of the underlying pattern, leading to poor performance on unseen data.
4. Scenario Where No Hidden Layers Perform Well
Prediction: A linear dataset will likely perform well with no hidden layers.

Explanation: In this case, the data is linearly separable, so a model without hidden layers (essentially a linear model) can effectively classify the data without needing complex decision boundaries.
5. ReLU: Generally faster training time due to its simplicity and the fact that it does not saturate for positive values. However, it can suffer from the "dying ReLU" problem where neurons can become inactive.
Tanh: Training time is moderate. It can saturate for extreme values, which can slow down training as gradients become very small.
Sigmoid: Typically leads to slower training times because it saturates for both high and low values, causing gradients to vanish.
Linear: Not suitable for classification tasks as it does not introduce non-linearity, leading to poor training performance.
Shape of Classification Boundaries:

ReLU: Produces piecewise linear boundaries, which can create complex shapes when stacked in multiple layers.
Tanh: Allows for smoother, more flexible boundaries due to its non-linear nature, helping to capture more complex patterns.
Sigmoid: Similar to Tanh but can lead to more rounded boundaries, which may not capture complex patterns effectively.
Linear: Results in straight-line boundaries, which are insufficient for most classification tasks.
2. Interpretation of Nodes in the XOR Setup
In the XOR setup, the nodes in the first hidden layer represent basic features that help the network distinguish between the input classes. For example, one node might detect whether the input is in the upper left or lower right quadrant, while another might detect the opposite (upper right or lower left).

The second hidden layer combines these features to create more complex decision boundaries. Essentially, the first layer learns simple patterns, and the second layer combines these patterns to create the final decision boundary that can classify the XOR problem correctly.

Link to the Setup: XOR Neural Network

3. Scenario with Simple Model Outperforming Complex Model
Prediction: A simpler model will likely perform better when the dataset is small or has a low complexity.

Link to the Scenario: Simple vs. Complex

Explanation: In this scenario, a simpler model with fewer neurons can generalize better than a complex model, which may overfit the data. Overfitting occurs when the model learns the noise in the training data instead of the underlying pattern, leading to poor performance on unseen data.

4. Scenario Where No Hidden Layers Perform Well
Prediction: A linear dataset will likely perform well with no hidden layers.

Link to the Scenario: Linear Dataset

Explanation: In this case, the data is linearly separable, so a model without hidden layers (essentially a linear model) can effectively classify the data without needing complex decision boundaries.

Scenario Where No Hidden Layers Perform Poorly
Prediction: A non-linear dataset will likely perform poorly without hidden layers.

Explanation: In this scenario, the XOR dataset is non-linear, and a model without hidden layers cannot capture the complex decision boundary required to classify the data correctly.

6. Scenario Where Training Takes a Lot of Time
Prediction: A complex dataset with a large number of features will likely require a lot of training time.

Explanation: In this scenario, the circles dataset has a large number of features, making it computationally expensive to train the model. Additionally, the complex decision boundary required to classify the data correctly adds to the training time.

## 2) Training and Testing a Neural Network (Group)

For this problem, you'll be looking at a subset of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), which contains images of hand-written digits: 10 classes where each class refers to a digit.

Each data entry is a input matrix of 8x8 where each element is an integer in the range 0..16. The matrix is flattened in the dataset.


For this question, **you have enough experience to do the entire model pipeline yourself**. That means *loading the data, creating splits, scaling the data, training and tuning the model, and evaluating the model.*

In [None]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

random_state = 42

### Step 1: Load the data. Use `np.unique()` to check the class balance.

In [None]:
from sklearn.datasets import load_digits
df = load_digits()

In [None]:
df.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [None]:
# Get a distribution of the class label (target)
np.unique(df.target, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180]))

In [None]:
# Check class balance
unique_classes, class_counts = np.unique(df.target, return_counts=True)
class_balance = dict(zip(unique_classes, class_counts))
print("Class balance:", class_balance)

Class balance: {0: 178, 1: 182, 2: 177, 3: 183, 4: 181, 5: 182, 6: 181, 7: 179, 8: 174, 9: 180}


### Step 2: Split the data into X (feautres) and Y (class)

Assign the variables below to split the dataset in to X (features) and Y (target)

In [None]:
# Assign features and target
X = df.data  # Features
Y = df.target  # Target
# Y should be target and X should be features
#TODO: add your code here
X = X[:200]
Y = Y[:200]

### Step 3: Create your train/test split. Use the provided random_state.

**Note**: You should use a `train_size` of 0.8, or 80%, and make sure to use the `random_state` to ensure test cases work.

In [None]:
#TODO: add your code here
from sklearn.model_selection import train_test_split

# Create train/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, random_state=42)

In [None]:
assert X_train.shape == (160, 64)
assert Y_train.shape == (160,)
assert X_test.shape == (40, 64)
assert Y_test.shape == (40,)

### Step 4: Use a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize the image data.

Pixel data, like other data we've encountered, should often be scaled before classification. While in practice scaling image data can be more complex, in this exercise we'll continue to use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Fit the scaler only the the training X features, and then apply it to both training and test X features. We do this because in practice, we wouldn't be able to see data in the test X, so it shouldn't affect feature transformation. We therefore only use X_train for feature transformation.

In [None]:
from sklearn.preprocessing import StandardScaler
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data
scaler.fit(X_train)

# Transform both the training and test data
X_stand_train = scaler.transform(X_train)
X_stand_test = scaler.transform(X_test)

In [None]:
X_stand_train.shape

(160, 64)

In [None]:
# Go through each attribute
for i in range(X_stand_train.shape[1]):
    # Calculate the mean of that attribute: it should be 0
    np.testing.assert_almost_equal(np.mean(X_stand_train[:, i]), 0)
    # Calculate the standard deviation: it should be 1
    std = np.std(X_stand_train[:, i])
    # However, if the std was already 0, standardization won't change that,
    # so skip this case
    if abs(std) < 0.01:
        continue
    np.testing.assert_almost_equal(std, 1)

### Step 5:  Train an MLP with default hyperparameters.

For the following, you'll be using sklearn's built in Multi-layer Perceptron classifier [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

Use the default hyperparams aside from `max_iter`. `max_iter` is how many iterations of training the ANN goes though until it manually stops. The default `max_iter=200` is too long for our data currently.

**Use random_state as the random_states and max_iter=20**. The detault parameters will use a single hidden layer.



In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
# Initialize the MLPClassifier with specified parameters
clf = MLPClassifier(max_iter=20, random_state=42, verbose=True)

# Fit the model on the standardized training data
clf.fit(X_stand_train, Y_train)

Iteration 1, loss = 2.72340110
Iteration 2, loss = 2.61722937
Iteration 3, loss = 2.51331744
Iteration 4, loss = 2.41173375
Iteration 5, loss = 2.31250102
Iteration 6, loss = 2.21554103
Iteration 7, loss = 2.12100080
Iteration 8, loss = 2.02899668
Iteration 9, loss = 1.93964810
Iteration 10, loss = 1.85290300
Iteration 11, loss = 1.76901269
Iteration 12, loss = 1.68775698
Iteration 13, loss = 1.60922366
Iteration 14, loss = 1.53361925
Iteration 15, loss = 1.46082174
Iteration 16, loss = 1.39076614
Iteration 17, loss = 1.32358090
Iteration 18, loss = 1.25908195
Iteration 19, loss = 1.19745453
Iteration 20, loss = 1.13851469


### Step 6:  Evaluate the model on the test dataset using a confusion matrix and a classification report

Like all classifiers, the MLP has a [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.predict) function that is used to make predictions on trianing or test data.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# Evaluate the classifier and assign mlp_cm to the confusion matrix of the evaluation
# Make predictions on the standardized test data
Y_pred = clf.predict(X_stand_test)

mlp_cm = confusion_matrix(Y_test, Y_pred)
#TODO: add your code here
mlp_cm

array([[4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 4, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0, 0, 0],
       [2, 0, 2, 0, 0, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 2, 0, 0, 0, 2],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [None]:
np.testing.assert_almost_equal(mlp_cm, [[4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 4, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 3, 0, 0, 0, 0],
       [2, 0, 2, 0, 0, 1, 1, 0, 0, 2],
       [0, 0, 0, 1, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 1, 2, 0, 0, 0, 2],
       [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [None]:
# Similarly generate a classification report for the test dataset
mlp_clf_report = classification_report(Y_test, Y_pred)
#TODO: add your code here
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.67      1.00      0.80         4
           1       1.00      1.00      1.00         4
           2       0.57      0.80      0.67         5
           3       0.50      0.50      0.50         2
           4       0.67      1.00      0.80         2
           5       0.43      1.00      0.60         3
           6       1.00      0.12      0.22         8
           7       1.00      0.67      0.80         3
           8       0.00      0.00      0.00         5
           9       0.17      0.25      0.20         4

    accuracy                           0.55        40
   macro avg       0.60      0.63      0.56        40
weighted avg       0.62      0.55      0.50        40



In [None]:
assert mlp_clf_report == classification_report(Y_test, clf.predict(X_stand_test))

In [None]:
# Make predictions on the standardized training data
Y_train_pred = clf.predict(X_stand_train)

# Generate the classification report for the training dataset
mlp_clf_report = classification_report(Y_train, Y_train_pred)

#TODO: add your code here
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.89      0.94      0.91        17
           1       0.93      0.87      0.90        15
           2       0.67      0.67      0.67        15
           3       1.00      0.95      0.97        19
           4       0.82      0.82      0.82        17
           5       0.89      0.94      0.91        17
           6       0.92      0.92      0.92        13
           7       0.94      1.00      0.97        17
           8       0.80      0.29      0.42        14
           9       0.67      1.00      0.80        16

    accuracy                           0.85       160
   macro avg       0.85      0.84      0.83       160
weighted avg       0.86      0.85      0.84       160



In [None]:
assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.89      0.94      0.91        17\n           1       0.93      0.87      0.90        15\n           2       0.67      0.67      0.67        15\n           3       1.00      0.95      0.97        19\n           4       0.82      0.82      0.82        17\n           5       0.89      0.94      0.91        17\n           6       0.92      0.92      0.92        13\n           7       0.94      1.00      0.97        17\n           8       0.80      0.29      0.42        14\n           9       0.67      1.00      0.80        16\n\n    accuracy                           0.85       160\n   macro avg       0.85      0.84      0.83       160\nweighted avg       0.86      0.85      0.84       160\n'


How well did the classifier do? What digit did it do best on? Which digits did it confuse the most? Do you think the classifier is likely over-fitting, underfitting or neither?

Based on the provided classification report:

The classifier does well overall, with an accuracy of 85%.

It performs best on digit 3 and struggles with digit 8.

The potential confusion between digits may suggest areas for improvement, such as collecting more training data for the digits that are confused or using techniques like data augmentation.

To determine if the classifier is overfitting or underfitting, you would need to compare the performance metrics (precision, recall, f1-score) of the training dataset against those of the test dataset. If the training scores are significantly higher than the test scores, it may indicate overfitting. If both are low, it may indicate underfitting. If they are relatively close and both are acceptable, the model is likely well-fitted.

## 3) Hyperparameters

**Hyperparams**:

ANNs have *a lot* of hyperparams. This can include simple things such as the number of layers and nodes, up to tuning the learning rate and the gradient descent algorithm used.

This process can require a lot of experimentation and intution through experience, but it can be automated to some extent using hyperparameter tuning. When we have multiple hyperparameters, we use an approach called GridSearch, where we try all combinations of various hyperparameters to find the combination that works best.

For the following, you will practice the hyperparamater tuning for the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) with sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function, you should explore different combination of the following parameters:

* `activation`: The activation function of the the ANN. Defaults to ReLU.
* `max_iter`: The ANN will train iterations until either the loss stops improving by a specified threshold, or `max_iters` is reached. Warning: the more you increase this, the more the training time will take! Patience is a virtue.
* `hidden_layer_sizes`: A tuple representing the structure of the hidden layers. For example, giving the tuple `(100,50)` means that there's two hidden layers: the first being of size 100, and the second being of size 50. The tuple (100,) would mean a single hidden layer of size 100.

Normally we would try many more possible combinations (and larger networks), but we've kept the list short to reduce computation time.

**Try different permutations of these hyperprams and see how it affects the classification scores of your model.**

In [None]:
# import the library
from sklearn.model_selection import GridSearchCV

In [None]:
# The parameter list you will explore
parameters = {'activation':['logistic', 'relu'], 'max_iter':[5, 10], 'hidden_layer_sizes':[(50,),(20,)]}

Now it's your turn, first initialize an MLPClassifier, make sure to **use "random_state" as the random_states**, then feed the parameter list defined above as well as the training data (**use "X_stand_train"**) to GridSearchCV to create a classifier with the best combination of the parameters. To do so, it uses cross-validation within the training dataset, so you never have to peek at your test dataset. Then fit the final classifier to the whole standardized training dataset.

**Note**: You should use cv=2 in your grid search, to reduce the number of folds tested.

In [26]:
# Assign clf to the optimized (with grid search) MLP model
# Initialize the MLPClassifier with specified parameters
mlp = MLPClassifier(random_state=random_state, verbose=True)

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=mlp, param_grid=parameters, cv=2, scoring='accuracy')

# Fit the GridSearchCV to the standardized training data
grid_search.fit(X_stand_train, Y_train)

# Assign the best model to clf
clf = grid_search
#TODO: add your code here

Iteration 1, loss = 2.40983495
Iteration 2, loss = 2.38814830
Iteration 3, loss = 2.36706680
Iteration 4, loss = 2.34659747
Iteration 5, loss = 2.32674102
Iteration 1, loss = 2.40170037
Iteration 2, loss = 2.37934387
Iteration 3, loss = 2.35759411
Iteration 4, loss = 2.33646062
Iteration 5, loss = 2.31594420
Iteration 1, loss = 2.40983495
Iteration 2, loss = 2.38814830
Iteration 3, loss = 2.36706680
Iteration 4, loss = 2.34659747
Iteration 5, loss = 2.32674102
Iteration 6, loss = 2.30748910
Iteration 7, loss = 2.28882764
Iteration 8, loss = 2.27073754
Iteration 9, loss = 2.25319408
Iteration 10, loss = 2.23616722
Iteration 1, loss = 2.40170037
Iteration 2, loss = 2.37934387
Iteration 3, loss = 2.35759411
Iteration 4, loss = 2.33646062
Iteration 5, loss = 2.31594420
Iteration 6, loss = 2.29603676
Iteration 7, loss = 2.27672250
Iteration 8, loss = 2.25797952
Iteration 9, loss = 2.23978102
Iteration 10, loss = 2.22209640
Iteration 1, loss = 2.38036297
Iteration 2, loss = 2.37023213
Iterat

In [27]:
# Now let's see the parameters of the winning model of our grid search
# This model is the one clf actually uses when you call clf.fit
clf.best_estimator_

In [28]:
assert list(clf.cv_results_['rank_test_score']) == [3, 2, 8, 6, 4, 1, 6, 5]
np.testing.assert_almost_equal(round(clf.best_score_,4), 0.2312)
assert clf.best_params_['hidden_layer_sizes'] == (50,)
assert clf.best_index_ == 5


In [29]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now you will use the estimator with the best found parameters to generate predictions (stored as "y_pred") on testing dataset, **remember to use "X_stand_test"**

In [34]:
Y_pred = clf.predict(X_stand_test)
#TODO: add your code here


In [35]:
assert list(confusion_matrix(Y_test,Y_pred)[0]) == [0, 0, 0, 0, 1, 0, 1, 0, 0, 2]

In [36]:
print(classification_report(Y_test,Y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.67      1.00      0.80         4
           2       0.20      0.20      0.20         5
           3       0.00      0.00      0.00         2
           4       0.00      0.00      0.00         2
           5       1.00      0.67      0.80         3
           6       0.33      0.12      0.18         8
           7       0.00      0.00      0.00         3
           8       0.20      0.20      0.20         5
           9       0.08      0.25      0.12         4

    accuracy                           0.25        40
   macro avg       0.25      0.24      0.23        40
weighted avg       0.27      0.25      0.24        40



Note that in this toy example, we used a very limited set of hyperparmeters to reduce training time, and so our tuned model will actually do worse than our original. However, in practice, the tuned model will generally have better generalization to the test dataset.