*   Student name: **[Fill your name here, double-click to edit]**
*   Student Panther ID: **[Fill your Panther ID here]** 
*   Collaborator(s): **[Fill your collaborator(s)' name here]**
*   **Notice on Academic Misconduct**: Sharing your codes with other students is also an academic misconduct. If your submission is found unsually similar to that of another student, you will be reported to the SCAI as a potential academic misconduct case, regardless of your reasons. Violations may lead to suspension or expulsion from the university.

# CAP5602 Homework 4 (15% total grade)

## **Deadline: 11/5/2022 11:59 PM**

In this homework, we will train, test, and visualize logistic regression and multi-layer perceptron models on a toy non-linearly separable classification dataset.

## 1. Generate and visualize the data (2% total grade)
In this question, you will write code to generate and visualize the dataset. First, study the following API for the function `make_circles` to generate the dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html.

Use this function to generate a dataset `(X, Y)` with **250 samples**, **noise=0.06**, and **factor=0.5**. Then plot the dataset using a scatter plot. For sanity check, your plot should show two noisy concentric circles (one for each class).

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

# Generate the dataset
X, Y = make_circles(n_samples=250, noise=.06, factor=.5, random_state=10)

# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.coolwarm, s=50, edgecolors='k')
plt.show()

## 2. Split dataset (1% total grade)

Write code to randomly split your dataset above into a train set and a test set. Your train set should contain 150 examples and your test set should contain 100 examples.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=100, random_state=42)

print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

## 3. Train and evaluate a logistic regression model (2% total grade)

Write code to train a [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model with default parameters using your train set. Then compute and print out the accuracy of the model on your test set.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train the model
model = LogisticRegression()
model.fit(X_train, Y_train)

# Compute and print test accuracy
Y_pred = model.predict(X_test)
acc = accuracy_score(Y_test, Y_pred)
print('Accuracy on test set:', acc)

## 4. Visualize your model (3% total grade)

Write code to visualize your logistic regression model. You must:
1.   Create a meshgrid on the 2d space covering all your input data.
2.   Make predictions on the meshgrid, reshape it appropriately, and plot the contours of the predictions together with all the data points. Since logistic regression can return label probability distributions, you must use [predict_proba](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba) to make predictions and visualize the probability contours for one of the labels.

**Important**: Since you will need to visualize several models later, you should write the above steps into a function `visualize(model, X, Y)` that can be applied to any model `model` and dataset `(X, Y)`. Then call this function to visualize your logistic regression model.

 


In [None]:
import numpy as np
import matplotlib.pyplot as plt

def visualize(model, X, Y):
    X0, X1 = X[:, 0], X[:, 1]

    # Find the range of the 2 dimensions that we will plot
    X0_min, X0_max = X0.min() - 0.1, X0.max() + 0.1
    X1_min, X1_max = X1.min() - 0.1, X1.max() + 0.1
    n_steps = 100 # Number of steps on each axis

    # Create a meshgrid
    xx, yy = np.meshgrid(np.arange(X0_min, X0_max, (X0_max-X0_min)/n_steps),
                        np.arange(X1_min, X1_max, (X1_max-X1_min)/n_steps))

    # Make predictions on the meshgrid
    Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
    Z1 = Z[:, 1] # Here we use the second column of the predictions, which corresponds to the label 1.
    Z1 = Z1.reshape(xx.shape)

    # Plot the contours of model predictions
    plt.contourf(xx, yy, Z1, cmap=plt.cm.coolwarm, alpha=0.8)
    plt.scatter(X0, X1, c=Y, cmap=plt.cm.coolwarm, s=50, edgecolors='k')
    plt.xlabel('x0')
    plt.ylabel('x1')
    plt.colorbar()
    plt.show()

visualize(model, X, Y)

## 5. Train, evaluate, and visualize an MLP model on your dataset (3% total grade)

Write code to:
1. Train an MLP classifier with one hidden layer containing 20 hidden units. You can use the default value for other parameters of your model.
2. Compute and print out the accuracy of your classifier on the test set.
3. Visualize your MLP model similarly to Question 4 above, using the function `visualize(...)` that you already defined.

**Important**: Again, to reuse your code later, combine the above steps into a single function `investigate_mlp(hidden_layer_sizes)`, where `hidden_layer_sizes` is the parameter specifying the number of hidden units in each layer. See the [MLPClassifier API](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) for details on this parameter. After writing this function, execute it with an appropriate `hidden_layer_sizes` value to train, evaluate, and visualize your model.

In [None]:
from sklearn.neural_network import MLPClassifier

def investigate_mlp(hidden_layer_sizes):
    model = MLPClassifier(hidden_layer_sizes)
    model.fit(X_train, Y_train)

    Y_pred = model.predict(X_test)
    acc = accuracy_score(Y_test, Y_pred)
    print('Accuracy on test set:', acc)

    visualize(model, X, Y)

investigate_mlp(hidden_layer_sizes=(20))

## 6. Investigate Deeper MLPs (2% total grade)

Repeat Question 5 with **4 progressively deeper** MLP models that contain 2, 3, 4 and 5 hidden layers respectively. All layers in these models must also contain 20 hidden nodes. You can use your `investigate_mlp(...)` function for each model.

In [None]:
investigate_mlp(hidden_layer_sizes=(20, 20))
investigate_mlp(hidden_layer_sizes=(20, 20, 20))
investigate_mlp(hidden_layer_sizes=(20, 20, 20, 20))
investigate_mlp(hidden_layer_sizes=(20, 20, 20, 20, 20))

## 7. Comparing the models (2% total grade)

What are your observations comparing all the models in this homework?

**Your answer:** 
* Logistic regression has very low accuracy on this dataset because it cannot capture the non-linear boundary between the classes. 
* MLPs can achieve better accuracy because they can model the non-linear decision boundary.
* Deeper MLPs tend to have equal or higher accuracy, but they become more overconfident near the decision boundary.