# ASSIGNMENT 3 - Tree-Based Models and Neural Networks

**Datasets:** `Dataset2.csv`, MNIST


#### Follow these steps before submitting your assignment

1. Complete the notebook.

2. All figures should have a x- and y-axis label and an appropriate title.

3. Once the notebook is complete, `Restart` your kernel by clicking 'Kernel' > 'Restart & Run All'.

4. Fix any errors until your notebook runs without any problems.

5. Make sure to reference all external code and documentation used.

7. Add sufficient comments to your code.

8. If you have trouble running neural network models on your laptop, you can use online platforms, like [Google Colab](https://colab.research.google.com/).

## Q1 — Random Forests and Multi-Layer Perceptrons

1. Load the Dataset2.
2. Display basic statistics and inspect for missing data.
3. Encode the categorical features (one-hot encoding).
4. Use 'Loan_Status' as the target.
5. Split the data into train, validation and test and hold out 20% and 10% of observations as the validation and test set, respectively. Pass random_state=42.
6. Normalize the data (Z-normalization).
7. Last time, we implemented Logistic Regression, SVM, and KNN using this dataset. This time, fit the following models to the training samples:
   1. Random Forest (RF) that consists of 5 base decision trees with the maximum depth of 5.
   2. Single-Layer Neural Network (Perceptron) with stochastic gradient descent (SGD) optimizer and a learning rate of 0.1, run the model for 10 iterations/epochs.
   3. A Multi-Layer Perceptron (MLP) implemented as follows:
      1. Two hidden layers (H1, H2), with 50 and 100 neurons/units in H1 and H2, respectively.
      2. Use tanh function as the activation function for hidden layers.
      3. Use a proper acitivation function for the output layer.
      4. Use Stochastic gradient descent optimizer with a learning rate of 0.1.
      5. Run the model for 10 iterations/epochs.
      6. Record the validation and training loss for each iteration.
8. Report the training time in milli second for all models. 
9. Use the Random Forest model you built to generate feature importance scores and a horizontal bar chart to plot the importance scores of all features in descending order. 
10. Select the important features from most to least important until the accumulated relative importance score reaches 90% or 0.9 and print out the selected features with their importance scores
11. Discussion Question: Compare the training time of your single layer neural network to the MLP model, and discuss the reasons.

## Q2 — Model Performance
1. Report the prediction results of all models in Q1 above on the test set of Dataset2, using these evaluation metrics: Confusion matrix, F1-score, Recall, Precision and Accuracy.
2. Plot the ROC curve and report AUC of the predictions on the test set.
3. Report the test time (in milli second) for all models.
4. For MLP, make the plot of learning curves (training and validation loss vs iterations/epochs).
5. Discussion Question: Based on the MLP learning curve, do you see any signes of overfitting? Why? If it overfits, how would you fix this issue?

## Q 3 - Convolutional Neural Network (CNN)

### Objective:
Develop a Convolutional Neural Network (CNN) model to classify handwritten digits from the MNIST dataset. Your tasks involve loading and preprocessing the dataset, designing a CNN architecture, training the model while reserving a portion of the data for validation, and evaluating the model's performance. Additionally, assess if the model exhibits overfitting through the analysis of learning curves.

1. Load the MNIST Dataset (using the "tensorflow.keras.datasets" library), which consists of grayscale images of handwritten digits. Normalize the image pixel values to a range of [0, 1], reshape the images to fit the CNN input requirements and convert the labels to one-hot encoded vectors for classification.
2. Design a CNN architecture for classifying MNIST handwritten digits with a CNN including three convolutional layers with ReLU activation (the first with 32 filters of size 3x3, the second with 64 filters of size 3x3 plus a max pooling layer of size 2x2, and the third also with 64 filters of size 3x3 followed by another max pooling layer of size 2x2), and the final dense layer with 64 units (ReLU activation), and then culminating in an output layer with 10 units for the 10 digit classes (0-9) using a proper output activation. 
3. Keep 20% of MNIST training dataset for validation and train the CNN on the remaining part of MNIST training dataset.
4. Train for 10 epochs. Use a suitable loss function, and optimizer if needed.
5. Plot the learning curves.
6. Evaluate the model's accuracy and F1-score on the MNIST test dataset.
7. Discussion Question: Explain whether the model overfits or not.

## Q4 — Use of GenAI
1. Did you use GenAI? If so, how?
2. What limitations did you come across and how did you address them?