# Lesson 26: multilayer perceptron activity

## Notebook set up
### Imports

In [None]:
# Third party imports
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score, log_loss, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, MinMaxScaler

## 1. Data preparation

### 1.1. Load diabetes dataset

In [6]:
diabetes_df = pd.read_csv('https://gperdrizet.github.io/FSA_devops/assets/data/unit3/diabetes_prediction_train.csv')

In [None]:
diabetes_df.head()

In [None]:
diabetes_df.info()

Define the label and feature columns below. The label is `diabetes` (binary: 0 or 1). Using separate lists for numerical, nominal and ordinal features makes preprocessing easier.

In [None]:
# Define the label
label = # YOUR CODE HERE

# Define numerical, ordinal and nominal features
# YOUR CODE HERE

# Complete feature list
features = # YOUR CODE HERE

In [None]:
# Select the features of interest and the label
diabetes_df = diabetes_df[features + [label]]

### 1.2. Train test split

Use `train_test_split` to split the data into training and testing sets. Use `random_state=315` for reproducibility.

In [None]:
training_df, testing_df = # YOUR CODE HERE

### 1.3. Preprocess numerical features

#### 1.3.1. Standard scale

Neural networks perform better when features are scaled. Use `StandardScaler` to fit on the training features and transform both training and testing features.

**Hint:** Fit the scaler on `training_df[numerical_features]`, then transform both `training_df[numerical_features]` and `testing_df[numerical_features]`.

In [None]:
feature_scaler = StandardScaler()

# YOUR CODE HERE: fit and transform the numerical features

#### 1.3.2. Clip outliers with IQR method

In [None]:
# YOUR CODE HERE: implement IQR clipping for numerical features

### 1.4. Preprocess categorical features

#### 1.4.1. Ordinal feature encoding

In [None]:
# YOUR CODE HERE: create and fit OrdinalEncoder, then transform both training and testing data

#### 1.4.2. Nominal feature encoding

In [None]:
# YOUR CODE HERE: create and fit OneHotEncoder, transform features, and concatenate back to dataframes

#### 1.4.3. Update the feature list

In [None]:
# YOUR CODE HERE: update the features list to include encoded features and remove the label

#### 1.4.4. Min/max scale categorical features

In [None]:
# YOUR CODE HERE: create MinMaxScaler with feature_range=(-1, 1), fit on categorical features, and transform

In [None]:
training_df.info()

## 2. Logistic regression model

Logistic regression is a linear model for classification. It serves as a good baseline before trying more complex models like neural networks.

### 2.1. Fit

Create a `LogisticRegression` model and fit it on the training data. Use `max_iter=1000` to ensure convergence.

In [None]:
logistic_model = # YOUR CODE HERE
fit_result = # YOUR CODE HERE

### 2.2. Test set evaluation

For classification, we can use accuracy, F1 score and/or AUC-ROC (and others) instead of RÂ². Use sklearn's [`metrics`](https://scikit-learn.org/stable/api/sklearn.metrics.html) module .

In [None]:
logistic_predictions = # YOUR CODE HERE
logistic_accuracy = # YOUR CODE HERE
logistic_f1 = # YOUR CODE HERE
logistic_auc = # YOUR CODE HERE
print(f'Logistic regression accuracy on test set: {logistic_accuracy:.4f}')
print(f'Logistic regression F1 score on test set: {logistic_f1:.4f}')
print(f'Logistic regression AUC-ROC score on test set: {logistic_auc:.4f}')

### 2.3. Performance analysis

For classification, visualize performance using a confusion matrix.

In [None]:
# YOUR CODE HERE

## 3. Multilayer perceptron (MLP) classifier

Now let's build a neural network classifier using sklearn's [`MLPClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

### 3.1. Single epoch training function

Complete the training function below. It should:
1. Split the data into training and validation sets
2. Call `partial_fit` on the model (remember to pass `classes=[0, 1]` on the first call)
3. Record training and validation [`log_loss`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) (aka binary cross-entropy) in the history dictionary

**Hint:** Use `model.partial_fit(X, y, classes=[0, 1])` for the first epoch. For subsequent epochs, `partial_fit` remembers the classes.

In [None]:
def train(model: MLPClassifier, df: pd.DataFrame, training_history: dict, classes: list = None) -> tuple[MLPClassifier, dict]:
    '''Trains sklearn MLP classifier model on given dataframe using validation split.
    Returns the updated model and training history dictionary containing training and
    validation log loss. If classes are not provided, assumes 0 and 1.'''

    global features, label

    df, val_df = train_test_split(df, random_state=315)
    
    # YOUR CODE HERE: call partial_fit on the model
    # If classes is provided, pass it to partial_fit
    
    # YOUR CODE HERE: append training and validation log loss to history
    
    return model, training_history

### 3.2. Model training

Create an `MLPClassifier` with:
- `hidden_layer_sizes=(64, 32)` - two hidden layers
- `activation='relu'` - ReLU activation function
- `learning_rate_init=0.001` - initial learning rate
- `warm_start=True` - keep weights between calls to fit
- `random_state=315` - for reproducibility

Train for 10 epochs using the training function above.

In [None]:
epochs = 10

training_history = {
    'training_loss': [],
    'validation_loss': []
}

mlp_model = # YOUR CODE HERE: create MLPClassifier

for epoch in range(epochs):

    # YOUR CODE HERE

### 3.3. Learning curves

Plot the training and validation loss over epochs to visualize the learning process.

In [None]:
# YOUR CODE HERE: plot training and validation loss
# Use plt.plot() for each curve
# Add title, xlabel, ylabel, and legend

### 3.4. Test set evaluation

Evaluate the MLP model on the test set, similar to how you evaluated the logistic regression model.

In [None]:
mlp_predictions = # YOUR CODE HERE
mlp_accuracy = # YOUR CODE HERE
mlp_f1 = # YOUR CODE HERE
mlp_auc = # YOUR CODE HERE
print(f'MLP accuracy on test set: {mlp_accuracy:.4f}')
print(f'MLP F1 score on test set: {mlp_f1:.4f}')
print(f'MLP AUC-ROC score on test set: {mlp_auc:.4f}')

### 3.5. Performance analysis

Create a confusion matrix for the MLP model predictions.

In [None]:
# YOUR CODE HERE: create confusion matrix for MLP predictions
# Follow the same pattern as the logistic regression confusion matrix

## 4. Model comparison

Compare the performance of both models side by side.

In [None]:
print(f'Logistic Regression accuracy on test set: {logistic_accuracy:.4f}')
print(f'MLP accuracy on test set: {mlp_accuracy:.4f}')

Create a side-by-side comparison of the confusion matrices for both models.

In [None]:
# YOUR CODE HERE