<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Code_challenge.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Code challenge: Classification - hyperparameter tuning
© ExploreAI Academy

In this train, we'll tackle a classification problem by tuning hyperparameters, using techniques like grid search to optimise model performance.

## Learning objectives

By the end of this train, you should be able to:
- Apply hyperparameter tuning to improve a classification model.
- Evaluate model performance with tuned hyperparameters.

## Instructions to students

- **Do not add or remove cells in this notebook. Do not edit or remove the `### START FUNCTION` or `### END FUNCTION` comments. Do not add any code outside of the functions you are required to edit. Doing any of this will lead to a mark of 0%!**
- Answer the questions according to the specifications provided.
- Use the given cell in each question to see if your function matches the expected outputs.
- Do not hard-code answers to the questions.
- The use of Stack Overflow, Google, and other online tools are permitted. However, copying fellow student's code is not permissible and is considered a breach of the Honour code below. Doing this will result in a mark of 0%.
- Good luck, and may the force be with you!

## Honour Code

I PASCHAL, UGWU, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

## Overview

Hyperparameters have a direct impact on the performance and predictions made by machine learning models. Within this coding challenge, we will strengthen our ability to produce appropriate classification solutions by extending a base model with tuned hyperparameters.

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/wine.jpg"
     alt="Some fine wine for your fine model"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Some fine wine for your fine modelling process.
Photo by <a href="https://unsplash.com/@hermez777?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"> Hermes Rivera</a> on Unsplash
</div>

The structure of this notebook is as follows:

 - First, we'll load our data to get a view of the predictor and response variables we will be modelling.
 - We'll then preprocess our data, binarising the target variable and splitting up the data into train and test sets.
 - We then model our data using a Support Vector Classifier.
 - Following this modelling, we define a custom metric as the log-loss in order to evaluate our produced model.
 - Using this metric, we then take several steps to improve our base model's performance by optimising the hyperparameters of the SVC through a grid search strategy.

## Imports

Let's go ahead and load the usual suspects

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

## The dataset

For this coding challenge we'll be using the [Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Repository. The constituents of this dataset are red and white variants of the Portuguese "Vinho Verde" wine.

This dataset consists of the following variables:

 - fixed acidity
 - volatile acidity
 - citric acid
 - residual sugar
 - chlorides
 - free sulfur dioxide
 - total sulfur dioxide
 - density
 - pH
 - sulphates
 - alcohol
 - quality (score between 0 and 10)

### Reading in the data


**Note** the feature we will be predicting is quality, i.e. the label is 'quality' using classification.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/winequality.csv')
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,0,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,0,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,0,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Question 1 - Data preprocessing

We would like to classify the wine according to it's quality using binary classification.
Write a function to preprocess the data so we can run it through the classifier. The function should:

* Convert the quality for lower quality wines (quality less than or equal to 4) to 0
* Convert the quality for higher quality wines (quality greater than or equal to 5) to 1
* Split the data into 75% training and 25% testing data
* Set random_state to equal 42 for this internal method.

_**Function specifications:**_
* Should take a dataframe
* Standardise the features using sklearn's ```StandardScaler```
* Convert the quality labels into binary labels
* Should fill nan values with zeros
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

In [None]:
### START FUNCTION
def data_preprocess(df):
    """
    Preprocesses the given DataFrame for classification, including standardization, label binarization, and train-test split.

    Parameters:
    df (DataFrame): Input DataFrame containing the wine quality dataset.

    Returns:
    tuple: Two tuples of the form `(X_train, y_train), (X_test, y_test)`.
           X_train (array-like): Features for training.
           y_train (array-like): Labels for training.
           X_test (array-like): Features for testing.
           y_test (array-like): Labels for testing.
    """
     # Convert quality labels into binary labels
    df['quality'] = df['quality'].apply(lambda x: 0 if x <= 4 else 1)

    # Fill NaN values with zeros
    df.fillna(0, inplace=True)

    # Split features and labels
    X = df.drop('quality', axis=1)
    y = df['quality']

    # Standardize the features
    scaler = preprocessing.StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)

    # Convert y_train and y_test to numpy arrays
    y_train = np.array(y_train)
    y_test = np.array(y_test)

    return (X_train, y_train), (X_test, y_test)

### END FUNCTION

In [None]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)

In [None]:
print(X_train[:2])

[[-0.57136659  0.07127869 -0.48054096  1.17914161 -0.09303318 -0.79974133
   0.0830898  -0.15472329 -0.36573452  0.13010447  0.06101473  0.25842195]
 [-0.57136659  1.50396711 -0.72301571  0.56008035 -0.63948302 -0.05776881
  -0.70572997  0.62379657  0.16787589 -0.86828773 -0.47467813 -0.99931317]]


_**Expected outputs:**_
```python
(X_train, y_train), (X_test, y_test)= data_preprocess(df)
print(X_train[:2])
print(y_train[:2])
print(X_test[:2])
print(y_test[:2])


[[-0.57136659  0.07127869 -0.48054096  1.17914161 -0.09303318 -0.79974133
   0.0830898  -0.15472329 -0.36573452  0.13010447  0.06101473  0.25842195]
 [-0.57136659  1.50396711 -0.72301571  0.56008035 -0.63948302 -0.05776881
  -0.70572997  0.62379657  0.16787589 -0.86828773 -0.47467813 -0.99931317]]

[1 0]

[[-0.57136659 -0.15493527 -0.54115965  0.90400327 -0.66050032 -0.31460545
   0.53384396  0.03990667 -1.35291379 -0.26925241 -0.34075491  1.18076103]
 [-0.57136659  0.29749266 -1.20796522  2.8987562  -0.80762143 -0.45729248
  -0.19863155 -0.22549783 -1.03274754 -0.7185289  -0.87644778  0.25842195]]

[1 1]
```

## Question 2 - Model training

Now that you have processed your data, let's jump straight into model fitting. Write a function that should:
* Instantiate a `SVC` model.
* Train the `SVC` model with default parameters.
* Return the trained SVC model.

_**Function specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `SVC` model which has a random state of 40 and gamma set to 'auto'.
* The returned model should be fitted to the data.

In [None]:
### START FUNCTION
def train_SVC_model(X_train, y_train):
    """
    Trains a Support Vector Classifier (SVC) model with default parameters on the given training data.

    Parameters:
    X_train (array-like): Features for training.
    y_train (array-like): Labels for training.

    Returns:
    SVC: Trained SVC model.
    """
    from sklearn.svm import SVC

    # Instantiate SVC model with default parameters
    svc_model = SVC(random_state=40, gamma='auto')

    # Fit the model to the training data
    svc_model.fit(X_train, y_train)

    return svc_model

### END FUNCTION

In [None]:
svc = train_SVC_model(X_train,y_train)
svc.classes_

array([0, 1])


_**Expected outputs:**_

```python
svc.classes_
```
```
array([0, 1], dtype=int64)
```

## Question 3 - Model testing

Now that you've trained your model. It's time to test its accuracy, however, we'll be using a custom scoring function for this. Create a function that implements the log loss function:

$$\Large  H(p,q)=  -\frac{1}{N}\sum_{i=1}^{N} ylog(\hat{y}_{i}) + (1- y)log(1 - \hat{y}_{i})$$

_**Function specifications:**_
* Should take two numpy `arrays` as input in the form `y_true` and `y_predicted`.
* Should return a `float64` for the log loss value rounded to 7 decimal places.

_**Hint:**_ the numpy subtract function can be used to perform a calculation across an array of values

In [None]:
### START FUNCTION
def custom_scoring_function(y_true, y_pred):
    """
    Calculates the log loss value for classification.

    Parameters:
    y_true (array-like): True labels.
    y_pred (array-like): Predicted probabilities of the positive class.

    Returns:
    float: Log loss value rounded to 7 decimal places.
    """
    epsilon = 1e-15
    y_pred = np.maximum(epsilon, y_pred)
    y_pred = np.minimum(1 - epsilon, y_pred)

    # Compute log loss
    log_loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

    return round(log_loss, 7)

### END FUNCTION

In [None]:
y_pred = svc.predict(X_test)
print('Log Loss value: ', custom_scoring_function(y_test, y_pred))
print('Accuracy: ',round(accuracy_score(y_test,y_pred),4))

Log Loss value:  1.2540518
Accuracy:  0.9637


_**Expected outputs:**_
```python
print('Log Loss value: ',custom_scoring_function(y_test,y_pred))
print('Accuracy: ',round(accuracy_score(y_test,y_pred),4))
```

> ```
Log Loss value:  1.2540518
Accuracy:  0.9637
```

## Hyperparameter optimisation

### Question 4.1 - Getting model parameters
In order to improve the accuracy of our classifier, we have to search for the best possible model (`SVC` in this case) parameters. However, we first have to find out what parameters can be tuned for the given model. Write a function that returns a list of available hyperparameters for a given model.

_**Function specifications:**_
* Should take in an sklearn model (estimator) object.
* Should return a list of parameters for the given model.

In [None]:
### START FUNCTION
def get_model_hyperparams(model):
    """
    Retrieves the available hyperparameters for a given sklearn model (estimator).

    Parameters:
    model (object): Sklearn model (estimator) object.

    Returns:
    list: List of hyperparameters for the given model.
    """
    # Extract hyperparameters from the model
    hyperparams = model.get_params().keys()

    return list(hyperparams)

### END FUNCTION

In [None]:
get_model_hyperparams(svc)

['C',
 'break_ties',
 'cache_size',
 'class_weight',
 'coef0',
 'decision_function_shape',
 'degree',
 'gamma',
 'kernel',
 'max_iter',
 'probability',
 'random_state',
 'shrinking',
 'tol',
 'verbose']

_**Expected outputs:**_

```python
get_model_hyperparams(SVC)
```

> ```
['C',
 'cache_size',
 'class_weight',
 'coef0',
 .
 .
 .
 'shrinking',
 'tol',
 'verbose']
```

### Question 4.2 - Hyperparameter search
The next step is define a set of `SVC` hyperparameters to search over. Write a function that searches for optimal parameters using the given dictionary of hyperparameters:

- C_list = [0.1, 1, 10]
- {C: 0.1, 1, 10}
- gamma_list = [0.01, 0.1, 1]
- {gamma: 0.01, 0.1, 1}
- D = {'C':[0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}

and using `custom_scoring_function` from **Question 3** above as a custom scoring function (_**Hint**_: Have a look at at the `make_scorer` object in sklearn `metrics`).

_**Function specifications:**_
* Should define a parameter grid using the given list of `SVC` hyperparameters
* Should return an sklearn `GridSearchCV` object with a cross validation of 5.
* Should return a value rounded to 4 decimal places.

In [None]:
### START FUNCTION
def tune_SVC_model(X_train, y_train):
    """
    Tune hyperparameters of an SVC model using grid search.

    Parameters:
    X_train (array-like): Features for training.
    y_train (array-like): Labels for training.

    Returns:
    GridSearchCV: Tuned SVC model.
    """
    # Import make_scorer function
    from sklearn.metrics import make_scorer

    # Define parameter grid
    param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}

    # Create SVC model
    svc = SVC(random_state=40, gamma='auto')

    # Create scorer object using custom scoring function
    custom_scorer = make_scorer(custom_scoring_function, greater_is_better=False)

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, scoring=custom_scorer, cv=5)

    # Fit grid search to data
    grid_search.fit(X_train, y_train)

    return grid_search

### END FUNCTION

In [None]:
# Tune SVC model hyperparameters
svc_tuned = tune_SVC_model(X_train, y_train)

y_pred = svc_tuned.predict(X_test)
print('Log Loss value: ',custom_scoring_function(y_test,y_pred))
print('Accuracy: ',round(accuracy_score(y_test,y_pred),4))

Log Loss value:  1.2115421
Accuracy:  0.9649


_**Expected outputs:**_
```python
print('Log Loss value: ',custom_scoring_function(y_test,y_pred))
print('Accuracy: ',round(accuracy_score(y_test,y_pred),4))
```

> ```
Log Loss value:  1.2115421
Accuracy:  0.9649
```

### Question 4.3 - Optimal model parameters
Write a function that returns the best hyperperameters for a given model (i.e. the `GridSearchCV`).

_**Function specifications:**_
* Should take in an sklearn GridSearchCV object.
* Should return a dictionary of optimal parameters for the given model.

In [None]:
### START FUNCTION
def get_best_params(model):
    """
    Retrieves the best hyperparameters from a trained GridSearchCV model.

    Parameters:
    model (GridSearchCV): Trained GridSearchCV model.

    Returns:
    dict: Dictionary of optimal hyperparameters for the given model.
    """
    # Extract best parameters from the GridSearchCV model
    best_params = model.best_params_

    return best_params

### END FUNCTION

In [None]:
get_best_params(svc_tuned)

{'C': 1, 'gamma': 1}

_**Expected outputs:**_
```python
get_best_params(svc_tuned)
```

> ```
{'C': 1, 'gamma': 1}
```

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>