## **Chapter 3:** Supervised learning

**Supervised learning** is a cornerstone of machine learning, where the goal is to learn a mapping from inputs (`features`) to outputs (`targets`), given a labeled dataset. This chapter delves into the two primary categories of supervised learning: *classification* and *regression*. Using `scikit-learn`, we'll explore how to apply these concepts to real-world datasets, including a detailed walkthrough with the Iris dataset for classification and another dataset for regression to provide comprehensive insights.

Scikit-learn offers a comprehensive suite of tools for implementing supervised learning models, which you can find explained in detail for [regression](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) and for [classification](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) algorithms.



### [3.1 Classification models](#22-classification-models)

The Iris dataset is *the* classic example for classification, where the task is to predict the species of an iris flower based on its sepal and petal dimensions.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
x, y = iris.data, iris.target

# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, stratify=y, random_state=42)

# Initialize the model
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
classifier.fit(x_train, y_train)

# Predictions
y_pred = classifier.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification Accuracy: {accuracy:.2f}")



### [3.2 Regression models](#21-regression-models):

In `regression models`, the goal is to predict a continuous value. Scikit-learn provides several regression models, including linear regression, decision trees, and support vector regression. For example, predicting the price of a house based on its features (size, location, etc.) is a regression problem.

#### Boston housing dataset for regression 

Since we cannot use iris for regression, let's use the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), which is a popular dataset for regression tasks. 

This dataset contains information related to housing in California, such as `median income`, `housing median age`, `average rooms`, `average bedrooms`, `population`, `average occupancy`, `latitude`, and `longitude`. The target variable is the `median house value` for California districts.

In [None]:
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
housing = fetch_california_housing()
x_h, y_h = housing.data, housing.target

print(f"{x_h.shape = }")
print(f"{y_h.shape = }")


In this example, we also included data scaling using `StandardScaler` before splitting the data. As introduced in the last chapter, scaling features is a common preprocessing step for many machine learning algorithms, especially for those sensitive to the scale of the data, like SVMs or neural networks. 

However, tree-based models like *RandomForest* are generally scale-invariant but applying scaling can still be beneficial for convergence speed in some cases or when using regularization.

In [None]:
# We follow the good practice and scale the data for regression tasks
scaler = StandardScaler()
x_h_scaled = scaler.fit_transform(x_h)

# Next, we split the dataset into training and testing sets
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(x_h_scaled, y_h, test_size=0.3, random_state=42)


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialize the RandomForestRegressor
regressor_h = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
regressor_h.fit(X_train_h, y_train_h)

# Predict the median house values on the testing set
y_pred_h = regressor_h.predict(X_test_h)

# Evaluate the model using the mean squared error (MSE) metric
mse_h = mean_squared_error(y_test_h, y_pred_h)
print(f"Regression Mean Squared Error on California Housing Dataset: {mse_h:.2f}")

This comprehensive approach, from data loading and preprocessing to model training and evaluation, illustrates the typical workflow for a regression task in supervised learning. 

By exploring different datasets and regression models, you gain a deeper understanding of how to tackle various types of regression problems effectively.

### [3.3 Model Evaluation](#33-model-evaluation)

Model evaluation is a critical step in the machine learning workflow. It allows you to assess the performance of your model and understand its strengths and weaknesses. In this chapter, we’ll cover essential evaluation metrics and tools used in both classification and regression tasks. Understanding these concepts will help you choose the right metric for your specific problem and ensure that your model meets the desired objectives.

#### Classification Metrics

For classification problems, *accuracy*, *precision*, *recall*, *F1 score*, and the *confusion matrix* are commonly used metrics.

* **Accuracy:** Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. It's a good measure when the class distributions are similar.

In [None]:
from sklearn.metrics import accuracy_score

y_true = [0,0,1,0,0,1,0,0,1,0]
y_pred = [0,1,0,0,0,1,1,0,1,0]

# Assuming y_true and y_pred are the true and predicted labels respectively
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")


* **Precision and Recall:** Precision measures the accuracy of positive predictions (i.e., the proportion of true positives among all positive predictions), whereas recall (or sensitivity) measures the ability of the classifier to find all positive samples (i.e., the proportion of true positives among all actual positives).

In [None]:
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"Precision: {precision:.2f}, Recall: {recall:.2f}")


* **F1-Score:** The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. It's especially useful when the class distribution is uneven.

In [None]:
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.2f}")


* **Confusion Matrix:** The confusion matrix is a table that describes the performance of a classification model on a set of test data for which the true values are known. It allows you to see the errors made by the classifier.

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)


#### Regression Metrics

For regression tasks, *Mean Squared Error (MSE)* and *Mean Absolute Error (MAE)* are widely used metrics.

* **Mean Squared Error (MSE):** MSE measures the average squared difference between the estimated values and the actual value. It gives a rough idea of the magnitude of error.

In [None]:
from sklearn.metrics import mean_squared_error

y_true = [1.1,2,3.3,4,5]
y_pred = [2.2,2,3,5,6]

# Assuming y_true and y_pred are the true and predicted values respectively
mse = mean_squared_error(y_true, y_pred)
print(f"Mean Squared Error: {mse:.2f}")


* **Mean Absolute Error (MAE):** MAE measures the average absolute difference between the estimated values and the actual value, providing a linear score that weights all errors equally.

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

#### Selecting the right metric

The choice of metric depends on your specific problem and goals. For instance, in a medical diagnosis problem, recall might be more important than precision, as missing a positive case could be more detrimental than falsely labeling a negative case as positive. Conversely, in a spam detection system, precision might be more critical to avoid filtering out important emails.

Understanding these metrics and their implications will help you evaluate your models effectively, guiding you towards making improvements and ultimately achieving better performance.

---
# --> 🚀💻💥 *Coding Challenge ([Step 4-5]((#34-coding-challenge)))*
---

### [3.4 Coding Challenge](#34-coding-challenge)

As a practical exercise, we revisit the injection modling dataset used in the matplotlib course unit. This challenge will involve predicting the `"quality"` column from the provided dataset. 

Like before, we'll break down the challenge into several steps, each focusing on a different aspect of the machine learning process.  

#### Task 

Develop a machine learning model to predict the `"quality"` of our injection modling experiments based on the various manufacturing parameters using scikit-learn.

#### The dataset 

The dataset should be already known form the previous exercise. It includes manufacturing parameters like `melt temperature`, `mold temperature`, and various measurements related to the manufacturing process, with the target variable being `"quality"`.

**Step 0:** Load and Examine the Dataset

* Load the `data.csv` dataset, encompassing process parameters and each lens's quality classification.
* Conduct an initial examination to comprehend its composition and content structure (e.g. by printing the head of the data).

In [None]:
# Load the injection molding data like we did in the last course unit
import pandas as  pd 

data = pd.read_csv("../data/data.csv", delimiter=";")

data.head()

**Step 1:** Data Loading and Preprocessing
* Separate the measurements from the target variable (`"quality"`).

In [None]:
# Separate the measurements from the target variable "quality"

import pandas as pd

y = data["quality"]
x = data[data.columns[:-1]]
# x = data.drop("quality", axis=1)

print(f"{x.shape = }")
print(f"{y.shape = }")


**Step 2:** Splitting the Data
* Split the dataset into training and testing sets using `train_test_split`.

* Select a `test_size` of `20%`

In [None]:
# Split the dataset into training and testing sets using `train_test_split`.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(f"{x_train.shape = }")
print(f"{x_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")

**Step 3:** Feature Scaling
* Apply feature scaling to the dataset using the `StandardScaler` to standardize the features.

In [None]:
# Apply feature scaling to the dataset using the StandardScaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)


**Step 4:** Model Selection and Training

* Choose a suitable model. For this example, we can again use `RandomForestClassifier` as a starting point.

* Train the model on the training set.

In [None]:
# Train the model on the training set

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

model = RandomForestClassifier(random_state=42)
model.fit(x_train_scaled, y_train)

# Performing 5-fold cross-validation
scores_cv = cross_val_score(model, x_test_scaled, y_test, cv=5)

print("Cross-validation accuracy scores:", scores_cv) # (We will cover scoring in chapter 3.3)


**Step 5:** Model Evaluation
* Evaluate the model's performance on the test set using appropriate metrics.

    * We are going to apply `accuracy`, `precision`, `recall` and `f1 score`

In [None]:
# Evaluate the model's performance on the test set using appropriate metrics.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_pred = model.predict(x_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

print(confusion_matrix(y_test, y_pred))


This challenge provides a comprehensive overview of applying supervised learning techniques to a real-world dataset, from preprocessing and model training to evaluation and optimization, offering a hands-on experience with scikit-learn.

[--> Back to Outline](#course-outline)

---