# UE6: Scikit-Learn

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png height=200>

### **Introduction to Scikit-Learn for Data Analysis**


Welcome to the Introduction to **Scikit-learn** course, tailored to provide you with foundational knowledge and practical skills in machine learning using the Scikit-learn library.

With a blend of theoretical insights and hands-on exercises, you'll learn to leverage Scikit-learn for both *supervised* and *unsupervised* learning tasks, ensuring you're well-equipped to address a wide range of data science challenges.

---

## **Course Outline**

#### **Chapter 1. [Introduction to Scikit-learn](#1-introduction-to-scikit-learn)**
- **Chapter 1.1 [Overview of Scikit-learn](#11-overview-of-scikit-learn):** Introduction to machine learning and the significance of scikit-learn.
- **Chapter 1.2 [Installation and Setup](#12-installation-and-setup):**  Setting up the machine learning environment.
- **Chapter 1.3 [Key Features of Scikit-learn](#13-key-features-of-scikit-learn):** Building a simple linear regression model as an introduction to machine learning.

#### **Chapter 2. [Data preprocessing with sklearn](#2-data-preprocessing-with-sklearn)**
- **Chapter 2.1 [Introduction to the dataset](#21-introduction-to-the-datset):** Introduction to the iris dataset that is going to be used as an example in this course unit. 
- **Chapter 2.2 [Data preprocessing with sklearn](#22-data-preprocessing-with-sklearn):** 
- **Chapter 2.3 [Preparing for model training](#23-model-training-and-evaluation):** Techniques for training models and evaluating their performance.

#### **Chapter 3. [Unsupervised Learning with Scikit-learn](#3-supervised-learning)**
- **Chapter 3.1 [Classification models](#31-classification-models):** Implementing classification models to predict discrete labels.
- **Chapter 3.2 [Regression models](#32-regression-models):** Employing regression models to forecast continuous outcomes.
- **Chapter 3.3 [Model evaluation](#33-model-evaluation):** Utilizing metrics for classification and for regression to evaluate and refine models.
- **Chapter 3.4 [Coding Challenge](#34-coding-challenge):** Deep dive into unsupervised learning techniques.

#### **Chapter 4. [Unsupervised Learning with Scikit-learn](#4-unsupervised-learning-with-scikit-learn)**
- **Chapter 4.1 [Clustering Techniques](#41-clustering-techniques):** Exploring K-means, hierarchical clustering, and DBSCAN.
- **Chapter 4.2 [Dimensionality Reduction](#42-dimensionality-reduction):** Principles of PCA, t-SNE, and LDA.
- **Chapter 4.3 [Association Rules](43-association-rules):** Introduction to Apriori algorithm for market basket analysis.
- **Chapter 4.4 [Coding Challenge](#44-coding-challenge):** Deep dive into unsupervised learning techniques.

#### **Chapter 5. [Advanced Topics and Best Practices](#5-advanced-topics-and-best-practices)**
- **Chapter 5.1 [Pipeline Construction](#51-pipeline-construction):** Streamlining workflows with pipelines.
- **Chapter 5.2 [Working with Imbalanced Data](#52-working-with-imbalanced-data):** Techniques for handling imbalanced datasets.
- **Chapter 5.3 [Ensemble Methods](#53-ensemble-methods):** Leveraging ensemble methods for improved model performance.
- **Chapter 5.4 [Best Practices in Machine Learning](#54-best-practices-in-machine-learning):** Guidelines for practical machine learning project management.
- **Chapter 5.5 [Learning Resources](#55-learning-resources):** Furthering your Scikit-learn knowledge.
- **Chapter 5.6 [Q&A Session](#5.6-qa-session):** Addressing common questions and challenges.

---

## **[1. Introduction to Scikit-learn](#1-introduction-to-scikit-learn)** 

**Scikit-learn** is an essential library for machine learning in Python, offering a wide array of tools for building and evaluating models, preprocessing data, and selecting and tuning algorithms. Known for its simplicity and ease of use, Scikit-learn enables both novices and experts to implement complex machine learning algorithms with minimal code. 

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png height=100>

Initially developed as part of a [Google Summer of Code project](https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2017), it has evolved into a robust, versatile library that supports a variety of [supervised](https://scikit-learn.org/stable/supervised_learning.html) and [unsupervised](https://scikit-learn.org/stable/unsupervised_learning.html) learning tasks. 

Whether you're looking to perform `classification`, `regression`, `clustering`, or `dimensionality reduction`, **Scikit-learn** provides efficient solutions that are foundational for data science projects and real-world applications.


### [1.1 Overview of Scikit-learn](#11-overview-of-scikit-learn)

Scikit-learn, often simply referred to as `sklearn`, makes it easy to implement many machine learning algorithms. With its comprehensive set of tools for data mining and data analysis, it is suited for both beginners and experts in the field of machine learning. 

It provides simple and efficient tools for predictive data analysis, accessible to everybody, and reusable in various contexts.

You can find a lot of useful information and explanation about scikit-learn in its [documentation](https://scikit-learn.org/stable/). Especially the [example gallery](https://scikit-learn.org/stable/auto_examples/index.html) is a great starting point to explore more machine learning models.   

### [1.2 Installation and Setup](#1.2-installation-and-setup)

Getting started with Scikit-learn is straightforward. It can be [installed](https://scikit-learn.org/stable/install.html) using pip, the package manager for Python. 

```bash
pip install scikit-learn
```

This command installs `Scikit-learn` and its dependencies, including `NumPy` and `SciPy`. 

After installation, you can verify that Scikit-learn is correctly installed by importing it into a Python session:

In [None]:
import sklearn
print(sklearn.__version__)

This code snippet imports the sklearn module and prints its version, ensuring that the library is ready for use.

### [1.3 Key Features of Scikit-learn](#1.3-key-features-of-scikit-learn)

Scikit-learn is designed around the concept of `'estimators'`, which is any object that can learn from data. Whether it's a *classification*, *regression*, *clustering*, or *dimensionality reduction* algorithm, all are implemented as Python classes. 

#### What is an exstimator?

In the context of **scikit-learn**, an `estimator` is any object that learns from data: it could be a *classification*, *regression*, or *clustering* algorithm or a *transformer* that extracts or filters useful features from raw data. The term *"estimator"* is a core concept in scikit-learn's design, underpinning how models and transformations are constructed and applied.

**Key Characteristics of Estimators:**

* Parameter Estimation from Data: At its core, an `estimator` is designed to *estimate* some parameters based on a dataset. For example, in the case of a linear regression model, the estimator would determine the coefficients ("parameters") that best fit the given data according to a particular criterion.

* Fit Method: All estimators have a `fit()` method. This method is used to feed the training data to the estimator. For supervised learning algorithms, fit() takes two arguments: the data X and the labels y (e.g., `estimator.fit(x, y)`). For unsupervised learning algorithms, `fit()` takes only the data `x` as an argument.

* Hyperparameters and Parameters: Estimators can be instantiated with given *hyperparameters*, which are set prior to the learning process and not derived from the data. In contrast, the parameters inferred during the learning process (like the coefficients in linear regression) are attributes of the estimator.

**Types of Estimators:**

Scikit-learn distinguishes between several types of estimators, primarily:

* Classifiers: Estimators that predict a category for a given sample. They are used in *supervised learning* for **categorical** outcomes.

* Regressors: Estimators that predict a continuous value. Like classifiers, they are used in *supervised learning* but for outcomes that are **continuous numbers**.

* Clusterers: Estimators that assign samples to clusters, used in *unsupervised learning*.

* Transformers: Estimators that transform input data to enhance it, reduce its *dimensionality*, or make it more suitable for a classifier or a regressor. Transformers also implement a `transform()` method in addition to `fit()`.

Understanding the concept of estimators is fundamental when working with scikit-learn, as it provides a consistent and predictable way to apply various machine learning algorithms and preprocessing techniques.

#### Why use scikit-learn?

The library is well-structured, making it easy to switch between different algorithms to find the one that works best for the task at hand. It supports a variety of preprocessing methods, feature selection techniques, and model evaluation metrics. 

Moreover, Scikit-learn's integration with `NumPy` and `Pandas` means it works well with a wide array of data sources, from simple `NumPy arrays` to complex `Pandas DataFrames`.

Scikit-learn's comprehensive [documentation](https://scikit-learn.org/stable/) and examples make it an invaluable resource for anyone looking to delve into machine learning. 

Its consistency and simplicity in API design ensure a short learning curve and allow researchers and practitioners to focus more on the problem to be solved, rather than the tool itself. For more specific applications, you can even implement [your own estimators](https://scikit-learn.org/stable/developers/develop.html) with sklearn. 

[--> Back to Outline](#course-outline)

---

## [2. Data preprocessing with sklearn](#2-data-preprocessing-with-sklearn)

**Data preprocessing** is a critical step in the machine learning pipeline. It involves transforming raw data into a format that is more suitable for modeling. This chapter will delve into essential preprocessing techniques using scikit-learn.

<img src=https://i.stack.imgur.com/jEULG.png height=400>


### [2.1 Introduction to the datset](#14-iris-dataset)

The *Iris dataset* is one of the most well-known datasets in the field of machine learning and statistics. It's often used as a simple, accessible example for teaching data visualization, machine learning, and data preprocessing techniques. The [dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) was introduced by the British biologist [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936 as an example of discriminant analysis.

```python 
# Load the data from sklearn
from sklearn import datasets
iris = datasets.load_iris()
```

When you load the Iris dataset using `load_iris()` from `sklearn.datasets`, you receive a dataset that consists of `150` samples of iris flowers. These samples are divided evenly across three different species of iris (`50` samples from each of `Setosa`, `Versicolor`, and `Virginica`). 

<img src=https://miro.medium.com/v2/resize:fit:1400/1*ZK9_HrpP_lhSzTq9xVJUQw.png height=200>

For each sample, the dataset includes four features (or `measurements`):

1. **Sepal Length:** The length of the sepal (the part of the flower that encases the bud before it blooms) in centimeters.
2. **Sepal Width:** The width of the sepal in centimeters.
3. **Petal Length:** The length of the petal (the colorful, often bright part of the flower) in centimeters.
4. **Petal Width:** The width of the petal in centimeters.

These measurements make up the feature matrix `X`, which is a `150x4` `np.array` where each row corresponds to a flower sample, and each column corresponds to one of the four features mentioned above.

In [None]:
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()
x_iris = iris.data
y_iris = iris.target


print(f"{x_iris.shape = }")
print(f"{y_iris.shape = }")

The target variable `y` is a 1-dimensional array of size `150`, where each entry is an integer representing the species of the corresponding flower in the feature matrix. The species are encoded as `0`, `1`, and `2`, which correspond to `Setosa`, `Versicolor`, and `Virginica`, respectively. 

The primary use of the Iris dataset in machine learning is for classification tasks, where the goal is to predict the species of an iris flower based on the measurements of its petals and sepals. Because of its simplicity and small size, it's particularly useful for demonstrating the basics of machine learning techniques, from data preprocessing to building various types of models, such as decision trees, support vector machines, or neural networks. 

In this course, we will use `iris` to introduce many tools from `scikit-learn`.

### [2.2 Data preprocessing with sklearn](#22-data-preprocessing-with-sklearn)

Data preprocessing is an essential stage in the machine learning pipeline, involving several crucial steps to make raw data more suitable for building models. In this chapter, we explore these steps using scikit-learn. 

Please not that sklearn just provides *one* way to solve these challenges. You can often also use `numpy` or `pandas` to solve these tasks in a similar fashion.  

#### Handling Missing Values with `SimpleImputer` 

Missing values in a dataset can lead to inaccurate models. **Scikit-learn**'s `SimpleImputer` offers a convenient way of dealing with this issue by allowing us to replace missing values with a specified placeholder. 

For example, we can replace missing numerical values with the mean of the remaining non-missing values in each column.


In [None]:
import numpy as np
from sklearn.impute import SimpleImputer


# Example dataset with missing values
x = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# Using SimpleImputer to replace missing values with the mean of the column
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(x)

X_imputed

**Iris:** While the Iris dataset doesn't have missing values, let's simulate this scenario to demonstrate handling them:

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer


# Simulate missing values in the first feature
x_missing = x_iris.copy()
x_missing[::10, 0] = np.nan

imputer = SimpleImputer(strategy='mean')
x_imputed = imputer.fit_transform(x_missing)

x_imputed

#### Feature Scaling via `StandardScaler`

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. `StandardScaler` is a tool in scikit-learn that standardizes features by removing the mean and scaling to unit variance.


In [None]:
from sklearn.preprocessing import StandardScaler

# Dataset
x = [[0, 15], [1, -10], [2, 20]]

# Standardizing the features
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

x_scaled

**Iris:** Feature scaling can be demonstrated by standardizing the Iris dataset's features:

In [None]:
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_iris)

# x_scaled now has each feature standardized
x_scaled

#### Normalizing Features with Normalizer 

Normalization adjusts the scale of data attributes. Scikit-learn's `Normalizer` scales individual samples to have unit norm, a process that can be particularly useful for sparse datasets.

In [None]:
from sklearn.preprocessing import Normalizer

# Dataset
x = [[4, 1, 2, 2], [1, 3, 9, 3], [5, 7, 5, 1]]

# Normalizing the features
normalizer = Normalizer()
x_normalized = normalizer.fit_transform(x)

x_normalized

**Iris:** Normalization ensures that each sample's feature vector has a unit norm. 

In [None]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
x_normalized = normalizer.fit_transform(x_iris)

# x_normalized now ensures each sample has a unit norm
x_normalized

#### Encoding Categorical Variables Using `OneHotEncoder`

Categorical variables are often represented as `strings` or `numbers` that indicate categories. To make these variables understandable to machine learning algorithms, we use *one-hot encoding*, which represents each category as a binary vector.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Dataset with categorical features
x = [['Male', 1], ['Female', 3], ['Female', 2]]

# Applying one-hot encoding
encoder = OneHotEncoder()
x_encoded = encoder.fit_transform(x).toarray()

x_encoded


**Iris:**: The Iris dataset's target variable (`y`) is already numerical and represents classes as `0`, `1`, and `2`. Typically, you would **not** one-hot encode the target variable for most scikit-learn classifiers since they handle numerical class labels directly. 

However, for demonstration:

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
y_encoded = encoder.fit_transform(y_iris.reshape(-1, 1)).toarray()

# y_encoded is now one-hot encoded
y_encoded


#### Feature Selection with `SelectKBest`

Not all features are equally important for a model. `SelectKBest` allows us to select a subset of the most important features according to a statistical measure, like the [chi-squared](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) test, enhancing model performance and reducing overfitting.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Sample dataset
x = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
y = [0, 1, 0]

# Selecting the 2 best features based on chi-squared test
selector = SelectKBest(chi2, k=2)
x_selected = selector.fit_transform(x, y)

x_selected


**Iris:** To select the two best features according to their [ANOVA F-value](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html):

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=2)
x_selected = selector.fit_transform(x_iris, y_iris)

# x_selected now contains only the 2 features deemed most important
x_selected


#### Feature Extraction Techniques

[Feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html) is crucial for converting raw data into a structured format. Scikit-learn provides several tools for this purpose, such as `DictVectorizer` for feature arrays represented as lists of dictionaries and `CountVectorizer` for converting text documents into a matrix of token counts.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = ['text data preprocessing', 'feature extraction with scikit-learn']

# Vectorizing text data
vectorizer = CountVectorizer()
x_vectorized = vectorizer.fit_transform(corpus).toarray()

x_vectorized

**Iris:** Feature extraction is more relevant to datasets with raw data like text or images. Since the Iris dataset is already in a structured format, feature extraction like text vectorization doesn't directly apply. However, you can transform or create new features based on existing ones, a process known as feature engineering, which might involve calculating interactions between features or other transformations.

In summary, these examples demonstrate how to apply various preprocessing techniques to generic data and to the Iris dataset using scikit-learn, preparing the data for machine learning modeling. 

This chapter prepares for for the important task of data preprocessing and ensuring that your dataset is in the best possible shape for training models.

### [2.3 Preparing for model training](#23-preparing-for-model-training)

Before training a model, it's vital to prepare your dataset properly. Besides data preprocessing, this preparation involves creating separate datasets for training and testing. By doing so, you can train your model on one portion of the data and evaluate its performance on a separate set that it hasn't seen before, which is crucial for assessing the model's ability to generalize.

<img src=https://miro.medium.com/v2/resize:fit:580/1*OECM6SWmlhVzebmSuvMtBg.png height=300>

#### Splitting a dataset using `train_test_split`

The purpose of splitting your dataset into training and testing sets is to provide an honest assessment of the model's performance. Scikit-learn offers the `train_test_split function`, which is an efficient way to randomly partition the dataset.

The `train_test_split` function shuffles your data and then splits it into training and testing subsets. You can specify the proportion of the dataset you want to allocate for testing with the test_size parameter.


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate dummy data
x_dummy = np.random.rand(100, 4)  # 100 samples, 4 features
y_dummy = np.random.randint(0, 2, 100)  # Binary target variable

# Splitting the dataset into training and testing sets
x_train_dummy, x_test_dummy, y_train_dummy, y_test_dummy = train_test_split(x_dummy, y_dummy, test_size=0.2, random_state=42)

print(f"Training set size: {x_train_dummy.shape[0]} samples")
print(f"Testing set size: {x_test_dummy.shape[0]} samples")


Here's how to apply it to the Iris dataset:

In [None]:
from sklearn.model_selection import train_test_split

# Assuming x and y are already defined (Iris dataset features and target)
x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.2, random_state=42)

# This splits the dataset into 80% training data and 20% testing data
print(f"{x_train.shape = }")
print(f"{x_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")


In this example, `20%` of the data is reserved for testing, and the `random_state` parameter ensures that the split is reproducible; using the same seed will produce the same split in future runs.

#### Importance of Random State in Reproducibility

In some cases, especially in datasets where some classes are underrepresented, it's important to use a stratified split. This method ensures that the proportion of classes in both the training and testing sets reflects that of the original dataset.

In [None]:
# Lets check the total number of classes in the y values of our iris data
print(f"{np.unique(y_train, return_counts=True)}")
print(f"{np.unique(y_test, return_counts=True)}")

Scikit-learn's `train_test_split` function allows for *stratified* splitting by using the `stratify` parameter:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.2, stratify=y_iris, random_state=41)

# This again splits the dataset into 80% training data and 20% testing data, this time accounting for the y ratio
# Lets check the total number of classes in the y values of our iris data
print(f"{np.unique(y_train, return_counts=True)}")
print(f"{np.unique(y_test, return_counts=True)}")

By setting `stratify=y_iris`, we ensure that the distribution of classes in both the training and testing sets matches the original dataset as closely as possible.

#### Cross-Validation: An Alternative to Train/Test Split

While splitting your data into `training` and `testing` sets is a good practice, it's also useful to employ `cross-validation` as an additional step. 

<img src=https://www.researchgate.net/publication/340567535/figure/fig2/AS:880966289588226@1587050139118/Train-test-cross-validation-split-methodology-used-in-this-paper-The-first-operation.jpg height= 400>

Cross-validation involves splitting the dataset into multiple chunks (called a `fold`) and training multiple models by using different chunks as the test set each time. This process helps in assessing the model's performance more robustly.

Scikit-learn offers various [cross-validation strategies](https://scikit-learn.org/stable/modules/cross_validation.html), but a simple and commonly used method is **K-fold cross-validation**, which can be performed as follows:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize a simple classifier (We will cover classifier in detail in chapter 3.1)
classifier_dummy = RandomForestClassifier(random_state=42)

# Performing 5-fold cross-validation
scores_dummy = cross_val_score(classifier_dummy, x_dummy, y_dummy, cv=5)

print("Cross-validation accuracy scores:", scores_dummy) # (We will cover scoring in chapter 3.3)


Here is how to apply cross validation to our iris data.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
classifier = RandomForestClassifier(random_state=42)

# Perform 5-fold cross-validation
scores = cross_val_score(classifier, x_iris, y_iris, cv=5)

print("Accuracy scores for each fold:", scores.mean())


In this example, the dataset is split into five parts; the model is trained five times, each time using a different part as the test set. The `cross_val_score` function then returns the accuracy of the model for each fold, providing insight into how the model performs across different subsets of the data.

By combining train/test splits with cross-validation, you can ensure that your models are both effective and generalizable, ready for the final steps of training and evaluation.

---
# --> 🚀💻💥 *Coding Challenge ([Step 0-3]((#34-coding-challenge)))*
---

## [3. Supervised learning](#3-supervised-learning)

**Supervised learning** is a cornerstone of machine learning, where the goal is to learn a mapping from inputs (`features`) to outputs (`targets`), given a labeled dataset. This chapter delves into the two primary categories of supervised learning: *classification* and *regression*. Using `scikit-learn`, we'll explore how to apply these concepts to real-world datasets, including a detailed walkthrough with the Iris dataset for classification and another dataset for regression to provide comprehensive insights.

Scikit-learn offers a comprehensive suite of tools for implementing supervised learning models, which you can find explained in detail for [regression](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) and for [classification](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) algorithms.



### [3.1 Classification models](#22-classification-models)

The Iris dataset is *the* classic example for classification, where the task is to predict the species of an iris flower based on its sepal and petal dimensions.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
x, y = iris.data, iris.target

# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, stratify=y, random_state=42)

# Initialize the model
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
classifier.fit(x_train, y_train)

# Predictions
y_pred = classifier.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Classification Accuracy: {accuracy:.2f}")



### [3.2 Regression models](#21-regression-models):

In `regression models`, the goal is to predict a continuous value. Scikit-learn provides several regression models, including linear regression, decision trees, and support vector regression. For example, predicting the price of a house based on its features (size, location, etc.) is a regression problem.

#### Boston housing dataset for regression 

Since we cannot use iris for regression, let's use the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), which is a popular dataset for regression tasks. 

This dataset contains information related to housing in California, such as `median income`, `housing median age`, `average rooms`, `average bedrooms`, `population`, `average occupancy`, `latitude`, and `longitude`. The target variable is the `median house value` for California districts.

In [None]:
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
housing = fetch_california_housing()
x_h, y_h = housing.data, housing.target

print(f"{x_h.shape = }")
print(f"{y_h.shape = }")


In this example, we also included data scaling using `StandardScaler` before splitting the data. As introduced in the last chapter, scaling features is a common preprocessing step for many machine learning algorithms, especially for those sensitive to the scale of the data, like SVMs or neural networks. 

However, tree-based models like *RandomForest* are generally scale-invariant but applying scaling can still be beneficial for convergence speed in some cases or when using regularization.

In [None]:
# We follow the good practice and scale the data for regression tasks
scaler = StandardScaler()
x_h_scaled = scaler.fit_transform(x_h)

# Next, we split the dataset into training and testing sets
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(x_h_scaled, y_h, test_size=0.3, random_state=42)


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialize the RandomForestRegressor
regressor_h = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
regressor_h.fit(X_train_h, y_train_h)

# Predict the median house values on the testing set
y_pred_h = regressor_h.predict(X_test_h)

# Evaluate the model using the mean squared error (MSE) metric
mse_h = mean_squared_error(y_test_h, y_pred_h)
print(f"Regression Mean Squared Error on California Housing Dataset: {mse_h:.2f}")

This comprehensive approach, from data loading and preprocessing to model training and evaluation, illustrates the typical workflow for a regression task in supervised learning. 

By exploring different datasets and regression models, you gain a deeper understanding of how to tackle various types of regression problems effectively.

### [3.3 Model Evaluation](#33-model-evaluation)

Model evaluation is a critical step in the machine learning workflow. It allows you to assess the performance of your model and understand its strengths and weaknesses. In this chapter, we’ll cover essential evaluation metrics and tools used in both classification and regression tasks. Understanding these concepts will help you choose the right metric for your specific problem and ensure that your model meets the desired objectives.

#### Classification Metrics

For classification problems, *accuracy*, *precision*, *recall*, *F1 score*, and the *confusion matrix* are commonly used metrics.

* **Accuracy:** Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. It's a good measure when the class distributions are similar.

In [None]:
from sklearn.metrics import accuracy_score

y_true = [0,0,1,0,0,1,0,0,1,0]
y_pred = [0,1,0,0,0,1,1,0,1,0]

# Assuming y_true and y_pred are the true and predicted labels respectively
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")


* **Precision and Recall:** Precision measures the accuracy of positive predictions (i.e., the proportion of true positives among all positive predictions), whereas recall (or sensitivity) measures the ability of the classifier to find all positive samples (i.e., the proportion of true positives among all actual positives).

In [None]:
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"Precision: {precision:.2f}, Recall: {recall:.2f}")


* **F1-Score:** The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. It's especially useful when the class distribution is uneven.

In [None]:
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.2f}")


* **Confusion Matrix:** The confusion matrix is a table that describes the performance of a classification model on a set of test data for which the true values are known. It allows you to see the errors made by the classifier.

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)


#### Regression Metrics

For regression tasks, *Mean Squared Error (MSE)* and *Mean Absolute Error (MAE)* are widely used metrics.

* **Mean Squared Error (MSE):** MSE measures the average squared difference between the estimated values and the actual value. It gives a rough idea of the magnitude of error.

In [None]:
from sklearn.metrics import mean_squared_error

y_true = [1.1,2,3.3,4,5]
y_pred = [2.2,2,3,5,6]

# Assuming y_true and y_pred are the true and predicted values respectively
mse = mean_squared_error(y_true, y_pred)
print(f"Mean Squared Error: {mse:.2f}")


* **Mean Absolute Error (MAE):** MAE measures the average absolute difference between the estimated values and the actual value, providing a linear score that weights all errors equally.

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

#### Selecting the right metric

The choice of metric depends on your specific problem and goals. For instance, in a medical diagnosis problem, recall might be more important than precision, as missing a positive case could be more detrimental than falsely labeling a negative case as positive. Conversely, in a spam detection system, precision might be more critical to avoid filtering out important emails.

Understanding these metrics and their implications will help you evaluate your models effectively, guiding you towards making improvements and ultimately achieving better performance.

---
# --> 🚀💻💥 *Coding Challenge ([Step 4-5]((#34-coding-challenge)))*
---

### [3.4 Coding Challenge](#34-coding-challenge)

As a practical exercise, we revisit the injection modling dataset used in the matplotlib course unit. This challenge will involve predicting the `"quality"` column from the provided dataset. 

Like before, we'll break down the challenge into several steps, each focusing on a different aspect of the machine learning process.  

#### Task 

Develop a machine learning model to predict the `"quality"` of our injection modling experiments based on the various manufacturing parameters using scikit-learn.

#### The dataset 

The dataset should be already known form the previous exercise. It includes manufacturing parameters like `melt temperature`, `mold temperature`, and various measurements related to the manufacturing process, with the target variable being `"quality"`.

**Step 0:** Load and Examine the Dataset

* Load the `data.csv` dataset, encompassing process parameters and each lens's quality classification.
* Conduct an initial examination to comprehend its composition and content structure (e.g. by printing the head of the data).

In [None]:
# Load the injection molding data like we did in the last course unit
import pandas as  pd 

data = pd.read_csv("../data/data.csv", delimiter=";")

data.head()

**Step 1:** Data Loading and Preprocessing
* Separate the measurements from the target variable (`"quality"`).

In [None]:
# Separate the measurements from the target variable "quality"

import pandas as pd

y = data["quality"]
x = data[data.columns[:-1]]
# x = data.drop("quality", axis=1)

print(f"{x.shape = }")
print(f"{y.shape = }")


**Step 2:** Splitting the Data
* Split the dataset into training and testing sets using `train_test_split`.

* Select a `test_size` of `20%`

In [None]:
# Split the dataset into training and testing sets using `train_test_split`.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(f"{x_train.shape = }")
print(f"{x_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")

**Step 3:** Feature Scaling
* Apply feature scaling to the dataset using the `StandardScaler` to standardize the features.

In [None]:
# Apply feature scaling to the dataset using the StandardScaler

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)


**Step 4:** Model Selection and Training

* Choose a suitable model. For this example, we can again use `RandomForestClassifier` as a starting point.

* Train the model on the training set.

In [None]:
# Train the model on the training set

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

model = RandomForestClassifier(random_state=42)
model.fit(x_train_scaled, y_train)

# Performing 5-fold cross-validation
scores_cv = cross_val_score(model, x_test_scaled, y_test, cv=5)

print("Cross-validation accuracy scores:", scores_cv) # (We will cover scoring in chapter 3.3)


**Step 5:** Model Evaluation
* Evaluate the model's performance on the test set using appropriate metrics.

    * We are going to apply `accuracy`, `precision`, `recall` and `f1 score`

In [None]:
# Evaluate the model's performance on the test set using appropriate metrics.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_pred = model.predict(x_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

print(confusion_matrix(y_test, y_pred))


This challenge provides a comprehensive overview of applying supervised learning techniques to a real-world dataset, from preprocessing and model training to evaluation and optimization, offering a hands-on experience with scikit-learn.

[--> Back to Outline](#course-outline)

---

## **[5. Best Practices and Resources](#5-best-practices-and-resources)**

### [5.1 Best Practices in Machine Learning](#51-best-practices-in-machine-learning)

Some of the best practices in machine learning include understanding the problem and data, iterative modeling, and continuous evaluation and tuning.

1. Clearly define the problem you are trying to solve.
2. Spend time on preprocessing and understanding your data.
3. Use cross-validation to estimate the performance of your models.
4. Keep experimenting with different models and hyperparameters.
5. Document your experiments and results for future reference.

### [5.2 Learning Resources](#52-learning-resources)

To further your knowledge in Scikit-learn and machine learning, explore the official Scikit-learn documentation, online courses, books, and community forums.

* Official Scikit-learn [Documentation](https://scikit-learn.org/stable/index.html)

* Coursera, edX, and Udacity for machine learning courses

* Books like "[Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)"

### [5.3 Q&A Session](#53-qa-session)

This section aims to shed light on some frequently asked questions and hurdles encountered by practitioners while working with Scikit-learn, offering insights and best practices for efficient machine learning model development and evaluation.

Q: How do I choose the right machine learning model for my problem?

A: Start with a simple model to establish a baseline and understand your data. The choice of model depends on the nature of your problem (`classification`, `regression`, `clustering`, etc.), the size and type of your data, and the computational efficiency you need. Use Scikit-learn's model selection tools like `cross-validation scores` to compare models.

Q: What are the most effective ways to handle overfitting in Scikit-learn?

A: Overfitting can be mitigated by simplifying the model (reducing complexity), collecting more data, or using techniques like `cross-validation`. Regularization techniques (`L1`, `L2 regularization`) are also effective. Scikit-learn provides parameters in many models to adjust regularization strength.

Q: Is it possible to handle both numerical and categorical data with Scikit-learn?

A: Yes, Scikit-learn can handle both types of data. Numerical data can be used directly, while categorical data often needs to be encoded before use. Use OneHotEncoder or LabelEncoder for categorical variables. The ColumnTransformer is a powerful tool for applying different preprocessing to different columns.

Q: Can Scikit-learn be used for deep learning tasks?

A: While Scikit-learn includes some basic neural network models via `MLPClassifier` and `MLPRegressor`, it is primarily designed for traditional machine learning. For deep learning tasks, libraries like [TensorFlow](https://www.tensorflow.org/) or [PyTorch](https://pytorch.org/) are more specialized and offer greater flexibility and control.

Q: How do I improve the performance of my Scikit-learn model?

A: Model performance can be enhanced by feature engineering, hyperparameter tuning using `GridSearchCV` or `RandomizedSearchCV`, and using more complex models or ensemble methods. Additionally, ensure your data is properly preprocessed and consider using `pipelines` to streamline your workflow.

[--> Back to Outline](#course-outline)

---