EM 538-001: Practical Machine Learning for Enginering Analystics (Spring 2025)  
Instructor: Fred Livingston (fjliving@ncsu.edu)  


#### Install scikit-learn library


In [None]:
# !pip install scikit-learn scipy

### Simple Holdout Method

<img src="images/estimator-api.png" alt="drawing" width="250"/>

#### Simple Hold Example from File

In [3]:
from sklearn.model_selection import train_test_split
import pandas as pd

df_iris = pd.read_csv('data/iris.csv')
X = df_iris[['PetalLength[cm]', 'PetalWidth[cm]']]
y = df_iris['Species']


X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=123,
                                                    shuffle=True)

### 3-Way Holdout Method
- In machine learning practice, we often need to experiment with an machine learning algorithm's hyperparameters to find a good setting.
- The process of tuning hyperparameters and comparing and selecting the resulting models is also called "model selection" (in contrast to "algorithm selection").
- We will cover topics such as "model selection" and "algorithm selection" in more detail later in this course.
- For now, we are introducing the simplest way of performing model selection: using the "holdout method."
- In the holdout method, we split a dataset into 3 subsets: a training, a validation, and a test datatset.
- To avoid biasing the estimate of the generalization performance, we only want to use the test dataset once, which is why we use the validation dataset for hyperparameter tuning (model selection).
- Here, the validation dataset serves as an estimate of the generalization performance, too, but it becomes more biased than the final estimate on the test data because of its repeated re-use during model selection (think of "multiple hypothesis testing").

<img src="images/holdout-tuning.png" alt="drawing" width="400"/>

## Stratification

- If we are working with small datasets and split it randomly into subsets, it will affect the class distribution in the samples -- this is problematic since machine learning algorithms/models assume that training, validation, and test samples have been drawn from the same distributions to produce reliable models and estimates of the generalization performance.

<img src="images/iris-subsampling.png" alt="drawing" width="400"/>

- The method of ensuring that the class label proportions are the same in each subset after splitting, we use an approach that is usually referred to as "stratification."
- Stratification is supported in scikit-learn's `train_test_split` method if we pass the class label array to the `stratify` parameter as shown below.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X, y = iris.data[:, 2:], iris.target
X_temp, X_test, y_temp, y_test = \
        train_test_split(X, y, test_size=0.2, 
                         shuffle=True, random_state=123)
np.bincount(y_temp)

In [None]:
X_temp, X_test, y_temp, y_test = \
        train_test_split(X, y, test_size=0.2, 
                         shuffle=True, random_state=123, stratify=y)
np.bincount(y_temp)

In [None]:
X_train, X_valid, y_train, y_valid = \
        train_test_split(X_temp, y_temp, test_size=0.2,
                         shuffle=True, random_state=123, stratify=y_temp)

print('Train size', X_train.shape, 'class proportions', np.bincount(y_train))
print('Valid size', X_valid.shape, 'class proportions', np.bincount(y_valid))
print('Test size', X_test.shape, 'class proportions', np.bincount(y_test))