***Set up:*** Let's define the default font sizes to make the figures prettier.

In [2]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

In [3]:
import numpy as np
from sklearn.datasets import fetch_openml # download real-life datasets

np.random.seed(42)

mnist = fetch_openml('mnist_784', as_frame=False)

X, y = mnist.data, mnist.target

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] # split the data into training and test sets

### **Training a Binary Classifier.**

First, we need to simplify problem and only try to identify one digit. "5-detector" will be an example of classifier a binary classifier.

In [4]:
y_train_5 = (y_train == '5') # True for all 5s, False for all other digits
y_test_5 = (y_test == '5')

Now let’s pick a classifier and train it. A good place to start is using a stochastic gradient descent [`Scikit-Learn’s SGDClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) class.

In [5]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

In [6]:
sgd_clf.predict([X[0]])

array([ True])

## **Performance Measures.**

### **1. Measuring Accuracy Using Cross-Validation:**

Let’s use the `cross_val_score()` function to evaluate our `SGDClassifier` model, using k-fold cross-validation with three folds. Remember that k-fold cross-validation means splitting the training set into k folds (in this case, three), then training the model k times, holding out a different fold each time for evaluation.

In [7]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy') # 3-fold cross-validation, accuracy as scoring metric

array([0.95035, 0.96035, 0.9604 ])

Implementing Cross-Validation:

In [8]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print((n_correct / len(y_pred)))

0.9669
0.91625
0.96785


`Stratifiedkfold` is a tool in the Scikit-learn library used to divide data into training and tests during cross-validation. The special feature of `stratifiedkfold` is that it retains the distribution of layers in data when divided into folds.

When using `stratifiedkfold`, you specify the number of fold (`n_splits`) you want to create. Each fold will be made so that the proportion of layers in the data is originally maintained. This ensures that each fold has a good representation for different classes in data.

For example, if you have a set of data with 100 models and 2 layers (class 0 and 1), and you want to divide into 5 folds, `stratifiedkfold` will divide data so that each fold has 80% of the model of class 0 And 20% of the sample of grade 1. This helps ensure that each fold has a good representation for both classes.

To use `StratifiedKFold`, you need to transfer training data (`x`) and the corresponding label (`y`y) to the `split()` method. It will return the indicators of the samples in training and test for each fold. You can use these indicators to access the corresponding samples in the data.