## Get the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets

## General technical points

The main API (Application Programming Interface) implemented by scikit-learn is that of the **estimator**. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts or filters useful features from raw data.

In scikit-learn all machine learning models are implemented as python classes. This means in particular that
* they implement learning and predicting,
* they store information learned form the data,
* their .fit() methods govern fitting or training a model, and
* their .predict() methods are used to predict unlabelled data points

Generally, the scikit API requires:
* all data to be in standard numpy or pandas format with features as columns and samples as rows,
* no missing values.

## Supervised learning

Machine learning means methods that allow computers to learn from data. If the data have labels, this is called supervised learning, i.e. the 'right' outcome is known.
Supervised learning typically has feature variables (predictor variables) and a target variable.
The two basic versions are [1] classification and [2] regression.

The **feature space** for supervised learning comes in form of a samples matrix (or design matrix) X. Typically samples are in rows and features are in column. The **target space** usually comes in form of a vector y.

Thus, an estimator for classification is a Python object that implements the methods **fit(X_train, y_train)** and **predict(X_test)**.

### Classification example

Classification trains a model with labelled features for the purpose of categorization unlabelled features.

The iris data set offers a 3-dimensional classification challenge.

In [2]:
iris_X, iris_y = datasets.load_iris(return_X_y=True)  # extract features and labels
print(f'Features dimension: {iris_X.shape}.\nLabels dimensions {iris_y.shape}.\nLabels:{np.unique(iris_y)}')

Features dimension: (150, 4).
Labels dimensions (150,).
Labels:[0 1 2]


Scikit-learn supports the creation of stratified training and testing samples of unordered data through random permutation.

N.B.: **Stratified sampling** is a method of sampling from a population by dividing members of the population into homogeneous subgroups. The use stratified sampling ensures that relative class frequencies are approximately preserved in train/validation folds.

In [3]:
from sklearn import model_selection as skmos
X_train, X_test, y_train, y_test = skmos.train_test_split(iris_X, iris_y, test_size=0.3, random_state=21, stratify=iris_y)

The k-neighbours classifier must be first instantiated and then fitted to the traing data, using features and labels.

In [4]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)  # instantiate estimator
knn.fit(X_train, y_train)  # fit the estimator
y_pred = knn.predict(X_test)  # predict test targets

Classifier fitting returns:
* the classifier itself and
* a modification to fit it to the data.

The simplest way to measure accuracy based on the test sample is the **.score()** method.

In [5]:
knn.score(X_test, y_test)

0.9777777777777777

## Regression example

In [6]:
boston_X, boston_y = datasets.load_boston(return_X_y=True)

from sklearn import model_selection as skmos
X_train, X_test, y_train, y_test = skmos.train_test_split(boston_X, boston_y, test_size=0.3, random_state=21)

In [7]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()  # instantiate regression
reg.fit(X_train, y_train)

LinearRegression()

In [8]:
reg.score(X_test, y_test)  # get R2 without need to .predict()

0.7115848967731223

## Cross validation: k-fold CV

The purpose of testing is to estimate a models quality of predicting data out of sample. For the purpose of testing a single split of the data has greater risk of of not being representative for a model's ability to generalize. Hence, where possible, multiple splits are preferred. An unordered dataset is typically split into folds. <u>Out of k folds each one is used as test set in turn</u>. There are as many train-test splits and scores as there are folds.

This maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. However, <u>the number of folds also determines the computational cost</u>.

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

boston_X, boston_y = datasets.load_boston(return_X_y=True)
reg = LinearRegression()
cv_results = cross_val_score(reg, boston_X, boston_y, cv=5)  # gives array of R2s
print(cv_results)

[ 0.63919994  0.71386698  0.58702344  0.07923081 -0.25294154]


## Tryouts

https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu

now:
https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html