# Scikit-learn
- a free software machine learning library
- various classification, regression and clustering algorithms
- built on NumPy, SciPy, and matplotlib

In Scikit-learn classifiers are Python objects. They are trained and evaluated using methods implemented by all classifier objects.

We start by importing a number of libraries and modules that we will be using in this class

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Tools for scaling data, PCA, and standard datasets
from sklearn import preprocessing, decomposition, datasets

# Tools for tracking learning curves and perform cross validation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, validation_curve, learning_curve

# The k-NN learning algorithm
from sklearn.neighbors import KNeighborsClassifier as kNN

We now load in memory the *Breast Cancer Wisconsin (Diagnostic) Data Set*, a dataset for binary classification. The labels are `M` (malignant cancer) and `B` (benign cancer).

In [None]:
cancer = pd.read_csv("../Datasets/cancer.csv")
cancer.info()

By inspecting the dataset we note that the first column (`id`) and the last column can be dropped.

In [None]:
cancer.head()

Since the last column contains all `Nan`, and there are not other `Nan` values in the dataset, we can delete it using the `dropna()` method invoked over the columns. This deletes any column that contains at least a `Nan` value.

In [None]:
cancer = cancer.dropna(axis='columns')
cancer.head()

Next, we create the set of instances by dropping the column `id` and by dropping the column `diagnosis` containing the labels. We do this using the method `drop()`.

In [None]:
X = cancer.drop(columns=['id', 'diagnosis']).values
X

In [None]:
X.shape

Finally, we replace the categorical labels `B` and `M` with numerical labels `0` and `1`.

In [None]:
str_to_int = {'B' : 0, 'M' : 1}
cancer['diagnosis'] = cancer['diagnosis'].map(str_to_int)
np.unique(cancer['diagnosis'])

This allows us to use the new values in the column `diagnoses` as vector of labels.

In [None]:
y = cancer['diagnosis'].values
y

Using the Numpy function `unique()` with the flag `return_counts` set, we can see the number of examples in each class.

In [None]:
np.unique(y, return_counts=True)

Our next step is to randomly split the dataset in training and test sets. Since the dataset is relatively small (569 points), we leave 40% of the data for testing. The `random_state` variable is used a seed (we choose 42 just as any other value) for the random number generator in case we want to repeat the experiment using the same random bits. The flag `stratify` creates a split with the same proportion of classes in the train and test sets (especially useful when datasets are unbalanced).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42,stratify=y)

We are now ready to train a classifier for this dataset. We create a $1$-NN classifier object by invoking the function `kNN(n_neighbors=1)`, where `kNN()` is the alias we created when we imported the module for `KNeighborsClassifier()`. The object is assigned to the variable `knn`.

In [None]:
knn = kNN(n_neighbors=1)
type(knn)

Then we train the 1-NN classifier by invoking the method `fit()` with training points and training labels as arguments

In [None]:
knn.fit(X_train, y_train)

Finally, we invoke the method `score()` to evaluate the accuracy of the trained model on both the training and the test set.

In [None]:
knn.score(X_train, y_train), knn.score(X_test, y_test)

As expected, the training accuracy is 1 (i.e., zero training error) while the testing accuracy is way below.

We perform a second experiment on the same random split this time using 3-NN.

In [None]:
knn = kNN(n_neighbors=3) # 3-NN
knn.fit(X_train, y_train)
knn.score(X_train, y_train), knn.score(X_test, y_test)

Predictably, the training accuracy went down (by about 5%), while the test accuracy is now pretty close to the training accuracy.

Next, we use the function `learning_curve()` to inspect the evolution of training and test performance of $7$-NN for increasing sizes of the training set.

For each value of the training set size, a 5-fold stratified cross-validation is performed to estimate the risk.

In [None]:
sizes = range(100, 401, 50)
train_size, train_score, val_score = learning_curve(kNN(n_neighbors=7), X, y, train_sizes=sizes, cv=5)

`val_score` is a matrix whose each row contains the accuracy on the $5$ folds of cross validation for a given value of training size

In [None]:
val_score

The training and cross-validation scores are plotted as follows.

In [None]:
plt.title('7-NN vs. training size')
plt.plot(train_size, np.mean(val_score, 1), label='Validation accuracy')
plt.plot(train_size, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.show()

Now we want to plot the training and test performance in terms of the parameter $k$ of $k$-NN. We start by creating a list of values of $k$ from 1 to 200 with steps of 20.

Then, we use the function `validation_curve()` to create a matrix of training scores and a matrix of test scores, where, as before, rows are indexed by the values of $k$ used to generate the scores, and columns report the per-fold performance in a cross-validation experiment.

In [None]:
neighbors = range(1,200,20)
train_score, val_score = validation_curve(kNN(), X, y, 'n_neighbors', neighbors, cv=5)
train_score, val_score

Plotting the results clearly reveals overfitting and underfitting regions of the parameter $k$, with the best value at about $k=25$.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, np.mean(val_score, 1), label='Testing accuracy')
plt.plot(neighbors, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

We move on to a different dataset: the *Pima Indians Diabetes Database*. The goal of this dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Hence, the are only two labels (binary classification).

In [None]:
pima = pd.read_csv("Datasets/diabetes.csv")
pima.info()

In [None]:
pima.head()

The `Outcome` column contains the labels. We use this to construct our sets of training points and training labels.

In [None]:
X = pima.drop(columns='Outcome').values
y = pima['Outcome'].values

As before, we count the proportions of positive and negative labels.

In [None]:
np.unique(y, return_counts=True)

Then we split the dataset in training set (60%) and test set (40%) using stratification.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42, stratify=y)

The validation curve is plotted using the same range of values for $k$ as before.

In [None]:
neighbors = range(1,200,20)
train_score, val_score = validation_curve(kNN(), X, y, 'n_neighbors', neighbors, cv=5)

Once more, the regions of underfitting and overfitting for the parameter $k$ are clearly seen in the plot.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, np.mean(val_score, 1), label='Testing accuracy')
plt.plot(neighbors, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

### Cross-validation to evaluate performance of a given algorithm
The function `cross_val_score()` performs cross validation to estimate the risk of the classifier output by a given algorithm.

Here is an example using $5$-fold cross-validation on the entire dataset to evaluate the performance of $21$-NN.

In [None]:
knn = kNN(n_neighbors=21)
scores = cross_val_score(knn, X, y, cv=5)
scores, scores.mean()

### Grid-search to find best value of parameter for the learning algorithm
We can use the function `GridSearch()` to look for the best parameter of an algorithm using the entire dataset.
- Repeat 5-fold cross-validation on the entire dataset for each value of the parameter in the grid
- Select the parameter with the best cross-validated score

In [None]:
k_grid = {'n_neighbors': range(1, 100, 20)}
learner = GridSearchCV(estimator=kNN(), param_grid=k_grid, cv=5, return_train_score=True)
learner.fit(X, y)
learner.best_params_, learner.best_score_ # vars containing the best parameter value and its corresponding cv score

The algorithm with the best parameter, $21$-NN, is available in the variable `learner.best_estimator_`, that is `learner.best_estimator_ = kNN(n_neighbors=21)`

We repeat the evaluation of this algorithm using 5-fold cross-validation.

In [None]:
model = learner.best_estimator_
scores = cross_val_score(model, X, y, cv=5)
scores.mean()

### Nested cross-validation to evaluate performance of a learning algorithm with parameters to tune
We saw that cross-validation allows us to use the data for choosing a good value of the parameter. However, we are still left with the problem of estimating the risk of the classifier generated by the algorithm. Nested cross-validation provides a way of estimating the risk of a classifier generated by an algorithm whose parameters are tuned using cross-validation on the training set.

In the following example, we:
- Run 5-fold cross-validation on the entire dataset
- On the training part of each fold, run *internal* 5-fold cross-validation to find the best value of the parameter
- Re-train the model on the training part of the outer fold using the optimized parameter
- Test the model on the testing part of the outer fold.

In [None]:
k_grid = {'n_neighbors': range(1,100,20)}
learner = GridSearchCV(estimator=kNN(), param_grid=k_grid, cv=5) # internal C-V
scores = cross_val_score(learner, X, y, cv=5) # external C-V
scores, scores.mean()

Note that the nested cross-validated estimate is $0.72$, while the cross-validated estimate we computed above using grid search on the entire dataset is higher, $0.74$. This discrepancy occurs because nested CV runs grid search on the smaller nested folds. On the other hand, the nested CV estimate is statistically more accurate.

## Preprocessing the dataset
Many learning algorithms may work better when the training set is rescaled in certain ways. Note that, in order to avoid contributing to overfitting, these rescalings should not depend on the training labels.

We illustrate the most popular rescaling technique on the cancer dataset.

In [None]:
X = cancer.drop(columns=['id', 'diagnosis']).values
y = cancer['diagnosis'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42,stratify=y)

The `StandardScaler()` function standardizes the values of each feature $i$. If $x_1(i),\ldots,x_m(i)$ are the values of the $i$-th feature in the dataset $x(1),\dots,x(m)$, then `StandardScaler()` replaces each value $x_t(i)$ with
$$x_t(i)' = \frac{x_t(i)-\mu_i}{\sigma_i}$$
where $$\mu_i = \frac{1}{m}\sum_{t=1}^m x_t(i) \;\;\;\textrm{and}\;\;\; \sigma_i^2 = \frac{1}{m}\sum_{t=1}^m \bigl(x_t(i)-\mu_i\big)^2$$

Note that `standard_scaler.fit_transform()` is used to compute $\mu_i$ and $\sigma_i$ for each feature $i$ on the training data and then to rescale the training data. The testing data are rescaled using the parameters computed on the training data. Allowing the learner to compute the rescaling parameters using the testing data would imply that the test set is made available (without labels) before the classifier is generated. This is typically not allowed in the statistical learning model.

In [None]:
standard_scaler = preprocessing.StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

Next, we compute the test set performance with and without rescaling for different values of $k$.

In [None]:
neighbors = range(1,8)
test_scores = []
test_scores_standard = []

for k in neighbors:
    knn = kNN(n_neighbors=k)
    knn.fit(X_train, y_train)
    test_scores.append(knn.score(X_test, y_test))
    knn.fit(X_train_standard, y_train)
    test_scores_standard.append(knn.score(X_test_standard, y_test))

Plotting the perfomance in both cases shows the benefits of rescaling.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, test_scores, label='Testing accuracy')
plt.plot(neighbors, test_scores_standard, label='Testing accuracy (scaled)')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

We now repeat the same exercise use the Pima Indians dataset.

In [None]:
X = pima.drop(columns='Outcome').values
y = pima['Outcome'].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=42, stratify=y)
standard_scaler = preprocessing.StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

In [None]:
neighbors = range(1,100,20)
test_scores = []
test_scores_standard = []

for k in neighbors:
    knn = kNN(n_neighbors=k)
    knn.fit(X_train, y_train)
    test_scores.append(knn.score(X_test, y_test))
    knn.fit(X_train_standard, y_train)
    test_scores_standard.append(knn.score(X_test_standard, y_test))

Also in this case, we see that rescaling helps boost the test accuracy when $k$ is not chosen optimally.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, test_scores, label='Testing accuracy')
plt.plot(neighbors, test_scores_standard, label='Testing accuracy (scaled)')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

We finish by showing a graphical representation of the $k$-NN classifier.

We do this by
- loading the IRIS dataset using the shortcut provided by the Scikit-learn module `datasets`
- projecting the dataset onto the two principal dimension via Principal Component Analysis.

In [None]:
iris = datasets.load_iris()

X = iris.data
y = iris.target

pca = decomposition.PCA(n_components=2)
pca.fit(X) # Compute PCA
X = pca.transform(X) # Project data onto first two principal components

A scatter plot of the projected data reveals that $k$-NN is going to have an easy time classifying this dataset...

In [None]:
plt.scatter(X[:,0], X[:,1], c=y)

The next code cell (adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html) allows us to visualize the decision surface of the $k$-NN classifier. Each colored region is the the set of data points that are assigned the same classification by $k$-NN.

In [None]:
from matplotlib.colors import ListedColormap

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# we create an instance of Neighbours Classifier and fit the data.
clf = kNN(n_neighbors=1)
clf.fit(X, y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification")

plt.show()

We repeat the same exercise using the Pima Indians dataset. Note that the job of $k$-NN is harder here.

In [None]:
X = pima.drop(columns='Outcome').values
y = pima['Outcome'].values

standard_scaler = preprocessing.StandardScaler()
X = standard_scaler.fit_transform(X)

pca = decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

In [None]:
plt.scatter(X[:,0], X[:,1], c=y)

In [None]:
h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])

# we create an instance of Neighbours Classifier and fit the data.
clf = kNN(n_neighbors=1)
clf.fit(X, y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification")

plt.show()