# Scikit-learn
- a free software machine learning library
- various classification, regression and clustering algorithms
- built on NumPy, SciPy, and matplotlib

In Scikit-learn classifiers are Python objects. They are trained and evaluated using methods implemented by all classifier objects.

We start by importing a number of libraries and modules that we will be using in this class

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Tools for scaling data, PCA, and standard datasets
from sklearn import preprocessing, decomposition, datasets

# Tools for tracking learning curves and perform cross validation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, validation_curve, learning_curve

# The k-NN learning algorithm
from sklearn.neighbors import KNeighborsClassifier as kNN

We now load in memory the *Breast Cancer Wisconsin (Diagnostic) Data Set*, a dataset for binary classification. The labels are `M` (malignant cancer) and `B` (benign cancer).

In [None]:
cancer = pd.read_csv("../Datasets/cancer.csv")
cancer.info()

By inspecting the dataset we note that the first column (`id`) and the last column can be dropped.

In [None]:
cancer.head()

Since the last column contains all `Nan`, and there are not other `Nan` values in the dataset, we can delete it using the `dropna()` method invoked over the columns. This deletes any column that contains at least a `Nan` value.

In [None]:
cancer = cancer.dropna(axis='columns')
cancer.head()

Next, we create the set of instances by dropping the column `id` and by dropping the column `diagnosis` containing the labels. We do this using the method `drop()`.

In [None]:
X = cancer.drop(columns=['id', 'diagnosis']).values
X

In [None]:
X.shape

Finally, we replace the categorical labels `B` and `M` with numerical labels `0` and `1`.

In [None]:
str_to_int = {'B' : 0, 'M' : 1}
cancer['diagnosis'] = cancer['diagnosis'].map(str_to_int)
np.unique(cancer['diagnosis'])

This allows us to use the new values in the column `diagnoses` as vector of labels.

In [None]:
y = cancer['diagnosis'].values
y

Using the Numpy function `unique()` with the flag `return_counts` set, we can see the number of examples in each class.

In [None]:
np.unique(y, return_counts=True)

Our next step is to randomly split the dataset in training and test sets. Since the dataset is relatively small (569 points), we leave 40% of the data for testing. The `random_state` variable is used a seed (we choose 42 just as any other value) for the random number generator in case we want to repeat the experiment using the same random bits. The flag `stratify` creates a split with the same proportion of classes in the train and test sets (especially useful when datasets are unbalanced).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42,stratify=y)

We are now ready to train a classifier for this dataset. We create a $1$-NN classifier object by invoking the function `kNN(n_neighbors=1)`, where `kNN()` is the alias we created when we imported the module for `KNeighborsClassifier()`. The object is assigned to the variable `knn`.

In [None]:
knn = kNN(n_neighbors=1)
type(knn)

Then we train the 1-NN classifier by invoking the method `fit()` with training points and training labels as arguments

In [None]:
knn.fit(X_train, y_train)

Finally, we invoke the method `score()` to evaluate the accuracy of the trained model on both the training and the test set.

In [None]:
knn.score(X_train, y_train), knn.score(X_test, y_test)

As expected, the training accuracy is 1 (i.e., zero training error) while the testing accuracy is way below.

We perform a second experiment on the same random split this time using 3-NN.

In [None]:
knn = kNN(n_neighbors=3) # 3-NN
knn.fit(X_train, y_train)
knn.score(X_train, y_train), knn.score(X_test, y_test)

Predictably, the training accuracy went down (by about 5%), while the test accuracy is now pretty close to the training accuracy.

Next, we use the function `learning_curve()` to inspect the evolution of training and test performance of $7$-NN for increasing sizes of the training set.

For each value of the training set size, a 5-fold stratified cross-validation is performed to estimate the risk.

In [None]:
sizes = range(100, 401, 50)
train_size, train_score, val_score = learning_curve(kNN(n_neighbors=7), X, y, train_sizes=sizes, cv=5)

`val_score` is a matrix whose each row contains the accuracy on the $5$ folds of cross validation for a given value of training size

In [None]:
val_score

The training and cross-validation scores are plotted as follows.

In [None]:
plt.title('7-NN vs. training size')
plt.plot(train_size, np.mean(val_score, 1), label='Validation accuracy')
plt.plot(train_size, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.show()

Now we want to plot the training and test performance in terms of the parameter $k$ of $k$-NN. We start by creating a list of values of $k$ from 1 to 200 with steps of 20.

Then, we use the function `validation_curve()` to create a matrix of training scores and a matrix of test scores, where, as before, rows are indexed by the values of $k$ used to generate the scores, and columns report the per-fold performance in a cross-validation experiment.

In [None]:
neighbors = range(1,200,20)
train_score, val_score = validation_curve(kNN(), X, y, param_name='n_neighbors', param_range=neighbors, cv=5)
train_score, val_score

Plotting the results clearly reveals overfitting and underfitting regions of the parameter $k$, with the best value at about $k=25$.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, np.mean(val_score, 1), label='Testing accuracy')
plt.plot(neighbors, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

We move on to a different dataset: the *Pima Indians Diabetes Database*. The goal of this dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Hence, the are only two labels (binary classification).

In [None]:
pima = pd.read_csv("Datasets/diabetes.csv")
pima.info()

In [None]:
pima.head()

The `Outcome` column contains the labels. We use this to construct our sets of training points and training labels.

In [None]:
X = pima.drop(columns='Outcome').values
y = pima['Outcome'].values

As before, we count the proportions of positive and negative labels.

In [None]:
np.unique(y, return_counts=True)

Then we split the dataset in training set (60%) and test set (40%) using stratification.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.4,random_state=42, stratify=y)

The validation curve is plotted using the same range of values for $k$ as before.

In [None]:
neighbors = range(1,200,20)
train_score, val_score = validation_curve(kNN(), X, y, param_name='n_neighbors', param_range=neighbors, cv=5)

Once more, the regions of underfitting and overfitting for the parameter $k$ are clearly seen in the plot.

In [None]:
plt.title('k-NN vs. number of neighbors')
plt.plot(neighbors, np.mean(val_score, 1), label='Testing accuracy')
plt.plot(neighbors, np.mean(train_score, 1), label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()