<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scikit-Learn-Introduction" data-toc-modified-id="Scikit-Learn-Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scikit-Learn Introduction</a></span><ul class="toc-item"><li><span><a href="#Data-visualization" data-toc-modified-id="Data-visualization-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data visualization</a></span></li><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data preparation</a></span></li><li><span><a href="#Apply-model-and-make-predictions" data-toc-modified-id="Apply-model-and-make-predictions-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Apply model and make predictions</a></span><ul class="toc-item"><li><span><a href="#test/evaluate-model" data-toc-modified-id="test/evaluate-model-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>test/evaluate model</a></span></li></ul></li><li><span><a href="#Measure---Quality-of-Classification" data-toc-modified-id="Measure---Quality-of-Classification-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Measure   Quality of Classification</a></span></li><li><span><a href="#Further-simple-models" data-toc-modified-id="Further-simple-models-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Further simple models</a></span><ul class="toc-item"><li><span><a href="#Gaussian-Naive-Bayes" data-toc-modified-id="Gaussian-Naive-Bayes-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Gaussian Naive Bayes</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Probabilistic-Classification" data-toc-modified-id="Probabilistic-Classification-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>Probabilistic Classification</a></span></li></ul></li><li><span><a href="#Two-short-examples--on-unsupervised-ML" data-toc-modified-id="Two-short-examples--on-unsupervised-ML-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Two short examples  on unsupervised ML</a></span><ul class="toc-item"><li><span><a href="#Dimensionality-Reduction" data-toc-modified-id="Dimensionality-Reduction-1.6.1"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Dimensionality Reduction</a></span></li><li><span><a href="#Clustering" data-toc-modified-id="Clustering-1.6.2"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Clustering</a></span></li></ul></li><li><span><a href="#Clustering-applied-to-digit-data" data-toc-modified-id="Clustering-applied-to-digit-data-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Clustering applied to digit data</a></span></li><li><span><a href="#Image-data-with-sklearn:" data-toc-modified-id="Image-data-with-sklearn:-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Image data with sklearn:</a></span><ul class="toc-item"><li><span><a href="#PCA" data-toc-modified-id="PCA-1.8.1"><span class="toc-item-num">1.8.1&nbsp;&nbsp;</span>PCA</a></span></li><li><span><a href="#Test-Isomap" data-toc-modified-id="Test-Isomap-1.8.2"><span class="toc-item-num">1.8.2&nbsp;&nbsp;</span>Test Isomap</a></span></li><li><span><a href="#Digit-classification" data-toc-modified-id="Digit-classification-1.8.3"><span class="toc-item-num">1.8.3&nbsp;&nbsp;</span>Digit classification</a></span><ul class="toc-item"><li><span><a href="#First-kNN:" data-toc-modified-id="First-kNN:-1.8.3.1"><span class="toc-item-num">1.8.3.1&nbsp;&nbsp;</span>First kNN:</a></span></li><li><span><a href="#Then--Gaussian-Naive-Bayes:" data-toc-modified-id="Then--Gaussian-Naive-Bayes:-1.8.3.2"><span class="toc-item-num">1.8.3.2&nbsp;&nbsp;</span>Then  Gaussian Naive Bayes:</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Scikit-Learn Introduction
A number of Python packages provide implementations of  machine learning algorithms. 
**[Scikit-Learn](http://scikit-learn.org)** is one of the most popular.
* it provides many of the common ML algorithms
* well-designed, uniform API (programming interface)
  * standardized and largely streamlined setup of the different models   
    &rarr; easy to switch
* good documentation

The first example is based on the **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)**. This had already been introduced by famous statistician
Ronald Fisher in 1936 and is used since then as instructive use case for classification. 
The data consists of
* measurements of length and width of both sepal (Bl&uuml;tenkelch) and petal (Bl&uuml;te) 
* classification of Iris sub-species



In [None]:
# the usual setup: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# seaboorn provides easy way to import iris dataset as pandas dataframe
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

In [None]:
iris

## Data visualization
First step should always be some investigation of data properties, i.e.
* basic statistical properties
* visualization of distributions


In [None]:
# basic statistics with pandas
iris.describe()

In [None]:
# distribution of single feature
sns.histplot(data=iris,x='sepal_length',hue='species')

In [None]:
# combined plot of 2 features
sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')

In [None]:
# combined plot matrix of all features in dataframe
#
# will provide scatter plot of all combinations of numerical columns in dataframe
# target (=species) can be given and will cause different colors
sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)

## Data preparation
For use in sklearn with  **supervised learning** the first step is always to split  data into 
* table/matrix of **features**
* list of **targets**

And then split the data into **train** and **test** sample:
* `train_test_split` function from sklearn
* by default 75% for training and 25% for test and validation
  * can be specified as parameter
* randomized selection of entries  
&rarr; inital order does not matter

In [None]:
# feature matrix
X=iris.loc[:,'sepal_length':'petal_width']
X.shape

In [None]:
# target
Y=iris.species
Y.shape

In [None]:
# break-up in train & test sample
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y)

In [None]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
knn.fit(X_train, y_train)
#knn.fit(X, Y)


## Apply model and make predictions

In [None]:
# create dummy iris
#X_new = np.array([[5, 4.9, 4, 1.2]])
# recent version want same datatype for testing
X_new = pd.DataFrame(np.array([[5, 4.9, 4, 1.2]]),columns=X.columns)
# 2D format required, nrows vs ncolums (1x2)
X_new.shape #

In [None]:
knn.predict(X_new) # apply model to new data point

### test/evaluate model


In [None]:
y_pred = knn.predict(X)
print("Test set predictions:\n {}".format(y_pred))

In [None]:
y_test==y_pred

In [None]:
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(Y, y_pred)
print("Test set score: {:.3f}".format(score))

***
Further useful checks are the **classification report** and the **confusion matrix**,  
they give detailed Info on mis-classifications:

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

(The meaning of `recall` etc. will be explained in a bit.) 

Another intuitive measure is the confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix

labels = np.unique(y_test)
mat = confusion_matrix(y_test, y_pred, labels=labels)
print (labels, '\n', mat)

***
**Repeat with different settings for number of neighbors**

**Usually high accuracy for Iris data**  
as scatter plot suggested there is rather clear separation between species

***
## Measure   Quality of Classification

The above `classification_report` presented several parameters which are useful to quantify how well the qualification works.

For these we need to introduce the following terms (assuming a classification with two classes *P* and *N*)
* $t_p = $ true-positive: number of cases with predicted *P* and correct *P*
* $t_n = $ true-negative: number of cases with predicted *N* and correct *N*
* $f_p = $ false-positive: number of cases with predicted *P* and correct *N*
* $f_n = $ false-negative: number of cases with predicted *N* and correct *P*
![Confusion matrix](./figures/wikipedia_confusion_matrix.png "More details: see Wikipedia article on confusion matrix")


Based on these, the parameters in the `classification_report`  are defined as:
* `precision` (or `purity`): $ t_p / ( t_p + f_p ) $ , i.e. fraction of cases classified as *P* which are true *P*
* `recall` (or `efficiency`): $ t_p / ( t_p + f_n ) $ , i.e. fraction of true *P* which are classified as *P*
* `f1-score` : Mean of `precision` and `recall`

See https://en.wikipedia.org/wiki/Precision_and_recall for a more detailed discussion

***

***
## Further simple models

### Gaussian Naive Bayes
Also a conceptually simple model
* basic assumption is that for each different category (*Iris-species*) the variables follow a Gaussian distribution.
* In training the model determines parameters of these Gaussians
* For classification then simply calculate probability of a given new Iris-data to be of species `i` based on Gaussian probability:
$$ P(x) = \frac{1}{{\sigma_i \sqrt {2\pi } }}e^{{{ - \left( {x - \mu_i } \right)^2 } \, } \left/ \right. {\, {2\sigma_i ^2 }}}$$
* where $\mu_i$ and $\sigma_i$ are mean and standarddeviation for respecitve variable and species `i`

(We'll look at why it's called "Bayes" in a bit more detail [here](http://localhost:8888/notebooks/Higgs-Gaussian.ipynb#GaussianNB).)

In [None]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(X_train, y_train)                # 3. fit model to data
y_gnb = model.predict(X_test)              # 4. predict on new data

In [None]:
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_gnb, y_test)
print("Test set score: {:.3f}".format(score))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_gnb, y_test))


In [None]:
mat = confusion_matrix(y_test, y_gnb, labels=labels)
print (labels,'\n', mat)

### Logistic Regression
This method is similar to standard liner regression. However, it can be used for discrete dependent variables, i.e. classification use-cases.

More info e.g. https://en.wikipedia.org/wiki/Logistic_regression 



In [None]:
from sklearn.linear_model import LogisticRegression # 1. choose model class
model = LogisticRegression(max_iter=500)            # 2. instantiate model
model.fit(X_train, y_train)                         # 3. fit model to data
y_lr = model.predict(X_test)                        # 4. predict on new data

In [None]:
# use scilearn function for score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_lr, y_test)
print("Test set score: {:.3f}".format(score))

***
### Probabilistic Classification

In general models can not only be used to give single classification as in above examples but one can also get a list of probabilities for the different possible outcomes:

In [None]:
yout=model.predict_proba(X_test)
print (yout[:5])

In [None]:
yout[y_lr != y_test]

In [None]:
list(zip(y_lr[:5],y_test[:5]))

Depending on type of problem this information can be further used to distinguish clear cases and those with overlapping classifications.

Or one can use it to adjust trade-off between precision and recall.

***
## Two short examples  on unsupervised ML

### Dimensionality Reduction
The Iris data is also a good show case for unsupervised learning.  
A common problem is **dimensionality reduction**, i.e. check if there is a lower dimensional representation which retains the essential features.
* In case of Iris data there are four feature dimensions
* scatter plot showed clear correlations between features
  * indication that less dimensions might be sufficient
  
One standard method is principal component analysis (PCA), which can be applied in case of (reasonably) linear correlations.

As before we have to do the usual scikit steps:
* Setup PCA model with 2 dimensions
* fit/train
* get reduced dimensions as output of transform

In [None]:
from sklearn.decomposition import PCA  # 1. Choose the model class
model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters
model.fit(X)                           # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X)              # 4. Transform the data to two dimensions
X_2D.shape

Visualize transformed data:

In [None]:
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot(x="PCA1", y="PCA2", hue='species', data=iris, fit_reg=False);

We can display the coefficients of the PCA transformation using the `model.components_` property:

In [None]:
model.components_

In [None]:
plt.matshow(model.components_)
plt.colorbar()
plt.xticks(range(len(X.columns)), X.columns)
plt.xlabel("Feature")
plt.ylabel("Principal components");

Or we plot the correlation like we did before:

In [None]:
sns.lmplot(x="PCA1", y="petal_width", hue='species', data=iris, fit_reg=False);

**Exercise** Redo the kNN classification using the 2 PCA Iris components and compare with the original kNN using all 4 Iris features

### Clustering

A clustering algorithm attempts to find distinct groups of data without reference to any labels.
One powerful method is Gaussian mixture model (GMM) *(Details see Data Science Handbook: 05.12-Gaussian-Mixtures.ipynb)*  
A GMM attempts to model the data as a collection of Gaussian blobs.

We can fit the Gaussian mixture model as follows:

In [None]:
from sklearn.mixture import GaussianMixture       # 1. Choose the model class
#
model =  GaussianMixture(n_components=3,
                         covariance_type='full')  # 2. Instantiate the model with hyperparameters

model.fit(X)                                      # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X)                          # 4. Determine cluster labels

In [None]:
iris['cluster'] = y_gmm
sns.lmplot(x="PCA1", y="PCA2", data=iris, hue='species',
           col='cluster', fit_reg=False);

##### Plot PCA data for each identified cluster  
Indicates good clustering, basically identical to species.


***
## Clustering applied to digit data

Another classic example case for ML is handwritten digits data.

A suitable dataset is included with sklearn, first we look into it:


In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape

In [None]:
type(digits)

In [None]:
digits?

In [None]:
print(digits.DESCR)

The data is sklearn specific container, basically a list of 8x8 pixels images

We plot a sub-set:

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

Plot shows pixel image together with label (in green).

* Some images are obvious
* Others seem difficult 

In [None]:
# Look at data from 1st image --> 2D table resembles 0
print (digits.images[1])

In [None]:
digits.images[0].shape

## Image data with sklearn:
To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.   

Already provided in Dataset, element `data` :

In [None]:
print (digits.data[0])

In [None]:
# to use as before just re-label
X = digits.data
y = digits.target

### PCA

In [None]:
# first try PCA
from sklearn.decomposition import PCA  # 1. Choose the model class
model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters
model.fit(X)                           # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X)              # 4. Transform the data to two dimensions

**now reduced 64 to 2 dimensions  
&rarr; visualize it**

In [None]:
xout=pd.DataFrame()
xout['tag']=y
xout['PCA1'] = X_2D[:, 0]
xout['PCA2'] = X_2D[:, 1]
sns.lmplot(x="PCA1", y="PCA2", hue='tag', data=xout, fit_reg=False, markers='.');


Some digits are nicely isolated, others less so

Think about it, which digits tend to look similar?

We can also have a look at the *principle components* that the PCA has extracted from the digits dataset:

In [None]:
model.components_.shape

In [None]:
# plot PCA components of digits
fig, ax = plt.subplots(1, 2, subplot_kw={'xticks': (), 'yticks': ()})
for idx, comp in enumerate(model.components_):
    img = comp.reshape(8,8)
    ax.ravel()[idx].imshow(img, cmap="coolwarm")

The left image shows the most important, the right image the second-most important component extracted by the PCA.
Compare this to the previous plot to see that this actually makes sense: 
* If you focus on the blue ("negative") pixels in the left image, those resemble the digit "3". Indeed, from the previous plot we see that the figures 3 cluster at low values of PCA1 (and around 0 for PCA2, i.e. they typically have not much of the second component in the right image). 
* The red in the left image could be part of a "4" which indeed has high values for PCA1.
* Similarly, the red in the right image is somewhat the outline of a "0" which has large positive values for PCA2 (and small absolute values for PCA1).

Can you find the digit "1"?

### Test Isomap 
Alternative method for dimension reduction:

In [None]:
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
XI_2D = iso.transform(digits.data)

In [None]:
XI_2D.shape

In [None]:
xout_iso=pd.DataFrame()
xout_iso['tag']=y
xout_iso['ISO1'] = XI_2D[:, 0]
xout_iso['ISO2'] = XI_2D[:, 1]
sns.lmplot(x="ISO1", y="ISO2", hue='tag', data=xout_iso, markers='.',fit_reg=False,
           scatter_kws={'alpha':0.7});


Separation clearly better with that method!

***
### Digit classification
Of course we want not just clustering but classification, so let's try our two models we had used before also on digits:

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

#### First kNN:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(Xtrain, ytrain)

In [None]:
# use scilearn function for score
from sklearn.metrics import accuracy_score
ypred = knn.predict(Xtest)
score = accuracy_score(ytest, ypred)
print("Test set score: {:.3f}".format(score))

***
**Detailed classification report**

In [None]:
from sklearn import metrics
print(metrics.classification_report(ytest, ypred))

**Check confusion matrix**  
very infomative for such a case

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

##### kNN performs really well!

***
#### Then  Gaussian Naive Bayes:

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

In [None]:
score = accuracy_score(y_model, ytest)
print("Test set score: {:.3f}".format(score))

In [None]:
from sklearn import metrics
print(metrics.classification_report(ytest, y_model))

In [None]:
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

##### GNB significantly worse, many more mis-ids!