# Introduction to scikit-learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [47]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## scikit-learn framework
`pip install scikit-learn`

scikit-learn (sklearn) is a library  
containing various machine learning algorithms

The nice thing about sklearn  
is its universal framework to all models:  
**model**, **fit**, and **predict**/**transform**

The basic input of `fit` is `X` and `y`  
where `X` is a dataset matrix  
and `y` is an array of labels

The convension of a dataset matrix  
in sklearn  
is each row is a sample while  
each column is a feature

#### datasets

`sklearn.datasets` module  
contains many functions  
for loading a datasets

In [36]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits['data']
y = digits['target']

Each model may locates  
in a different module

An easy way to find  
how to import a model  
is to Google and  
find it in the sklearn document

#### $k$-nearest neighbors classifier

In [37]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier() ### select a model
model.fit(X, y) ### fit the model
y_model = model.predict(X) ### make prediction

In [38]:
from sklearn.metrics import accuracy_score
### the high score is because
### we used the training set to test
accuracy_score(y, y_model)

0.9905397885364496

use `??` to read the docstring  
and find information  
about the hyperparameters

In [39]:
KNeighborsClassifier??

after training  
use `vars(model)`  
to see the trained parameters  
of the model

In [40]:
vars(model)

{'_fit_X': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 '_fit_method': 'kd_tree',
 '_tree': <sklearn.neighbors.kd_tree.KDTree at 0x23ed2a8>,
 '_y': array([0, 1, 2, ..., 8, 9, 8]),
 'algorithm': 'auto',
 'classes_': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'effective_metric_': 'euclidean',
 'effective_metric_params_': {},
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'outputs_2d_': False,
 'p': 2,
 'radius': None,
 'weights': 'uniform'}

In [41]:
model._fit_X.shape ### same as X

(1797, 64)

#### $k$-means clustering

In [42]:
from sklearn.cluster import KMeans
model = KMeans()
model.fit(X)
y_model = model.predict(X)

In [43]:
KMeans?

In [44]:
vars(model)

{'algorithm': 'auto',
 'cluster_centers_': array([[ 0.00000000e+00,  9.16030534e-02,  3.23282443e+00,
          1.02480916e+01,  1.25458015e+01,  6.25190840e+00,
          7.13740458e-01,  3.81679389e-03,  7.63358779e-03,
          7.06106870e-01,  6.97328244e+00,  1.27862595e+01,
          1.31374046e+01,  1.04389313e+01,  1.76335878e+00,
          4.71844785e-16, -1.69135539e-17,  1.16030534e+00,
          8.01908397e+00,  1.22251908e+01,  1.27404580e+01,
          9.82824427e+00,  1.06870229e+00, -1.80411242e-16,
          4.33680869e-19,  1.46183206e+00,  7.98091603e+00,
          1.40190840e+01,  1.43244275e+01,  6.01145038e+00,
          2.44274809e-01,  8.67361738e-19,  0.00000000e+00,
          9.42748092e-01,  7.94274809e+00,  1.39389313e+01,
          1.33702290e+01,  3.49236641e+00,  8.77862595e-02,
          0.00000000e+00,  3.46944695e-18,  1.16030534e+00,
          9.31297710e+00,  1.10190840e+01,  1.26488550e+01,
          5.22519084e+00,  2.93893130e-01,  6.59194921e-17

In [45]:
model.cluster_centers_.shape 
### 8 centers in R^64

(8, 64)

#### Linear regression

In [48]:
### make sample data
x = np.linspace(0,10,20)
y = 3 + 0.5*x + 0.2*np.random.randn(20)

In [69]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = x[:, np.newaxis]
model.fit(X, y)
y_model = model.predict(X)

In [None]:
LinearRegression?

In [51]:
vars(model)

{'_residues': 1.0332626694540439,
 'coef_': array([0.5070146]),
 'copy_X': True,
 'fit_intercept': True,
 'intercept_': 2.9574823707599176,
 'n_jobs': None,
 'normalize': False,
 'rank_': 1,
 'singular_': array([13.57241785])}

`model.coef_` contains  
the coefficients of each columns

`model.intercept_` is the constant term

## Scores of a model
Each model has different default score

Many metrics for measuring  
the performance of a model  
are contained in `sklearn.metrics`  

See [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for a list of scores

For classification problems  
the score can be the **accuracy**  

In [54]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits['data']
y = digits['target']

In [55]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, y)
y_model = model.predict(X)

The default score  
for `KNeighborsClassifier`  
is the accuracy

In [None]:
model?

In [57]:
model.score(X, y)

0.9905397885364496

In [58]:
from sklearn.metrics import accuracy_score
accuracy_score(y_model, y)

0.9905397885364496

For clustering problems  
the score can be  
the **adjusted Rand index**

See more [here](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)

In [60]:
from sklearn.cluster import KMeans
model = KMeans()
model.fit(X)
y_model = model.predict(X)

The default score  
for `KMeans`  
is the negation of the sum of distances

In [64]:
model.score?

In [63]:
model.score(X)

-1264988.112404823

In [66]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y, y_model)

0.5751372605057907

For regression problems  
the score can be  
the the mean of squared errors  
or the $R^2$ score

In [68]:
### make sample data
x = np.linspace(0,10,20)
y = 3 + 0.5*x + 0.2*np.random.randn(20)

In [2]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = x[:, np.newaxis]
model.fit(X, y)
y_model = model.predict(X)

The default score  
for `LinearRegression`  
is the $R^2$ score

In [70]:
model.score?

In [71]:
model.score(X, y)

0.9862303246979223

In [72]:
from sklearn.metrics import r2_score
r2_score(y, y_model)

0.9862303246979222

In [77]:
from sklearn.metrics import mean_squared_error
u = mean_squared_error(y, y_model)
u

0.031968249148189604

In [78]:
### mean of y - y_mean square
v = ((y - y.mean())**2).mean()
v

2.321641465530105

In [79]:
1 - u/v

0.9862303246979222

## Save and load a model
One may use thye joblib package  
to save and load a model  
`pip install joblib`

In [1]:
import os
from joblib import dump, load

Training a model  
can possibly take a long time

Once a `model` is trained  
use `dump(model, 'filename.joblib') ` to save the model

In [None]:
### make sample data
x = np.linspace(0,10,20)
y = 3 + 0.5*x + 0.2*np.random.randn(20)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = x[:, np.newaxis]
model.fit(X, y)
y_model = model.predict(X)

In [83]:
dump(model, 'linear_model.joblib')

['linear_model.joblib']

In [86]:
os.listdir('.')

['Introduction-to-scikit-learn.ipynb',
 'linear_classifier.png',
 '256px-Colored_neural_network.svg.png',
 'Algorithms-spectral-embedding.ipynb',
 'A-taste-of-data-science.ipynb',
 'Complexity-sorting-and-vectorization.ipynb',
 'A-taste-of-feature-engineering.ipynb',
 'Introduction-to-NetworkX.ipynb',
 '.git',
 'Algorithms-linear-classifier.ipynb',
 'linear_model.joblib',
 'Algorithms-neural-network-feedforward-and-accuracy.ipynb',
 'Algorithms-data-to-graph.ipynb',
 '.ipynb_checkpoints',
 'LICENSE',
 'Algorithms-k-mean-clustering.ipynb',
 '256px-SVM_margin.png',
 'kmean.png',
 'spectral_embedding.png',
 'kNN.png',
 'README.md',
 'NeuralNetwork1.ipynb',
 'eball.png',
 'Algorithms-searching-algorithms.ipynb']

Use  
`model = load('filename.joblib')`  
to retrieve a model

In [87]:
new_model = load('linear_model.joblib')

In [88]:
vars(new_model)

{'_residues': 0.639364982963793,
 'coef_': array([0.49859111]),
 'copy_X': True,
 'fit_intercept': True,
 'intercept_': 3.0738209729588046,
 'n_jobs': None,
 'normalize': False,
 'rank_': 1,
 'singular_': array([13.57241785])}

In [3]:
### since this is only for illustration
### let's remove the file  
### to keep the folder clean
os.remove('linear_model.joblib')