# Scikit-learn

## Data 


As it was mentioned the data needs to be `np.array` (`shape = (n_observations, n_features)`).

Quite often there are too many features to represent the whole matrix (e.g. in text analysis or genomics data) - that's when it's useful to represent it as sparse matrices - `scipy.sparse`.

Scikit-learn contains some broadly used datasets - e.g. information about Boston stock market.

You can find the location of your sklearn library run the following commands:

In [1]:
import sklearn
sklearn.base

<module 'sklearn.base' from '/home/piotrek/miniconda3/envs/BioML/lib/python3.6/site-packages/sklearn/base.py'>

If you are using Linux, you can use bash commands within Jupyter Notebook. If you are not using Linux, don't worry - this part is not essential.

In [7]:
!ls /home/piotrek/miniconda3/envs/BioML/lib/python3.6/site-packages/sklearn

base.py					   kernel_ridge.py
_build_utils				   learning_curve.py
calibration.py				   linear_model
__check_build				   manifold
cluster					   metrics
covariance				   mixture
cross_decomposition			   model_selection
cross_validation.py			   multiclass.py
datasets				   multioutput.py
decomposition				   naive_bayes.py
discriminant_analysis.py		   neighbors
dummy.py				   neural_network
ensemble				   pipeline.py
exceptions.py				   preprocessing
externals				   __pycache__
feature_extraction			   random_projection.py
feature_selection			   semi_supervised
gaussian_process			   setup.py
grid_search.py				   svm
__init__.py				   tests
_isotonic.cpython-36m-x86_64-linux-gnu.so  tree
isotonic.py				   utils
kernel_approximation.py


In [8]:
!ls /home/piotrek/miniconda3/envs/BioML/lib/python3.6/site-packages/sklearn/datasets/data

boston_house_prices.csv  diabetes_target.csv.gz  linnerud_exercise.csv
breast_cancer.csv	 digits.csv.gz		 linnerud_physiological.csv
diabetes_data.csv.gz	 iris.csv		 wine_data.csv


This is where the default datasets are kept.

Jupyter Notebooks contain "magic" functions. Most useful:
- `%cd` - change the current working directory
- `%load` - load code into the current frontend
- `%matplotlib [gui]` - set up matplotlib to work interactively (choose backend)
- `%pwd` - print working directory
- `%timeit` - times a command
- `%%timeit` - times the execution of a cell
- `%%bash` - run cell with bash in subprocess
- `%%perl` - run cell with perl in subprocess
- `%magic` - print information about the magic function system

In [16]:
%magic

In [22]:
%%bash
head boston_house_prices.csv

506,13,,,,,,,,,,,,
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1


In [5]:
import sklearn
sklearn.base

<module 'sklearn.base' from '/home/piotrek/miniconda3/envs/BioML/lib/python3.6/site-packages/sklearn/base.py'>

Let's import the data using a function from pandas.

In [29]:
from pandas import read_csv

#read_csv? # question mark allows you to peek into the documentation
#read_csv?? # shows you the source code
data_boston = read_csv("boston_house_prices.csv", skiprows=[0])
display(data_boston.head())
print(data_boston.shape)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


(506, 14)


As it is a very popular dataset, scikit-learn has a built in function that loads it along with metadata. Let's use it.

In [30]:
import sklearn.datasets
boston = sklearn.datasets.load_boston()

Let's see, what it contains:

In [31]:
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [32]:
print(boston["DESCR"])

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [33]:
print("Shape of Boston data [n_observations, n_features]: ", boston.data.shape)
display(boston.data[0])
print("Shape of Boston target data: ", boston.target.shape)
display(boston.target[0])

Shape of Boston data [n_observations, n_features]:  (506, 13)


array([  6.32000000e-03,   1.80000000e+01,   2.31000000e+00,
         0.00000000e+00,   5.38000000e-01,   6.57500000e+00,
         6.52000000e+01,   4.09000000e+00,   1.00000000e+00,
         2.96000000e+02,   1.53000000e+01,   3.96900000e+02,
         4.98000000e+00])

Shape of Boston target data:  (506,)


24.0

Let's check for NaNs (Not a Number).

In [34]:
import numpy as np
print(np.isnan(boston.data).any())
print(np.isnan(boston.target).any())


False
False


Now we know how to load the data, let's divide it into test and training sets.
## Test & Training sets

<img src="figures/train_test_split_matrix.svg" width="80%">

In [35]:
from sklearn.model_selection import train_test_split
train_test_split?

train_X, test_X, train_y, test_y = train_test_split(boston.data, boston.target, test_size=0.25, random_state=42)

Data processed in this way, can be given as input to scikit-learn estimators.

# Ordinary least squares regression

Find a linear relationship between $p$ features and dependent variable. In real world, there is some noise in the linear relationship.

<img src="figures/ols.png" width="80%">


Firs of all, we have to define a measure of fit - how good is our model. Let's define it as the least squares error.

$$
SE = \sum_{i=1}^{n} r_i^2
$$


$$
r = y_i - f(x_i, \beta)
$$

$y$ is the actual value, $f(x_i, \beta)$ is the value predicted by the regression model given the $p$ features

<img src="figures/varianceexplained.gif" width="80%">

(source: http://my.ilstu.edu/~wjschne/138/Psychology138Lab9.html)

Let's see a synthetic example.

In [39]:
# import necessary modules

%matplotlib notebook 
# setting matplotlib backend
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")

Let's model $y = 2x + 5$

$x$ is the predictor (feature)

$y$ is the target value (we want to predict it, given x)

What we basically want to do is, given $x$, discover that you need to multiply it by $2$ and add $5$ to get $y$.

In [40]:
x = np.linspace(0, 10, 100) # 100 evenly spaced numbers between 0 and 10
y = 2 * x + 5

In [41]:
plt.rcParams['figure.figsize'] = (8, 8) # change of plot size (by default it's small)
plt.rcParams.update({'font.size': 17}) # font size
plt.plot(x, y, 'o') # plotting x, y as circles

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f49e99a4b38>]

As it was mentioned, scikit-learn requires matrices of shape: `[n_observations, n_features]`.
We have 100 observations and one feature (value of `x`).
We need to change shape of the matrix to make it understandable to scikit-learn.
<img src="figures/train_test_split_matrix.svg" width="80%">

In [46]:
print('Shape before: ', x.shape)
print(x)
X = x[:, np.newaxis] # newaxis adds a dimension
print('After: ', X.shape)
print(X)

Shape before:  (100,)
[  0.           0.1010101    0.2020202    0.3030303    0.4040404
   0.50505051   0.60606061   0.70707071   0.80808081   0.90909091
   1.01010101   1.11111111   1.21212121   1.31313131   1.41414141
   1.51515152   1.61616162   1.71717172   1.81818182   1.91919192
   2.02020202   2.12121212   2.22222222   2.32323232   2.42424242
   2.52525253   2.62626263   2.72727273   2.82828283   2.92929293
   3.03030303   3.13131313   3.23232323   3.33333333   3.43434343
   3.53535354   3.63636364   3.73737374   3.83838384   3.93939394
   4.04040404   4.14141414   4.24242424   4.34343434   4.44444444
   4.54545455   4.64646465   4.74747475   4.84848485   4.94949495
   5.05050505   5.15151515   5.25252525   5.35353535   5.45454545
   5.55555556   5.65656566   5.75757576   5.85858586   5.95959596
   6.06060606   6.16161616   6.26262626   6.36363636   6.46464646
   6.56565657   6.66666667   6.76767677   6.86868687   6.96969697
   7.07070707   7.17171717   7.27272727   7.37373737   

Let's divide the data into test and training set

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [58]:
plt.figure()
plt.plot(X_train, y_train, 'bo', label="training_data")
plt.plot(X_test, y_test, 'rx', label="test_data")
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

### Linear regression

In [59]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [53]:
LinearRegression?

In [68]:
print('Weight of x: ', regressor.coef_[0])
print('y intercept: ', regressor.intercept_)

Weight of x:  1.96101790152
y intercept:  4.95188221696


In [69]:
min_pt = X.min() * regressor.coef_[0] + regressor.intercept_
max_pt = X.max() * regressor.coef_[0] + regressor.intercept_
fig = plt.figure()
plt.plot([X.min(), X.max()], [min_pt, max_pt])
plt.plot(X_train, y_train, 'o')
plt.show()

<IPython.core.display.Javascript object>

Based on the training data, we found the relationship between features and target value. It was really easy, let's get closer to reality and add **noise**.

In [70]:
rng = np.random.RandomState(42)
y = 2 * x + 5 + rng.uniform(-3, 3, size=len(x))
fig = plt.figure()
plt.plot(x, y, 'o')
min_pt = 2 * X.min() + 5
max_pt = 2 * X.max() + 5

plt.plot([X.min(), X.max()], [min_pt, max_pt], 'black')


<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f49e7482b70>]

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
regressor.fit(X_train, y_train)
print('Weight of x: ', regressor.coef_[0])
print('y intercept: ', regressor.intercept_)

min_pt = 2 * X.min() + 5
max_pt = 2 * X.max() + 5


plt.plot([X.min(), X.max()], [min_pt, max_pt], 'black')
ax = plt.figure()
min_pt_pred = X.min() * regressor.coef_[0] + regressor.intercept_
max_pt_pred = X.max() * regressor.coef_[0] + regressor.intercept_
plt.plot([X.min(), X.max()], [min_pt_pred, max_pt_pred], 'g--', label="fitted regression line")
plt.plot(X_train, y_train, 'bo', label="training data")
plt.plot(X_test, y_test, 'rx', label="test data")
plt.legend()
plt.show()

Weight of x:  1.96101790152
y intercept:  4.95188221696


<IPython.core.display.Javascript object>

Using scikit-learn we almost exactly found the relationship despite the noisiness of the data. Let's see what would happen if we had **less training data**. Let's use interactive widgets. To do that, let's convert our rough scripts into a function.

In [112]:
def fit_regression(data_size):
    plt.figure()
    x = np.linspace(0, 10, data_size)
    X = x[:, np.newaxis]
    rng = np.random.RandomState(42)
    y = 2 * x + 5 + rng.uniform(-3, 3, size=len(x))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    regressor.fit(X_train, y_train)
    print('Weight of x: ', regressor.coef_)
    print('y intercept: ', regressor.intercept_)

    min_pt = 2 * X.min() + 5
    max_pt = 2 * X.max() + 5


    plt.plot([X.min(), X.max()], [min_pt, max_pt], 'black', label="actual (2x + 5)")

    min_pt_pred = X.min() * regressor.coef_[0] + regressor.intercept_
    max_pt_pred = X.max() * regressor.coef_[0] + regressor.intercept_
    plt.plot([X.min(), X.max()], [min_pt_pred, max_pt_pred], 'g--',
             label="found {0:.2f}x {1:+.2f}".format(regressor.coef_[0], regressor.intercept_))
             #formatting strings info - https://docs.python.org/2/library/string.html#format-specification-mini-language
    
    plt.plot(X_train, y_train, 'bo', label="training data")
    plt.plot(X_test, y_test, 'rx', label="test data")
    plt.legend(fontsize="x-small")
    return regressor.coef_, regressor.intercept_

In [116]:
from ipywidgets import widgets
# more info about widgets here - https://ipywidgets.readthedocs.io/en/latest/
widgets.interact(fit_regression, data_size=widgets.BoundedIntText(min=4, max=300))


<function __main__.fit_regression>

## Too little data - large noise - fitted coefficients are very 'unstable'

In [117]:
def give_coefficients(data_size):
    x = np.linspace(0, 10, data_size)
    X = x[:, np.newaxis]
    rng = np.random.RandomState(42)
    y = 2 * x + 5 + rng.uniform(-3, 3, size=len(x))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    regressor.fit(X_train, y_train)
    return regressor.coef_, regressor.intercept_

a = list()
b = list()
for i in range(4, 4000):
    coefficients = give_coefficients(i)
    a.append(coefficients[0])
    b.append(coefficients[1])
fig = plt.figure()
plt.plot(np.arange(len(a)), a, label="fitted weight of x")
plt.plot([0, len(a)], [2, 2], label="actual weight of x")
plt.title("Fluctuations of fitted value for weight of x and y intercept")
plt.xlabel("Amount of data")
plt.ylabel("Fitted value")
plt.plot(np.arange(len(b)), b, label="fitted y intercept")
plt.plot([0, len(a)], [5, 5], label="actual y intercept")
plt.legend(fontsize="small")
plt.show()

<IPython.core.display.Javascript object>

And what if the relationship is not linear?
$$ y = \sin(x) + noise$$

cannot be modeled with a line. 

In [118]:
x = np.linspace(-5, 5, 200)
y = np.sin(x) + rng.uniform(-0.75, 0.75, size=len(x))
fig = plt.figure()
plt.plot(x, y, 'o')
plt.plot(x, np.sin(x))
plt.title(r'$y = \sin(x)$')
X = x[:, np.newaxis]

<IPython.core.display.Javascript object>

In [129]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
print('Weight of x: ', regressor.coef_[0])
print('y intercept: ', regressor.intercept_)
min_pt = X.min() * regressor.coef_[0] + regressor.intercept_
max_pt = X.max() * regressor.coef_[0] + regressor.intercept_

plt.figure()
plt.plot([X.min(), X.max()], [min_pt, max_pt], label="Fitted regression line")
plt.plot(X_train, y_train, 'o', label="Training data")

predict_data = np.linspace(-5, 5, 20)
predict_data = predict_data[:, np.newaxis]
predicted_values = regressor.predict(predict_data)
plt.plot(predict_data, predicted_values, 'gx', markersize=10, label="Predicted values")
plt.title("Sine cannot be modeled using regression")
plt.legend(loc="best", prop={'size': 8})


Weight of x:  -0.0546068400036
y intercept:  0.0324743499252


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f49e64f2ba8>

This model is not complex enough. Let's use K-neighbors regression. Here, to predict a value, we look at K nearest (in terms of features [and usually Euclid metric]) neighbors and return their mean as the predicted value.

In [127]:
from sklearn.neighbors import KNeighborsRegressor
kneighbor_regression = KNeighborsRegressor(n_neighbors=1)
kneighbor_regression.fit(X_train, y_train)
y_pred_train = kneighbor_regression.predict(X_train)
plt.figure()
plt.plot(X_train, y_train, 'o', label="actual value", markersize=10)
plt.plot(X_train, y_pred_train, 's', label="predicted value", markersize=4)
plt.legend(loc='best', prop={'size': 8});

<IPython.core.display.Javascript object>

In [130]:
y_pred_test = kneighbor_regression.predict(X_test)
fig = plt.figure()
plt.plot(X_test, y_test, 'o', label="actual value", markersize=8)

plt.plot(X_test, y_pred_test, 's', label="predicted value", markersize=4)
plt.plot(x, np.sin(x))
plt.legend(loc='best', prop={'size': 8});

<IPython.core.display.Javascript object>

In [135]:
def N_neighbors_effect(K):
    fig = plt.figure()
    kneighbor_regression = KNeighborsRegressor(n_neighbors=K)
    kneighbor_regression.fit(X_train, y_train)
    y_pred_test = kneighbor_regression.predict(X_test)

    plt.plot(X_test, y_test, 'o', label="actual value", markersize=8)
    plt.plot(X_test, y_pred_test, 's', label="predicted value", markersize=4)
    plt.plot(x, np.sin(x), label="actual relationship")
    plt.title("K-Neighbors regression for K = " + str(K))
    plt.legend(loc='best',prop={'size': 8})
    plt.show()
    
widgets.interactive(N_neighbors_effect, K=widgets.IntSlider(min=1, value=1,max=180))

## Complexity of model has to match the problem

+ too complex and the model will memorize training data, poor generalization
+ too simple and the model won't capture the nuances of the problem

<img src="figures/plot_kneigbors_regularization.png" width="80%">

## Back to linear models

We have to measure the quality of our model. A common measure used in regression problems is $R^2$.


<img src="figures/r_squared.png" width="65%">

MSE - Mean Squared Error can also be used:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (\text{predicted}_i - \text{true}_i)^2$$

In [136]:
fig = plt.figure()
x = np.linspace(0, 10, 100)
X = x[:, np.newaxis]
rng = np.random.RandomState(42)
y = 2 * x + 5 + rng.uniform(-3, 3, size=len(x))
Y = y[:, np.newaxis]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
regressor.fit(X_train, y_train)
print('Weight of x: ', regressor.coef_)
print('y intercept: ', regressor.intercept_)

min_pt = 2 * X.min() + 5
max_pt = 2 * X.max() + 5


plt.plot([X.min(), X.max()], [min_pt, max_pt], 'black')
min_pt_pred = X.min() * regressor.coef_[0] + regressor.intercept_
max_pt_pred = X.max() * regressor.coef_[0] + regressor.intercept_
plt.plot([X.min(), X.max()], [min_pt_pred, max_pt_pred], 'g--')
plt.plot(X_train, y_train, 'bo')
plt.plot(X_test, y_test, 'rx')

plt.show()
regressor.score(X_test, y_test)

<IPython.core.display.Javascript object>

Weight of x:  [[ 1.9610179]]
y intercept:  [ 4.95188222]


0.89477978537519487

Ordinary Least Squares, according to Gauss-Markov theorem is BLUE - Best Linear Unbiased Estimator.

However, we might want to loose the unbiasness in order to get better result (decreasing the error coming from variance).


<img src="figures/bias-and-variance.jpg" width="65%">

<img src="figures/biaserror.png" width="65%">

## Regularization

In OLS we were minimizing the following: 


$$ \text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b  - y_i||^2 $$

<img src="figures/regularization.png" width="65%">


For Ridge Regression we will minimize: 

$$ \text{min}_{w,b}  \sum_i || w^\mathsf{T}x_i + b  - y_i||^2  + \alpha ||w||_2^2$$ 

And for Lasso: 

$$ \text{min}_{w, b} \sum_i \frac{1}{2} || w^\mathsf{T}x_i + b  - y_i||^2  + \alpha ||w||_1$$ 


## In scikit-learn these methods are available here:

`from sklearn.linear_model import Ridge`

`ridge = Ridge(alpha=alpha).fit(X_train, y_train)`

`from sklearn.linear_model import Lasso`

`lasso = Lasso(alpha=alpha).fit(X_train, y_train)`

-----------------------------
# Clustering

Let's create three blobs of points.

In [140]:
from sklearn.datasets import make_blobs
plt.figure()
X, y = make_blobs(random_state=42)
plt.scatter(X[:, 0], X[:, 1], s=100)
plt.show()

<IPython.core.display.Javascript object>

We want the clustering algorithm to assign these points to some clusters. We can **see** these clusters, but we are slow and limited to 3D. Computer cannot see them, but if we specify, how to group them, it can perform much faster and in more dimensions. In K-Means Clustering, we give the number of clusters and the data as input, and the computer returns the assignment of each point to one of the clusters. Usually Euclid distance is used.

In [143]:
from sklearn.cluster import KMeans
plt.figure()
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels);

<IPython.core.display.Javascript object>

For us it's pretty obvious but in more dimensions it gets tricky.

## Sometimes we don't want to cluster by distance

Then density-based methods might be more useful

In [144]:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=400,
                  noise=0.1,
                  random_state=1)
plt.figure()
plt.scatter(X[:,0], X[:,1])
plt.show()

<IPython.core.display.Javascript object>

In [149]:
from sklearn.cluster import DBSCAN

plt.figure()
db = DBSCAN(eps=0.22,
            min_samples=10,
            metric='euclidean')
prediction = db.fit_predict(X)


plt.scatter(X[:, 0], X[:, 1], c=prediction);

<IPython.core.display.Javascript object>

And another example of density-based clustering.

In [153]:
from sklearn.datasets import make_circles

plt.figure()
X, y = make_circles(n_samples=1500, 
                    factor=.4, 
                    noise=.05)

plt.scatter(X[:, 0], X[:, 1]);
plt.show()

<IPython.core.display.Javascript object>

In [156]:
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN

X, y = make_circles(n_samples=1500, 
                    factor=.4, 
                    noise=.05)

km = KMeans(n_clusters=2)
plt.figure()
plt.title("KMeans")
plt.scatter(X[:, 0], X[:, 1], c=km.fit_predict(X))

db = DBSCAN(eps=0.2)
plt.figure()
plt.title("DBSCAN - Density-based Spatial Clustering of Applications with Noise")
plt.scatter(X[:, 0], X[:, 1], c=db.fit_predict(X));


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit


Some of the examples coming from **SciPy2017 Scikit-learn Tutorial**