![Image](images/scikit-learn-logo.png)

Scikit-Learn is an opensource machine learning library and has collection of all the popular machine learning algorithms. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. It makes these available by providing highly optimized python functions and classes. Let's look at some scikit-learn features.

In [35]:
# Print scikit-learn version
import sklearn
sklearn.__version__

'0.24.2'

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

# Loading datasets

Scikit-learn provides some sample datasets to getting started with building our models without having need to acquire dataset from external source (which is not trivial). Provided by module `sklearn.datasets`.

**APIs:** 

* **`datasets.load_iris(*[, return_X_y, as_frame])`** - Load and return the iris dataset (classification).
* **`datasets.load_digits(*[, n_class, …])`** - Load and return the digits dataset (classification).

**Examples:**

In [25]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Example 1

from sklearn import datasets

# import iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

In [10]:
# Example 2

# import the digits dataset
digits = datasets.load_digits()

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

# Preprocessing

Pre-processing step includes scaling, centering, normalization, binarization methods. They all have very similar APIs with 4 functions - `fit()`, `transform()`, `fit_transform()` and `inverse_transform()`. Provided by module `sklearn.preprocessing`.

**APIs:**

* **`fit(X[, y])`** - Fit to data.
* **`fit_transform(X[, y])`** - Fit to data, then transform it.
* **`inverse_transform(X)`** - Undo the transform.
* **`transform(X)`** - As the name suggests, transform.

**Examples:**

In [13]:
# Example 1

from sklearn.preprocessing import StandardScaler

X = [[0, 15], [1, -10]]

# scale data according to computed scaling values
print (StandardScaler().fit(X).transform(X))

[[-1.  1.]
 [ 1. -1.]]


In [14]:
# Example 2

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler()
scaler.fit(data)
print (scaler.transform(data))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>


# Dimensionality Reduction

Most often than not, we have to select only few features or parameters to work with. This is because features can be highly correlated, may have missing values, etc. For this reason, there are many dimensionality reduction techniques including among others PCA, NMF or ICA. In scikit-learn, they are provided by module `sklearn.decomposition`.

**APIs:**

* **`fit(X[, y])`** - Fit model on training data X.
* **`fit_transform(X[, y])`** - Fit model to X and perform dimensionality reduction on X.
* **`get_params([deep])`** - Get parameters for this estimator.
* **`inverse_transform(X)`** - Transform X back to its original space.
* **`set_params( ** params)`** - Set the parameters of this estimator.
* **`transform(X)`** - Perform dimensionality reduction on X.

**Examples:**

In [15]:
# Example 1

import numpy as np
from sklearn.decomposition import PCA

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

[0.99244289 0.00755711]
[6.30061232 0.54980396]


In [16]:
# Example 2

from sklearn.decomposition import TruncatedSVD
from scipy.sparse import random as sparse_random

X = sparse_random(100, 100, density=0.01, format='csr', random_state=42)

svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
svd.fit(X)
print(svd.explained_variance_ratio_)
print(svd.explained_variance_ratio_.sum())
print(svd.singular_values_)

[0.06460458 0.06339574 0.06394296 0.05352982 0.04061973]
0.28609283521378903
[1.55360944 1.5121377  1.51052009 1.37056529 1.19917045]


<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>


# Feature Selection

This is another method of dimensionality reduction. The `sklearn.feature_selection` module can be used for feature selection/dimensionality reduction on sample sets.

**APIs:**

* **`fit(X[, y])`**  - Fit to data.
* **`fit_transform(X[, y])`** - Fit to data, then transform it.
* **`get_params([deep])`** - Get parameters for this estimator.
* **`get_support([indices])`** - Get a mask, or integer index, of the features selected
* **`inverse_transform(X)`** - Reverse the transformation operation
* **`set_params(**params)`** - Set the parameters of this estimator.
* **`transform(X)`** - Reduce X to the selected features.

**Examples:**

In [17]:
# Example 1 : Removing features with low variance

from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
print (sel.fit_transform(X))

[[0 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]]


In [12]:
# Example 2 : Tree-based feature selection

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)
print (X.shape)

clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
print (clf.feature_importances_) 

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
print (X_new.shape )              

(150, 4)
[0.09993554 0.05498416 0.38131317 0.46376713]
(150, 2)



<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

# Models

Scikit-learn provides collection of machine learning models - both supervised and unsupervised, classification, regression and clustering. Following are few examples.


## Classification

**APIs:**

* **`decision_function(X)`** - Evaluates the decision function for the samples in X.
* **`fit(X, y[, sample_weight])`** - Fit the model according to the given training data.
* **`get_params([deep])`** - Get parameters for this estimator.
* **`predict(X)`** - Perform classification on samples in X.
* **`score(X, y[, sample_weight])`** - Return the mean accuracy on the given test data and labels.
* **`set_params( ** params)`** - Set the parameters of this estimator.

**Examples:**

In [18]:
# Example 1 : Support Vector Classifier

from sklearn import svm

X = [[0, 0], [1, 1]]
y = [0, 1]

clf = svm.SVC()
clf.fit(X, y)

print (clf.predict([[2., 2.]]))

[1]


In [19]:
# Example 2 : Decision Tree Classifier

from sklearn import tree

X = [[0, 0], [1, 1]]
Y = [0, 1]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

print (clf.predict([[2., 2.]]))

[1]


<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

## Regression

**APIs:**

* **`fit(X, y)`** - Fit the model to data matrix X and target(s) y.
* **`get_params([deep])`** - Get parameters for this estimator.
* **`predict(X)`** - Predict using the multi-layer perceptron model.
* **`score(X, y[, sample_weight])`** - Return the coefficient of determination  of the prediction.
* **`set_params( ** params)`** - Set the parameters of this estimator.

**Examples:**

In [20]:
# Example 1 MLP Regressor

from sklearn.neural_network import MLPRegressor

X = [[0, 0], [2, 2]]
y = [0.5, 2.5]

regr = MLPRegressor()
regr.fit(X, y)
print (regr.predict([[1, 1]]))

[1.51911035]


In [21]:
# Example 2 : Support Vector Regressor

from sklearn import svm

X = [[0, 0], [2, 2]]
y = [0.5, 2.5]

regr = svm.SVR()
regr.fit(X, y)
print (regr.predict([[1, 1]]))

[1.5]



<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

## Clustering

Clustering of unlabeled data can be performed with the module `sklearn.cluster`.

**APIs:**

* **`fit(X[, y, sample_weight])`** - Learn the clusters on train data.
* **`fit_predict(X[, y, sample_weight])`** - Compute cluster centers and predict cluster index for each sample.
* **`fit_transform(X[, y, sample_weight])`** - Compute clustering and transform X to cluster-distance space.
* **`get_params([deep])`** - Get parameters for this estimator.
* **`predict(X[, sample_weight])`** - Predict the closest cluster each sample in X belongs to.
* **`set_params( ** params)`** - Set the parameters of this estimator.
* **`transform(X)`** - Transform X to a cluster-distance space.

**Examples:**


In [38]:
# Example 1 : K-Means Clustering

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print (kmeans.labels_)
print (kmeans.predict([[0, 0], [12, 3]]))
print (kmeans.cluster_centers_)

[1 1 1 0 0 0]
[1 0]
[[10.  2.]
 [ 1.  2.]]


In [40]:
# Example 2 : Mean Shift

from sklearn.cluster import MeanShift
import numpy as np
X = np.array([[1, 1], [2, 1], [1, 0], [4, 7], [3, 5], [3, 6]])

ms = MeanShift(bandwidth=2).fit(X)
print (ms.labels_)
print (ms.predict([[0, 0], [5, 5]]))
print (ms.cluster_centers_)

[1 1 1 0 0 0]
[1 0]
[[3.33333333 6.        ]
 [1.33333333 0.66666667]]


<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

# Model Selection & Evaluation

## Cross-validation

It's a method to avoid overfitting. The model is trained with different train-validate splits and the average score of the model is computed. The model with maximum average score is finally selected. Provided by `sklearn.model_selection`.

**APIs:**

* **`cross_val_score(estimator, X, y=None, scoring=None, cv=None)`** - Evaluate a score by cross-validation

**Examples:**

In [22]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)
print (X.shape, y.shape)

(150, 4) (150,)


In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print (X_train.shape, y_train.shape)

print (X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# Print the cross-validation for 5 different splits
scores  = cross_val_score(clf, X, y, cv=5)
print (scores)

# Print average score
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

(90, 4) (90,)
(60, 4) (60,)
[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.98 accuracy with a standard deviation of 0.02


<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

## Metrics 

The `sklearn.metrics` module implements functions assessing prediction error for specific purposes.

**APIs:**

* **`confusion_matrix(y_true, y_pred, *[, …])`** - Compute confusion matrix to evaluate the accuracy of a classification.
* **`roc_auc_score(y_true, y_score, *[, average, …])`** - Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
* **`accuracy_score(y_true, y_pred, *[, …])`** - Accuracy classification score.
* **`classification_report(y_true, y_pred, *[, …])`** - Build a text report showing the main classification metrics.
* **`f1_score(y_true, y_pred, *[, labels, …])`** - Compute the F1 score, also known as balanced F-score or F-measure.
* **`precision_score(y_true, y_pred, *[, labels, …])`** - Compute the precision.
*  **`recall_score(y_true, y_pred, *[, labels, …])`** - Compute the recall.

**Examples:**

In [24]:
# Example 1 : Confusion Matrix

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
print (confusion_matrix(y_true, y_pred))

[[2 0 0]
 [0 0 1]
 [1 0 2]]


In [25]:
# Example 2 : Classification report

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.67      1.00      0.80         2
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.50      0.67         2

    accuracy                           0.60         5
   macro avg       0.56      0.50      0.49         5
weighted avg       0.67      0.60      0.59         5



# References

* Scikit-learn.org. 2021. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. [online] Available at: <https://scikit-learn.org/> [Accessed 2 August 2021].