# Getting started with scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
Guide: https://scikit-learn.org/stable/getting_started.html

## Fitting and predicting: estimator basics
Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.
The fit method has two inputs:
· The sample matrix X. (n_samples,n_features)
· The target value y. Real numbers for regression tasks, and integers for classification. Not specified for unsupervized tasks.
X and y are both array-like data types.

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X= [[1,2,3],[11,12,13]] # 2 samples, 3 features
y = [0,1] # classes of each sample
clf.fit(X,y)

RandomForestClassifier(random_state=0)

In [2]:
clf.predict(X) # predict classes of the training data

array([0, 1])

## Transformers and pre-processors
Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.

In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). The transformer objects don’t have a predict method but rather a transform method that outputs a newly transformed sample matrix X:

In [4]:
from sklearn.preprocessing import StandardScaler

# scale data according to computed scaling values 啥意思？
StandardScaler().fit(X).transform(X)

array([[-1., -1., -1.],
       [ 1.,  1.,  1.]])

## Pipelines: chaining pre-processors and estimators
Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

In [5]:
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object 噢？
pipe = make_pipeline(StandardScaler(),LogisticRegression())

# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y = True) # What is Iris dataset? Sounds like a good teaching sample :D
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [6]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test),y_test)

0.9736842105263158

## Model evaluation
Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation.

We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions.


In [7]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()
...
result = cross_validate(lr, X, y)  # defaults to 5-fold CV
result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])