# Getting Started

https://scikit-learn.org/stable/getting_started.html
    
We discuss the main features that scikit-learn provides. 

It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.). 

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. 

It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

# Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. 

Each estimator can be fitted to some data using its fit method.

## Example: 

https://scikit-learn.org/stable/getting_started.html

Use RandomForestClassifier to fit training data and perform prediction

- Training/Fitting:

    - The fit method generally accepts 2 inputs: The sample matrix X and target value y.
        
        - The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.
        - The target values y which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.
    
    - Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse matrices.

- Prediction:

    - Once the estimator is fitted, it can be used for predicting target values of new data. 
    - You don't need to re-train the estimator:

In [7]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)
# take 2 samples and 3 features
# Output is classes of each sample
print('Perform Random Forest Classifier:')
print(RandomForestClassifier(random_state=0))
print()
# Predict of training data
print('clf.predict(X):')  # predict classes of the training data
print(clf.predict(X))
print()
# array([0, 1])

# predict of new data set
print('clf.predict([[1, 2, 3], [4, 5, 6], [7, 8, 9], [11, 12, 13],[14, 15, 16]]):')  
# predict classes of new data
print(clf.predict([[1, 2, 3], [4, 5, 6], [7, 8, 9], [11, 12, 13],[14, 15, 16]]))
# array([0, 1])

Perform Random Forest Classifier:
RandomForestClassifier(random_state=0)

clf.predict(X):
[0 1]

clf.predict([[1, 2, 3], [4, 5, 6], [7, 8, 9], [11, 12, 13],[14, 15, 16]]):
[0 0 1 1 1]


# Transformers and pre-processors

https://scikit-learn.org/stable/getting_started.html
    
Machine learning workflows are often composed of different parts. 

- A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.
- In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). 
- The transformer objects don't have a predict method but rather a transform method that outputs a newly transformed sample matrix X:

## Example: Transform by Standard Scaler

In [8]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15],
     [1, -10]]

print('StandardScaler().fit(X).transform(X):')
print(StandardScaler().fit(X).transform(X))
#array([[-1.,  1.],
#       [ 1., -1.]])

StandardScaler().fit(X).transform(X):
[[-1.  1.]
 [ 1. -1.]]


Sometimes, you want to apply different transformations to different features: the ColumnTransformer is designed for these use-cases.
    
# Pipelines: chaining pre-processors and estimators

https://scikit-learn.org/stable/getting_started.html

Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: 
it can be fitted and used for prediction with fit and predict. 
        
As we will see later, using a pipeline will also prevent you from data leakage, 
i.e., disclosing some testing data in your training data.

## Example: Load iris dataset and Pipeline Transform

We load the Iris dataset, we use train_test_split() to split it into train and test sets.

We then compute the accuracy score of a pipeline on the test data:

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression(random_state=0))

# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
# 1. standard scaller 
# 2. perform logic regression.
print('pipe.fit(X_train, y_train):')
print(pipe.fit(X_train, y_train))
print()
# Pipeline(steps=[('standardscaler', StandardScaler()), \
#                ('logisticregression', LogisticRegression(random_state=0))])

# we can now use it like any other estimator
print('accuracy_score(pipe.predict(X_test), y_test):')
print(accuracy_score(pipe.predict(X_test), y_test))
# 0.97...

pipe.fit(X_train, y_train):
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=0))])

accuracy_score(pipe.predict(X_test), y_test):
0.9736842105263158


# Model Evaluation and Cross Validation

In the last example, we use train_test_split() to splits a dataset into train and test sets.

The scikit-learn provides many other tools for model evaluation, in particular for cross-validation.

In the below example fitting a model to automatic generate data.

It does not mean that it will predict well on unseen data. 

We evaluated the result by the cross-validation. 

We use a 5-fold cross-validation procedure. 

Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions. 

## Example: Model Evaluation and Cross Validation

- We use make_regression to generate X, y dataset.

- Then use LinearRegression() to fit the model

- Then we use cross_validate function to perform the cross validation and score the result.

In [17]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

# defaults to 5-fold Cross-validation
print('result = cross_validate(lr, X, y):')
result = cross_validate(lr, X, y)  

# r_squared score is high because dataset is easy
print("result['test_score']:") 
print(result['test_score'])
#array([1., 1., 1., 1., 1.])

result = cross_validate(lr, X, y):
result['test_score']:
[1. 1. 1. 1. 1.]


# Automatic Hyper-Parameter Searches

https://scikit-learn.org/stable/getting_started.html
    
All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. 

The generalization power of an estimator often critically depends on a few parameters. 

For example a RandomForestRegressor has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter that determines the maximum depth of each tree. 

Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.

Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). 

In the following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV object. 

When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has been fitted with the best set of parameters.

## Example: Automatic Hyper-Parameter Searches

In [23]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}

# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), 
                            n_iter=5,
                            param_distributions=param_distributions, 
                            random_state=0)

print('search.fit(X_train, y_train):')
print(search.fit(X_train, y_train))
print()
#RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
#                   param_distributions={'max_depth': ...,
#                                        'n_estimators': ...},
#                                        binrandom_state=0)
print('search.best_params_:')
print(search.best_params_)
print()
# {'max_depth': 9, 'n_estimators': 4}

# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
print('search.score(X_test, y_test):')
print(search.score(X_test, y_test))
#0.73...

search.fit(X_train, y_train):
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000023340C31408>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000023340149AC8>},
                   random_state=0)

search.best_params_:
{'max_depth': 9, 'n_estimators': 4}

search.score(X_test, y_test):
0.735363411343253


# Next Steps

https://scikit-learn.org/stable/getting_started.html#
    
We discussed estimator fitting and predicting, pre-processing steps, pipelines, cross-validation tools and automatic hyper-parameter searches. 

This should give you an overview of some of the main features of the scikit-learn library.

You can also provide the public API.