# Getting Started

### https://scikit-learn.org/stable/getting_started.html

### It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.)

### Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

## Fitting and predicting: estimator basics

### Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

### Here is a simple example where we fit a RandomForestClassifier to some very basic data:

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)

RandomForestClassifier(random_state=0)

### about row(행), column(열)
### 가로: 오, 행, 횡, row, horizontal
### 세로: 열, 종, column, vertical

### The fit method generally accepts 2 inputs:

### The samples matrix (or design matrix) X. The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

### The target values y which are real numbers for regression(회귀) tasks, or integers for classification (or any other discrete(이산, 분리된) set of values). For unsupervized learning tasks, y does not need to be specified. y is usually 1d array where the i th entry corresponds to the target of the i th sample (row) of X.

### Both X and y are usually expected to be numpy arrays or equivalent array-like data types, though some estimators work with other formats such as sparse(부족한, 희박한) matrices(희소행렬: 대부분의 값이 0인 행렬, <-> dense matrices).

### Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

In [2]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

In [3]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

## Transformers and pre-processors

### Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.
### In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). The transformer objects don’t have a predict method but rather a transform method that outputs a newly transformed sample matrix X:

### scikit-lkearn 데이터 전처리 스케일 조정(스케일러) 
### Standard Scaler: 평균 0 , 분산 1로 조정, fit, transform, fit_transform 지원
### Robust Scaler: 평균과 분산 대신 중간값(median)과 사분위값을 사용, (X - Q2) / (Q3-Q1), [Q2 = 2분위값, Q3 = 3분위값, Q1 = 1분위값]
### MinMax Scaler: (X-Xmin)/(Xmax-Xmin), 모든 값이 0~1사이에 존재, 정규화 방법중 원데이터 분포를 유지하면서 정규화하는 방법
### Normalizer: 특성벡터의 모든 길이가 1이 되도록 조정(반지름 1인 원에 투영하는 느낌), 특성벡터의 길이는 상관없고, 데이터의 방향이나 각도가 중요할 경우 사용

In [5]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15],
     [1, -10]]
# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

## Pipelines: chaining pre-processors and estimators

### Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

### In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

### 로지스틱 모형 식은 독립 변수가 (-∞,∞)의 어느 숫자이든 상관 없이 종속 변수 또는 결과 값이 항상 범위 [0,1] 사이에 있도록 한다

### (e^x) / (1+e^x)

### 파이프라인 (Pipeline) 은 전처리의 단계인 모델 생성, 학습 등을 포함하는 여러 단계의 머신러닝 프로세스를 한 번에 처리할 수 있는 클래스

### random_state=0는 random seed의 선택값, 즉 동일한 랜덤씨드를 사용하는 번호

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# fit the whole pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [7]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model evaluation

### Fitting a model to some data does not entail(수반하다) that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation(evaluating estimator performance).

### We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions. Please refer to our User Guide for more details:

###  make_regression은 datasets서브 패키지에  회귀 분석 시험용 가상 데이터를 생성하는 명령어

In [10]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()
result = cross_validate(lr, X, y)  # defaults to 5-fold CV(Cross_Validate)
result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])

In [13]:
result

{'fit_time': array([0.00400114, 0.00400066, 0.00300074, 0.00300074, 0.0040009 ]),
 'score_time': array([0.0010004, 0.       , 0.       , 0.       , 0.       ]),
 'test_score': array([1., 1., 1., 1., 1.])}

## Automatic parameter searches

### All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a RandomForestRegressor has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.
### Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV object. When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has been fitted with the best set of parameters. Read more in the User Guide:

In [16]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)
search.fit(X_train, y_train)

RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000001B8B6C665E0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000001B8B1AD49D0>},
                   random_state=0)

In [17]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [18]:
# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)

0.735363411343253