사용해야할 데이터: `melb_data.csv`

## 1. Data Preprocessing

### Task

- data loading

- Categorical 변수

    - Categorical 변수는 클래스 종류가 40개 이하인 feature 만 사용하고 onehot encoding으로 처리해서 사용한다.
        - onehot encoding 설정: 
            - handle_unknown='ignore'
            - sparse=False
            
    - 만약 `na` 값이 있다면 해당 feature 에서 가장 많이 나온 값으로 대체 한다.

- numerical 변수
    - `na`값이 있다면 해당 feature의 평균값으로 대체한다. 대체값을 사용했으면 해당 위치 또한 feature 로 만들어 학습에 사용한다.


위과정을 정확했다면 완성된 데이터프레임의 `shape` 은 `(13580, 49)`이다.

In [None]:
!pwd

/content


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('/content/drive/MyDrive/data/melb_data.csv')

y = df.Price
X = df.drop(['Price'], axis="columns")

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=701)

In [None]:
from sklearn.compose import make_column_selector

numeric_selector = make_column_selector(dtype_include="number")
cate_selector = make_column_selector(dtype_include='object')

cate_cols = cate_selector(X_train)
numeric_cols = numeric_selector(X_train)

select_cols = numeric_cols + cate_cols

X_train = X_train[select_cols]
X_test = X_test[select_cols]

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import ColumnTransformer

cate_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore", sparse=False)
)

preprocessor = ColumnTransformer(transformers=[
        ('impute&onehot', cate_pipe, ['CouncilArea']),
        ('impute_numeric', SimpleImputer(strategy="mean"), numeric_selector),
        ('numeric_impute_indicator', MissingIndicator(), numeric_selector)
])

In [None]:
model = preprocessor.fit_transform(X, y)
model

array([[0., 0., 0., ..., 0., 1., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## 2. Cross validation 을 이용해서 가장 좋은 Deicision Tree 를 찾아보자
    - 고려해야 할 hyperparameter 조건
        - max_leaf_nodes
        - criterion:'mse', 'mae

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor

In [None]:
for k in model.get_params().keys():
    print(k)

memory
steps
verbose
columntransformer
decisiontreeregressor
columntransformer__n_jobs
columntransformer__remainder
columntransformer__sparse_threshold
columntransformer__transformer_weights
columntransformer__transformers
columntransformer__verbose
columntransformer__impute&onehot
columntransformer__impute_numeric
columntransformer__numeric_impute_indicator
columntransformer__impute&onehot__memory
columntransformer__impute&onehot__steps
columntransformer__impute&onehot__verbose
columntransformer__impute&onehot__simpleimputer
columntransformer__impute&onehot__onehotencoder
columntransformer__impute&onehot__simpleimputer__add_indicator
columntransformer__impute&onehot__simpleimputer__copy
columntransformer__impute&onehot__simpleimputer__fill_value
columntransformer__impute&onehot__simpleimputer__missing_values
columntransformer__impute&onehot__simpleimputer__strategy
columntransformer__impute&onehot__simpleimputer__verbose
columntransformer__impute&onehot__onehotencoder__categories
colu

In [None]:
param_grid = {
    "dmodel__max_leaf_nodes": [10, 20, 30, 40, 50],
    "dmodel__criterion":["mse","mae"]
}
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = make_pipeline(preprocessor, DecisionTreeRegressor(random_state=0))
grid=GridSearchCV(model, param_grid, cv=5, return_train_score=True)

grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('impute&onehot',
                                                                         Pipeline(memory=None,
                                                                                  steps=[('simpleimputer',
                                                                                          SimpleImputer(add_indicator=False,
                                                                                                        copy=True,
               

In [None]:
grid.best_score_

0.6338096597687615

In [None]:
pd.DataFrame(grid.cv_results_).loc[:,["mean_train_score","mean_test_score"]]

Unnamed: 0,mean_train_score,mean_test_score
0,0.527843,0.522049
1,0.622758,0.598927
2,0.662435,0.615089
3,0.690533,0.626382
4,0.713762,0.63381
5,0.503678,0.49708
6,0.587489,0.578085
7,0.631032,0.604562
8,0.65589,0.617114
9,0.675896,0.627076


In [None]:
grid.cv_results_

{'mean_fit_time': array([0.0588387 , 0.06333523, 0.06579275, 0.06807261, 0.06926956,
        4.57528982, 4.78661938, 4.89463739, 4.9308815 , 4.96790595]),
 'mean_score_time': array([0.00857439, 0.00853186, 0.0084312 , 0.00850887, 0.00876718,
        0.00863938, 0.00912151, 0.00897298, 0.00862508, 0.00865936]),
 'mean_test_score': array([0.52204915, 0.59892701, 0.61508872, 0.62638206, 0.63380966,
        0.49707986, 0.57808506, 0.60456183, 0.6171142 , 0.62707629]),
 'mean_train_score': array([0.52784272, 0.62275792, 0.66243504, 0.69053293, 0.71376243,
        0.50367839, 0.58748897, 0.6310324 , 0.65589021, 0.67589591]),
 'param_decisiontreeregressor__criterion': masked_array(data=['mse', 'mse', 'mse', 'mse', 'mse', 'mae', 'mae', 'mae',
                    'mae', 'mae'],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_decisiontreeregressor__max_leaf_nodes': masked_ar