<font color="#CC3D3D"><p>
# ML Pipeline: Building a Custom Pipeline

<font color="blue"><p>
#### 모형개발 절차
1. 수치형 피처
 - 결측값처리: SimpleImputer(strategy=`???`)
 - 이상값처리: FunctionTransformer()
 - 스케일링: StandardScaler()
2. 범주형 피처
 - 결측값처리: SimpleImputer(strategy="most_frequent")
 - 인코딩: OneHotEncoder(handle_unknown="ignore")
3. 공통
 - Feature Selection: SelectPercentile(percentile=`???`)
 - Modeling: Logistic Regression(C=`???`)
 - Hyperparametor Optimization: GridSearch()

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer  # still experimental 
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, PowerTransformer 
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from category_encoders import TargetEncoder  # scikit-learn과 호환됨
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn import set_config

#### Load data

In [2]:
data = pd.read_csv('allstate_train.csv')
data.head()

Unnamed: 0,customer_ID,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,...,C_previous,duration_previous,A,B,C,D,E,F,G,cost
0,10000000,1,0,0,08:35,IN,10001,2,0,2,...,1.0,2.0,1,0,2,2,1,2,2,633
1,10000000,2,0,0,08:38,IN,10001,2,0,2,...,1.0,2.0,1,0,2,2,1,2,1,630
2,10000000,3,0,0,08:38,IN,10001,2,0,2,...,1.0,2.0,1,0,2,2,1,2,1,630
3,10000000,4,0,0,08:39,IN,10001,2,0,2,...,1.0,2.0,1,0,2,2,1,2,1,630
4,10000000,5,0,0,11:55,IN,10001,2,0,2,...,1.0,2.0,1,0,2,2,1,2,1,630


#### 수치형/범주형 피처 분리 & 학습/평가 데이터 분할

In [3]:
numeric_features = ['group_size','car_age','age_oldest','age_youngest','duration_previous','cost']
categorical_features = ['day','homeowner','car_value','risk_factor','married_couple','C_previous','state','shopping_pt']

X_train, X_test, y_train, y_test = train_test_split(data[numeric_features+categorical_features], 
                                                    data['record_type'], test_size=0.9, 
                                                    stratify=data['record_type'], random_state=0)

####  파이프라인 구축: 수치형과 범주형 피처를 다르게 처리할 수 있는 ColumnTransformer를 활용

In [4]:
# 이상치 처리 방법 중 가장 단순한 방법:
def remove_outlier(X):  
    df = pd.DataFrame(X)
    return df.apply(lambda x: x.clip(x.quantile(.05), x.quantile(.95)), axis=0).values

In [5]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("outlier", FunctionTransformer(remove_outlier)), # 함수를 전처리기로 변환하여 sklearn에 없는 새로운 전처리기를 만듬
        ("scaler", StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")), 
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False)), # sparse = true -> 눈에안보이는 형식으로 저장됨, 메모리 아낌
    ]
)

column_transformer = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

preprocessor = Pipeline(
    steps=[
        ("column", column_transformer), 
        ("selector", SelectPercentile(percentile=50)),
    ]
)

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor), 
        ("classifier", LogisticRegression()),
    ]
)

In [11]:
set_config(display="diagram")  # To view the text pipeline, change to display='text'.
model

#### 파이프라인을 통한 모형 학습

In [7]:
model.fit(X_train, y_train)
print("model score: %.3f" % roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))

model score: 0.848


#### 파이프라인을 통한 하이퍼파라미터 최적화

In [8]:
%%time

param_grid = {
    "preprocessor__column__num__imputer__strategy": ["mean", "median"],
    "preprocessor__selector__percentile": range(50,100,20),
    "classifier__C": [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(model, param_grid, scoring='roc_auc', cv=3)
grid_search.fit(X_train, y_train)

CPU times: total: 6min 14s
Wall time: 2min 48s


In [9]:
print(f"Best params: {grid_search.best_params_}")
print(f"Internal CV score: {grid_search.best_score_:.3f}")
print("Test score from grid search: %.3f" % roc_auc_score(y_test, grid_search.predict_proba(X_test)[:,1]))

Best params: {'classifier__C': 100, 'preprocessor__column__num__imputer__strategy': 'mean', 'preprocessor__selector__percentile': 90}
Internal CV score: 0.850
Test score from grid search: 0.848


<font color="#CC3D3D"><p>
# End