<a href="https://colab.research.google.com/github/hyonnys/tp1/blob/main/tp1_RegressionData_modelpipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 0. Data Loading**

In [1]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1


In [2]:
import os
import pandas as pd
import numpy as np
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from xgboost.sklearn import XGBRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, ConstantKernel
from sklearn.metrics import r2_score as r2, mean_squared_error as mse, mean_absolute_error as mae
import datetime as dt
from tqdm import tqdm
import time

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
abalone = pd.read_csv('/content/drive/MyDrive/regression_data.csv')
abalone.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [5]:
abalone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [6]:
abalone.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


# **Part 1. Data Engineering**<br>
- Outlier Drop :  `Height (> 0.4)`

In [7]:
# data engineering
# height outlier 확인
outlier_height = abalone[abalone['Height'] > 0.4]
outlier_height

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
1417,M,0.705,0.565,0.515,2.21,1.1075,0.4865,0.512,10
2051,F,0.455,0.355,1.13,0.594,0.332,0.116,0.1335,8


In [8]:
# Height outlier 데이터 제거 및 결과 확인
abalone.drop(outlier_height.index, inplace=True)
abalone[abalone['Height']>0.4]

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings


# **Part 2. Data Preprocessing**<br>
#### 💡 수치형 변수<br>
Scale Standardization<br>
- MinMaxScaler : 모든 값을 0과 1 사이의 값으로 조정 (이상치 민감)
- StandardScaler : 특성의 평균을 0, 표준편차를 1로 조정 (이상치 덜 민감)
<br>

Normalization <br>
- Log 변환 : 다수 값들이 제한된 범위 내에서만 존재하고 특정 값들이 큰 형태의 분포에서 사용됨 (모든 값이 음수가 아니여야 함)
<br>

#### 💡 범주형 변수<br>
- One Hot Encoding
  + 각 범주형 특성 값들을 각 값들에 대해 해당하는지/하지 않는지를 1과 0으로 인코딩한다. 차원을 늘릴 수 있다.
  + 특성의 정보를 분산시키므로 트리 기반 모델에는 비효율적이다.
- Ordinal Encoding
  + 특성 값들을 서로 다른 정수 값들로 인코딩함
  + 양적 대소 관계를 갖는 것처럼 간주되므로 선형 회귀 모델 보다 트리 기반 모델에서 효율적임.
<br>


## **2-1. First Experiment**<br>

> ### *Data Preprocessing*<br>
- 수치형 변수 : **_Standard Scaling_**
- 범주형 변수 `Sex` : **_One Hot Encoding_**

<br>

> ### *Model Selection*<br>
- Linear Regressison
- Ridge
- RandomForestRegressor
- Gradient Boosting Regressor
- KNeighbors Regressor
- SVM - SVR
<br>

> ### *Results*<br>
-
<br>

**📌 [sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html?highlight=selectkbest#sklearn.feature_selection.SelectKBest)**

In [9]:
abalone1 = abalone.copy()

In [10]:
# 데이터 분리 기능
def data_split(df, target, features):
  y = df[target]
  x= df[features]
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
  print(f'x_train: {x_train.shape}, y_train: {y_train.shape}')
  print(f'x_test: {x_test.shape}, y_test: {y_test.shape}')
  return x_train, x_test, y_train, y_test

In [11]:
target = 'Rings'
features1 = abalone1.columns.difference(['Rings'])
x_train, x_test, y_train, y_test = data_split(abalone1, target, features1)

x_train: (3131, 8), y_train: (3131,)
x_test: (1044, 8), y_test: (1044,)


In [12]:
numeric_feats = x_train.dtypes[x_train.dtypes != "object"].index
categoric_feats = x_train.dtypes[x_train.dtypes == "object"].index

In [13]:
sscaler = StandardScaler()
x_train[numeric_feats] = sscaler.fit_transform(x_train[numeric_feats])
# x_val[numeric_feats] = sscaler.transform(x_val[numeric_feats])
x_test[numeric_feats] = sscaler.transform(x_test[numeric_feats])

In [14]:
# scaling 결과 확인
# 평균 0에 가까운 수로, 표준편차 1에 가까운 수로 변환되었는지 확인
x_train[numeric_feats].describe().T[['mean', 'std']]

Unnamed: 0,mean,std
Diameter,1.089302e-16,1.00016
Height,-3.6310070000000005e-17,1.00016
Length,-1.815504e-16,1.00016
Shell weight,-6.808139000000001e-17,1.00016
Shucked weight,-1.860891e-16,1.00016
Viscera weight,2.4963180000000003e-17,1.00016
Whole weight,-3.177131e-17,1.00016


In [15]:
# OneHotEncoder : categorical feature - Sex
ohe = OneHotEncoder()
x_train_ohe = ohe.fit_transform(x_train)
x_test_ohe = ohe.transform(x_test)

# 결과 확인
print((x_train_ohe.dtypes == 'object').sum())
print((x_test_ohe.dtypes == 'object').sum())

0
0


In [16]:
# data preprocessing 결과 확인
x_train_ohe.head()

Unnamed: 0,Diameter,Height,Length,Sex_1,Sex_2,Sex_3,Shell weight,Shucked weight,Viscera weight,Whole weight
2667,0.414915,0.272779,0.501373,1,0,0,-0.109638,0.476648,0.195746,0.213957
4086,0.414915,0.403271,0.543362,0,1,0,-0.001912,0.035414,0.291626,0.083463
2552,-1.869461,-1.684603,-1.808021,0,0,1,-1.463391,-1.333311,-1.320068,-1.382547
1804,0.617971,0.142287,0.795296,0,1,0,0.116586,0.674753,0.821248,0.516742
247,-1.361822,-1.423619,-1.388131,0,0,1,-1.222803,-1.290539,-1.278976,-1.295892


In [17]:
# OrdinalEncoding
le = OrdinalEncoder()
x_train_le = le.fit_transform(x_train)
x_test_le = le.transform(x_test)

# 결과 확인
print((x_train_le.dtypes == 'object').sum())
print((x_test_le.dtypes == 'object').sum())

0
0


In [18]:
# 성능 결과 출력 기능
def print_score(model, X_train, y_train, X_test, y_test) :

    train_score = np.round(model.score(X_train, y_train) , 3)
    val_score = np.round(np.mean(cross_val_score(model, X_train, y_train, scoring='r2', cv=5).round(3)),3)
    test_score = np.round(model.score(X_test, y_test),3)
    d = {'train':train_score, 'val':val_score, 'test':test_score}
    ser = pd.Series(data=d, index=['train', 'val', 'test'])

    return train_score, val_score, test_score, ser

In [19]:
# LinearRegression
ols = LinearRegression()
ols.fit(x_train_le, y_train)

# 성능 비교
ols_train, ols_val, ols_test, ols_ser = print_score(ols,x_train_le, y_train, x_test_le, y_test)

In [20]:
# 성능 확인
ols_ser

train    0.530
val      0.524
test     0.565
dtype: float64

In [21]:
# Ridge
rg = Ridge()
rg.fit(x_train_le, y_train)

# 성능 비교
rg_train, rg_val, rg_test, rg_ser = print_score(rg,x_train_le, y_train, x_test_le, y_test)

In [22]:
# RandomForestRegressor_le
rf = RandomForestRegressor()
rf.fit(x_train_le, y_train)

# 성능 비교
rf_train, rf_val, rf_test, rf_ser = print_score(rf,x_train_le, y_train, x_test_le, y_test)

In [23]:
# SVM
svr = SVR()
svr.fit(x_train_le, y_train)

#성능 비교
svr_train, svr_val, svr_test, svr_ser = print_score(svr,x_train_le, y_train, x_test_le, y_test)

In [24]:
# GradientBoostingRegressor
gb = GradientBoostingRegressor()
gb.fit(x_train_le, y_train)

# 성능 비교
gb_train, gb_val, gb_test, gb_ser = print_score(gb,x_train_le, y_train, x_test_le, y_test)


In [25]:
# KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors = 4)
knn.fit(x_train_le, y_train)

# 성능 비교
knn_train, knn_val, knn_test, knn_ser = print_score(knn,x_train_le, y_train, x_test_le, y_test)

In [26]:
frame = {'ols':ols_ser, 'ridge':rg_ser, 'rf':rf_ser, 'svr':svr_ser, 'gb':gb_ser, 'knn':knn_ser}
pd.DataFrame(frame)

Unnamed: 0,ols,ridge,rf,svr,gb,knn
train,0.53,0.53,0.933,0.544,0.661,0.676
val,0.524,0.525,0.53,0.529,0.537,0.438
test,0.565,0.565,0.565,0.572,0.573,0.499


## **2-2. Second Experiment**<br>
> ### *Data Preprocessing*<br>
- 수치형 변수 : **_MinMax_** Scaling
- 범주형 변수 : **_Drop_**
<br>

> ### *Model Selection*<br>
- RandomForestRegressor
- Gradient Boosting Regressor
- KNeighbors Regressor
- SVM - SVR
<br>

> ### *Results*<br>
-
<br>

In [27]:
abalone2 = abalone.copy()

In [28]:
target = 'Rings'
features2 = abalone2.columns.difference(['Rings', 'Sex'])
x_train2, x_test2, y_train2, y_test2 = data_split(abalone2, target, features2)

x_train: (3131, 7), y_train: (3131,)
x_test: (1044, 7), y_test: (1044,)


In [29]:
mscaler = MinMaxScaler()
x_train_mscaled = mscaler.fit_transform(x_train2)
x_test_mscaled = mscaler.transform(x_test2)

In [30]:
# LinearRegression
ols.fit(x_train_mscaled, y_train)
ols_train, ols_val, ols_test, ols_ser2 = print_score(ols,x_train_mscaled, y_train, x_test_mscaled, y_test)

# Ridge
rg.fit(x_train_mscaled, y_train)
rg_train, rg_val, rg_test, rg_ser2 = print_score(rg,x_train_mscaled, y_train, x_test_mscaled, y_test)


# RandomForestRegressor
rf.fit(x_train_mscaled, y_train)
rf_train, rf_val, rf_test, rf_ser2 = print_score(rf,x_train_mscaled, y_train, x_test_mscaled, y_test)

# SVR
svr.fit(x_train_mscaled, y_train)
svr_train, svr_val, svr_test, svr_ser2 = print_score(svr,x_train_mscaled, y_train, x_test_mscaled, y_test)

# GradientBoostingRegressor
gb.fit(x_train_mscaled, y_train)
gb_train, gb_val, gb_test, gb_ser2 = print_score(gb,x_train_mscaled, y_train, x_test_mscaled, y_test)

# KNeighborsRegressor
knn.fit(x_train_mscaled, y_train)
knn_train, knn_val, knn_test, knn_ser2 = print_score(knn,x_train_mscaled, y_train, x_test_mscaled, y_test)

In [31]:
# XGBRegressor
xgb = XGBRegressor()
xgb.fit(x_train_mscaled, y_train)

xgb_train, xgb_val, xgb_test, xgb_ser2 = print_score(xgb,x_train_mscaled, y_train, x_test_mscaled, y_test)

In [32]:
# 모델 성능 비교
frame2 = {'ols':ols_ser2, 'ridge':rg_ser2, 'rf':rf_ser2, 'svr':svr_ser2,'xgb':xgb_ser2, 'gb':gb_ser2, 'knn':knn_ser2}
pd.DataFrame(frame2)

Unnamed: 0,ols,ridge,rf,svr,xgb,gb,knn
train,0.526,0.517,0.935,0.527,0.941,0.651,0.669
val,0.521,0.51,0.526,0.514,0.48,0.527,0.437
test,0.557,0.546,0.555,0.555,0.493,0.563,0.498


## **2-3. Third Experiment**<br>
> ### *Data Engineering*<br>
- Outlier Drop :  `Height (> 0.4)`
<br>

> ### *Data Preprocessing*<br>
- Numerical Features
  + **_Log_** Scaling
- **_Drop Categorical Feature `Sex`_**
<br>

> ### *Model Selection*<br>
- Polynomial Regressor
- RandomForestRegressor
- Gradient Boosting Regressor
- KNeighbors Regressor
- SVM - SVR
- Gaussian Process Regressor
- XGBRegressor

In [33]:
x_train_log = np.log1p(x_train2)
x_test_log = np.log1p(x_test2)

In [34]:
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

poly_ols = PolynomialRegression(degree=2)
poly_ols.fit(x_train_log, y_train)

poly_train, poly_val, poly_test, poly_ser3 = print_score(poly_ols,x_train_log, y_train, x_test_log, y_test)

In [35]:
# LinearRegression
ols.fit(x_train_log, y_train)
ols_train, ols_val, ols_test, ols_ser3 = print_score(ols,x_train_log, y_train, x_test_log, y_test)

# Ridge
rg.fit(x_train_log, y_train)
rg_train, rg_val, rg_test, rg_ser3 = print_score(rg,x_train_log, y_train, x_test_log, y_test)


# RandomForestRegressor
rf.fit(x_train_log, y_train)
rf_train, rf_val, rf_test, rf_ser3 = print_score(rf,x_train_log, y_train, x_test_log, y_test)

# SVR
svr.fit(x_train_log, y_train)
svr_train, svr_val, svr_test, svr_ser3 = print_score(svr,x_train_log, y_train, x_test_log, y_test)

# GradientBoostingRegressor
gb.fit(x_train_log, y_train)
gb_train, gb_val, gb_test, gb_ser3 = print_score(gb,x_train_log, y_train, x_test_log, y_test)

# KNeighborsRegressor
knn.fit(x_train_log, y_train)
knn_train, knn_val, knn_test, knn_ser3 = print_score(knn,x_train_log, y_train, x_test_log, y_test)

In [36]:
# Gaussian Process Regressor
gpr = GaussianProcessRegressor()
gpr.fit(x_train_log, y_train)

gpr_train, gpr_val, gpr_test, gpr_ser3 = print_score(gpr,x_train_log, y_train, x_test_log, y_test)

In [37]:
# XGBRegressor
xgb = XGBRegressor()
xgb.fit(x_train_log, y_train)

xgb_train, xgb_val, xgb_test, xgb_ser3 = print_score(xgb,x_train_log, y_train, x_test_log, y_test)

In [38]:
frame3 = {'ols':ols_ser3,'poly_ols':poly_ser3, 'ridge':rg_ser3, 'rf':rf_ser3, 'svr':svr_ser3,\
          'xgb':xgb_ser3, 'gb':gb_ser3, 'knn':knn_ser3, 'gpr':gpr_ser3}
pd.DataFrame(frame3)

Unnamed: 0,ols,poly_ols,ridge,rf,svr,xgb,gb,knn,gpr
train,0.547,0.577,0.534,0.934,0.53,0.941,0.651,0.688,0.673
val,0.542,0.56,0.526,0.519,0.52,0.48,0.529,0.474,-0.204
test,0.576,0.598,0.566,0.558,0.568,0.493,0.564,0.542,-6.096


## **2-4.Fourth Experiment**<br>
### Data Engineering<br>
- Outlier Drop :  `Height (> 0.4)` , `Rings (> 15)`
<br>

### Data Preprocessing<br>
- 수치형 변수 : log 변환
- 범주형 변수 : Ordinal Encoding
<br>

### Results<br>
-

In [39]:
abalone3 = abalone.copy()

In [40]:
outlier_rings = abalone3[abalone3['Rings']> 15]
outlier_rings

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
6,F,0.530,0.415,0.150,0.7775,0.2370,0.1415,0.330,20
7,F,0.545,0.425,0.125,0.7680,0.2940,0.1495,0.260,16
9,F,0.550,0.440,0.150,0.8945,0.3145,0.1510,0.320,19
32,M,0.665,0.525,0.165,1.3380,0.5515,0.3575,0.350,18
33,F,0.680,0.550,0.175,1.7980,0.8150,0.3925,0.455,19
...,...,...,...,...,...,...,...,...,...
3929,F,0.650,0.515,0.215,1.4980,0.5640,0.3230,0.425,16
3930,F,0.670,0.535,0.185,1.5970,0.6275,0.3500,0.470,21
3931,I,0.550,0.440,0.165,0.8605,0.3120,0.1690,0.300,17
3944,M,0.550,0.440,0.160,0.9910,0.3480,0.1680,0.375,20


In [41]:
abalone3.drop(outlier_rings.index, inplace=True)
abalone3[abalone3['Rings']> 15]

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings


In [42]:
# 데이터 분리
target = 'Rings'
features3 = abalone3.columns.difference(['Rings'])
x_train3, x_test3, y_train3, y_test3 = data_split(abalone3, target, features3)

x_train: (2935, 8), y_train: (2935,)
x_test: (979, 8), y_test: (979,)


In [43]:
# Log 변환
x_train3[numeric_feats] = np.log1p(x_train3[numeric_feats])
x_test3[numeric_feats] = np.log1p(x_test3[numeric_feats])

In [44]:
# OrdinalEncoding
le = OrdinalEncoder()
x_train3_le = le.fit_transform(x_train3)
x_test3_le = le.transform(x_test3)

# 결과 확인
print((x_train3_le.dtypes == 'object').sum())
print((x_test3_le.dtypes == 'object').sum())

0
0


In [45]:
x_train3_le.head()

Unnamed: 0,Diameter,Height,Length,Sex,Shell weight,Shucked weight,Viscera weight,Whole weight
2848,0.395415,0.14842,0.485508,1,0.298993,0.464363,0.230318,0.812706
1773,0.281412,0.131028,0.381855,2,0.189794,0.308954,0.142801,0.566166
2054,0.307485,0.10436,0.381855,3,0.114221,0.236257,0.08158,0.402461
62,0.34359,0.122218,0.425268,1,0.182322,0.263902,0.17689,0.528567
1293,0.336472,0.122218,0.41871,2,0.165514,0.20945,0.127953,0.459006


In [46]:
# LinearRegression
ols.fit(x_train3_le, y_train3)
ols_train4, ols_val4, ols_test4, ols_ser4 = print_score(ols,x_train3_le, y_train3, x_test3_le, y_test3)

# Ridge
rg.fit(x_train3_le, y_train3)
rg_train4, rg_val4, rg_test4, rg_ser4 = print_score(rg,x_train3_le, y_train3, x_test3_le, y_test3)


# RandomForestRegressor
rf.fit(x_train3_le, y_train3)
rf_train4, rf_val4, rf_test4, rf_ser4 = print_score(rf,x_train3_le, y_train3, x_test3_le, y_test3)

# SVR
svr.fit(x_train3_le, y_train3)
svr_train4, svr_val4, svr_test4, svr_ser4 = print_score(svr,x_train3_le, y_train3, x_test3_le, y_test3)

# XGBRegressor
xgb = XGBRegressor()
xgb.fit(x_train3_le, y_train3)

xgb_train4, xgb_val4, xgb_test4, xgb_ser4 = print_score(xgb,x_train3_le, y_train3, x_test3_le, y_test3)

In [47]:
frame4 = {'ols':ols_ser4,'ridge':rg_ser4, 'rf':rf_ser4, 'svr':svr_ser4,'xgb':xgb_ser4}
pd.DataFrame(frame4)

Unnamed: 0,ols,ridge,rf,svr,xgb
train,0.545,0.535,0.94,0.502,0.947
val,0.54,0.529,0.55,0.489,0.503
test,0.547,0.536,0.558,0.504,0.511


# **Part 3. Model Tuning**
비선형 회귀 모델 (RandomForestRegressor, XGBRegressor) <br>
💡 과대적합 발생원인 <br>
- 데이터의 양이 충분하지 못할 경우
- 분산이 크거나 노이즈가 심한 경우
- 모델의 복잡도가 큰 경우 (모델의 학습 가능한 가중치의 개수)
- 과도하게 큰 epoch로 학습하는 경우
<br>

💡 모델 성능 향상 <br>

✅ **[Baseline]**<br>
[Epoch 10, lr 0.001] <br>
[Train Dataset] Loss = 5.863, Accuracy = 0.824 <br>
[Test Dataset] Accuracy = 0.827 <br>


## **3-1. GridSearchCV**

In [48]:
abalone4 = abalone.copy()

# 데이터 분리
target = 'Rings'
features4 = abalone4.columns.difference(['Rings','Sex'])
x_train4, x_test4, y_train4, y_test4 = data_split(abalone4, target, features4)

x_train: (3131, 7), y_train: (3131,)
x_test: (1044, 7), y_test: (1044,)


In [49]:
# Log 변환
x_train4[numeric_feats] = np.log1p(x_train4[numeric_feats])
x_test4[numeric_feats] = np.log1p(x_test4[numeric_feats])

In [50]:
rf_grid = make_pipeline(
    RandomForestRegressor(
        n_estimators=200,
        max_depth=2,
        random_state = 42
    )
)
print(rf_grid)

Pipeline(steps=[('randomforestregressor',
                 RandomForestRegressor(max_depth=2, n_estimators=200,
                                       random_state=42))])


In [51]:
rf_params = {'randomforestregressor__n_estimators' : [100, 150, 200], 'randomforestregressor__max_depth': [4, 6, 8, 9, 10, 11, 12]}

In [52]:
rf_gridCV = GridSearchCV(rf_grid, param_grid=rf_params, scoring='r2', cv=5, verbose=3)
rf_gridCV.fit(x_train4, y_train4)

Fitting 5 folds for each of 21 candidates, totalling 105 fits
[CV 1/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=100;, score=0.476 total time=   0.5s
[CV 2/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=100;, score=0.473 total time=   0.7s
[CV 3/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=100;, score=0.495 total time=   0.7s
[CV 4/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=100;, score=0.529 total time=   0.7s
[CV 5/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=100;, score=0.461 total time=   0.8s
[CV 1/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=150;, score=0.476 total time=   1.5s
[CV 2/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators=150;, score=0.476 total time=   1.1s
[CV 3/5] END randomforestregressor__max_depth=4, randomforestregressor__n_estimators

In [53]:
print("최적 하이퍼파라미터: ", rf_gridCV.best_params_)
print("최적 AUC: ", rf_gridCV.best_score_)

최적 하이퍼파라미터:  {'randomforestregressor__max_depth': 9, 'randomforestregressor__n_estimators': 200}
최적 AUC:  0.5323714661998805


In [54]:
pd.DataFrame(rf_gridCV.cv_results_).sort_values(by="rank_test_score").T

Unnamed: 0,11,10,8,14,7,9,13,17,12,20,...,19,16,15,18,5,4,3,2,1,0
mean_fit_time,1.772477,1.658683,1.854504,2.105215,1.216117,0.859069,1.353473,2.27442,1.194477,2.478599,...,1.855846,1.767274,0.988791,1.104476,2.097273,1.652839,2.129223,1.24624,1.039717,0.648036
std_fit_time,0.159026,0.418747,0.370921,0.371806,0.129794,0.007547,0.01831,0.395051,0.178535,0.514289,...,0.266437,0.333239,0.020911,0.120326,0.998673,0.849754,1.02114,0.095623,0.240667,0.09924
mean_score_time,0.024618,0.038109,0.023637,0.026258,0.017733,0.013081,0.01916,0.047736,0.017094,0.033911,...,0.023645,0.024275,0.015792,0.014441,0.025906,0.022352,0.026858,0.022541,0.017179,0.015521
std_score_time,0.002718,0.03276,0.003551,0.002452,0.004068,0.000783,0.000155,0.037188,0.001733,0.008052,...,0.003467,0.003911,0.002331,0.000317,0.009403,0.011631,0.010739,0.006599,0.004388,0.008501
param_randomforestregressor__max_depth,9,9,8,10,8,9,10,11,10,12,...,12,11,11,12,6,6,6,4,4,4
param_randomforestregressor__n_estimators,200,150,200,200,150,100,150,200,100,200,...,150,150,100,100,200,150,100,200,150,100
params,"{'randomforestregressor__max_depth': 9, 'rando...","{'randomforestregressor__max_depth': 9, 'rando...","{'randomforestregressor__max_depth': 8, 'rando...","{'randomforestregressor__max_depth': 10, 'rand...","{'randomforestregressor__max_depth': 8, 'rando...","{'randomforestregressor__max_depth': 9, 'rando...","{'randomforestregressor__max_depth': 10, 'rand...","{'randomforestregressor__max_depth': 11, 'rand...","{'randomforestregressor__max_depth': 10, 'rand...","{'randomforestregressor__max_depth': 12, 'rand...",...,"{'randomforestregressor__max_depth': 12, 'rand...","{'randomforestregressor__max_depth': 11, 'rand...","{'randomforestregressor__max_depth': 11, 'rand...","{'randomforestregressor__max_depth': 12, 'rand...","{'randomforestregressor__max_depth': 6, 'rando...","{'randomforestregressor__max_depth': 6, 'rando...","{'randomforestregressor__max_depth': 6, 'rando...","{'randomforestregressor__max_depth': 4, 'rando...","{'randomforestregressor__max_depth': 4, 'rando...","{'randomforestregressor__max_depth': 4, 'rando..."
split0_test_score,0.503827,0.501659,0.504933,0.501086,0.50223,0.503754,0.498078,0.500195,0.502286,0.498929,...,0.495426,0.498455,0.502124,0.499392,0.501961,0.500941,0.502273,0.47632,0.476139,0.476145
split1_test_score,0.519458,0.519227,0.518234,0.517636,0.516366,0.518964,0.517995,0.516487,0.520543,0.516931,...,0.516912,0.515104,0.515076,0.516747,0.511494,0.509771,0.50739,0.478511,0.475863,0.473095
split2_test_score,0.545769,0.545998,0.542334,0.541099,0.543438,0.545264,0.542413,0.541791,0.541169,0.543165,...,0.544028,0.542293,0.541841,0.542933,0.534704,0.535001,0.535406,0.49701,0.49564,0.494724


In [55]:
xgb_grid = XGBRegressor(n_estimators=10000,
                        max_depth=5,
                        learning_rate=0.01,
                        n_jobs=5,
                        early_stopping_rounds=30,
                        random_state=42)
print(xgb_grid)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=50,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.01, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=5, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=10000, n_jobs=5, num_parallel_tree=None,
             predictor=None, random_state=42, ...)


In [58]:
xgb_params={ 'max_depth':[2, 4, 6, 8,9, 10],'learning_rate':[0.03, 0.05, 0.07]}

In [None]:
xgb_gridCV = GridSearchCV(xgb_grid, xgb_params, scoring='r2', cv=5, verbose=3)
xgb_gridCV.fit(x_train4, y_train4)
# Must have at least 1 validation dataset for early stopping.

In [None]:
print("최적 하이퍼파라미터: ", xgb_gridCV.best_params_)
print("최적 AUC: ", xgb_gridCV.best_score_)

## **3-2. Randomized SearchCV**

In [67]:
rr_params = {
    'n_estimators' : [100, 150, 200],
    'max_depth': [4, 6, 8, 9, 10, 11, 12]
}

In [63]:
rf_rs = RandomForestRegressor(max_depth=2,
                              random_state=42,
                              n_jobs=1)

In [69]:
rsCV = RandomizedSearchCV(
    rf_rs,
    rr_params,
    scoring='r2',
    n_iter=10,
    cv=5,
    verbose=3,
    random_state=42
)

rsCV.fit(x_train4, y_train4)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .....max_depth=4, n_estimators=100;, score=0.476 total time=   1.5s
[CV 2/5] END .....max_depth=4, n_estimators=100;, score=0.473 total time=   0.7s
[CV 3/5] END .....max_depth=4, n_estimators=100;, score=0.495 total time=   0.7s
[CV 4/5] END .....max_depth=4, n_estimators=100;, score=0.529 total time=   0.5s
[CV 5/5] END .....max_depth=4, n_estimators=100;, score=0.461 total time=   0.5s
[CV 1/5] END ....max_depth=11, n_estimators=200;, score=0.500 total time=   2.0s
[CV 2/5] END ....max_depth=11, n_estimators=200;, score=0.516 total time=   2.0s
[CV 3/5] END ....max_depth=11, n_estimators=200;, score=0.542 total time=   1.9s
[CV 4/5] END ....max_depth=11, n_estimators=200;, score=0.599 total time=   2.1s
[CV 5/5] END ....max_depth=11, n_estimators=200;, score=0.493 total time=   2.9s
[CV 1/5] END ....max_depth=11, n_estimators=100;, score=0.502 total time=   1.4s
[CV 2/5] END ....max_depth=11, n_estimators=100;

In [70]:
print("최적 하이퍼파라미터: ", rsCV.best_params_)
print("최적 AUC: ", rsCV.best_score_)

최적 하이퍼파라미터:  {'n_estimators': 200, 'max_depth': 9}
최적 AUC:  0.5323714661998805
