<a href="https://colab.research.google.com/github/nan-park/section2_project/blob/main/data_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

In [2]:
from google.colab import files
files.upload();
# model_data.csv
# model_data_stationary.csv (model_data.csv를 정상성 갖도록 변환한 데이터)

Saving model_data.csv to model_data.csv
Saving model_data_stationary.csv to model_data_stationary.csv


In [3]:
url1 = 'model_data.csv'
url2 = 'model_data_stationary.csv'
df = pd.read_csv(url1, index_col=0)
df_stationary = pd.read_csv(url2, index_col=0)

# **데이터 모델링**

먼저 시계열 데이터임을 무시하고, 각각의 시계열 데이터(row)는 다른 시간대 데이터와 독립적이라고 가정하고 모델링을 해보려고 한다.<br>
즉, 자기상관관계(autocorrelation)를 갖지 않는다고 가정한다.<br>
그러면 일반적 선형회귀나 랜덤포레스트, XGBRegressor 등을 사용할 수 있다.

타겟 설정 -> 소비자물가상승률(%)

In [5]:
target = '소비자물가상승률(%)'
features = df.columns.tolist()
features.remove(target)

X = df[features]
y = df[target]

In [15]:
len(y)  # 데이터 개수 265개

265

훈련/테스트 비율을 8:2로 잡고, 
훈련 데이터를 cross validation(cv=5)으로 한다.(데이터 개수가 적음)

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- 기준모델(Baseline model)

타겟의 평균값을 기준모델로 잡는다.

In [44]:
y_test_pred_base = [y_train.mean()] * len(y_test)
# 회귀 평가지표
# from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('validation data eval(Baseline)')
print(f'mae : {mean_absolute_error(y_test, y_test_pred_base).round(5)}')
print(f'mse : {mean_squared_error(y_test, y_test_pred_base).round(5)}')
print(f'rmse : {(mean_squared_error(y_test, y_test_pred_base)**0.5).round(5)}')
print(f'r2 score : {r2_score(y_test, y_test_pred_base)}')

validation data eval(Baseline)
mae : 1.00976
mse : 1.6075
rmse : 1.26787
r2 score : -0.015812847536039953


- 다중선형회귀모델(1차)

In [47]:
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import cross_val_score

# 회귀평가지표: neg_mean_absolute_error(-0.4727), r2(0.7), neg_mean_squared_error(-0.49253)
linear = LinearRegression(normalize=True)
linear_rmse_score = -cross_val_score(linear, X_train, y_train, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5).mean()
print(f'Linear model rmse score : {linear_rmse_score.round(5)}')

Linear model rmse score : 0.70049


- 랜덤포레스트 회귀(RandomForestRegressor)

In [80]:
from sklearn.ensemble import RandomForestRegressor
randomforest = RandomForestRegressor( # 하이퍼파라미터 모두 default
    n_estimators=100,
    criterion='squared_error',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_leaf_nodes=None,
    n_jobs=-1,
    random_state=42,
    verbose=3
)
random_mae_score = -cross_val_score(randomforest, X_train, y_train, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5).mean()
print(f'RandomForest model rmse score : {random_mae_score.round(5)}')

RandomForest model rmse score : 0.5379


하이퍼파라미터 튜닝(직접 튜닝)

In [91]:
randomforest = RandomForestRegressor( # 하이퍼파라미터 모두 default
    n_estimators=100,
    criterion='squared_error',
    max_depth=12, # default None
    min_samples_split=2,
    min_samples_leaf=1,
    max_leaf_nodes=None,
    n_jobs=-1,
    random_state=42,
    verbose=3
)
random_mae_score = -cross_val_score(randomforest, X_train, y_train, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5).mean()
print(f'RandomForest model rmse score : {random_mae_score.round(5)}')

RandomForest model rmse score : 0.53682


파라미터에 따라서 크게 달라지지는 않는 모습.

 - XGBRegressor 회귀

In [51]:
from xgboost import XGBRegressor
xgb = XGBRegressor( # 하이퍼파라미터 모두 default
    booster='gblinear',
    objective='reg:linear',
    eval_metric='rmse',
    max_depth=6,
    learning_rate=0.01,
    colsample_bytree=1,
    n_jobs=-1,
)
xgb_rmse_score = -cross_val_score(xgb, X_train, y_train, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5).mean()
print(f'XGB model rmse score : {xgb_rmse_score.round(5)}')

XGB model rmse score : 1.00519


하이퍼파라미터 튜닝

In [122]:
xgb = XGBRegressor( # 하이퍼파라미터 모두 default
    booster='gblinear',
    objective='reg:linear',
    eval_metric='rmse',
    max_depth=6,
    learning_rate=0.23,
    n_jobs=-1,
    colsample_bytree=1,
    gamma=0,
    subsample=0.5,
)
xgb_rmse_score = -cross_val_score(xgb, X_train, y_train, scoring='neg_root_mean_squared_error', n_jobs=-1, cv=5).mean()
print(f'XGB model rmse score : {xgb_rmse_score.round(5)}')

XGB model rmse score : 0.76922
