### [Regression Data Analysis]
# KaKR House Price Prediction - BaseLine Model

- Baseline 출처 : [KaKR House Price Prediction Baseline](https://www.kaggle.com/kcs93023/2019-ml-month-2nd-baseline)  
위 링크에서 제공한 Baseline Model을 이용한 코드입니다.

---

### Import Module

In [1]:
import warnings
warnings.filterwarnings("ignore")

import os
from os.path import join

import pandas as pd
import numpy as np

import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
import xgboost as xgb
import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset

In [2]:
data = pd.read_csv("./data/train.csv")
sub = pd.read_csv("./data/test.csv")
print(data.shape, sub.shape)

(15035, 21) (6468, 20)


### Delete Target Value

In [3]:
y = data['price']
del data['price']

### train data 생성

In [4]:
train_len = len(data)
data = pd.concat((data, sub), axis=0)

### submission 할 test data의 index 추출

In [5]:
sub_id = data['id'][train_len:]
del data['id']

### date column Parsing

In [None]:
data['date'] = data['date'].apply(lambda x : str(x[:6])).astype(str)

### Skewed columns preprocessing

In [6]:
skew_columns = ['bedrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']

for c in skew_columns:
    data[c] = np.log1p(data[c].values)

### Submission test data & train data 분리

In [7]:
sub = data.iloc[train_len:, :]
x = data.iloc[:train_len, :]

### Boosting models 생성

In [8]:
gboost = GradientBoostingRegressor(random_state=2019)
xgboost = xgb.XGBRegressor(random_state=2019)
lightgbm = lgb.LGBMRegressor(random_state=2019)

models = [{'model':gboost, 'name':'GradientBoosting'}, {'model':xgboost, 'name':'XGBoost'},
          {'model':lightgbm, 'name':'LightGBM'}]

### CV score 함수 

In [9]:
def get_cv_score(models):
    kfold = KFold(n_splits=5, random_state=2019).get_n_splits(x.values)
    for m in models:
        print("Model {} CV score : {:.4f}".format(m['name'], np.mean(cross_val_score(m['model'], x.values, y)), 
                                             kf=kfold))

### RMSE 함수

In [11]:
def average_rmse(model, train, label):
    rmse_list = np.sqrt(-cross_val_score(model, train, label, scoring='neg_mean_squared_error', cv=5))
    print("{} RMSE lists: {}".format(model.__class__.__name__, np.round(rmse_list, 2)))
    print("{} RMSE average: {}".format(model.__class__.__name__, np.round(np.mean(rmse_list), 2)))

### R2 score 함수

In [13]:
def average_r2(model, train, label):
    r2_list = cross_val_score(model, train, label, scoring='r2', cv=5)
    print("{} r2_score lists: {}".format(model.__class__.__name__, np.round(r2_list, 4)))
    print("{} r2_score average: {}".format(model.__class__.__name__, np.round(np.mean(r2_list), 4)))

### Model Blending 함수 (평균값 사용)

In [19]:
def AveragingBlending(models, x, y, sub_x):
    for m in models : 
        m['model'].fit(x.values, y)
    
    predictions = np.column_stack([
        m['model'].predict(sub_x.values) for m in models
    ])
    return np.mean(predictions, axis=1)

### Scores 확인

In [10]:
get_cv_score(models)

Model GradientBoosting CV score : 0.8573
Model XGBoost CV score : 0.8539
Model LightGBM CV score : 0.8749


In [12]:
average_rmse(gboost, x.values, y)
average_rmse(xgboost, x.values, y)
average_rmse(lightgbm, x.values, y)

GradientBoostingRegressor RMSE lists: [139988.31 140042.2  134287.46 146891.34 130229.18]
GradientBoostingRegressor RMSE average: 138287.7
XGBRegressor RMSE lists: [143999.01 137914.56 128302.08 141671.79 134353.97]
XGBRegressor RMSE average: 137248.28
LGBMRegressor RMSE lists: [141292.54 137110.43 120027.57 123845.67 114511.62]
LGBMRegressor RMSE average: 127357.57


In [14]:
average_r2(gboost, x.values, y)
average_r2(xgboost, x.values, y)
average_r2(lightgbm, x.values, y)

GradientBoostingRegressor r2_score lists: [0.8787 0.8635 0.8546 0.8256 0.8756]
GradientBoostingRegressor r2_score average: 0.8596
XGBRegressor r2_score lists: [0.8717 0.8676 0.8673 0.8378 0.8676]
XGBRegressor r2_score average: 0.8624
LGBMRegressor r2_score lists: [0.8764 0.8691 0.8838 0.876  0.9039]
LGBMRegressor r2_score average: 0.8819


### Train set을 split 해서 score 확인

In [18]:
X_train, X_test, y_train, y_test = train_test_split(data[:15035], y, test_size=0.2, random_state=157)

In [20]:
y_pred = AveragingBlending(models, X_train, y_train, X_test)

In [21]:
mean_squared_error(y_test, y_pred) ** 0.5

124923.31748742981

In [22]:
r2_score(y_test, y_pred)

0.8825670708261794

---

### Submission data로 제출파일 생성

In [39]:
y_pred = AveragingBlending(models, x, y, sub)

In [40]:
submission = pd.read_csv("./data/sample_submission.csv")
submission.head()

Unnamed: 0,id,price
0,15035,100000
1,15036,100000
2,15037,100000
3,15038,100000
4,15039,100000


In [41]:
submission['price'] = y_pred
submission.describe()

Unnamed: 0,id,price
count,6468.0,6468.0
mean,18268.5,537178.4
std,1867.295103,327387.6
min,15035.0,158813.8
25%,16651.75,325995.0
50%,18268.5,462035.2
75%,19885.25,643118.8
max,21502.0,4915031.0


In [42]:
submission.to_csv("./baseline.csv", index_label=False, index=False)

In [43]:
submission_test = pd.read_csv("./submission.csv")
submission_test.head()

Unnamed: 0,id,price
0,15035,494290.6
1,15036,474450.8
2,15037,1256917.0
3,15038,306866.4
4,15039,286101.1


> submission score : 129439