# LightGBM

Microsoft が開発した**勾配ブースティング (Gradient Boosting) アルゴリズム**を扱うためのフレームワーク  
勾配ブースティングは決定木 (Decision Tree) から派生したアルゴリズムで、**複数の決定木を逐次的に構築したアンサンブル学習**

** ランダムフォレストの特徴 **  
1. 高精度
2. 説明変数が何百何千でも効率的に作動
3. 目的変数に対する説明変数の重要度を推定
4. 欠損値にも有効に動作
5. 個体数がアンバランスでもエラーバランスが保てる（不均衡データもOK）

** LightGBMの注意点 **  
- データ型に object がある場合 category に変換

** データセット **  
House Prices: Advanced Regression Techniques  
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

KaggleHousingPrice EDA  
https://qiita.com/katsu1110/items/a1c3185fec39e5629bcb  
https://qiita.com/AykeJq0ILeYFOR4/items/20589df26b550aaa16b0

検証  
https://blog.amedama.jp/entry/2018/05/01/081842

DIC：HousingPriceの授業

In [117]:
import numpy as np
import pandas as pd

### データ取得

In [118]:
df = pd.read_csv('input/train.csv')

In [119]:
pd.set_option('display.max_columns', df.shape[1])
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [120]:
df.shape

(1460, 81)

In [121]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

### 欠損値を含む特徴量

In [122]:
for col in df.columns:
    null_num = df[col].isnull().sum()
    if null_num > 0:
        print(col+" : "+str(df[col].isnull().sum())+"("+str(df[col].dtype)+")")

LotFrontage : 259(float64)
Alley : 1369(object)
MasVnrType : 8(object)
MasVnrArea : 8(float64)
BsmtQual : 37(object)
BsmtCond : 37(object)
BsmtExposure : 38(object)
BsmtFinType1 : 37(object)
BsmtFinType2 : 38(object)
Electrical : 1(object)
FireplaceQu : 690(object)
GarageType : 81(object)
GarageYrBlt : 81(float64)
GarageFinish : 81(object)
GarageQual : 81(object)
GarageCond : 81(object)
PoolQC : 1453(object)
Fence : 1179(object)
MiscFeature : 1406(object)


### データ型を object から category に変換

In [124]:
obj_list = [col for col in df.columns if df[col].dtype == 'object']

In [126]:
for obj in obj_list:
    df[obj]=df[obj].astype("category")

In [127]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null category
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null category
Alley            91 non-null category
LotShape         1460 non-null category
LandContour      1460 non-null category
Utilities        1460 non-null category
LotConfig        1460 non-null category
LandSlope        1460 non-null category
Neighborhood     1460 non-null category
Condition1       1460 non-null category
Condition2       1460 non-null category
BldgType         1460 non-null category
HouseStyle       1460 non-null category
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null category
RoofMatl         1460 non-null catego

In [103]:
X=df.drop(["Id", "SalePrice"],axis=1)

In [104]:
y=df["SalePrice"]

### データ分割

In [105]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

### 学習

**交差検証**  
ブーストラウンドごとの評価関数の状況を交差検証で確認できる

In [106]:
import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

In [107]:
lgbm_params = {
    # 回帰問題
    'objective': 'regression',
    # RMSE (平均二乗誤差平方根) の最小化を目指す
    'metric': 'rmse'
}

In [110]:
model = lgb.train(
    lgbm_params, 
    lgb_train, 
    valid_sets=lgb_eval,
    # 最大ラウンド数（default：100）
    num_boost_round=1000,
    # 学習打ち切りラウンド数（default：0）
    early_stopping_rounds=10
)

[1]	valid_0's rmse: 76683.9
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's rmse: 70898.8
[3]	valid_0's rmse: 66074.5
[4]	valid_0's rmse: 61965
[5]	valid_0's rmse: 58346
[6]	valid_0's rmse: 54991.3
[7]	valid_0's rmse: 52326.3
[8]	valid_0's rmse: 49672.4
[9]	valid_0's rmse: 47550
[10]	valid_0's rmse: 45443
[11]	valid_0's rmse: 43907.9
[12]	valid_0's rmse: 42474.5
[13]	valid_0's rmse: 41076.2
[14]	valid_0's rmse: 39944.4
[15]	valid_0's rmse: 38998.6
[16]	valid_0's rmse: 38058.7
[17]	valid_0's rmse: 37236.8
[18]	valid_0's rmse: 36471.6
[19]	valid_0's rmse: 35820.9
[20]	valid_0's rmse: 35365.9
[21]	valid_0's rmse: 34974.4
[22]	valid_0's rmse: 34594.3
[23]	valid_0's rmse: 34276.7
[24]	valid_0's rmse: 34001.8
[25]	valid_0's rmse: 33822.7
[26]	valid_0's rmse: 33694.4
[27]	valid_0's rmse: 33519.7
[28]	valid_0's rmse: 33404.8
[29]	valid_0's rmse: 33265.1
[30]	valid_0's rmse: 32868.3
[31]	valid_0's rmse: 32724.2
[32]	valid_0's rmse: 32649.3
[33]	valid_0's rmse: 32691.