### lightGBMで分析をする

まずは下準備

In [515]:
import pandas as pd
import numpy as np
import lightgbm as lgb

まずは、きのう（20220901.ipynb）で抽出した特徴量を読み込む

In [516]:
rent = pd.read_csv('rent.csv')
area_size = pd.read_csv('area_size.csv')
house_age = pd.read_csv('house_age.csv')
n_floor = pd.read_csv('n_floor.csv')
room_arrange = pd.read_csv('room_arrange.csv')
contract_span = pd.read_csv('contract_span.csv')

In [517]:
X_train = pd.concat([house_age, area_size], axis=1)

In [518]:
y_train = rent

訓練データと検証データを分割しておく。

In [519]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [520]:
test_area_size = pd.read_csv('test_area_size.csv')
test_house_age = pd.read_csv('test_house_age.csv')
test_n_floor = pd.read_csv('test_n_floor.csv')
test_room_arrange = pd.read_csv('test_room_arrange.csv')
test_contract_span = pd.read_csv('test_contract_span.csv')

ひとまず、比較的単純で扱いやすい築年数(test_house_age)と面積(area_size)だけでlightGBMを使ってみる。

In [521]:
X_test = pd.concat([test_house_age, test_area_size], axis=1)

In [522]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'params':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

Please use params argument of the Dataset constructor to pass this parameter.


You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 2
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's l2: 2.06477e+09	valid_1's l2: 2.48908e+09
[20]	training's l2: 1.58646e+09	valid_1's l2: 2.11262e+09
[30]	training's l2: 1.46379e+09	valid_1's l2: 2.03611e+09
[40]	training's l2: 1.4094e+09	valid_1's l2: 1.99866e+09
[50]	training's l2: 1.37179e+09	valid_1's l2: 1.97978e+09
[60]	training's l2: 1.34303e+09	valid_1's l2: 1.95681e+09
[70]	training's l2: 1.32483e+09	valid_1's l2: 1.94649e+09
[80]	training's l2: 1.30224e+09	valid_1's l2: 1.9259e+09
[90]	training's l2: 1.28021e+09	valid_1's l2: 1.90427e+09
[100]	training's l2: 1.2591e+09	valid_1's l2: 1.89371e+09
[110]	training's l2: 1.24262e+09	valid_1's l2: 1.88111e+

誤差が大きすぎてお話にならない

次に特徴量に間取りを加えてみる

LDKに得点をふる。LとDとKに1点を加える。また、たまに現れるSにも1点を加える。そして居室の数をそれらの合計点に足す。<br>
例えば、３LDKなら3+1+1+1=6点である。1Rなら1点である。

In [523]:
room_arrange_scores = []
for ldks in room_arrange['間取り']:
    room_arrange_score = 0
    for s in ldks:
        if s.isdigit():
            room_arrange_score += int(s)
        elif (s in ['L', 'D', 'K', 'S']):
            room_arrange_score += 1
        else:
            pass
    
    room_arrange_scores.append(room_arrange_score)

In [524]:
room_arrange_scores = pd.Series(room_arrange_scores)

room_arrangeの特典群を最大値と最小値を見ると、まあ妥当そうである。

In [525]:
print(max(room_arrange_scores))
print(min(room_arrange_scores))

9
1


テストデータも作る

In [526]:
test_room_arrange_scores = []
for ldks in test_room_arrange['間取り']:
    test_room_arrange_score = 0
    for s in ldks:
        if s.isdigit():
            test_room_arrange_score += int(s)
        elif (s in ['L', 'D', 'K', 'S']):
            test_room_arrange_score += 1
        else:
            pass
    
    test_room_arrange_scores.append(test_room_arrange_score)

In [527]:
test_room_arrange_scores = pd.Series(test_room_arrange_scores)

こちらも問題なさそう

In [528]:
print(max(test_room_arrange_scores))
print(min(test_room_arrange_scores))

11
1


それでは改めて、今作ったroom_arrange_scoresと、面積(area_size)と築年数(house_age)でlightGBMを動かす。

In [529]:
X_train = pd.concat([house_age, area_size, room_arrange_scores], axis=1)

In [530]:
y_train = rent

In [531]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [532]:
X_test = pd.concat([test_house_age, test_area_size, test_room_arrange_scores], axis=1)

In [533]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'params':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 520
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 3
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's l2: 1.99721e+09	valid_1's l2: 2.41298e+09
[20]	training's l2: 1.48397e+09	valid_1's l2: 2.02411e+09
[30]	training's l2: 1.35231e+09	valid_1's l2: 1.95022e+09
[40]	training's l2: 1.29264e+09	valid_1's l2: 1.92746e+09
[50]	training's l2: 1.25185e+09	valid_1's l2: 1.89602e+09
[60]	training's l2: 1.21793e+09	valid_1's l2: 1.87348e+09


Please use params argument of the Dataset constructor to pass this parameter.


[70]	training's l2: 1.19107e+09	valid_1's l2: 1.8545e+09
[80]	training's l2: 1.1725e+09	valid_1's l2: 1.84695e+09
[90]	training's l2: 1.15241e+09	valid_1's l2: 1.83298e+09
[100]	training's l2: 1.13693e+09	valid_1's l2: 1.81906e+09
[110]	training's l2: 1.12522e+09	valid_1's l2: 1.81591e+09
[120]	training's l2: 1.10784e+09	valid_1's l2: 1.80278e+09
[130]	training's l2: 1.09168e+09	valid_1's l2: 1.78432e+09
[140]	training's l2: 1.07279e+09	valid_1's l2: 1.77055e+09
[150]	training's l2: 1.05701e+09	valid_1's l2: 1.75931e+09
[160]	training's l2: 1.04223e+09	valid_1's l2: 1.74911e+09
[170]	training's l2: 1.02962e+09	valid_1's l2: 1.74024e+09
[180]	training's l2: 1.014e+09	valid_1's l2: 1.72312e+09
[190]	training's l2: 1.00031e+09	valid_1's l2: 1.71233e+09
[200]	training's l2: 9.89271e+08	valid_1's l2: 1.70859e+09
[210]	training's l2: 9.77892e+08	valid_1's l2: 1.70187e+09
[220]	training's l2: 9.68978e+08	valid_1's l2: 1.69949e+09
[230]	training's l2: 9.57378e+08	valid_1's l2: 1.69149e+09
[240

まだまともな結果が出ていない

以下では試しに、「面積」「築年数」「間取り」「契約期間」「階数」を全て入れてみる。

その前に「階数」はlightGBMに入れるための加工が済んでいないので、ここで加工しておく。

階数の指標をfloor_scoreとFloor＿scoreの二つに分ける。<br>
ここで<br>
floor_score=(その部屋のある階数)<br>
Floor_score=(全体の階数)<br>
である。

↓訓練データと検証データ用

In [535]:
floor_scores = []
Floor_scores = []
for n in n_floor:
    if (n == "" or "所在階"):
        continue
    else:
        floor_score = int(n[0])
        Floor_score = int(n[0])
    floor_scores.append(floor_score)
    Floor_scores.append(Floor_score)


In [536]:
floor_scores = pd.Series(floor_scores)
Floor_scores = pd.Series(Floor_scores)

  floor_scores = pd.Series(floor_scores)
  Floor_scores = pd.Series(Floor_scores)


↓テストデータ用

In [537]:
test_floor_scores = []
test_Floor_scores = []
for n in test_n_floor:
    if (n == "" or "所在階"):
        continue
    else:
        test_floor_score = int(n[0])
        test_Floor_score = int(n[0])
    test_floor_scores.append(test_floor_score)
    test_Floor_scores.append(test_Floor_score)

In [538]:
test_floor_scores = pd.Series(test_floor_scores)
test_Floor_scores = pd.Series(test_Floor_scores)

  test_floor_scores = pd.Series(test_floor_scores)
  test_Floor_scores = pd.Series(test_Floor_scores)


「面積」「築年数」「間取り」「契約期間」「所在階」の各データをlightGBMに入れる。

In [539]:
X_train = pd.concat([house_age, area_size, room_arrange_scores, contract_span, floor_scores, Floor_scores], axis=1)

In [540]:
y_train = rent

In [541]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [542]:
X_test = pd.concat([test_house_age, test_area_size, test_room_arrange_scores, test_contract_span, test_floor_scores, test_Floor_scores], axis=1)

In [543]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'params':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

Please use params argument of the Dataset constructor to pass this parameter.


You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 531
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 4
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's l2: 1.97646e+09	valid_1's l2: 2.39879e+09
[20]	training's l2: 1.43091e+09	valid_1's l2: 1.98045e+09
[30]	training's l2: 1.28227e+09	valid_1's l2: 1.89983e+09
[40]	training's l2: 1.21898e+09	valid_1's l2: 1.86189e+09
[50]	training's l2: 1.17963e+09	valid_1's l2: 1.84071e+09
[60]	training's l2: 1.14409e+09	valid_1's l2: 1.80942e+09
[70]	training's l2: 1.11216e+09	valid_1's l2: 1.77531e+09
[80]	training's l2: 1.09034e+09	valid_1's l2: 1.75475e+09
[90]	training's l2: 1.07137e+09	valid_1's l2: 1.73214e+09
[100]	training's l2: 1.05152e+09	valid_1's l2: 1.7136e+09
[110]	training's l2: 1.03599e+09	valid_1's l2: 1.70162