### lightGBMで分析をする

まずは下準備

In [212]:
import pandas as pd
import numpy as np
import lightgbm as lgb

まずは、きのう（20220901.ipynb）で抽出した特徴量を読み込む

In [213]:
rent = pd.read_csv('rent.csv')
area_size = pd.read_csv('area_size.csv')
house_age = pd.read_csv('house_age.csv')
n_floor = pd.read_csv('n_floor.csv')
room_arrange = pd.read_csv('room_arrange.csv')
contract_span = pd.read_csv('contract_span.csv')

In [214]:
X_train = pd.concat([house_age, area_size], axis=1)

In [215]:
y_train = rent

訓練データと検証データを分割しておく。

In [216]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [217]:
test_area_size = pd.read_csv('test_area_size.csv')
test_house_age = pd.read_csv('test_house_age.csv')
test_n_floor = pd.read_csv('test_n_floor.csv')
test_room_arrange = pd.read_csv('test_room_arrange.csv')
test_contract_span = pd.read_csv('test_contract_span.csv')

ひとまず、比較的単純で扱いやすい築年数(test_house_age)と面積(area_size)だけでlightGBMを使ってみる。

In [218]:
X_test = pd.concat([test_house_age, test_area_size], axis=1)

In [219]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'params':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)

Please use params argument of the Dataset constructor to pass this parameter.


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 2
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's l2: 2.06477e+09	valid_1's l2: 2.48908e+09
[20]	training's l2: 1.58646e+09	valid_1's l2: 2.11262e+09
[30]	training's l2: 1.46379e+09	valid_1's l2: 2.03611e+09
[40]	training's l2: 1.4094e+09	valid_1's l2: 1.99866e+09
[50]	training's l2: 1.37179e+09	valid_1's l2: 1.97978e+09
[60]	training's l2: 1.34303e+09	valid_1's l2: 1.95681e+09
[70]	training's l2: 1.32483e+09	valid_1's l2: 1.94649e+09
[80]	training's l2: 1.30224e+09	valid_1's l2: 1.9259e+09
[90]	training's l2: 1.28021e+09	valid_1's l2: 1.90427e+09
[100]	training's l2: 1.2591e+09	valid_1's l2: 1.89371e+09
[110]	training's l2: 1.24262e+09	valid_1's l2: 1.88111e+09
[120]	training's l2: 1.229e+09	valid_1's l2: 1.87253e+09
[130

誤差が大きすぎてお話にならない

次に特徴量に間取りを加えてみる

LDKに得点をふる。LとDとKに1点を加える。また、たまに現れるSにも1点を加える。そして居室の数をそれらの合計点に足す。<br>
例えば、３LDKなら3+1+1+1=6点である。1Rなら1点である。

In [220]:
room_arrange_scores = []
for ldks in room_arrange['間取り']:
    room_arrange_score = 0
    for s in ldks:
        if s.isdigit():
            room_arrange_score += int(s)
        elif (s in ['L', 'D', 'K', 'S']):
            room_arrange_score += 1
        else:
            pass
    
    room_arrange_scores.append(room_arrange_score)

In [221]:
room_arrange_scores = pd.Series(room_arrange_scores)

room_arrangeの特典群を最大値と最小値を見ると、まあ妥当そうである。

In [222]:
print(max(room_arrange_scores))
print(min(room_arrange_scores))

9
1


テストデータも作る

In [223]:
test_room_arrange_scores = []
for ldks in test_room_arrange['間取り']:
    test_room_arrange_score = 0
    for s in ldks:
        if s.isdigit():
            test_room_arrange_score += int(s)
        elif (s in ['L', 'D', 'K', 'S']):
            test_room_arrange_score += 1
        else:
            pass
    
    test_room_arrange_scores.append(test_room_arrange_score)

In [224]:
test_room_arrange_scores = pd.Series(data=test_room_arrange_scores, name='間取り得点')

こちらも問題なさそう

In [225]:
print(max(test_room_arrange_scores))
print(min(test_room_arrange_scores))

11
1


それでは改めて、今作ったroom_arrange_scoresと、面積(area_size)と築年数(house_age)でlightGBMを動かす。

In [226]:
X_train = pd.concat([house_age, area_size, room_arrange_scores], axis=1)

In [227]:
y_train = rent

In [228]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [229]:
X_test = pd.concat([test_house_age, test_area_size, test_room_arrange_scores], axis=1)

In [230]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'metrics':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 520
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 3
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's rmse: 44690.1	valid_1's rmse: 49122.1
[20]	training's rmse: 38522.3	valid_1's rmse: 44990.2
[30]	training's rmse: 36773.8	valid_1's rmse: 44161.3
[40]	training's rmse: 35953.3	valid_1's rmse: 43902.9
[50]	training's rmse: 35381.4	valid_1's rmse: 43543.3
[60]	training's rmse: 34898.8	valid_1's rmse: 43283.7
[70]	training's rmse: 34511.9	valid_1's rmse: 43064
[80]	training's rmse: 34241.8	valid_1's rmse: 42976.2
[90]	training's rmse: 33947.1	valid_1's rmse: 42813.4
[100]	training's rmse: 33718.4	valid_1's rmse: 42650.4
[110]	training's rmse: 33544.3	valid_1's rmse: 42613.5
[120]	training's rmse: 33284.3	valid_1's

まだまともな結果が出ていない

以下では試しに、「面積」「築年数」「間取り」「契約期間」「階数」を全て入れてみる。

その前に「階数」はlightGBMに入れるための加工が済んでいないので、ここで加工しておく。

階数の指標をfloor_scoreとFloor＿scoreの二つに分ける。<br>
ここで<br>
floor_score=(その部屋のある階数)<br>
Floor_score=(全体の階数)<br>
である。

↓訓練データと検証データ用

In [231]:
from cmath import nan
import re

i = 0
for s in n_floor["所在階"]:
    try:
        n_floor["所在階"][i] = re.findall(r"\d+", s)
    except:
        n_floor["所在階"][i] = nan
    i += 1

In [232]:
floor_scores = []
Floor_scores = []
for n in n_floor["所在階"]:
    if (n == ""):
        continue
    else:
        try:
            floor_score = int(n[0])
        except:
            floor_score = nan
        try:
            Floor_score = int(n[1])
        except:
            Floor_score = nan
        floor_scores.append(floor_score)
        Floor_scores.append(Floor_score)

floor_scores = pd.Series(data=floor_scores, name='所在階')
Floor_scores = pd.Series(data=Floor_scores, name='全体の階数')

#floor_scores = floor_scores.rename(columns={0:'所在階'})#列名の振り直し
floor_scores.to_csv("floor_scores.csv",index=False)
#Floor_scores = Floor_scores.rename(columns={0:'全体の階数'})#列名の振り直し
Floor_scores.to_csv("capital_floor_scores.csv",index=False)


↓テストデータ用

In [233]:
test_n_floor['所在階'][0][0]

'['

In [234]:
test_Floor_scores

0        8.0
1        3.0
2        1.0
3        1.0
4        4.0
        ... 
31257    6.0
31258    8.0
31259    1.0
31260    1.0
31261    5.0
Name: 全体の階数, Length: 31262, dtype: float64

In [235]:
test_floor_scores = []
test_Floor_scores = []
for n in test_n_floor["所在階"]:
    if (n == ""):
        continue
    else:
        try:
            test_floor_score = int(n[0])
        except:
            test_floor_score = nan
        try:
            test_Floor_score = int(n[1])
        except:
            test_Floor_score = nan
        test_floor_scores.append(test_floor_score)
        test_Floor_scores.append(test_Floor_score)

test_floor_scores = pd.Series(data=test_floor_scores, name='所在階')
test_Floor_scores = pd.Series(data=test_Floor_scores, name='全体の階数')

#test_floor_scores = test_floor_scores.rename(columns={0:'所在階'})#列名の振り直し
test_floor_scores.to_csv("test_floor_scores.csv",index=False)
#test_Floor_scores = test_Floor_scores.rename(columns={0:'全体の階数'})#列名の振り直し
test_Floor_scores.to_csv("test_capital_floor_scores.csv",index=False)

「面積」「築年数」「間取り」「契約期間」「所在階」の各データをlightGBMに入れる。

In [236]:
X_train = pd.concat([house_age, area_size, room_arrange_scores, contract_span, floor_scores, Floor_scores], axis=1)

In [237]:
y_train = rent

In [238]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [239]:
X_test = pd.concat([test_house_age, test_area_size, test_room_arrange_scores, test_contract_span, test_floor_scores, test_Floor_scores], axis=1)

In [240]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

params = {
    'objective':'regression',
    'metrics':'rmse'
}

model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_eval], verbose_eval=10, num_boost_round=1000, early_stopping_rounds=10)

y_pred = model.predict(X_test, num_iteration=model.best_iteration)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 632
[LightGBM] [Info] Number of data points in the train set: 22029, number of used features: 6
[LightGBM] [Info] Start training from score 118651.337373
Training until validation scores don't improve for 10 rounds
[10]	training's rmse: 40857.7	valid_1's rmse: 45533.1
[20]	training's rmse: 32346.4	valid_1's rmse: 39173.4
[30]	training's rmse: 29665.4	valid_1's rmse: 37324.9
[40]	training's rmse: 28509.5	valid_1's rmse: 36567
[50]	training's rmse: 27669.1	valid_1's rmse: 36003.8
[60]	training's rmse: 27034.3	valid_1's rmse: 35574.1
[70]	training's rmse: 26453.2	valid_1's rmse: 35259.6
[80]	training's rmse: 25996.2	valid_1's rmse: 34782.7
[90]	training's rmse: 25551	valid_1's rmse: 34575.5
[100]	training's rmse: 25258.7	valid_1's rmse: 34475
[110]	training's rmse: 24921.1	valid_1's rmse: 34285.6
[120]	training's rmse: 24613.8	valid_1's rmse: 34217.2
[130]	training's rmse: 24331.5	valid_1's rmse: 34057.