해당 코드는 Kaggle Notebook 환경에서 해당 코드를 진행하였습니다.

라이브러리 버전
- pandas 1.3.4
- numpy 1.19.5
- sklearn 0.23.2

국토연구원 참고 데이터 : https://www.bigdata-region.kr/#/dataset/0ad3c882-f7ee-4faf-970d-00c53cb65a84

참고데이터를 탐색해본 결과 격자공간고유번호의 1 ~ 5자리는 참고데이터의 SIGNGU_NM와 SIGNGU_CD 컬럼 (시군구를 나타냄)

1 ~ 10자리는 참고데이터의 SPG_NM 컬럼

1 ~ 2자리는 지역을 나타냄

송하인, 수하인 격자공간고유번호 변수를 자릿수 기준으로 나눌 수 있다는 점을 확인하였습니다.

### **Import Module**

In [110]:
# basic
import pandas as pd
import numpy as np
from glob import glob
from tqdm import tqdm
import os
import category_encoders as ce
import optuna
import warnings
warnings.filterwarnings(action='ignore') 

# sklearn
import sklearn
from sklearn.model_selection import KFold, GridSearchCV, train_test_split, StratifiedKFold, TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans

# model
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor, Pool
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor, BaggingRegressor, ExtraTreesRegressor, StackingRegressor

### **Load data**

In [111]:
# dacon data
train = pd.read_csv('../input/logistics-distribution/train_df.csv', encoding='cp949')
test = pd.read_csv('../input/logistics-distribution/test_df.csv', encoding='cp949')
submission = pd.read_csv('../input/logistics-distribution/sample_submission.csv', encoding='cp949')

# 국토연구원 data
space_dir = '../input/space50'
space_list = os.listdir(space_dir)
space_df_list = []
for i in space_list:
    print(i+'등록')
    globals() ['{0}'.format(i[:-4])] = pd.read_csv(space_dir+'/{0}'.format(i))
    space_df_list.append(i[:-4])

In [112]:
TC_NU_SPG_50_METER_50.head(3)

### **Feature Engineering**

'DL_GD_LCLS_NM' 변수와 'DL_GD_MCLS_NM' 변수를 결합하여 새롭게 'group'이라는 파생변수를 생성하였습니다.

In [113]:
train['group'] = train.DL_GD_LCLS_NM + '-' + train.DL_GD_MCLS_NM
test['group'] = test.DL_GD_LCLS_NM + '-' + test.DL_GD_MCLS_NM

송하인, 수하인 격자공간고유번호 변수를 자릿수 기준으로 새로운 변수들을 생성하였습니다.

학습데이터의 새로운 변수들의 송하인, 수하인 격자번호를 한 집합체로 담아 학습과 테스트 데이터에 매핑처리를 하였습니다.

In [114]:
# str10
train['SEND_SPG_INNB_str10'] = train['SEND_SPG_INNB'].astype(str).str[:10]
train['SEND_SPG_INNB_str10'] = train['SEND_SPG_INNB_str10'].astype(int)
test['SEND_SPG_INNB_str10'] = test['SEND_SPG_INNB'].astype(str).str[:10]
test['SEND_SPG_INNB_str10'] = test['SEND_SPG_INNB_str10'].astype(int)
train['REC_SPG_INNB_str10'] = train['REC_SPG_INNB'].astype(str).str[:10]
train['REC_SPG_INNB_str10'] = train['REC_SPG_INNB_str10'].astype(int)
test['REC_SPG_INNB_str10'] = test['REC_SPG_INNB'].astype(str).str[:10]
test['REC_SPG_INNB_str10'] = test['REC_SPG_INNB_str10'].astype(int)

ssi10 = set(train.SEND_SPG_INNB_str10)
#ssi10_t = set(test.SEND_SPG_INNB_str10)
rsi10 = set(train.REC_SPG_INNB_str10)
#rsi10_t = set(test.REC_SPG_INNB_str10)

#print('SEND_SPG_INNB 차집합 수 :', len(ssi10.difference(ssi10_t)))
#print('REC_SPG_INNB 차집합 수 :', len(rsi10.difference(rsi10_t)))

ssi10.update(rsi10)
#ssi10_t.update(rsi10_t)

#print('UPDATE 후 차집합 수 :', len(ssi10.difference(ssi10_t)))

"""
SEND_SPG_INNB_str10_index = []
for i in list(ssi10.difference(ssi10_t)):
    train = train.drop(train[train['SEND_SPG_INNB_str10'] == i].index,axis='index')
    train = train.drop(train[train['REC_SPG_INNB_str10'] == i].index,axis='index')
    print(len(set(train.SEND_SPG_INNB_str10).difference(set(test.SEND_SPG_INNB_str10))))
"""

# mapping
dictionary_str10 = {}
for i,s in enumerate(ssi10):
    dictionary_str10[s] = i
    
train['SEND_SPG_INNB_str10'] = train.SEND_SPG_INNB_str10.map(dictionary_str10)
train['REC_SPG_INNB_str10'] = train.REC_SPG_INNB_str10.map(dictionary_str10)
test['SEND_SPG_INNB_str10'] = test.SEND_SPG_INNB_str10.map(dictionary_str10)
test['REC_SPG_INNB_str10'] = test.REC_SPG_INNB_str10.map(dictionary_str10)

# str5
train['SEND_SPG_INNB_str5'] = train['SEND_SPG_INNB'].astype(str).str[:5]
train['SEND_SPG_INNB_str5'] = train['SEND_SPG_INNB_str5'].astype(int)
test['SEND_SPG_INNB_str5'] = test['SEND_SPG_INNB'].astype(str).str[:5]
test['SEND_SPG_INNB_str5'] = test['SEND_SPG_INNB_str5'].astype(int)
train['REC_SPG_INNB_str5'] = train['REC_SPG_INNB'].astype(str).str[:5]
train['REC_SPG_INNB_str5'] = train['REC_SPG_INNB_str5'].astype(int)
test['REC_SPG_INNB_str5'] = test['REC_SPG_INNB'].astype(str).str[:5]
test['REC_SPG_INNB_str5'] = test['REC_SPG_INNB_str5'].astype(int)

ssi5 = set(train.SEND_SPG_INNB_str5)
#ssi5_t = set(test.SEND_SPG_INNB_str5)
rsi5 = set(train.REC_SPG_INNB_str5)
#rsi5_t = set(test.REC_SPG_INNB_str5)

#print('SEND_SPG_INNB 차집합 수 :', len(ssi5.difference(ssi5_t)))
#print('REC_SPG_INNB 차집합 수 :', len(rsi5.difference(rsi5_t)))

ssi5.update(rsi5)
#ssi5_t.update(rsi5_t)

"""
SEND_SPG_INNB_str5_index = []
for i in list(ssi5.difference(ssi5_t)):
    train = train.drop(train[train['SEND_SPG_INNB_str5'] == i].index,axis='index')
    train = train.drop(train[train['REC_SPG_INNB_str5'] == i].index,axis='index')
    print(len(set(train.SEND_SPG_INNB_str5).difference(set(test.SEND_SPG_INNB_str5))))
"""

# mapping
dictionary_str5 = {}
for i,s in enumerate(ssi5):
    dictionary_str5[s] = i
    
train['SEND_SPG_INNB_str5'] = train.SEND_SPG_INNB_str5.map(dictionary_str5)
train['REC_SPG_INNB_str5'] = train.REC_SPG_INNB_str5.map(dictionary_str5)
test['SEND_SPG_INNB_str5'] = test.SEND_SPG_INNB_str5.map(dictionary_str5)
test['REC_SPG_INNB_str5'] = test.REC_SPG_INNB_str5.map(dictionary_str5)

# str2
train['SEND_SPG_INNB_str2'] = train['SEND_SPG_INNB'].astype(str).str[:2]
train['SEND_SPG_INNB_str2'] = train['SEND_SPG_INNB_str2'].astype(int)
test['SEND_SPG_INNB_str2'] = test['SEND_SPG_INNB'].astype(str).str[:2]
test['SEND_SPG_INNB_str2'] = test['SEND_SPG_INNB_str2'].astype(int)
train['REC_SPG_INNB_str2'] = train['REC_SPG_INNB'].astype(str).str[:2]
train['REC_SPG_INNB_str2'] = train['REC_SPG_INNB_str2'].astype(int)
test['REC_SPG_INNB_str2'] = test['REC_SPG_INNB'].astype(str).str[:2]
test['REC_SPG_INNB_str2'] = test['REC_SPG_INNB_str2'].astype(int)

ssi2 = set(train.SEND_SPG_INNB_str2)
#ssi2_t = set(test.SEND_SPG_INNB_str2)
rsi2 = set(train.REC_SPG_INNB_str2)
#rsi2_t = set(test.REC_SPG_INNB_str2)

#print('SEND_SPG_INNB 차집합 수 :', len(ssi2.difference(ssi2_t)))
#print('REC_SPG_INNB 차집합 수 :', len(rsi2.difference(rsi2_t)))

ssi2.update(rsi2)
#ssi2_t.update(rsi2_t)

#print('UPDATE 후 차집합 수 :', len(ssi2.difference(ssi2_t)))
#print(':', ssi2.difference(ssi2_t))

# mapping
dictionary_str2 = {}
for i,s in enumerate(ssi2):
    dictionary_str2[s] = i
    
train['SEND_SPG_INNB_str2'] = train.SEND_SPG_INNB_str2.map(dictionary_str2)
train['REC_SPG_INNB_str2'] = train.REC_SPG_INNB_str2.map(dictionary_str2)
test['SEND_SPG_INNB_str2'] = test.SEND_SPG_INNB_str2.map(dictionary_str2)
test['REC_SPG_INNB_str2'] = test.REC_SPG_INNB_str2.map(dictionary_str2)

'group'변수를 Target Encoder를 사용하여 인코딩하였습니다.

In [115]:
# TargetEncoder
encoder = ce.target_encoder.TargetEncoder(cols=['group'])
encoder.fit(train['group'],train['INVC_CONT'])
train['group'] = encoder.transform(train['group'])
test['group'] = encoder.transform(test['group'])

In [116]:
train.head(3)

1 ~ 10자리, 1 ~ 1자리 기준으로 나누어 생성한 변수들을 실험하에 제거

최종적으로 1 ~ 5자리, group 변수만 사용

In [135]:
train_all = train
test_all = test

train_all = train_all.drop(['SEND_SPG_INNB','REC_SPG_INNB','DL_GD_LCLS_NM','DL_GD_MCLS_NM'],axis=1)
test_all = test_all.drop(['SEND_SPG_INNB','REC_SPG_INNB','DL_GD_LCLS_NM','DL_GD_MCLS_NM'],axis=1)

train_all = train_all.drop(['SEND_SPG_INNB_str10','REC_SPG_INNB_str10','SEND_SPG_INNB_str2','REC_SPG_INNB_str2'],axis=1)
test_all = test_all.drop(['SEND_SPG_INNB_str10','REC_SPG_INNB_str10','SEND_SPG_INNB_str2','REC_SPG_INNB_str2'],axis=1)

X = train_all.drop(['INVC_CONT','index'],axis=1)
y = train_all['INVC_CONT']
#y_median = np.median(train_all['INVC_CONT'].values)
X_test = test_all[X.columns]

def rmse(y_pred, y_test):
    return np.sqrt(mean_squared_error(y_test, y_pred))

### **Modeling**

여러 모델들을 실험해본결과
성능이 우수한 7가지 모델들을 가지고 Stacking을 하였습니다.

In [143]:
model1 = GradientBoostingRegressor()
model2 = XGBRegressor(learning_rate = 0.1, metrics = rmse, random_state=42) 
model3 = RandomForestRegressor()
model4 = BaggingRegressor()
model5 = LGBMRegressor()
model6 = ExtraTreesRegressor()
model7 = CatBoostRegressor(learning_rate = 0.1, bootstrap_type = 'Bernoulli')

In [144]:
# GradientBoostingRegressor	, XGBRegressor, RandomForestRegressor, BaggingRegressor, LGBMRegressor, ExtraTreesRegressor, CatBoostRegressor
estimators = [('gbr',model1),('xgb',model2),('rfr',model3),('br',model4),('lgb',model5),('etr',model6), ('cat',model7)]

In [145]:
stackingmodel = StackingRegressor(estimators=estimators,final_estimator=model1)
stackingmodel.fit(X, y)

In [153]:
stack_pred = stackingmodel.predict(X_test)
stack_pred

In [154]:
submission['INVC_CONT'] = np.round(stack_pred)
#submission.INVC_CONT[submission.INVC_CONT < 0] = abs(submission.INVC_CONT)
#submission.INVC_CONT[submission.INVC_CONT < 0] = y_median
submission.to_csv('stack_pred7.csv',index=False)