- 개발 환경(OS): Windows 10 Education, 64비트 운영 체제, x64 기반 프로세서
![image.png](attachment:image.png)

## Library version check

In [1]:
!pip install optuna sktime

Collecting sktime
  Downloading sktime-0.21.0-py3-none-any.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting deprecated>=1.2.13
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting scikit-base<0.6.0
  Downloading scikit_base-0.5.0-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.2/118.2 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-base, deprecated, sktime
Successfully installed deprecated-1.2.14 scikit-base-0.5.0 sktime-0.21.0
[0m

In [2]:
import sys
import sktime
import tqdm as tq
import xgboost as xgb
import matplotlib
import optuna
import seaborn as sns
import sklearn as skl
import pandas as pd
import numpy as np
print("-------------------------- Python & library version --------------------------")
print("Python version: {}".format(sys.version))
print("pandas version: {}".format(pd.__version__))
print("numpy version: {}".format(np.__version__))
print("matplotlib version: {}".format(matplotlib.__version__))
print("tqdm version: {}".format(tq.__version__))
print("sktime version: {}".format(sktime.__version__))
print("xgboost version: {}".format(xgb.__version__))
print("seaborn version: {}".format(sns.__version__))
print("scikit-learn version: {}".format(skl.__version__))
print("------------------------------------------------------------------------------")

-------------------------- Python & library version --------------------------
Python version: 3.9.16 (main, Dec  7 2022, 01:11:51) 
[GCC 9.4.0]
pandas version: 1.5.0
numpy version: 1.23.4
matplotlib version: 3.6.1
tqdm version: 4.64.1
sktime version: 0.21.0
xgboost version: 1.6.2
seaborn version: 0.12.0
scikit-learn version: 1.1.2
------------------------------------------------------------------------------


## 0. load the libararies

In [3]:
import matplotlib.pyplot as plt
from tqdm import tqdm
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.utils.plotting import plot_series
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings(action='ignore')
pd.set_option('display.max_columns', 30)

## 1. preprocessing the data

#####  - time series를 일반 regression 문제로 변환하기 위해 시간 관련 변수 추가(월 / 주 / 요일)
#####  - 전력소비량의 건물별 요일별 시간대별 평균 / 건물별 시간대별 평균 / 건물별 시간대별 표준편차 변수 추가
###### 건물별 요일별 시간대별 표준편차 / 건물별 평균 등 여러 통계량 생성 후 몇개 건물에 테스트, 최종적으로 성능 향상에 도움이 된 위 3개 변수만 추가
#####  - 공휴일 변수 추가
#####  - 시간(hour)는 cyclical encoding하여 변수 추가(sin time & cos time) 후 삭제
#####  - CDH(Cooling Degree Hour) & THI(불쾌지수) 변수 추가
##### - 건물별 모델 생성 시 무의미한 태양광 발전 시설 / 냉방시설 변수 삭제

In [4]:
# ['datetime', 'product', 'target', 'large', 'mid', 'small', 'brand', 'sale', 'brand_count', 'index', 'product_feature']

In [5]:
path = ''

In [6]:
train = pd.read_parquet(path + 'df_all2.parquet')
test = pd.read_parquet(path + 'test_all2.parquet')

In [7]:
train.drop(columns=['sale', 'brand_count'], inplace=True)
train['brand'] = train['brand'].astype('object')
test['brand'] = test['brand'].astype('object')

In [8]:
def process(df):
    date = pd.to_datetime(df.datetime)
    df['day'] = date.dt.day
    df['dow'] = date.dt.weekday
    df['month'] = date.dt.month
    df['week'] = date.dt.weekofyear
    df['date'] = date.dt.date.astype('str')

    ### 공휴일 변수 추가
    df['holiday'] = df.apply(lambda x : 0 if x['day']<5 else 1, axis = 1)
    special_days = ['2020-01-01', '2020-06-06', '2020-08-15', '2020-08-17', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-03-01', '2020-04-15', '2020-04-30', '2020-05-05', '2020-09-30', '2020-10-01', '2020-10-02', '2020-10-03', '2020-10-09', '2020-12-25',
                    '2021-01-01', '2021-02-11', '2021-02-12', '2021-02-13', '2021-03-01', '2021-05-05', '2021-05-19', '2021-06-06', '2021-08-15', '2021-09-20', '2021-09-21', '2021-09-22', '2021-10-03', '2021-10-09', '2021-12-25',
                    '2022-01-01', '2022-01-31', '2022-02-01', '2022-02-02', '2022-03-01', '2022-05-05', '2022-05-08', '2022-06-06', '2022-08-15', '2022-09-09', '2022-09-10', '2022-09-11', '2022-09-12', '2022-10-03', '2022-10-09',  '2022-10-10', '2022-12-25',
                    '2023-01-01', '2023-01-21', '2023-01-22', '2023-01-23', '2023-01-24', '2023-03-01', '2023-05-05', '2023-05-27'
                   ]
    df.loc[df.date.isin(special_days), 'holiday'] = 1
    df['holiday'] = df.apply(lambda x : 0 if x['dow']<5 else 1, axis = 1)

    df.drop(columns=['date', 'datetime'], inplace=True)
    return df

In [9]:
train = process(train)
test = process(test)

In [10]:
train.columns

Index(['target', 'product', 'large', 'mid', 'small', 'brand', 'day', 'dow',
       'month', 'week', 'holiday'],
      dtype='object')

In [11]:
cat_col = ['product', 'large', 'mid', 'small', 'brand', 'day', 'dow',
       'month', 'week', 'holiday']

for i in cat_col:
    train[i] = train[i].astype('category')
    test[i] = test[i].astype('category')

#### 모델은 시계열 데이터에 좋은 성능을 보이는 XGBoost를 선정했습니다.

In [12]:
# Define SMAPE loss function
def SMAPE(true, pred):
    return np.mean((np.abs(true-pred))/(np.abs(true) + np.abs(pred))) * 100

#### 아래와 같이 평가 Metric인 SMAPE는 실제값보다 작게 추정할 때 더 좋지 않습니다.
#### 이는 전력사용량을 높게 예측하는 것보다 작게 예측할 때 실제로 더 큰 문제가 될 수 있음을 반영한 것으로 보입니다.

In [13]:
print("실제값이 100일 때 50으로 underestimate할 때의 SMAPE : {}".format(SMAPE(100, 50)))
print("실제값이 100일 때 150으로 overestimate할 때의 SMAPE : {}".format(SMAPE(100, 150)))

실제값이 100일 때 50으로 underestimate할 때의 SMAPE : 33.33333333333333
실제값이 100일 때 150으로 overestimate할 때의 SMAPE : 20.0


#### 그러나 일반 mse를 objective function으로 훈련할 때 과소추정하는 건물들이 있음을 확인했습니다.
#### 이때문에 SMAPE 점수가 높아진다고 판단, 이를 해결하기 위해 아래와 같이 objective function을 새로 정의했습니다.
#### 새 목적함수는 residual이 0보다 클 때, 즉 실제값보다 낮게 추정할 때 alpha만큼의 가중치를 곱해 반영합니다.

#### XGBoost는 custom objective function으로 훈련하기 위해선 아래와 같이
#### gradient(1차 미분함수) / hessian(2차 미분함수)를 정의해 두 값을 return해주어야 합니다.

In [14]:
#### alpha를 argument로 받는 함수로 실제 objective function을 wrapping하여 alpha값을 쉽게 조정할 수 있도록 작성했습니다.
# custom objective function for forcing model not to underestimate
def weighted_mse(alpha = 1):
    def weighted_mse_fixed(label, pred):
        residual = (label - pred).astype("float")
        grad = np.where(residual>0, -2*alpha*residual, -2*residual)
        hess = np.where(residual>0, 2*alpha, 2.0)
        return grad, hess
    return weighted_mse_fixed

## 3. model tuning

#### 다른 parameter를 고정하지 않고 전체 parameter를 튜닝하고자 하면 매우 오래걸리기 때문에
#### 모델 내 hyperparameter들은 아래와 같이 sklearn의 gridsearchCV를 활용해 튜닝하고,
#### XGBoost의 early stopping 기능으로 n_estimators를 튜닝하고,
#### weighted_mse의 alpha값을 튜닝했습니다.

#### ***참고***
##### gridsearch 코드는 빠른 튜닝을 위해 NIPA서버를 활용해 gpu버전으로 튜닝하여 cpu버전으로 찾은 parameter와는 값이 다릅니다.
###### (gpu와 cpu버전의 bootstrap 과정 등의 차이에 기인하는 것으로 생각됩니다.)
##### 그러므로 gridsearchCV 코드는 제출하되 튜닝된 parameter를 csv로 첨부합니다.
##### 이후 모든 과정은 첨부된 csv에 저장된 paramter를 사용하여 훈련, 예측합니다.  

In [15]:
y = train[train['product'] == i]['target']
x = train[train['product'] == i].iloc[:, 1:]

In [16]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV
preds = np.array([])
product_unique = train['product'].unique()
alpha = 25

for i in tqdm(product_unique):
    y = train[train['product'] == i]['target']
    x = train[train['product'] == i].iloc[:, 2:]
    pred_df = pd.DataFrame()
    for seed in [42]:

        y_train, y_test, x_train, x_test = temporal_train_test_split(y = y, X = x, test_size = 21)

        y_train = train[train['product'] == i]['target']
        x_train, x_test = train[train['product'] == i].iloc[:, 2:], test.loc[test['product'] == i, ].iloc[:,1:]
        x_test = x_test[x_train.columns]

        model = XGBRegressor(tree_method='gpu_hist', objective=weighted_mse(alpha), random_state=seed, enable_categorical=True)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        pred_df.loc[:,seed] = y_pred
    pred = pred_df.mean(axis=1)
    preds = np.append(preds, pred)

100%|██████████| 15890/15890 [2:47:10<00:00,  1.58it/s]  


In [17]:
submission = pd.read_csv(path + 'new_sub.csv')
submission['answer'] = preds
submission.to_csv(path + 'xgboost_weight_mse.csv', index = False)

In [20]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV
preds = np.array([])
product_unique = train['product'].unique()

for i in tqdm(product_unique):
    y = train[train['product'] == i]['target']
    x = train[train['product'] == i].iloc[:, 2:]
    pred_df = pd.DataFrame()
    for seed in [42]:
        def objective(trial):
            y_train, y_test, x_train, x_test = temporal_train_test_split(y = y, X = x, test_size = 21)

            params = {
            'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
            "objective": "reg:squarederror",
            'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
            "alpha": trial.suggest_loguniform("alpha", 0.01, 10.0),
            "gamma": trial.suggest_loguniform("lambda", 1e-8, 10.0),
            'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
            'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
            "learning_rate": trial.suggest_loguniform("learning_rate", 0.005, 0.05),
            'n_estimators':trial.suggest_int("n_estimators", 30, 10000),
            'max_depth': trial.suggest_int('max_depth', 5, 17),
            # 'random_state': trial.suggest_categorical('random_state', [2020]),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
            'random_state': seed,
            }


            pds = PredefinedSplit(np.append(-np.ones(len(x)-21), np.zeros(21)))
            model = xgb.XGBRegressor(**params, enable_categorical=True)
            model.fit(x_train, y_train)

            pred = model.predict(x_test)

            smape = SMAPE(y_test, pred)
            return smape

        study = optuna.create_study(direction="minimize")
        study.optimize(objective, n_trials=20)

        params=study.best_params
        params['tree_method'] = 'gpu_hist'
        params['random_state'] = seed

        y_train, y_test, x_train, x_test = temporal_train_test_split(y = y, X = x, test_size = 21)

        y_train = train[train['product'] == i]['target']
        x_train = train[train['product'] == i].iloc[:, 3:], test.loc[test.num == i, ].iloc[:,1:]
        x_test = x_test[x_train.columns]

        model = XGBRegressor(**params, enable_categorical=True)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        pred_df.loc[:,seed] = y_pred
    pred = pred_df.mean(axis=1)
    preds = np.append(preds, pred)

  0%|          | 0/15890 [00:00<?, ?it/s][I 2023-08-05 16:50:31,383] A new study created in memory with name: no-name-efb7022b-285e-4099-abf6-7d6bfad7874c
[I 2023-08-05 16:50:34,146] Trial 0 finished with value: 100.0 and parameters: {'lambda': 0.034697598392250735, 'alpha': 0.11038843922208386, 'colsample_bytree': 0.3, 'subsample': 0.5, 'learning_rate': 0.04343317395947322, 'n_estimators': 6910, 'max_depth': 8, 'min_child_weight': 148}. Best is trial 0 with value: 100.0.
[I 2023-08-05 16:50:36,369] Trial 1 finished with value: 100.0 and parameters: {'lambda': 5.9014601704204495, 'alpha': 0.014077300049123567, 'colsample_bytree': 1.0, 'subsample': 0.7, 'learning_rate': 0.017670034107356028, 'n_estimators': 5678, 'max_depth': 7, 'min_child_weight': 110}. Best is trial 0 with value: 100.0.
[I 2023-08-05 16:50:37,343] Trial 2 finished with value: 100.0 and parameters: {'lambda': 0.009807917471135588, 'alpha': 0.016806139929513554, 'colsample_bytree': 0.3, 'subsample': 0.4, 'learning_rate'

AttributeError: 'DataFrame' object has no attribute 'num'

In [None]:
submission = pd.read_csv(path + 'new_sub.csv')
submission['answer'] = preds
submission.to_csv(path + 'xgboost_optuna.csv', index = False)

In [None]:
preds = np.array([])
for i in tqdm(range(60)):

    pred_df = pd.DataFrame()   # 시드별 예측값을 담을 data frame

    for seed in [0,1,2,3,4,5]: # 각 시드별 예측
        y_train = train.loc[train.num == i+1, 'power']
        x_train, x_test = train.loc[train.num == i+1, ].iloc[:, 3:], test.loc[test.num == i+1, ].iloc[:,1:]
        x_test = x_test[x_train.columns]

        xgb = XGBRegressor(seed = seed, n_estimators = best_it[i], eta = 0.01,
                           min_child_weight = xgb_params.iloc[i, 2], max_depth = xgb_params.iloc[i, 3],
                           colsample_bytree=xgb_params.iloc[i, 4], subsample=xgb_params.iloc[i, 5])

        if xgb_params.iloc[i,6] != 0:  # 만약 alpha가 0이 아니면 weighted_mse 사용
            xgb.set_params(**{'objective':weighted_mse(xgb_params.iloc[i,6])})

        xgb.fit(x_train, y_train)
        y_pred = xgb.predict(x_test)
        pred_df.loc[:,seed] = y_pred   # 각 시드별 예측 담기

    pred = pred_df.mean(axis=1)        # (i+1)번째 건물의 예측 =  (i+1)번째 건물의 각 시드별 예측 평균값
    preds = np.append(preds, pred)

In [None]:
xgb_params = pd.read_csv('./parameters/hyperparameter_xgb.csv')

### find the bset iteration (given alpha = 100)

In [None]:
scores = []   # smape 값을 저장할 list
best_it = []  # best interation을 저장할 list
for i in tqdm(range(60)):
    y = train.loc[train.num == i+1, 'power']
    x = train.loc[train.num == i+1, ].iloc[:, 3:]
    y_train, y_valid, x_train, x_valid = temporal_train_test_split(y = y, X = x, test_size = 168)

    xgb_reg = XGBRegressor(n_estimators = 10000, eta = 0.01, min_child_weight = xgb_params.iloc[i, 2],
                           max_depth = xgb_params.iloc[i, 3], colsample_bytree = xgb_params.iloc[i, 4],
                           subsample = xgb_params.iloc[i, 5], seed=0)
    xgb_reg.set_params(**{'objective':weighted_mse(100)}) # alpha = 100으로 고정

    xgb_reg.fit(x_train, y_train, eval_set=[(x_train, y_train),
                                            (x_valid, y_valid)], early_stopping_rounds=300, verbose=False)
    y_pred = xgb_reg.predict(x_valid)
    pred = pd.Series(y_pred)

    sm = SMAPE(y_valid, y_pred)
    scores.append(sm)
    best_it.append(xgb_reg.best_iteration) ## 실제 best iteration은 이 값에 +1 해주어야 함.

### alpha tuning for weighted MSE

In [None]:
alpha_list = []
smape_list = []
for i in tqdm(range(60)):
    y = train.loc[train.num == i+1, 'power']
    x = train.loc[train.num == i+1, ].iloc[:, 3:]
    y_train, y_test, x_train, x_test = temporal_train_test_split(y = y, X = x, test_size = 168)
    xgb = XGBRegressor(seed = 0,
                      n_estimators = best_it[i], eta = 0.01, min_child_weight = xgb_params.iloc[i, 2],
                      max_depth = xgb_params.iloc[i, 3], colsample_bytree = xgb_params.iloc[i, 4], subsample = xgb_params.iloc[i, 5])

    xgb.fit(x_train, y_train)
    pred0 = xgb.predict(x_test)
    best_alpha = 0
    score0 = SMAPE(y_test,pred0)

    for j in [1, 3, 5, 7, 10, 25, 50, 75, 100]:
        xgb = XGBRegressor(seed = 0,
                      n_estimators = best_it[i], eta = 0.01, min_child_weight = xgb_params.iloc[i, 2],
                      max_depth = xgb_params.iloc[i, 3], colsample_bytree = xgb_params.iloc[i, 4], subsample = xgb_params.iloc[i, 5])
        xgb.set_params(**{'objective' : weighted_mse(j)})

        xgb.fit(x_train, y_train)
        pred1 = xgb.predict(x_test)
        score1 = SMAPE(y_test, pred1)
        if score1 < score0:
            best_alpha = j
            score0 = score1

    alpha_list.append(best_alpha)
    smape_list.append(score0)
    print("building {} || best score : {} || alpha : {}".format(i+1, score0, best_alpha))

In [None]:
no_df = pd.DataFrame({'score':smape_list})
plt.bar(np.arange(len(no_df))+1, no_df['score'])
plt.plot([1,60], [10, 10], color = 'red')

## 4. test inference

### preprocessing for test data

In [None]:
# train set과 동일한 전처리 과정
test = pd.read_csv('./data/test.csv', encoding = 'cp949')
cols = ['num', 'date_time', 'temp', 'wind','hum' ,'prec', 'sun', 'non_elec', 'solar']
test.columns = cols
date = pd.to_datetime(test.date_time)
test['hour'] = date.dt.hour
test['day'] = date.dt.weekday
test['month'] = date.dt.month
test['week'] = date.dt.weekofyear
test['sin_time'] = np.sin(2*np.pi*test.hour/24)
test['cos_time'] = np.cos(2*np.pi*test.hour/24)
test['holiday'] = test.apply(lambda x : 0 if x['day']<5 else 1, axis = 1)
test.loc[('2020-08-17'<=test.date_time)&(test.date_time<'2020-08-18'), 'holiday'] = 1

## 건물별 일별 시간별 발전량 평균
tqdm.pandas()
test['day_hour_mean'] = test.progress_apply(lambda x : power_mean.loc[(power_mean.num == x['num']) & (power_mean.day == x['day']) & (power_mean.hour == x['hour']) ,'power'].values[0], axis = 1)

## 건물별 시간별 발전량 평균 넣어주기
tqdm.pandas()
test['hour_mean'] = test.progress_apply(lambda x : power_hour_mean.loc[(power_hour_mean.num == x['num']) & (power_hour_mean.hour == x['hour']) ,'power'].values[0], axis = 1)

tqdm.pandas()
test['hour_std'] = test.progress_apply(lambda x : power_hour_std.loc[(power_hour_std.num == x['num']) & (power_hour_std.hour == x['hour']) ,'power'].values[0], axis = 1)

test.drop(['non_elec', 'solar','hour','date_time'], axis = 1, inplace = True)

# pandas 내 선형보간 method 사용
for i in range(60):
    test.iloc[i*168:(i+1)*168, :]  = test.iloc[i*168:(i+1)*168, :].interpolate()


test['THI'] = 9/5*test['temp'] - 0.55*(1-test['hum']/100)*(9/5*test['hum']-26)+32

cdhs = np.array([])
for num in range(1,61,1):
    temp = test[test['num'] == num]
    cdh = CDH(temp['temp'].values)
    cdhs = np.concatenate([cdhs, cdh])
test['CDH'] = cdhs

test = test[['num','temp', 'wind', 'hum', 'prec', 'sun', 'day', 'month', 'week',
       'day_hour_mean', 'hour_mean', 'hour_std', 'holiday', 'sin_time',
       'cos_time', 'THI', 'CDH']]
test.head()

In [None]:
xgb_params['alpha'] = alpha_list
xgb_params['best_it'] = best_it
xgb_params.head()

In [None]:
#xgb_params.to_csv('./hyperparameter_xgb_final.csv', index=False)

In [None]:
## best hyperparameters 불러오기
xgb_params = pd.read_csv('./parameters/hyperparameter_xgb_final.csv')
xgb_params.head()

In [None]:
best_it = xgb_params['best_it'].to_list()
best_it[0]        # 1051

### seed ensemble
#### - seed별로 예측값이 조금씩 바뀝니다.
#### - seed의 영향을 제거하기 위해 6개의 seed(0부터 5)별로 훈련, 예측하여 6개 예측값의 평균을 구했습니다.

In [None]:
preds = np.array([])
for i in tqdm(range(60)):

    pred_df = pd.DataFrame()   # 시드별 예측값을 담을 data frame

    for seed in [0,1,2,3,4,5]: # 각 시드별 예측
        y_train = train.loc[train.num == i+1, 'power']
        x_train, x_test = train.loc[train.num == i+1, ].iloc[:, 3:], test.loc[test.num == i+1, ].iloc[:,1:]
        x_test = x_test[x_train.columns]

        xgb = XGBRegressor(seed = seed, n_estimators = best_it[i], eta = 0.01,
                           min_child_weight = xgb_params.iloc[i, 2], max_depth = xgb_params.iloc[i, 3],
                           colsample_bytree=xgb_params.iloc[i, 4], subsample=xgb_params.iloc[i, 5])

        if xgb_params.iloc[i,6] != 0:  # 만약 alpha가 0이 아니면 weighted_mse 사용
            xgb.set_params(**{'objective':weighted_mse(xgb_params.iloc[i,6])})

        xgb.fit(x_train, y_train)
        y_pred = xgb.predict(x_test)
        pred_df.loc[:,seed] = y_pred   # 각 시드별 예측 담기

    pred = pred_df.mean(axis=1)        # (i+1)번째 건물의 예측 =  (i+1)번째 건물의 각 시드별 예측 평균값
    preds = np.append(preds, pred)

In [None]:
preds = pd.Series(preds)

fig, ax = plt.subplots(60, 1, figsize=(100,200), sharex = True)
ax = ax.flatten()
for i in range(60):
    train_y = train.loc[train.num == i+1, 'power'].reset_index(drop = True)
    test_y = preds[i*168:(i+1)*168]
    ax[i].scatter(np.arange(2040) , train.loc[train.num == i+1, 'power'])
    ax[i].scatter(np.arange(2040, 2040+168) , test_y)
    ax[i].tick_params(axis='both', which='major', labelsize=6)
    ax[i].tick_params(axis='both', which='minor', labelsize=4)
#plt.savefig('./predict_xgb.png')
plt.show()

In [None]:
submission = pd.read_csv('./data/sample_submission.csv')
submission['answer'] = preds
submission.to_csv('./submission/submission_xgb_noclip.csv', index = False)

## 5. post processing

#### weighted mse와 같은 맥락에서, 과도한 underestimate를 막기 위해 예측값을 후처리했습니다.
##### - 예측 주로부터 직전 4주(train set 마지막 28일)의 건물별 요일별 시간대별 전력소비량의 최솟값을 구한 뒤,
##### - test set의 같은 건물 요일 시간대의 예측값과 비교하여 만약 1번의 최솟값보다 예측값이 작다면 최솟값으로 예측값을 대체해주었습니다.
##### - public score 0.01 , private score 0.08 정도의 성능 향상이 있었습니다.

In [None]:
train_to_post = pd.read_csv('./data/train.csv', encoding = 'cp949')
cols = ['num', 'date_time', 'power', 'temp', 'wind','hum' ,'prec', 'sun', 'non_elec', 'solar']
train_to_post.columns = cols
date = pd.to_datetime(train_to_post.date_time)
train_to_post['hour'] = date.dt.hour
train_to_post['day'] = date.dt.weekday
train_to_post['month'] = date.dt.month
train_to_post['week'] = date.dt.weekofyear
train_to_post = train_to_post.loc[(('2020-08-17'>train_to_post.date_time)|(train_to_post.date_time>='2020-08-18')), ].reset_index(drop = True)

pred_clip = []
test_to_post = pd.read_csv('./data/test.csv',  encoding = 'cp949')
cols = ['num', 'date_time', 'temp', 'wind','hum' ,'prec', 'sun', 'non_elec', 'solar']
test_to_post.columns = cols
date = pd.to_datetime(test_to_post.date_time)
test_to_post['hour'] = date.dt.hour
test_to_post['day'] = date.dt.weekday
test_to_post['month'] = date.dt.month
test_to_post['week'] = date.dt.weekofyear

## submission 불러오기
df = pd.read_csv('./submission/submission_xgb_noclip.csv')
for i in range(60):
    min_data = train_to_post.loc[train_to_post.num == i+1, ].iloc[-28*24:, :] ## 건물별로 직전 28일의 데이터 불러오기
    ## 요일별, 시간대별 최솟값 계산
    min_data = pd.pivot_table(min_data, values = 'power', index = ['day', 'hour'], aggfunc = min).reset_index()
    pred = df.answer[168*i:168*(i+1)].reset_index(drop=True) ## 168개 데이터, 즉 건물별 예측값 불러오기
    day =  test_to_post.day[168*i:168*(i+1)].reset_index(drop=True) ## 예측값 요일 불러오기
    hour = test_to_post.hour[168*i:168*(i+1)].reset_index(drop=True) ## 예측값 시간 불러오기
    df_pred = pd.concat([pred, day, hour], axis = 1)
    df_pred.columns = ['pred', 'day', 'hour']
    for j in range(len(df_pred)):
        min_power = min_data.loc[(min_data.day == df_pred.day[j])&(min_data.hour == df_pred.hour[j]), 'power'].values[0]
        if df_pred.pred[j] < min_power:
            pred_clip.append(min_power)
        else:
            pred_clip.append(df_pred.pred[j])

##### 초록색으로 표시된 값이 원래의 예측값, 주황색이 후처리된 예측값입니다.
##### 변동이 거의 없는 건물도 있으나, 유의미하게 바뀐 건물도 확인됩니다.

In [None]:
pred_origin = df.answer
pred_clip = pd.Series(pred_clip)

for i in range(60):
    power = train_to_post.loc[train_to_post.num == i+1, 'power'].reset_index(drop=True)
    preds = pred_clip[i*168:(i+1)*168]
    preds_origin = pred_origin[i*168:(i+1)*168]
    preds.index = range(power.index[-1], power.index[-1]+168)
    preds_origin.index = range(power.index[-1], power.index[-1]+168)

    plot_series(power, preds,  preds_origin, markers = [',', ',', ','])

#### create submission file

In [None]:
submission = pd.read_csv('./data/sample_submission.csv')
submission['answer'] = pred_clip
submission.to_csv('./submission//submission_xgb_final.csv', index = False)