# 대회 개요

- 주제 
    - 예시로 제시된 지역의 기상 데이터와 과거 발전량 데이터를 활용하여, 시간대별 태양광 발전량을 예측

- 예측 
    - 모델은 7일(Day 0~ Day6) 동안의 데이터를 인풋으로 활용하여, 향후 2일(Day7 ~ Day8) 동안의 30분 간격의 발전량(TARGET)을 예측해야 합니다. (1일당 48개씩 총 96개 타임스텝에 대한 예측)

- 규칙 
    - Pinball Loss,  퀀타일 값 (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
    - 외부 데이터 사용 불가
    - 사전 학습 모델(pre-trained Model) 사용 불가

- 평가
    - 1차 평가(Public Score): 테스트 데이터 중 랜덤 샘플 된 50 %로 채점, 대회 기간 중 공개
    - 2차 평가(Private Score): 나머지 테스트 데이터로 채점, 대회 종료 직후 공개
    - 최종 평가: 2차 평가 점수, 코드 그리고 워드(.docx) 자료를 종합하여 우승자 선정


# 데이터 설명

- Hour - 시간
- Minute - 분
- DHI - 수평면 산란일사량(Diffuse Horizontal Irradiance (W/m2))
- DNI - 직달일사량(Direct Normal Irradiance (W/m2))
- WS - 풍속(Wind Speed (m/s))
- RH - 상대습도(Relative Humidity (%))
- T - 기온(Temperature (Degree C))
- Target - 태양광 발전량 (kW)


- 산란일사량(DHI) : 대기중의 공기 분자나 부유 입자에 충돌하여 여러 방향으로 산란된 일사량

- 직달 일사량(DNI) : 대기중의 수증기나 작은 먼지에 흡수되지 않은 채로, 태양으로부터 직접 지표면에 도달하는 일사량이다. 이 떄에 지표면에서 태양과 직각인 면에 도달하는 일사량이다.
    - 직달 일사량은 날씨에 따라 강도가 다르다. 맑은날에는 보통 직달일사량이 전천일사량보다 크지만, 흐린날에는 직달 일사량이 매우 작고 전천일사량이 훨씬 크다.
    
- 전천 일사량(GHI) : 수평면에 입사하는 직달일사 및 산란 복사를 합친것을 말한다.

- GHI = DNI * COS천정각(90-태양의 고도)+산란일사량

![title](./Reference/sun.PNG)

# Insight

## EDA 아이디어
- 태양광 발전량은 이전것의 결과가 누적되지 않기 때문에, 제일 최근 하루의 값이 중요해 보인다.
- 제일 최근의 날짜 하루도 필요
    - 내일,내일 모래의 발전량 예측시 제일 최근의 자료를 통해서 내일,내일 모래의 상태를 최대한 정확하게 예측 시키기 위하여
- 변화가 중요하지 않고, 그 시점에 대한 데이터를 얻을 7일치가 필요(절대 크게 변하지 않을 값)
    - 7일치를 평균내어서 그 시점의 '평균적' 인 발전량을 모델에 반영시키기 위해서
    - 하루치만 쓰기에는 날씨가 변화무쌍하고 그에 따라서 발전량도 많이 변하기 때문이다.
- 외부데이터를 붙일 수 없으므로 최대한 데이터 내에서 정보를 뽑아내는것이 필요하다.
    - 새로운 변수로는 GHI, 일출시간, 고도에 따른 cos 값 등을 추가 가능할 것

## Modeling 아이디어
 다음과 같이 분석중에 얻은 데이터 및 예측의 특성을 Insight 를 정리해 보았다.

**1. 예측 범위가 매우 길다**
- 일주일동안의 데이터로 30분 간격의 96개 데이터를 시계열로 예측하는것은, 너무 예측 범위가 길어 불안정하다.

**2. 날씨 예측이 중요하다.**
- 날씨 영향을 많이 받는 태양광 발전의 특성상 날씨가 매우 중요한 변수이다.
- 하지만 날씨 예측은 기상청도 힘들어하는 예측인데 빈약한 데이터를 통해서 예측이 불가능 할 것이다.
- 즉 모델은 Robust 해야한다. 과적합을 피하는것이 매우 중요하다.

**3. 평가 데이터의 특성이 training 데이터와 크게 다르다.**
- training 에서 validation set 으로 score 가 좋게 나와도 막상 제출하면 score 가 안좋았다.
- 이는 추정하려는 test 데이터와 training set 이 다른 년도이기 때문에 일어나는 현상으로 보인다.
- 다른 년도이기 떄문에 발전량의 특성이 다를것이다. 즉 validation set 으로도 올바른 평가가 이루어지지 않는것이다. 

위 insight 를 토대로 중요한것은 모델은 최대한 Robust 해야함을 알았다. 위 특성을 고려한 모델링을 고려해 보았다.

**Modeling**
- 1.시계열
    - 이전 날짜들의 발전량 추이와, 그 다음 날짜 발전량 추이가 매우 달랐다. 
    - 즉 단변량 시계열로는 추정에 무리가 있다.
    - 다변량 시계열 모형이 요구된다.(하지만 ARMAX 모델 적합시 영 좋지 않았음)
- 2.딥러닝 모델(LSTM,GRU,...)
    - validation 의 loss 가 매우 작더라도 막상 데이콘에 제출하면 loss 가 아주 크게 나왔다.
    - 머신러닝 모델은 hyper parameter 를 통해 세세한 조절이 가능하나 딥러닝 모델은 직관적으로 조절이 힘듦(예시로 drop out 등이 어떤 작용을 할지 알수가 없음)
    - Validation 으로 평가하는것이 무의미한 시점에서 딥러닝 모델을 쓰는것은 무리라고 판단
- 3.Regression
    - Quantile loss 를 제공하는 머신러닝 모델인 Catboost , lgbm 을 고려하였음
    - 두 모델 적합시에 어느정도 비슷한 성능을 보임
    - 앙상블 시에도 큰 진전이 없어서 lgbm 단일 모델로 하기로 함
- 4,lgbm 
    - 최대한 Robust 하게 진행
    - overfitting 을 방지하고자 overfitting 을 방지하기 위한 파라미터 중심으로 최적화 진행

# Module Import 

In [1]:
import pandas as pd
import numpy as np
import os
import glob
import random
import seaborn as sns
import math
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from platform import python_version

os.environ['PYTHONHASHSEED'] = str(42)
np.random.seed(42)
random.seed(42)
#import warnings
#warnings.filterwarnings("ignore")

In [2]:
print(f'python version: {python_version()}')
print(f'numpy version : {np.__version__}')
print(f'pandas version : {pd.__version__}')
print(f'seaborn version : {sns.__version__}')

python version: 3.7.4
numpy version : 1.19.1
pandas version : 1.1.1
seaborn version : 0.10.1


In [3]:
train = pd.read_csv('./data/train/train.csv')
df_train = train.copy()
submission = pd.read_csv('./data/sample_submission.csv')

# EDA

## data : 시간별 7일 평균 및 차이 

- 시간별 7일간의 평균을 추가해 준다.
- 그리고 Target 변수는 추정하려는 변수이므로 특별히 그 차이도 Target_diff 라는 이름으로 붙여넣는다.

In [4]:
def data(df):
    w7=[1/7,1/7,1/7,1/7,1/7,1/7,1/7]
    col = ['DHI','DNI','WS','RH','T','TARGET']
    fixed = df.copy()
    temp = df.copy()
    df_7 = (w7[0]*temp[col] + 
            w7[1]*temp[col].shift(-48) + 
            w7[2]*temp[col].shift(-48*2) + 
            w7[3]*temp[col].shift(-48*3) + 
            w7[4]*temp[col].shift(-48*4) + 
            w7[5]*temp[col].shift(-48*5) + 
            w7[6]*temp[col].shift(-48*6)) # 오늘 이전의 7일의 평균
    df_1 = temp[col].shift(-48*6) # 오늘시점 
    df[col] = df_1 # 오늘시점의 값을 추가
    col2 = ['TARGET']
    df_diff = temp[col2].shift(-48*6) - temp[col2].shift(-48*5)
    df[['DHI7','DNI7','WS7','RH7','T7','TARGET7']] = df_7 # 7일 평균의 값을 추가
    df[['TARGET_diff']] = df_diff # difference 을 추가
    df.dropna(inplace=True)
    return(df)

In [5]:
data(df_train)

Unnamed: 0,Day,Hour,Minute,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff
0,0,0,0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,78.188571,-7.142857,0.0,0.0
1,0,0,30,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.000000,0.0,0.0
2,0,1,0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0
3,0,1,30,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.440000,-7.285714,0.0,0.0
4,0,2,0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52267,1088,21,30,0.0,0.0,2.4,70.70,-4.0,0.0,0.0,0.0,2.428571,68.254286,-3.000000,0.0,0.0
52268,1088,22,0,0.0,0.0,2.4,66.79,-4.0,0.0,0.0,0.0,2.514286,66.628571,-3.285714,0.0,0.0
52269,1088,22,30,0.0,0.0,2.2,66.78,-4.0,0.0,0.0,0.0,2.471429,66.621429,-3.285714,0.0,0.0
52270,1088,23,0,0.0,0.0,2.1,67.72,-4.0,0.0,0.0,0.0,2.514286,67.664286,-3.571429,0.0,0.0


## hour : 시간,분 합치기

- Hour 과 Minute 가 쓸데없이 나누어져 있다.
- Hour 에 Minute 를 통합한다.

In [6]:
def hour(df):
    df['Hour'] = df['Hour'] + df['Minute']/60
    return(df)

In [7]:
hour(df_train) 

Unnamed: 0,Day,Hour,Minute,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff
0,0,0.0,0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,78.188571,-7.142857,0.0,0.0
1,0,0.5,30,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.000000,0.0,0.0
2,0,1.0,0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0
3,0,1.5,30,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.440000,-7.285714,0.0,0.0
4,0,2.0,0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52267,1088,21.5,30,0.0,0.0,2.4,70.70,-4.0,0.0,0.0,0.0,2.428571,68.254286,-3.000000,0.0,0.0
52268,1088,22.0,0,0.0,0.0,2.4,66.79,-4.0,0.0,0.0,0.0,2.514286,66.628571,-3.285714,0.0,0.0
52269,1088,22.5,30,0.0,0.0,2.2,66.78,-4.0,0.0,0.0,0.0,2.471429,66.621429,-3.285714,0.0,0.0
52270,1088,23.0,0,0.0,0.0,2.1,67.72,-4.0,0.0,0.0,0.0,2.514286,67.664286,-3.571429,0.0,0.0


## Daymean : 7일동안의 평균

- 계절성을 알 수 있게 된다면, 그에 따른 일조량도 자연스럽게 알 수 있기 때문에 매우 유용한 정보이다.
- 쓸 수 잇는 데이터는 온도(T) 밖에 없기때문에, 이를 이용해 최대한 계절성을 나타내고 싶었다.
- 그러므로 일주일에대한 모든 시간에 대한 평균을 넣게 된다면, 날씨 차이에 대한 분산도 크게 줄 뿐더러 겨울의 경우 낮은 값, 여름의 경우 높은 값을 나타내어 어느정도 날마다의 일조량을 근사하는 값이 될 수 있을것이다.
- 추가적으로 Target 에 대한 mean 도 같이 넣었다.

In [8]:
def Daymean(df) :
    df['Weekmean_T']= df['T7'].groupby(df['Day']).transform('mean')
    df['Weekmean_TARGET'] = df['TARGET7'].groupby(df['Day']).transform('mean')
    df['Daymean_TARGET'] = df['TARGET'].groupby(df['Day']).transform('mean')
    return(df)

In [9]:
Daymean(df_train)

Unnamed: 0,Day,Hour,Minute,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET
0,0,0.0,0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,78.188571,-7.142857,0.0,0.0,-3.500000,6.573162,5.589715
1,0,0.5,30,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.000000,0.0,0.0,-3.500000,6.573162,5.589715
2,0,1.0,0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715
3,0,1.5,30,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.440000,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715
4,0,2.0,0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0,-3.500000,6.573162,5.589715
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52267,1088,21.5,30,0.0,0.0,2.4,70.70,-4.0,0.0,0.0,0.0,2.428571,68.254286,-3.000000,0.0,0.0,-0.910714,9.479379,8.710275
52268,1088,22.0,0,0.0,0.0,2.4,66.79,-4.0,0.0,0.0,0.0,2.514286,66.628571,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275
52269,1088,22.5,30,0.0,0.0,2.2,66.78,-4.0,0.0,0.0,0.0,2.471429,66.621429,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275
52270,1088,23.0,0,0.0,0.0,2.1,67.72,-4.0,0.0,0.0,0.0,2.514286,67.664286,-3.571429,0.0,0.0,-0.910714,9.479379,8.710275


## sunrise : 일출 근사하기

- 일출시간을 알게 된다면 이를 이용해서 적위를 알 수 있다.
- 적위를 알게 된다면 태양의 고도를 근사할 수 있게 된다.
- 태양의 고도는 태양광에서 매우 중요한 변수이므로 Sunrise 변수를 구하는것은 매우 중요하다.

In [10]:
# 7일 동안이나 우연으로 날씨가 안좋아서 해가 제 시간에 안뜬다는것은 매우 드문일이다.
# 즉 TARGET7 에서 모두 0 인 값은 해가 안뜨는 시간이라고 확신할 수 있다.
# 그에 따라서 일출 시간을 정할 수 있을것이다. 
def sunrise(df):
    for i in range(0,df.shape[0]):
        if (df.loc[i,'TARGET7'] == 0) & (df.loc[i,'Hour'] < 9):
            df.loc[i,'sun'] = df.loc[i,'Hour']
    df.loc[:,'sun'].fillna(0,inplace=True)
    df['Sunrise'] = df['sun'].groupby(df['Day']).transform('max')
    df.drop(columns=['sun'],inplace=True)
    return(df) # 일출은 쓰여져 있는 시간 30분 내로 이루어졌다고 생각할 수 있다.

In [11]:
sunrise(df_train)

Unnamed: 0,Day,Hour,Minute,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise
0,0,0.0,0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,78.188571,-7.142857,0.0,0.0,-3.500000,6.573162,5.589715,7.5
1,0,0.5,30,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.000000,0.0,0.0,-3.500000,6.573162,5.589715,7.5
2,0,1.0,0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715,7.5
3,0,1.5,30,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.440000,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715,7.5
4,0,2.0,0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0,-3.500000,6.573162,5.589715,7.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52267,1088,21.5,30,0.0,0.0,2.4,70.70,-4.0,0.0,0.0,0.0,2.428571,68.254286,-3.000000,0.0,0.0,-0.910714,9.479379,8.710275,7.5
52268,1088,22.0,0,0.0,0.0,2.4,66.79,-4.0,0.0,0.0,0.0,2.514286,66.628571,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275,7.5
52269,1088,22.5,30,0.0,0.0,2.2,66.78,-4.0,0.0,0.0,0.0,2.471429,66.621429,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275,7.5
52270,1088,23.0,0,0.0,0.0,2.1,67.72,-4.0,0.0,0.0,0.0,2.514286,67.664286,-3.571429,0.0,0.0,-0.910714,9.479379,8.710275,7.5


## GHI : GHI,cos 계산하기

- GHI = DNI * COS천정각(90-태양의 고도)+산란일사량 이다.
- 이를 구하려면 태양의 고도를 추정해야 한다. 
- 고도 추정은 일출 시간을 근사적으로 이용하여 아래와 같이 추정하였다.

- 일출 시간에 따른 적위
    - 하지는 5시 10분경, 동지는 7시 40분, 춘분과 추분은 6:30분경 일출이다.
    - 즉 해가 없는 마지막 시간이 5 -> 하지 (23.5)
    - 4.5 -> 21 도
    - 5.0 -> 14 도
    - 5.5 -> 7
    - 6.0 -> 0
    - 6.5 -> -7 도
    - 7   -> -14 도
    - 7.5 -> -21 도
    - 위와 같이 적위를 근사하도록 하자. 그럼 식은 적위 식은 (6-x)*14 가 된다.(x 는 Sunrise 변수)
- 고도 구하기
    - 공식 : sin(b) = sin(a)sin(l) + cos(a)cos(l)cos(h)
        - b : 태양의 고도각
        - a : 태양의 적위
        - l : 위도(37도 설정)
        - h : 시각에 따른 각도
- 위 공식과 일출을 이용한 적위를 사용하여 고도를 근사
- 추가적으로 7일동안의 평균 GHI 와 고도의 Cos 값도 추가하였다

In [12]:
def GHI(df):
    h = df['Hour']
    rad = math.pi/180
    l = 37 * rad # 한국의 위도
    a = (6-df['Sunrise']) * 14 * rad # 일출 시간으로 적위를 근사
    b = np.arcsin(np.sin(a)*np.sin(l) + np.cos(a)*np.cos(l)*np.cos((h*15-180)*(np.pi/180))) # b는 태양의 고도
    df['cos'] =  np.cos((np.pi/2)-b)
    df.loc[df['cos'] < 0,['cos']] = 0 
    df['GHI7']= df['DNI7'] * np.cos((np.pi/2)-b) + df['DHI7'] # GHI 계산식
    return(df) 

In [13]:
GHI(df_train)

Unnamed: 0,Day,Hour,Minute,DHI,DNI,WS,RH,T,TARGET,DHI7,...,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7
0,0,0.0,0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,...,78.188571,-7.142857,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0
1,0,0.5,30,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,...,77.395714,-7.000000,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0
2,0,1.0,0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,...,77.324286,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0
3,0,1.5,30,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,...,76.440000,-7.285714,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0
4,0,2.0,0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,...,77.774286,-7.428571,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52267,1088,21.5,30,0.0,0.0,2.4,70.70,-4.0,0.0,0.0,...,68.254286,-3.000000,0.0,0.0,-0.910714,9.479379,8.710275,7.5,0.0,0.0
52268,1088,22.0,0,0.0,0.0,2.4,66.79,-4.0,0.0,0.0,...,66.628571,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275,7.5,0.0,0.0
52269,1088,22.5,30,0.0,0.0,2.2,66.78,-4.0,0.0,0.0,...,66.621429,-3.285714,0.0,0.0,-0.910714,9.479379,8.710275,7.5,0.0,0.0
52270,1088,23.0,0,0.0,0.0,2.1,67.72,-4.0,0.0,0.0,...,67.664286,-3.571429,0.0,0.0,-0.910714,9.479379,8.710275,7.5,0.0,0.0


## Delete : Day,Minute 삭제하기

In [14]:
df_train.drop(columns = ['Day','Minute'],inplace=True)

## 1일후, 2일후 Target 추가

In [15]:
df_train['y_1day'] = train['TARGET'].shift(-48*7) # 48 단계 후의 값들을 불러온다. (1일 후의 값들)
df_train['y_2day'] = train['TARGET'].shift(-48*8) # 96 단계 후의의 값들을 불러온다. (2일 후의 값들)
df_train = df_train.iloc[:-96] # 맨 뒤 2일동안의 데이터는 예측할 true 값들이 없기때문에 의미가 없다. 즉 삭제
df_train

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,...,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7,y_1day,y_2day
0,0.0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
1,0.5,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
3,1.5,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
4,2.0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52171,21.5,0.0,0.0,3.5,55.97,-1.0,0.0,0.0,0.0,2.557143,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52172,22.0,0.0,0.0,3.9,54.23,-2.0,0.0,0.0,0.0,2.557143,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52173,22.5,0.0,0.0,4.1,54.21,-2.0,0.0,0.0,0.0,2.485714,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52174,23.0,0.0,0.0,4.3,56.46,-2.0,0.0,0.0,0.0,2.485714,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0


## Test set 적용

In [18]:
# Test set 변환하는 함수 적용
def preprocess_Test(df):
    temp = df.copy() # copy 를 떠야 에러가 안난다.
    temp = temp[['Day','Hour','Minute', 'DHI', 'DNI', 'WS', 'RH', 'T','TARGET']]
    data(temp)
    hour(temp)
    Daymean(temp)
    sunrise(temp)
    GHI(temp)
    return temp.iloc[-48:, :] 
# 이 때에 -48 로서 1일치만 가져온다는 뜻
# 왜냐하면, 위 모델의 경우 회귀가 단지 이전의 1일차의 값만 가지고 그 뒤 1,2일차의 예측을 하게 되기 떄문이다.

In [19]:
# test set 불러오기 및 EDA 적용
df_test = []
for i in range(81):
    file_path = './data/test/' + str(i) + '.csv'
    temp = pd.read_csv(file_path)
    temp = preprocess_Test(temp)
    df_test.append(temp) # 계속 append 해서 붙여준다.

In [20]:
X_test = pd.concat(df_test)
X_test.shape

(3888, 22)

In [21]:
# 필요없는 Day / Minute 삭제
X_test.drop(columns = ['Day','Minute'],inplace=True)

In [22]:
df_train

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,...,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7,y_1day,y_2day
0,0.0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.800000,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
1,0.5,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
3,1.5,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
4,2.0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,...,0.0,0.0,-3.500000,6.573162,5.589715,7.5,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52171,21.5,0.0,0.0,3.5,55.97,-1.0,0.0,0.0,0.0,2.557143,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52172,22.0,0.0,0.0,3.9,54.23,-2.0,0.0,0.0,0.0,2.557143,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52173,22.5,0.0,0.0,4.1,54.21,-2.0,0.0,0.0,0.0,2.485714,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0
52174,23.0,0.0,0.0,4.3,56.46,-2.0,0.0,0.0,0.0,2.485714,...,0.0,0.0,-1.889881,9.590381,8.875729,7.5,0.0,0.0,0.0,0.0


In [23]:
X_test

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7
0,0.0,0.0,0.0,0.8,80.92,-2.8,0.0,0.0,0.0,1.728571,50.244286,-0.914286,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
1,0.5,0.0,0.0,0.9,81.53,-2.9,0.0,0.0,0.0,1.742857,50.894286,-1.071429,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
2,1.0,0.0,0.0,1.0,79.91,-3.0,0.0,0.0,0.0,1.757143,50.887143,-1.214286,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
3,1.5,0.0,0.0,0.9,79.91,-3.0,0.0,0.0,0.0,1.742857,51.135714,-1.285714,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
4,2.0,0.0,0.0,0.9,77.20,-3.0,0.0,0.0,0.0,1.771429,51.441429,-1.357143,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43,21.5,0.0,0.0,0.8,63.35,13.7,0.0,0.0,0.0,0.857143,68.795714,11.171429,0.0,0.0,14.960119,25.312016,23.704122,4.0,0.0,0.0
44,22.0,0.0,0.0,0.7,64.82,13.1,0.0,0.0,0.0,0.900000,69.504286,10.657143,0.0,0.0,14.960119,25.312016,23.704122,4.0,0.0,0.0
45,22.5,0.0,0.0,0.7,66.10,12.8,0.0,0.0,0.0,0.985714,71.531429,10.228571,0.0,0.0,14.960119,25.312016,23.704122,4.0,0.0,0.0
46,23.0,0.0,0.0,0.6,67.64,12.4,0.0,0.0,0.0,1.042857,72.258571,9.785714,0.0,0.0,14.960119,25.312016,23.704122,4.0,0.0,0.0


# Modeling

- 모델링시 파라미터 추측은 test score 와 Dacon 에 제출했던 Score 를 기반으로 선택
- 여기에는 Validation set 을 나누지 않았다. 그 이유는 과적합을 피하기 위해서 e_eastimator 를 매우 줄여놓았기 떄문에 early stopping을 굳이 할 필요가 없었기 때문이다.
- 그러므로 다른 파일에서 나눈 데이터를 이용해 하이퍼 파라미터를 어느정도 조절 후 예측 데이터를 생성시에는 모든 데이터를 이용해서 prediction

- boosting ='dart' 
    - 과적합을 방지하고, 예측력을 높이는 dart 선택
- drop_rate = 0.2,
    - drop rate 는 default 인 0.1 보다 조금 높혀서 0.2 로 선택. 너무 높아도 잘 예측하지 못했다.
- n_estimators= 300,
    - 300개만 진행하여 최대한 깊게 학습하는것을 피했다.
- learning_rate= 0.3,
- min_data_in_leaf = 1000,
    - 분기가 1000개일때만 진행하게 해서 Robust 성을 높혔다. 개인적으로 제일 주요했던 변수
- bagging_fraction = 0.8 
- max_depth = 5,
    - depth 도 5로 설정해서 깊게 분기되는것을 막음

In [24]:
X_train_1 = df_train.iloc[:, :-2] 
Y_train_1 = df_train.iloc[:, -2]
X_train_2 = df_train.iloc[:, :-2]
Y_train_2 = df_train.iloc[:, -1]

In [25]:
X_train_1.head()

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7
0,0.0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.8,78.188571,-7.142857,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
1,0.5,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.0,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
2,1.0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
3,1.5,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.44,-7.285714,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
4,2.0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0


In [26]:
X_train_2.head()

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7
0,0.0,0.0,0.0,1.9,86.51,-2.0,0.0,0.0,0.0,1.8,78.188571,-7.142857,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
1,0.5,0.0,0.0,1.8,86.54,-2.0,0.0,0.0,0.0,1.814286,77.395714,-7.0,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
2,1.0,0.0,0.0,1.7,85.72,-3.0,0.0,0.0,0.0,1.871429,77.324286,-7.285714,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
3,1.5,0.0,0.0,1.4,85.73,-3.0,0.0,0.0,0.0,1.842857,76.44,-7.285714,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0
4,2.0,0.0,0.0,1.1,87.04,-4.0,0.0,0.0,0.0,1.842857,77.774286,-7.428571,0.0,0.0,-3.5,6.573162,5.589715,7.5,0.0,0.0


In [27]:
X_test.head()

Unnamed: 0,Hour,DHI,DNI,WS,RH,T,TARGET,DHI7,DNI7,WS7,RH7,T7,TARGET7,TARGET_diff,Weekmean_T,Weekmean_TARGET,Daymean_TARGET,Sunrise,cos,GHI7
0,0.0,0.0,0.0,0.8,80.92,-2.8,0.0,0.0,0.0,1.728571,50.244286,-0.914286,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
1,0.5,0.0,0.0,0.9,81.53,-2.9,0.0,0.0,0.0,1.742857,50.894286,-1.071429,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
2,1.0,0.0,0.0,1.0,79.91,-3.0,0.0,0.0,0.0,1.757143,50.887143,-1.214286,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
3,1.5,0.0,0.0,0.9,79.91,-3.0,0.0,0.0,0.0,1.742857,51.135714,-1.285714,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0
4,2.0,0.0,0.0,0.9,77.2,-3.0,0.0,0.0,0.0,1.771429,51.441429,-1.357143,0.0,0.0,0.638988,7.290951,3.280946,7.0,0.0,0.0


## LGB

In [28]:
quantiles = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [29]:
from lightgbm import LGBMRegressor
def LGBM(q, X_train, Y_train, X_test):
    model = LGBMRegressor(objective='quantile', # quantile regression 으로서 0.1 ~ 0.9 의 값을 체크한다.
                          boosting ='dart',
                          alpha=q, # alpha 에 q 를 넣게되면 추정을 quantile 로 해준다.
                          drop_rate = 0.2,
                          n_estimators= 300,
                          learning_rate= 0.3,
                          min_data_in_leaf = 1000,
                          bagging_fraction = 0.8,
                          max_depth = 5,
                          seed = 42)
    model.fit(X_train, Y_train,
              verbose=100)
    pred = pd.Series(model.predict(X_test).round(2))
    return pred, model

In [38]:
# Target 예측
def train_data(X_train, Y_train, X_test):
    LGBM_models=[]
    LGBM_actual_pred = pd.DataFrame()
    for q in quantiles:
        print(q)
        pred , model = LGBM(q, X_train, Y_train, X_test)
        LGBM_models.append(model)
        LGBM_actual_pred = pd.concat([LGBM_actual_pred,pred],axis=1)
    LGBM_actual_pred.columns=quantiles
    return LGBM_models, LGBM_actual_pred

In [39]:
# Target1
# Target1 은 하루 후의 데이터 예측이다.
models_1, results_1 = train_data(X_train_1, Y_train_1, X_test)
results_1.sort_index()[:48]

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9


Unnamed: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
1,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
2,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
3,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
4,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
5,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
6,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
7,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
8,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
9,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17


In [40]:
# Target2
# Target2 는 이틀 뒤의 데이터 예측이다.
models_2, results_2 = train_data(X_train_2, Y_train_2 ,X_test)
results_2.sort_index()[:48]

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9


Unnamed: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
1,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
2,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
3,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
4,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
5,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
6,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
7,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.03
8,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04
9,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.04


In [41]:
results_1.sort_index().iloc[:48]

Unnamed: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
1,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
2,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
3,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
4,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
5,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
6,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
7,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
8,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
9,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17


In [43]:
print(results_1.shape, results_2.shape)

(3888, 9) (3888, 9)


In [44]:
submission.loc[submission.id.str.contains("Day7"), "q_0.1":] = results_1.sort_index().values
submission.loc[submission.id.str.contains("Day8"), "q_0.1":] = results_2.sort_index().values
submission

Unnamed: 0,id,q_0.1,q_0.2,q_0.3,q_0.4,q_0.5,q_0.6,q_0.7,q_0.8,q_0.9
0,0.csv_Day7_0h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
1,0.csv_Day7_0h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
2,0.csv_Day7_1h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
3,0.csv_Day7_1h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
4,0.csv_Day7_2h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.17
...,...,...,...,...,...,...,...,...,...,...
7771,80.csv_Day8_21h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.03
7772,80.csv_Day8_22h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.03
7773,80.csv_Day8_22h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.03,0.03
7774,80.csv_Day8_23h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.03


## 음수 0 으로 바꾸기

- Robust 성을 최대한 추구하기 위해서 0보다 작은 예측도 간간히 나온다.
- 그래서 0보다 작은 경우는 0 을 대입한다.

In [47]:
np.sum(submission.iloc[:,1:] < 0)

q_0.1    119
q_0.2     81
q_0.3     42
q_0.4     17
q_0.5     18
q_0.6     25
q_0.7     19
q_0.8     18
q_0.9     13
dtype: int64

In [48]:
val = submission.iloc[:,1:]
val[val<0] = 0

In [50]:
submission.iloc[:,1:] = val

##  밤시간 0 으로 바꾸기

- 해가 뜨지 않는 시간에도 간간히 발전량을 예측한 경우가 있다.
- 그런 경우에도 모두 0으로 바꾸어 주었다.

In [52]:
def fill0(df):
    for i in range(0,162):
        df.iloc[i*48:(i*48)+9,1:] = 0
        df.iloc[i*48+40:(i*48)+48,1:] = 0
    return(df)

In [53]:
fill0(submission)

Unnamed: 0,id,q_0.1,q_0.2,q_0.3,q_0.4,q_0.5,q_0.6,q_0.7,q_0.8,q_0.9
0,0.csv_Day7_0h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.csv_Day7_0h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.csv_Day7_1h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.csv_Day7_1h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.csv_Day7_2h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
7771,80.csv_Day8_21h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7772,80.csv_Day8_22h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7773,80.csv_Day8_22h30m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7774,80.csv_Day8_23h00m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 제출

In [54]:
submission.to_csv('./data/hanadool-submission33.csv', index=False)