## Generate Related Time Series Data

* 여기서는 Related Time Series를 만드는 방법에대해 알아봅니다. Relted Time Series를 학습에 사용하기 위해서는 Target Time Series와 같은 주기여야 합니다. [참고: https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html](https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html)


<div>
<img src="../images/related_time_series_frequency.PNG" width="800"/>
</div>


* 다음 예제에서는 US Holidays를 Target timeseries와 Join하여 Related time series를 만드는 방법에 대해 알아 봅니다.Amazon Forecast에서는 US Holidays경우아래와 같은 옵션을 선택함으로서 간단하게 US Holidays를 활용할 수 있습니다.

```python
prophet_create_predictor_response=forecast.create_predictor(
      PredictorName=prophet_predictorName, 
      AlgorithmArn=prophet_algorithmArn,
      ForecastHorizon=forecastHorizon,
      PerformAutoML= False,
      PerformHPO=False,
      EvaluationParameters= {"NumberOfBacktestWindows": NumberOfBacktestWindows, 
                             "BackTestWindowOffset": BackTestWindowOffset}, 
      InputDataConfig= {"DatasetGroupArn": target_datasetGroupArn, 
                        "SupplementaryFeatures": [ 
                         { 
                            "Name": "holiday",
                            "Value": "US"
                         }
                      ]},
      FeaturizationConfig= {"...
```                                      

* __하지만 한국 휴일과 같은 경우는 음력 명절(설, 추석) 포함 되어 있지 않아 별도의 작업이 필요하게 됩니다. 따라서 어떻게 Related Time Series를 만드는지 알아봅니다.__ 

<div>
<img src="../images/Korea_holidyays.PNG" width="500"/>
</div>

### Library Import

In [1]:
import pandas as pd
import numpy as np
import time
import warnings
import os
import boto3
import datetime

In [2]:
#####################################################
# Retrieve saved variables
#####################################################
%store -r

data_dir='data'
print("start date:", start_train_date)
print("end date:", end_val_date)

start date: 2015-01-01
end date: 2018-01-01


In [3]:
start=datetime.datetime.strptime(start_train_date,'%Y-%m-%d')
end=datetime.datetime.strptime(end_val_date,'%Y-%m-%d')
print(start,end)

2015-01-01 00:00:00 2018-01-01 00:00:00


### 토요일 일요일 Holiday 만들기 

In [4]:
holidays_df=[]

while start<end:
    if(start.weekday()==5 or start.weekday()==6):
        h=1
    else:
        h=0
    holidays_df.append([start,h])
    start+=datetime.timedelta(days=1)
    
holidays_df=pd.DataFrame(holidays_df,columns=["Date","is_holiday"])
holidays_df=holidays_df.set_index("Date")
holidays_df.head()

#days = [start + datetime.timedelta(days=x) for x in range((end-start).days + 1) if (start + datetime.timedelta(days=x)).weekday() == 6]

Unnamed: 0_level_0,is_holiday
Date,Unnamed: 1_level_1
2015-01-01,0
2015-01-02,0
2015-01-03,1
2015-01-04,1
2015-01-05,0


### 공휴일 정보 추가하기 

In [5]:
# Holidyas 파일 불러오기
#원본 파일 확인하기
holidays=pd.read_csv(os.path.join(data_dir,'datasets_7476_10641_usholidays.csv'))
holidays.head()


Unnamed: 0.1,Unnamed: 0,Date,Holiday
0,0,2010-12-31,New Year's Day
1,1,2011-01-17,"Birthday of Martin Luther King, Jr."
2,2,2011-02-21,Washington's Birthday
3,3,2011-05-30,Memorial Day
4,4,2011-07-04,Independence Day


In [6]:
#US Holiday만 추출하기 
holidays=pd.read_csv(os.path.join(data_dir,'datasets_7476_10641_usholidays.csv'),usecols=["Date"])
holidays=holidays.set_index("Date")
holidays.head()

2010-12-31
2011-01-17
2011-02-21
2011-05-30
2011-07-04


## 휴일데이터와 Target Data Join하여 Related Time Series 만들기 

이번 실습에서 Target Time Series 의 Forecast Dimension은 Store id와 item id입니다. 여기서는 휴일 데이터와 Target time series를 합하여 Related Time Series를만듭니다.__Related Times Series의 데이터는 Forecast 구간에도 있어야 합니다.__
[참고](https://docs.aws.amazon.com/forecast/latest/dg/related-time-series-datasets.html)


<div>
<img src="../images/related_time_series_windows.PNG" width="800"/>
</div>



In [7]:
# Training 및 Validation 구간의 시계열 만큼의 데이터 자르기 
holidays=holidays[holidays.index<end_val_date]
holidays=holidays[holidays.index>=start_train_date]
holidays.head()

2015-01-01
2015-01-19
2015-02-16
2015-05-25
2015-07-03


In [8]:
for date in holidays.index:
    holidays_df["is_holiday"][holidays_df.index==date]=1
holidays_df.head()

Unnamed: 0_level_0,is_holiday
Date,Unnamed: 1_level_1
2015-01-01,1
2015-01-02,0
2015-01-03,1
2015-01-04,1
2015-01-05,0


Unique한 item_id,Store 추가하기

In [9]:
df=total_stores_sales
df=df[['item_id','store']]
df.head()

Unnamed: 0_level_0,item_id,store
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-01,1,1
2015-01-02,1,1
2015-01-03,1,1
2015-01-04,1,1
2015-01-05,1,1


In [10]:
unique_store_id=pd.unique(df['store'])
print(unique_store_id)
unique_item_id=pd.unique(df['item_id'])
print(unique_item_id)

['1' '2' '3' '4' '5' '6' '7' '8' '9' '10']
['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30'
 '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44'
 '45' '46' '47' '48' '49' '50']


In [11]:
related_df=pd.DataFrame([])
for i in unique_store_id:
    for j in unique_item_id:
        temp=holidays_df.copy()
        temp['store']=i
        temp['item_id']=j       
        related_df=pd.concat([temp,related_df])
related_df=related_df[["item_id","store","is_holiday"]]
related_df.head()


Unnamed: 0_level_0,item_id,store,is_holiday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-01,50,10,1
2015-01-02,50,10,0
2015-01-03,50,10,1
2015-01-04,50,10,1
2015-01-05,50,10,0


In [12]:
related_df.tail()

Unnamed: 0_level_0,item_id,store,is_holiday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-12-27,1,1,0
2017-12-28,1,1,0
2017-12-29,1,1,0
2017-12-30,1,1,1
2017-12-31,1,1,1


In [13]:
print(len(related_df))

548000


### Related Time Series Missing data 처리 
이번 Lab의 데이터는 모든 Target time series 의 item/store series가 missing date가 없었습니다. 따라서 아래 function 처럼 join을 해도 무방합니다.

<code>
test_related_df=df.join(holidays_df)
</code>

하지만 Target time series가 Missing인 value가 있는 경우는 related time series 를 처리에 주의가 필요합니다. Target value경우는 Missing data가 있을경우 자동으로 filling 됩니다. 이경우 Related time series도 missing value없이 처리하여야 합니다.




<div>
<img src="../images/related_time_series_missing_value.PNG" width="800"/>
</div>


<div>
<img src="../images/related_time_seires_filling_values.PNG" width="800"/>
</div>


In [14]:
test_related_df=df.join(holidays_df)
print(len(test_related_df))

548000


In [15]:
related_time_series_filename="related_holidays.csv"
related_time_series_path=data_dir + "/" + related_time_series_filename
related_df.to_csv(related_time_series_path, header=False)

In [16]:
%store related_time_series_filename
%store related_time_series_path

Stored 'related_time_series_filename' (str)
Stored 'related_time_series_path' (str)
