# [Module 2.3] Forecast 학습 데이터 준비 (Import Dataset)
- 이 노트북에서는 이전에 생성한 target_time_series.csv, related_time_series.csv, store_meta.csv 파일을 가지고 Forecast가 학습을 할 수 있게 하는 작업을 합니다.  

아래 (1) 단계의 데이타 파일의 준비는 이전 노트북에서 준비가 되었고, 여기서는 (2) ~ (7) 의 과정을 수행 합니다.

![Fig.2.1.Forecast-Lifecycle](../StoreItemDemand/img/Fig.2.1.Forecast-Lifecycle.png)
**Source: By Kyoungtae Hwang**

상세하게 이 노트북은 아래와 같은 작업을 수행 합니다.<br>

- Create IAM role
    - forecast 서비스가 다른 서비스(예: S3)에 접근시 사용할 역할을 생성하고 권한을 부여 합니다.
    
    
- (2) 데이터 파일 S3에 업로드 (Upload the Target Data to S3)
    - 이전 노트북에서 만든 target_time_series.csv, related_time_series.csv, store_meta.csv 파일을 S3에 업로드 합니다.


- (3) Create a dataset group
    - 전체 데이터 셋을 (Target Data Set, Related Data Set, Item Meta Data Set)을 담을 상위의 Dataset Group을 생성 합니다. 

**아래의 (4), (5)는 Target, Related, Item-Meta를 세번 반복 합니다.**
- (4) Create schema 
    - 여기서는 컬럼 타입을 정의하는 스키마 파일을 정의해서 Forecast서비스가 어떠한 데이타가 입력 되는지를 알게 합니다.


- (5) Create Target Dataset
    - 실제로 Target Data Set을 생성 합니다.


- (6) Update dataset group
    - 위에서 생성된 Target, Related and Item-Meta Data Set을 Dataset Group에 연결 시키는 작업을 합니다.


- (7) Create dataset import job
    - S3에 업로드 된 Target, Related, Item-Meta의 3개의 파일을 Import하여 Forecast 서비스가 사용할 수 있게 합니다.
    
---    
이 과정은 약 10분 정도 소요 됩니다 **About 10 mins may be elapsed**


In [1]:
import boto3
from time import sleep
import os
import pandas as pd
import json
import time
import pprint
import numpy as np

In [2]:
%store -r

## Project Name 및 Parameters

- Dataset의 이름을 정합니다.
- DATASET_FREQUENCY 를 Weekly로 설정 합니다. 
- 또한 TIMESTAMP_FORMAT 를 yyyy-mm-dd 형식으로 지정 합니다.


In [3]:
DATASET_FREQUENCY = "W" 
TIMESTAMP_FORMAT = "yyyy-MM-dd"

project = 'WalmartKaggleWithThreeDatasets'
suffix = str(np.random.uniform())[4:9]
target_suffix = '_Target'
related_suffix = '_Related'
item_meta_suffix = '_ItemMeta'

target_datasetName= project+'DS' + target_suffix + suffix
item_meta_dataset_name= project+'DS' + item_meta_suffix + suffix
related_dataset_Name= project+'DS' + related_suffix + suffix
item_datasetGroupName= project +'DSG'+ item_meta_suffix + suffix

In [4]:
region = boto3.Session().region_name
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')

## 역할 생성 (Create role)

**이 작업을 하기 전에  이 노트북을 실행하는 SageMaker notebook instance 가 AmazonSageMakerFullAccess, AmazonS3FullAccess, AmazonForecastFullAccess, IAMFullAccess 4개의 정책을 가지고 있어야 합니다.**
만일 아래 셀에서 에러가 발생했다면, [AddPolicy](../0.0.Prerequisite/Prerequisite.md) 의 스크롤을 내려서 "3. Add IAM Policy (Permission)" 부분 부터 시작하여 권한 추가 해주세요. (Make sure that a role for SageMaker notebook instance has these policies attached such as AmazonSageMakerFullAccess, AmazonS3FullAccess, AmazonForecastFullAccess, IAMFullAccess)

이 부분은 ForecastRolePOC_XXX 역할을 생성하고, ForecastRolePOC_XXX 에게 AmazonForecastFullAccess, AmazonS3FullAccess 이 두개의 Policy(권한)을 부여 합니다. ForecastRolePOC_XXX 는 Forecast 서비스가 다른 서비스(예: S3) 에 접근시 사용합니다.



In [5]:
iam = boto3.client("iam")

# Put the role name
role_name = "ForecastRoleWalmart" + suffix
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like tåo use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/AmazonForecastFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::057716757052:role/ForecastRoleWalmart91891


## (2) 데이터 파일 S3에 업로드 - 가장 상단 이미지의 (2) 과정 임

버킷을 생성하고 이전에 노트북에서 생성한 3개의 csv 파일을 S3에 업로드 합니다.

In [6]:
import boto3
import sagemaker

s3_resource = boto3.resource('s3')
s3 = boto3.client('s3')

# if you want, replace with a name of your S3 bucket
bucket_name = sagemaker.Session().default_bucket()  

if s3_resource.Bucket(bucket_name).creation_date is None:
    # bucket is not existing 
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})    
else: 
    # Bucket exists
    print("bucket name is ", bucket_name)
    

bucket name is  sagemaker-ap-northeast-2-057716757052


In [7]:
# Upload Target File under a bucket folder
bucket_folder = project
s3_file_path = bucket_folder + "/" + target_time_series_filename

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_file_path).upload_file(target_time_series_path)
target_s3DataPath = "s3://"+bucket_name + "/" + s3_file_path

# Upload Related File under a bucket folder
bucket_folder = project
s3_file_path = bucket_folder + "/" + related_time_series_filename

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_file_path).upload_file(related_time_series_path)
related_s3DataPath = "s3://"+bucket_name + "/" + s3_file_path

# Upload Item Meta File under a bucket folder
bucket_folder = project
s3_file_path = bucket_folder + "/" + store_meta_filename

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_file_path).upload_file(store_meta_path)
item_meta_s3DataPath = "s3://"+bucket_name + "/" + s3_file_path

## (3) Create Dataset Group

Dataset Group을 생성 합니다. 도메인은 CUSTOM을 사용 합니다.

In [8]:
# Create the DatasetGroup
create_dataset_group_response = forecast.create_dataset_group(
      DatasetGroupName=item_datasetGroupName,
      Domain="CUSTOM",
     )
item_meta_datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [9]:
forecast.describe_dataset_group(DatasetGroupArn=item_meta_datasetGroupArn)

{'DatasetGroupName': 'WalmartKaggleWithThreeDatasetsDSG_ItemMeta91891',
 'DatasetGroupArn': 'arn:aws:forecast:ap-northeast-2:057716757052:dataset-group/WalmartKaggleWithThreeDatasetsDSG_ItemMeta91891',
 'DatasetArns': [],
 'Domain': 'CUSTOM',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 8, 23, 7, 49, 0, 982000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 8, 23, 7, 49, 0, 982000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'b5e3f218-a981-4505-a29d-8cae83b17db0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 23 Aug 2020 07:49:02 GMT',
   'x-amzn-requestid': 'b5e3f218-a981-4505-a29d-8cae83b17db0',
   'content-length': '322',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## (4) Create schema for target data

target dataset schema를 생성 합니다. 

In [10]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
target_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

## (5) Create Target Time Sereis Dataset

In [11]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=target_datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = target_schema
)

In [12]:
target_second_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=target_second_datasetArn)

{'DatasetArn': 'arn:aws:forecast:ap-northeast-2:057716757052:dataset/WalmartKaggleWithThreeDatasetsDS_Target91891',
 'DatasetName': 'WalmartKaggleWithThreeDatasetsDS_Target91891',
 'Domain': 'CUSTOM',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'W',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'target_value', 'AttributeType': 'float'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 8, 23, 7, 49, 7, 118000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 8, 23, 7, 49, 7, 118000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '32f6610f-96eb-40e0-9f21-d4d2d71782bf',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 23 Aug 2020 07:49:08 GMT',
   'x-amzn-requestid': '32f6610f-96eb-40e0-9f21-d4d2d71782bf',
   'content-length': '55

## (4) Create schema for related data

In [13]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"Temperature",
         "AttributeType":"float"
      },
      {
         "AttributeName":"Fuel_Price",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }       
   ]
}

## (5) Create Related Time Sereis Dataset

In [14]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_dataset_Name,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = related_schema
)

In [15]:
related_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=related_datasetArn)

{'DatasetArn': 'arn:aws:forecast:ap-northeast-2:057716757052:dataset/WalmartKaggleWithThreeDatasetsDS_Related91891',
 'DatasetName': 'WalmartKaggleWithThreeDatasetsDS_Related91891',
 'Domain': 'CUSTOM',
 'DatasetType': 'RELATED_TIME_SERIES',
 'DataFrequency': 'W',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'Temperature', 'AttributeType': 'float'},
   {'AttributeName': 'Fuel_Price', 'AttributeType': 'float'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 8, 23, 7, 49, 12, 803000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 8, 23, 7, 49, 12, 803000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '56f91a18-e33c-412d-b961-430764b0f3d5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 23 Aug 2020 07:49:14 GMT',
   'x-amzn-requestid'

## (4) Create schema for Item Meta data

In [16]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
item_meta_schema ={
   "Attributes":[
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },       
      {
         "AttributeName":"StoreType",
         "AttributeType":"string"
      }       
   ]
}

## (5) Create Item-Meta Dataset

In [17]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='ITEM_METADATA',
                    DatasetName=item_meta_dataset_name,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = item_meta_schema
)

In [18]:
item_meta_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=item_meta_datasetArn)

{'DatasetArn': 'arn:aws:forecast:ap-northeast-2:057716757052:dataset/WalmartKaggleWithThreeDatasetsDS_ItemMeta91891',
 'DatasetName': 'WalmartKaggleWithThreeDatasetsDS_ItemMeta91891',
 'Domain': 'CUSTOM',
 'DatasetType': 'ITEM_METADATA',
 'Schema': {'Attributes': [{'AttributeName': 'item_id',
    'AttributeType': 'string'},
   {'AttributeName': 'StoreType', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 8, 23, 7, 49, 17, 585000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 8, 23, 7, 49, 17, 585000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'affdbefa-41fe-4da6-86d7-b3112174d43b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 23 Aug 2020 07:49:18 GMT',
   'x-amzn-requestid': 'affdbefa-41fe-4da6-86d7-b3112174d43b',
   'content-length': '473',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## (6) Update dataset group with the target, related and item_meta dataset

In [19]:
# Attach the target dataset and related data set  to the Dataset Group:
forecast.update_dataset_group(
    DatasetGroupArn=item_meta_datasetGroupArn, 
    DatasetArns=[target_second_datasetArn,
                 related_datasetArn,
                 item_meta_datasetArn])

{'ResponseMetadata': {'RequestId': 'f0c75e73-83bc-4c75-9fec-5460dfda87b1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 23 Aug 2020 07:49:20 GMT',
   'x-amzn-requestid': 'f0c75e73-83bc-4c75-9fec-5460dfda87b1',
   'content-length': '2',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## (7) Create dataset_import_job used to download dataset from S3

In [20]:
# Target Import Job
datasetImportJobName = 'DSIMPORT_JOB_TARGET_WALMART' + suffix
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=target_second_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":target_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [21]:
ds_target_second_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_target_second_import_job_arn)

arn:aws:forecast:ap-northeast-2:057716757052:dataset-import-job/WalmartKaggleWithThreeDatasetsDS_Target91891/DSIMPORT_JOB_TARGET_WALMART91891


In [22]:
# Related Import Job
datasetImportJobName = 'DSIMPORT_JOB_RELATED_WALMART' + related_suffix + suffix
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=related_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":related_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [23]:
ds_related_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_related_import_job_arn)

arn:aws:forecast:ap-northeast-2:057716757052:dataset-import-job/WalmartKaggleWithThreeDatasetsDS_Related91891/DSIMPORT_JOB_RELATED_WALMART_Related91891


In [24]:
# Finally we can call import the dataset
datasetImportJobName = 'DSIMPORT_JOB_RELATED_WALMART' + related_suffix + suffix
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=item_meta_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":item_meta_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [25]:
ds_itme_meta_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_itme_meta_import_job_arn)

arn:aws:forecast:ap-northeast-2:057716757052:dataset-import-job/WalmartKaggleWithThreeDatasetsDS_ItemMeta91891/DSIMPORT_JOB_RELATED_WALMART_Related91891


In [26]:
%%time

while True:
    dataTargetImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_target_second_import_job_arn)['Status']
    print("Target: ", dataTargetImportStatus)
    dataRelatedImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_related_import_job_arn)['Status']
    print("Related: ", dataRelatedImportStatus)
    dataItemMetaImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_related_import_job_arn)['Status']
    print("Item Metadata: ", dataItemMetaImportStatus)    
    if dataTargetImportStatus != 'ACTIVE' and dataTargetImportStatus != 'CREATE_FAILED':
        sleep(30)
    elif dataRelatedImportStatus != 'ACTIVE' and dataRelatedImportStatus != 'CREATE_FAILED':
        sleep(30)
    elif dataItemMetaImportStatus != 'ACTIVE' and dataItemMetaImportStatus != 'CREATE_FAILED':
        sleep(30)    
    else:
        break

Target:  CREATE_PENDING
Related:  CREATE_PENDING
Item Metadata:  CREATE_PENDING
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS
Target:  CREATE_IN_PROGRESS
Related:  CREATE_IN_PROGRESS
Item Metadata:  CREATE_IN_PROGRESS


In [27]:
%store project
%store region
%store bucket_name
%store bucket_folder
%store role_arn
%store role_name
%store suffix
%store target_suffix
%store item_meta_suffix
%store related_suffix

%store item_meta_datasetGroupArn
%store target_second_datasetArn
%store related_datasetArn
%store item_meta_datasetArn
%store ds_target_second_import_job_arn
%store ds_related_import_job_arn
%store ds_itme_meta_import_job_arn




Stored 'project' (str)
Stored 'region' (str)
Stored 'bucket_name' (str)
Stored 'bucket_folder' (str)
Stored 'role_arn' (str)
Stored 'role_name' (str)
Stored 'suffix' (str)
Stored 'target_suffix' (str)
Stored 'item_meta_suffix' (str)
Stored 'related_suffix' (str)
Stored 'item_meta_datasetGroupArn' (str)
Stored 'target_second_datasetArn' (str)
Stored 'related_datasetArn' (str)
Stored 'item_meta_datasetArn' (str)
Stored 'ds_target_second_import_job_arn' (str)
Stored 'ds_related_import_job_arn' (str)
Stored 'ds_itme_meta_import_job_arn' (str)
