# [Module 2] Forecast 학습 데이터 준비 (Import Dataset)
- 이 노트북에서는 Module1에서 생성한 target_time_series.csv 파일을 가지고 Forecast가 학습을 할 수 있게 하는 작업을 합니다. (This notebook has the following processes)

- Create IAM role
    - forecast 서비스가 다른 서비스(예: S3)에 접근시 사용할 역할을 생성 합니다.
- Create a dataset group
    - 전체 데이터 셋을 (Target Data Set, Related Data Set, Item Meta Data Set)을 담을 상위의 Dataset Group을 생성 합니다.
- Create a schema for a dataset
    - 여기서는 Target Data Set의 컬럼 정보, 컬럼 타입을 정의하는 스키마 파일을 정의해서 Forecast서비스가 어떠한 데이타가 입력 되는지를 알게 합니다.
- Create the dataset
    - 실제로 Target Data Set을 생성 합니다.
- Attach the dataset to the dataset group
    - 위에서 생성된 Target Data Set을 Dataset Group에 포함 시키는 작업을 합니다.
- Upload the Target Data to S3
    - [Module 1] 에서 만든 target_time_series.csv 파일을 S3에 업로드 합니다.
- Create a dataset import job
    - S3에 업로드 된 target_time_series.csv 파일을 Target Data Set에 Import하여 Forecast 서비스가 사용할 수 있게 합니다.
    
    
* 이 과정은 약 5분 정도 소요 됩니다 **About 5 mins may be elapsed**


In [1]:
import boto3
from time import sleep
import os
import pandas as pd
import json
import time
import pprint
import numpy as np

In [2]:
%store -r

## Parmeters

- DATASET_FREQUENCY 를 Day로 설정 합니다. 참고로 Week로 한다면 "W"로 지정 합니다. 또한 TIMESTAMP_FORMAT 를 yyyy-mm-dd 형식으로 지정 합니다.
- Target Dataset 및 Target Dataset Group 의 이름을 지정 합니다.

In [3]:
DATASET_FREQUENCY = "D" 
TIMESTAMP_FORMAT = "yyyy-MM-dd"

suffix = str(np.random.uniform())[4:9]


# Enter a project name
project = 'StoreItemDemand'



target_datasetName= project+'DS' + suffix
related_datasetName=project+'Related'+suffix
datasetGroupName= project +'DSG'+ suffix

In [4]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
     data = json.load(notebook_info)
     resource_arn = data['ResourceArn']
     region = resource_arn.split(':')[3]
print(region)

ap-northeast-2


In [5]:
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')
forecast_query = session.client(service_name='forecastquery')

## Create role
**Make sure that a role for SageMaker notebook instance has these policies attached such as AmazonSageMakerFullAccess, AmazonS3FullAccess, AmazonForecastFullAccess, IAMFullAccess**
- ForecastRolePOC_XXX 역할을 생성하고, AmazonForecastFullAccess, AmazonS3FullAccess 이 두개의 Policy(권한)을 부여 합니다. ForecastRolePOC_XXX 는 Forecast 서비스가 다른 서비스(예: S3) 에 접근시 사용합니다.

In [6]:
iam = boto3.client("iam")

# Put the role name
role_name = "ForecastRolePOC" + suffix
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like tåo use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/AmazonForecastFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::870180618679:role/ForecastRolePOC88082


## Create DatasetGroup

In [7]:
create_dataset_group_response = forecast.create_dataset_group(
      DatasetGroupName= datasetGroupName,
      Domain="CUSTOM",
     )
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']
datasetGroupArn

'arn:aws:forecast:ap-northeast-2:870180618679:dataset-group/StoreItemDemandDSG88082'

- dataset_group 의 생성 상태를 확인 합니다.

In [8]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

{'DatasetGroupName': 'StoreItemDemandDSG88082',
 'DatasetGroupArn': 'arn:aws:forecast:ap-northeast-2:870180618679:dataset-group/StoreItemDemandDSG88082',
 'DatasetArns': [],
 'Domain': 'CUSTOM',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 631000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 631000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'a4f4bdeb-3f52-4538-aef3-55f036332d1a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 12 Jul 2020 16:23:40 GMT',
   'x-amzn-requestid': 'a4f4bdeb-3f52-4538-aef3-55f036332d1a',
   'content-length': '274',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## Create schemas

In [9]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
target_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"store",
         "AttributeType":"string"
      },       
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
   ]
}

In [10]:
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"store",
         "AttributeType":"string"
      },       
      {
         "AttributeName":"is_holidays",
         "AttributeType":"integer"
      },
   ]
}

## Create Dataset

In [11]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=target_datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = target_schema
)
target_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=target_datasetArn)

{'DatasetArn': 'arn:aws:forecast:ap-northeast-2:870180618679:dataset/StoreItemDemandDS88082',
 'DatasetName': 'StoreItemDemandDS88082',
 'Domain': 'CUSTOM',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'D',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'},
   {'AttributeName': 'store', 'AttributeType': 'string'},
   {'AttributeName': 'target_value', 'AttributeType': 'float'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 745000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 745000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '57959ebe-a631-483a-9108-29280c13a6e3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 12 Jul 2020 16:23:40 GMT',
   'x-amzn-requestid': '57959ebe-a631-483a-9108-29280c13a6e3',
   'co

In [12]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = related_schema
)
related_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=related_datasetArn)

{'DatasetArn': 'arn:aws:forecast:ap-northeast-2:870180618679:dataset/StoreItemDemandRelated88082',
 'DatasetName': 'StoreItemDemandRelated88082',
 'Domain': 'CUSTOM',
 'DatasetType': 'RELATED_TIME_SERIES',
 'DataFrequency': 'D',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'},
   {'AttributeName': 'store', 'AttributeType': 'string'},
   {'AttributeName': 'is_holidays', 'AttributeType': 'integer'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 891000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 7, 12, 16, 23, 40, 891000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '9057852e-9d83-4a33-b3e5-6fb48ebbaf19',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 12 Jul 2020 16:23:40 GMT',
   'x-amzn-requestid': '9057852e-9d83-4a33-b3e5-6fb48ebba

## Attach the target time series dataset to the DatasetGroup

In [13]:
# Attach the Dataset to the Dataset Group:
forecast.update_dataset_group(
    DatasetGroupArn=datasetGroupArn, 
    DatasetArns=[target_datasetArn,
                 related_datasetArn])


{'ResponseMetadata': {'RequestId': 'af9254c5-632a-4f48-a532-4234b3698175',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Sun, 12 Jul 2020 16:23:40 GMT',
   'x-amzn-requestid': 'af9254c5-632a-4f48-a532-4234b3698175',
   'content-length': '2',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## Create a bucket

In [14]:
import boto3
import sagemaker

s3_resource = boto3.resource('s3')
s3 = boto3.client('s3')

# if you want, replace with a name of your S3 bucket
bucket_name = sagemaker.Session().default_bucket()  

if s3_resource.Bucket(bucket_name).creation_date is None:
    # bucket is not existing 
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})    
else: 
    # Bucket exists
    print("bucket name is ", bucket_name)
    

bucket name is  sagemaker-ap-northeast-2-870180618679


### Upload target data to S3

- Module 1 에서 생성한 target_time_series.csv 파일을 S3에 업로드 합니다.

In [15]:
# Upload Target File under a bucket folder
bucket_folder = project
s3_file_path = bucket_folder + "/" + target_time_series_filename

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_file_path).upload_file(target_time_series_path)
target_s3DataPath = "s3://"+bucket_name + "/" + s3_file_path
target_s3DataPath

's3://sagemaker-ap-northeast-2-870180618679/StoreItemDemand/target_time_series.csv'

In [16]:
s3_file_path = bucket_folder + "/" + related_time_series_filename

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_file_path).upload_file(related_time_series_path)
related_s3DataPath = "s3://"+bucket_name + "/" + s3_file_path
related_s3DataPath

's3://sagemaker-ap-northeast-2-870180618679/StoreItemDemand/related_holidays.csv'

## Create dataset_import_job used to download dataset from S3
- S3에서 Target Data Set으로 데이터를 Import 합니다.

In [17]:
# Finally we can call import the dataset
datasetImportJobName = 'DATASET_IMPORT_JOB_TARGET' + suffix
response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=target_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":target_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [18]:
target_import_job_arn=response['DatasetImportJobArn']
print(target_import_job_arn)

arn:aws:forecast:ap-northeast-2:870180618679:dataset-import-job/StoreItemDemandDS88082/DATASET_IMPORT_JOB_TARGET88082


In [19]:
# Finally we can call import the dataset
datasetImportJobName = 'DATASET_IMPORT_JOB_RELATED' + suffix
response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=related_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":related_s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [20]:
related_import_job_arn=response['DatasetImportJobArn']
print(related_import_job_arn)

arn:aws:forecast:ap-northeast-2:870180618679:dataset-import-job/StoreItemDemandRelated88082/DATASET_IMPORT_JOB_RELATED88082


In [21]:
%%time

while True:
    target_Status = forecast.describe_dataset_import_job(DatasetImportJobArn=target_import_job_arn)['Status']
    related_Status = forecast.describe_dataset_import_job(DatasetImportJobArn=related_import_job_arn)['Status']
    print("target_import_status:{}".format(target_Status))
    print("related_import_status:{}".format(related_Status))
    if (target_Status != 'ACTIVE' and target_Status != 'CREATE_FAILED') or (related_Status != 'ACTIVE' and related_Status != 'CREATE_FAILED') :
        sleep(30)
    else:
        break

target_import_status:CREATE_PENDING
related_import_status:CREATE_PENDING
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGRESS
related_import_status:CREATE_IN_PROGRESS
target_import_status:CREATE_IN_PROGR

In [22]:
%store project
%store suffix
%store region
%store target_import_job_arn
%store related_import_job_arn
%store datasetGroupArn
%store target_datasetArn
%store related_datasetArn
%store bucket_name
%store bucket_folder
%store role_arn
%store role_name
%store validation_stores_sales

Stored 'project' (str)
Stored 'suffix' (str)
Stored 'region' (str)
Stored 'target_import_job_arn' (str)
Stored 'related_import_job_arn' (str)
Stored 'datasetGroupArn' (str)
Stored 'target_datasetArn' (str)
Stored 'related_datasetArn' (str)
Stored 'bucket_name' (str)
Stored 'bucket_folder' (str)
Stored 'role_arn' (str)
Stored 'role_name' (str)
Stored 'validation_stores_sales' (DataFrame)
