# [Module 5] Create Dataset Group


* IAM Role 생성 및 권한 할당
    - Personalize 서비스가 사용할 역할을 생성 및 권한을 할당 합니다.
* 데이타 세트 그룹 생성 (DatasetGroup)    
* 데이타 스키마 생성
* 데이타 세트 생성 (Dataset)
* 데이타 Import (S3 --> Personalize 서비스로 다운로드)

---

# 0. 환경 설정

In [1]:
# Imports
import boto3
import json
import numpy as np
import pandas as pd
import time
from datetime import datetime

import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdate
from botocore.exceptions import ClientError

다음으로 여러분의 환경이 Amazon Personalize와 성공적으로 통신할 수 있는지 확인해야 합니다.

In [2]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')

생성할 오브젝트의 끝에 임의의 숫자를 부여하기 위해 suffix 정의

In [3]:
suffix = str(np.random.uniform())[4:9]

In [4]:
%store -r

## 1. Personalize Service의 S3 접근 권한
Personalize Service 는 해당 S3 버킷에 접근하기 위해서 권한

In [5]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*",
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': 'G0GHQPG95DHFGYTD',
  'HostId': '417v8TJOr9fJrBS0h5hb7aXj6hN2GAxt63H9++llaz3DSkPHKPbuA7+ZgHxLeJonnOJFKFKa43k=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': '417v8TJOr9fJrBS0h5hb7aXj6hN2GAxt63H9++llaz3DSkPHKPbuA7+ZgHxLeJonnOJFKFKa43k=',
   'x-amz-request-id': 'G0GHQPG95DHFGYTD',
   'date': 'Mon, 06 Mar 2023 08:48:28 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

## Personalize IAM Role

또한, Amazon Personalize는 특정 작업들을 실행할 권한을 갖기 위해, AWS에서 역할을 맡을 수 있는 기능이 필요합니다. 
예를 들어 Personalize 는 S3에 접근을 해야 합니다. 그래서 이를 위한 역할이 필요하고, 이 역할은 S3 접근 권한이 필요 합니다.

In [6]:
# 이미 기존에 사용하신 Role 이 있으면 아래와 Role ARN 을 넣고 사용함.

role_arn = "arn:aws:iam::376278017302:role/service-role/AmazonSageMaker-ExecutionRole-20230112T204234"
print(role_arn)

arn:aws:iam::376278017302:role/service-role/AmazonSageMaker-ExecutionRole-20230112T204234


## 2. 데이터 세트 그룹 생성 및 대기

Personalize에서 가장 큰 단위는 **데이터 세트 그룹(Dataset Group)** 이며, 이렇게 하면 데이터, 이벤트 추적기(event tracker), 솔루션(solution) 및 캠페인(campaign)이 분리됩니다. 공통의 데이터 수집을 공유하는 것들을 그룹화합니다. 원하는 경우 아래 그룹명을 자유롭게 변경해 주세요.

### 2.1 데이터 세트 그룹 생성 - Base 데이터셋

In [7]:
create_base_dataset_group_response = personalize.create_dataset_group(
    name = "RetailDemo-base-dataset-group"
)

In [8]:
base_dataset_group_arn = create_base_dataset_group_response['datasetGroupArn']
base_dataset_group_arn

'arn:aws:personalize:us-east-1:376278017302:dataset-group/RetailDemo-base-dataset-group'

#### 데이터 세트 그룹이 활성화 상태가 될 때까지 대기

아래의 모든 항목에서 Dataset Group을 사용하려면 활성화(active)가 되어야 합니다. 아래 셀을 실행하고 DatasetGroup: ACTIVE로 변경될 때까지 기다려 주세요.

In [10]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = base_dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("Base-DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

Base-DatasetGroup: ACTIVE


### 2.2 데이터 세트 그룹 생성 - CVR 적용 데이터셋

In [11]:
create_cvr_dataset_group_response = personalize.create_dataset_group(
    name = "RetailDemo-cvr-dataset-group"
)

In [12]:
cvr_dataset_group_arn = create_cvr_dataset_group_response['datasetGroupArn']
cvr_dataset_group_arn

'arn:aws:personalize:us-east-1:376278017302:dataset-group/RetailDemo-cvr-dataset-group'

#### 데이터 세트 그룹이 활성화 상태가 될 때까지 대기

아래의 모든 항목에서 Dataset Group을 사용하려면 활성화(active)가 되어야 합니다. 아래 셀을 실행하고 DatasetGroup: ACTIVE로 변경될 때까지 기다려 주세요.

In [13]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = cvr_dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("CVR-DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

CVR-DatasetGroup: CREATE PENDING
CVR-DatasetGroup: CREATE PENDING
CVR-DatasetGroup: ACTIVE


## 3. 스키마 생성

Personalize가 데이터를 이해하는 방법의 핵심 구성 요소는 아래 정의 된 스키마(schema)에서 비롯됩니다. 이 설정은 CSV 파일을 통해 제공된 데이터를 요약하는 방법을 Personalize 서비스에 알려줍니다. 열(column)과 유형(type)은 위에서 만든 파일의 내용과 일치합니다.

### 3.1 인터렉션

In [14]:
interaction_schema_name="RetailDemo-interaction-schema"

schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        { 
            "name": "EVENT_TYPE",
            "type": "string"
        },        
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}


create_schema_response = personalize.create_schema( 
    name = interaction_schema_name,
    schema = json.dumps(schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:376278017302:schema/RetailDemo-interaction-schema",
  "ResponseMetadata": {
    "RequestId": "d2a0b5f1-8940-4d1e-ade9-fa809f75dafb",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:52:07 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "95",
      "connection": "keep-alive",
      "x-amzn-requestid": "d2a0b5f1-8940-4d1e-ade9-fa809f75dafb"
    },
    "RetryAttempts": 0
  }
}


### 3.2 아이템

In [15]:
base_item_schema_name="RetailDemo-base-item-schema"

schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
    {
        "name": "ITEM_ID",
        "type": "string"
    },
    {
        "name": "NAME",
        "type": "string"
    },
    {
      "name": "CATEGORY_L1",
      "type": [
        "string"
      ],
      "categorical": True
    },
    {
      "name": "STYLE",
      "type": [
        "string"
      ],
      "categorical": True
    },
    {
        "name": "PRODUCT_DESCRIPTION",
        "type": "string"
    },
    {
      "name": "PRICE",
      "type": "float"
    },    
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(      
    name = base_item_schema_name,
    schema = json.dumps(schema)
)

base_item_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:376278017302:schema/RetailDemo-base-item-schema",
  "ResponseMetadata": {
    "RequestId": "9c7c56c4-063b-489d-8a48-617951659e9b",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:52:12 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "93",
      "connection": "keep-alive",
      "x-amzn-requestid": "9c7c56c4-063b-489d-8a48-617951659e9b"
    },
    "RetryAttempts": 0
  }
}


In [16]:
cvr_item_schema_name="RetailDemo-cvr-item-schema"

schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
    {
        "name": "ITEM_ID",
        "type": "string"
    },
    {
      "name": "CVR",
      "type": "float"
    }, 
    {
        "name": "NAME",
        "type": "string"
    },
    {
      "name": "CATEGORY_L1",
      "type": [
        "string"
      ],
      "categorical": True
    },
    {
      "name": "STYLE",
      "type": [
        "string"
      ],
      "categorical": True
    },
    {
        "name": "PRODUCT_DESCRIPTION",
        "type": "string"
    },
    {
      "name": "PRICE",
      "type": "float"
    }     
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(      
    name = cvr_item_schema_name,
    schema = json.dumps(schema)
)

cvr_item_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:376278017302:schema/RetailDemo-cvr-item-schema",
  "ResponseMetadata": {
    "RequestId": "8217f568-6367-48b3-951f-0d1a2b9cb03b",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:52:16 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "92",
      "connection": "keep-alive",
      "x-amzn-requestid": "8217f568-6367-48b3-951f-0d1a2b9cb03b"
    },
    "RetryAttempts": 0
  }
}


### 3.3 유저

In [17]:
user_schema_name="RetailDemo-user-schema"

schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
    {
        "name": "USER_ID",
        "type": "string"
    },
    {
      "name": "USER_NAME",
      "type": "string"
    },        
    {
      "name": "GENDER",
      "type": [
        "string"
      ],
      "categorical": True
    }        
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(      
    name = user_schema_name,
    schema = json.dumps(schema)
)

user_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:376278017302:schema/RetailDemo-user-schema",
  "ResponseMetadata": {
    "RequestId": "42377114-190f-46c6-a39c-2fedbfd91a47",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:52:37 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "88",
      "connection": "keep-alive",
      "x-amzn-requestid": "42377114-190f-46c6-a39c-2fedbfd91a47"
    },
    "RetryAttempts": 0
  }
}


## 4. 데이터 세트 생성

그룹 다음으로 생성할 것은 실제 데이터 세트입니다. 아래의 코드 셀을 실행하여 데이터 세트을 생성해 주세요.

### 4.1 Interactions 데이터 세트 생성 (Base 데이터셋)

In [18]:
dataset_type = "INTERACTIONS"
create_base_dataset_response = personalize.create_dataset(
    name = "RetailDemo-base-interaction-dataset",
    datasetType = dataset_type,
    datasetGroupArn = base_dataset_group_arn,
    schemaArn = interaction_schema_arn
)

base_interaction_dataset_arn = create_base_dataset_response['datasetArn']
print(json.dumps(create_base_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-base-dataset-group/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "786337a0-20f5-44c4-b3b3-1a4400b3b9ac",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:58:59 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "110",
      "connection": "keep-alive",
      "x-amzn-requestid": "786337a0-20f5-44c4-b3b3-1a4400b3b9ac"
    },
    "RetryAttempts": 0
  }
}


### 4.2 Items 데이터 세트 생성  (Base 데이터셋)

In [19]:
dataset_type = "ITEMS"
create_base_item_dataset_response = personalize.create_dataset(
    name = "RetailDemo-base-item-dataset",
    datasetType = dataset_type,
    datasetGroupArn = base_dataset_group_arn,
    schemaArn = base_item_schema_arn,
  
)

base_item_dataset_arn = create_base_item_dataset_response['datasetArn']
print(json.dumps(create_base_item_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-base-dataset-group/ITEMS",
  "ResponseMetadata": {
    "RequestId": "04790954-ef9d-4e2a-823d-830dd4f33bee",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:59:01 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "103",
      "connection": "keep-alive",
      "x-amzn-requestid": "04790954-ef9d-4e2a-823d-830dd4f33bee"
    },
    "RetryAttempts": 0
  }
}


### 4.3 Users 데이터 세트 생성 (Base 데이터셋)

In [20]:
dataset_type = "USERS"
create_base_user_dataset_response = personalize.create_dataset(
    name = "RetailDemo-base-user-dataset",
    datasetType = dataset_type,
    datasetGroupArn = base_dataset_group_arn,
    schemaArn = user_schema_arn,
  
)

base_user_dataset_arn = create_base_user_dataset_response['datasetArn']
print(json.dumps(create_base_user_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-base-dataset-group/USERS",
  "ResponseMetadata": {
    "RequestId": "407b08ab-5550-42e3-860d-20e9abc486fb",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:59:02 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "103",
      "connection": "keep-alive",
      "x-amzn-requestid": "407b08ab-5550-42e3-860d-20e9abc486fb"
    },
    "RetryAttempts": 0
  }
}


### 4.4 Interactions 데이터 세트 생성 (CVR 적용 데이터셋)

In [21]:
dataset_type = "INTERACTIONS"
create_cvr_dataset_response = personalize.create_dataset(
    name = "RetailDemo-cvr-interaction-dataset",
    datasetType = dataset_type,
    datasetGroupArn = cvr_dataset_group_arn,
    schemaArn = interaction_schema_arn
)

cvr_interaction_dataset_arn = create_cvr_dataset_response['datasetArn']
print(json.dumps(create_cvr_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-cvr-dataset-group/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "c3285be5-d6c9-4f0b-8832-a8b71f29ab76",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:59:03 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "109",
      "connection": "keep-alive",
      "x-amzn-requestid": "c3285be5-d6c9-4f0b-8832-a8b71f29ab76"
    },
    "RetryAttempts": 0
  }
}


### 4.5 Items 데이터 세트 생성  (CVR 적용 데이터셋)

In [22]:
dataset_type = "ITEMS"
create_cvr_item_dataset_response = personalize.create_dataset(
    name = "RetailDemo-cvr-item-dataset",
    datasetType = dataset_type,
    datasetGroupArn = cvr_dataset_group_arn,
    schemaArn = cvr_item_schema_arn,
  
)

cvr_item_dataset_arn = create_cvr_item_dataset_response['datasetArn']
print(json.dumps(create_cvr_item_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-cvr-dataset-group/ITEMS",
  "ResponseMetadata": {
    "RequestId": "d1f92fbb-4913-4219-997d-bb415629f1eb",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:59:04 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "102",
      "connection": "keep-alive",
      "x-amzn-requestid": "d1f92fbb-4913-4219-997d-bb415629f1eb"
    },
    "RetryAttempts": 0
  }
}


### 4.6 Users 데이터 세트 생성 (CVR 적용 데이터셋)

In [23]:
dataset_type = "USERS"
create_cvr_user_dataset_response = personalize.create_dataset(
    name = "RetailDemo-cvr-user-dataset",
    datasetType = dataset_type,
    datasetGroupArn = cvr_dataset_group_arn,
    schemaArn = user_schema_arn,
  
)

cvr_user_dataset_arn = create_cvr_user_dataset_response['datasetArn']
print(json.dumps(create_cvr_user_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:376278017302:dataset/RetailDemo-cvr-dataset-group/USERS",
  "ResponseMetadata": {
    "RequestId": "a1c062cb-61a9-4bec-8a0e-b710c4d2488a",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 08:59:05 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "102",
      "connection": "keep-alive",
      "x-amzn-requestid": "a1c062cb-61a9-4bec-8a0e-b710c4d2488a"
    },
    "RetryAttempts": 0
  }
}


#### 데이터 세트 생성을 위해서 1분 기다림

In [24]:
time.sleep(60)

## 5. 데이터 세트 Import

이전에는 정보를 저장하기 위해 데이터 세트 그룹 및 데이터 세트를 생성했으므로, 
이제는 모델 구축을 위해 S3에서 Amazon Personalize로 데이터를 로드하는 import job을 실행합니다.



#### Interaction 데이터 세트 Import Job 생성 (Base 데이터셋)

In [25]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDeom-base-interaction-dataset-import",
    datasetArn = base_interaction_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, base_warm_train_interaction_filename)
    },
    roleArn = role_arn
)

base_interation_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDeom-base-interaction-dataset-import",
  "ResponseMetadata": {
    "RequestId": "bae5dc74-a4a1-4fb7-b8cb-cf78b0a900ea",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 09:06:43 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "130",
      "connection": "keep-alive",
      "x-amzn-requestid": "bae5dc74-a4a1-4fb7-b8cb-cf78b0a900ea"
    },
    "RetryAttempts": 0
  }
}


#### 아이템 데이터 세트 Import Job 생성 (Base 데이터셋)

In [26]:
create_item_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDemo-base-item-dataset-import",
    datasetArn = base_item_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, base_items_filename)
    },
    roleArn = role_arn
)

base_item_dataset_import_job_arn = create_item_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_item_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDeom-base-interaction-dataset-import",
  "ResponseMetadata": {
    "RequestId": "bae5dc74-a4a1-4fb7-b8cb-cf78b0a900ea",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 09:06:43 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "130",
      "connection": "keep-alive",
      "x-amzn-requestid": "bae5dc74-a4a1-4fb7-b8cb-cf78b0a900ea"
    },
    "RetryAttempts": 0
  }
}


#### 유저 데이터 세트 Import Job 생성 (Base 데이터셋)

In [27]:
create_user_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDemo-base-user-dataset-import",
    datasetArn = base_user_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, base_users_filename)
    },
    roleArn = role_arn
)

base_user_dataset_import_job_arn = create_user_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_user_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDemo-base-user-dataset-import",
  "ResponseMetadata": {
    "RequestId": "e90d2c22-b2e0-4221-a949-cec4397b8c77",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 09:06:46 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "123",
      "connection": "keep-alive",
      "x-amzn-requestid": "e90d2c22-b2e0-4221-a949-cec4397b8c77"
    },
    "RetryAttempts": 0
  }
}


### 데이터 세트 Import job이 활성화 상태가 될 때까지 대기

Import job이 완료되기까지 시간이 걸립니다. 아래 코드 셀의 출력 결과가 DatasetImportJob: ACTIVE가 될 때까지 기다려 주세요.

#### 인터렉션

In [32]:
%%time

status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = base_interation_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE
CPU times: user 12.9 ms, sys: 0 ns, total: 12.9 ms
Wall time: 59.7 ms


#### 아이템

In [33]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = base_item_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


#### 유저

In [34]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = base_user_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


#### Interaction 데이터 세트 Import Job 생성 (CVR 적용 데이터셋)

In [28]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDeom-cvr-interaction-dataset-import",
    datasetArn = cvr_interaction_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, cvr_warm_train_interaction_filename)
    },
    roleArn = role_arn
)

cvr_interation_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDeom-cvr-interaction-dataset-import",
  "ResponseMetadata": {
    "RequestId": "1a639dd9-169f-4757-828b-ac89b2072dc6",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 09:06:48 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "129",
      "connection": "keep-alive",
      "x-amzn-requestid": "1a639dd9-169f-4757-828b-ac89b2072dc6"
    },
    "RetryAttempts": 0
  }
}


#### 아이템 데이터 세트 Import Job 생성 (CVR 적용 데이터셋)

In [29]:
create_item_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDemo-cvr-item-dataset-import",
    datasetArn = cvr_item_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, cvr_items_filename)
    },
    roleArn = role_arn
)

cvr_item_dataset_import_job_arn = create_item_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDeom-cvr-interaction-dataset-import",
  "ResponseMetadata": {
    "RequestId": "1a639dd9-169f-4757-828b-ac89b2072dc6",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 09:06:48 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "129",
      "connection": "keep-alive",
      "x-amzn-requestid": "1a639dd9-169f-4757-828b-ac89b2072dc6"
    },
    "RetryAttempts": 0
  }
}


#### 유저 데이터 세트 Import Job 생성 (CVR 적용 데이터셋)

In [31]:
create_user_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "RetailDemo-cvr-user-dataset-import",
    datasetArn = cvr_user_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, cvr_users_filename)
    },
    roleArn = role_arn
)

cvr_user_dataset_import_job_arn = create_user_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_user_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:376278017302:dataset-import-job/RetailDemo-cvr-user-dataset-import",
  "ResponseMetadata": {
    "RequestId": "69540997-0541-4317-bad8-258e920f41d9",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 06 Mar 2023 13:44:11 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "122",
      "connection": "keep-alive",
      "x-amzn-requestid": "69540997-0541-4317-bad8-258e920f41d9"
    },
    "RetryAttempts": 0
  }
}


### 데이터 세트 Import job이 활성화 상태가 될 때까지 대기

Import job이 완료되기까지 시간이 걸립니다. 아래 코드 셀의 출력 결과가 DatasetImportJob: ACTIVE가 될 때까지 기다려 주세요.

#### 인터렉션

In [35]:
%%time

status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = cvr_interation_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE
CPU times: user 3.3 ms, sys: 0 ns, total: 3.3 ms
Wall time: 28.3 ms


#### 아이템

In [36]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = cvr_item_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


#### 유저

In [37]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = cvr_user_dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: ACTIVE


## Review

위의 코드를 사용하여 데이타 세트 그룹, 데이타 세트, 데이타 세트 Import까지를 수행 하였습니다. 다음 단계는 이를 기반으로 솔류션(모델)을 생성하는 단계를 진행 합니다.


## 다음 노트북에 대한 참고 사항

다음 실습에 필요한 몇 가지 값들이 있습니다. 아래 셀을 실행하여 저장한 후, 다음 주피터 노트북에서 그대로 사용할 수 있습니다.

In [38]:
%store base_dataset_group_arn
%store base_interaction_dataset_arn
%store base_item_dataset_arn
%store base_user_dataset_arn
%store base_item_schema_arn

%store cvr_dataset_group_arn
%store cvr_interaction_dataset_arn
%store cvr_item_dataset_arn
%store cvr_user_dataset_arn
%store cvr_item_schema_arn

%store interaction_schema_arn
%store user_schema_arn

%store role_arn

Stored 'base_dataset_group_arn' (str)
Stored 'base_interaction_dataset_arn' (str)
Stored 'base_item_dataset_arn' (str)
Stored 'base_user_dataset_arn' (str)
Stored 'base_item_schema_arn' (str)
Stored 'cvr_dataset_group_arn' (str)
Stored 'cvr_interaction_dataset_arn' (str)
Stored 'cvr_item_dataset_arn' (str)
Stored 'cvr_user_dataset_arn' (str)
Stored 'cvr_item_schema_arn' (str)
Stored 'interaction_schema_arn' (str)
Stored 'user_schema_arn' (str)
Stored 'role_arn' (str)
