# Create dataset group and dataset in Personalize
This notebook creates the Amazon Personalize interactions dataset and resources for the Amazon videos recommendations.

<a id='contents' />

## Content Table

1. [Loading libraries and data](#loading)
2. [Setting permissions](#permissions)
3. [Create schemas](#schema)
4. [Create dataset groups and datasets within](#dataset)
5. [Create an import job for each of the datasets](#import)

<a id='loading' />

## Loading libraries and data
[(back to top)](#contents)

In [2]:
import json
import boto3
import time
import sagemaker

account_num = '<YOUR_ACCOUNT_NUMBER>'
bucket   = '<YOUR_BUCKET_NAME>'
print(bucket)
prefix   = 'tidy_data'
region   = boto3.Session().region_name
print(region)

checkride-mfcs-aiml
us-east-1


In [3]:
#Set a name for the dataset group
dataset_group_name = 'video-dataset-group'
#Set a name for the interactions schema
VIDEO_INTERACTION_SCHEMA_NAME = 'video-interactions-schema'
VIDEO_INTERACTION_SCHEMA_ARN  = 'arn:aws:personalize:{}:{}:schema/'.format(region, account_num) + \
                                VIDEO_INTERACTION_SCHEMA_NAME
#Input name of the data on interactions
interactions_filename = 'movies-interactions.csv'
#Interactions dataset name in Personalize
interactions_dataset_name='video-interactions'

MAX_WAIT_TIME = time.time() + 60*60 # 1 hour

In [4]:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

<a id='permissions' />

## Setting permissions
[(back to top)](#contents)

### Set up IAM role and allow Personalize to access your bucket

**S3 Bucket Permissions for Personalize Access**

Amazon Personalize needs to be able to read the contents of the S3 bucket. So add a bucket policy which allows that.

In [14]:
def allow_bucket_access():
    s3 = boto3.client('s3')
    policy = {
        "Version": "2012-10-17",
        "Id": "PersonalizeS3BucketAccessPolicy",
        "Statement": [
            {
                "Sid": "PersonalizeS3BucketAccessPolicy",
                "Effect": "Allow",
                "Principal": {
                    "Service": "personalize.amazonaws.com"
                },
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::{}".format(bucket),
                    "arn:aws:s3:::{}/*".format(bucket)
                ]
            }
        ]
    }

    s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy))

In [15]:
allow_bucket_access()

**Create an IAM Role that gives Amazon Personalize permissions to access your S3 bucket.**

Amazon Personalize needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. We create an IAM role and attach the required policies to it. 

In [16]:
def create_personalize_role():
    iam = boto3.client('iam')

    role_name = 'PersonalizeS3Role'
    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": "personalize.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
            }
        ]
    }

    try:
        print('Creating role: {}...'.format(role_name))
        create_role_response = iam.create_role(
            RoleName = role_name,
            AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
        )
    except Exception as e:
        print('role creation failed. Likely already existed.')

    print('Attaching Personalize full access policy...')
    pers_policy_arn = 'arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess'
    iam.attach_role_policy(
        RoleName  = role_name,
        PolicyArn = pers_policy_arn
    )
    print('Attaching S3 read-only access policy...')
    s3_policy_arn = 'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess'
    iam.attach_role_policy(
        RoleName  = role_name,
        PolicyArn = s3_policy_arn
    )

    print('Waiting for policy attachment to propagate...')
    time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

    role_arn = 'arn:aws:iam::{}:role/{}'.format(account_num, role_name)
    return role_arn

In [17]:
role_arn = create_personalize_role()
print(role_arn)

Creating role: PersonalizeS3Role...
role creation failed. Likely already existed.
Attaching Personalize full access policy...
Attaching S3 read-only access policy...
Waiting for policy attachment to propagate...
arn:aws:iam::386102487792:role/PersonalizeS3Role


<a id='schema' />

## Creating schemas
[(back to top)](#contents)

### Create the interactions schema if it is not in place already.

First, define a schema to tell Amazon Personalize what type of dataset to upload. in this case we will use information on the User ID, the item ID, the timestamp, event rating, whether the purchase was verified or not, and event type (always "review").

In [18]:
try:
    # first see if the schema is already in place
    arn = VIDEO_INTERACTION_SCHEMA_ARN
    response = personalize.describe_schema(schemaArn=arn)
    interactions_schema_arn = response['schema']['schemaArn']
    print(interactions_schema_arn)
except Exception as e:
    print('Schema {} did not exist, creating it...'.format(arn))
    schema = {
        "type": "record",
        "name": "Interactions",
        "namespace": "com.amazonaws.personalize.schema",
        "fields": [
            {
                "name": "USER_ID",
                "type": "string"
            },
            {
                "name": "ITEM_ID",
                "type": "string"
            },
            {
                "name": "TIMESTAMP",
                "type": "long"
            },
            { 
                "name": "EVENT_RATING",
                "type": "float"
            },
            { 
                "name": "EVENT_VERIFIED_PURCHASE",
                "type": "string",
                "categorical": True
            },
            {
                "name": "EVENT_TYPE",
                "type": "string"
            }
        ],
        "version": "1.0"
    }

    create_schema_response = personalize.create_schema(
        name   = VIDEO_INTERACTION_SCHEMA_NAME,
        schema = json.dumps(schema)
    )

    interactions_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))

Schema arn:aws:personalize:us-east-1:386102487792:schema/video-interactions-schema did not exist, creating it...
{
  "schemaArn": "arn:aws:personalize:us-east-1:386102487792:schema/video-interactions-schema",
  "ResponseMetadata": {
    "RequestId": "ac6e4c32-5506-4ec4-96b0-0fb5155516f6",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 30 Jul 2020 21:53:57 GMT",
      "x-amzn-requestid": "ac6e4c32-5506-4ec4-96b0-0fb5155516f6",
      "content-length": "91",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


<a id='dataset' />

## Create a dataset group and the datasets within it
[(back to top)](#contents)

### Create a dataset group

Information stored within dataset groups has no impact on any other dataset group or models created before. This allows to run many experiments. Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

- User-item-interactions
- Event streams (real-time interactions)
- User metadata
- Item metadata

In [26]:
print('\nCreating new dataset group {}'.format(dataset_group_name))
create_dataset_group_response = personalize.create_dataset_group(
    name = dataset_group_name
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))


Creating new dataset group video-dataset-group
{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:386102487792:dataset-group/video-dataset-group",
  "ResponseMetadata": {
    "RequestId": "26f68de1-3b78-48bd-a84f-1eb4056cb6f2",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 30 Jul 2020 22:04:52 GMT",
      "x-amzn-requestid": "26f68de1-3b78-48bd-a84f-1eb4056cb6f2",
      "content-length": "98",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset group, it must be active. This can take a minute or two. Wait the cell below to show the ACTIVE status. It checks the status of the dataset group every second, up to a maximum of 1 hour.

In [27]:
max_time = time.time() + MAX_WAIT_TIME
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response['datasetGroup']['status']
    print('DatasetGroup: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Create the interactions dataset

Now, we will create the interactions dataset with the name defined at the beginning of the notebook.

In [28]:
dataset_type = 'INTERACTIONS'
create_dataset_response = personalize.create_dataset(
    name = interactions_dataset_name,
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interactions_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:386102487792:dataset/video-dataset-group/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "3bdc696e-b4ba-4609-8269-1503f2a23a0c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 30 Jul 2020 22:05:56 GMT",
      "x-amzn-requestid": "3bdc696e-b4ba-4609-8269-1503f2a23a0c",
      "content-length": "100",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [29]:
print(interactions_schema_arn)
print(dataset_group_arn)

arn:aws:personalize:us-east-1:386102487792:schema/video-interactions-schema
arn:aws:personalize:us-east-1:386102487792:dataset-group/video-dataset-group


<a id='import' />

## Create an import job for each of the datasets
[(back to top)](#contents)

Now, we execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset.
We create and run the dataset import job using the CreateDatasetImportJob API, specifying the datasetGroupArn and set the dataLocation to the S3 bucket where we stored the training data.

In [30]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = '{}-interactions-import'.format(dataset_group_name),
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}/{}".format(bucket, prefix, interactions_filename)
    },
    roleArn = role_arn
)

interactions_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:386102487792:dataset-import-job/video-dataset-group-interactions-import",
  "ResponseMetadata": {
    "RequestId": "00d48c50-f70b-4421-9353-517a2572448e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 30 Jul 2020 22:05:56 GMT",
      "x-amzn-requestid": "00d48c50-f70b-4421-9353-517a2572448e",
      "content-length": "127",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Wait for the dataset import jobs to complete
Wait for Dataset Import Jobs Active Status

In [31]:
%%time
print('Waiting for INTERACTIONS data import to complete...')
max_time = time.time() + MAX_WAIT_TIME
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = interactions_dataset_import_job_arn
    )
    status = describe_dataset_import_job_response['datasetImportJob']['status']
    print('DatasetImportJob: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)
    if status == 'ACTIVE':
        print('INTERACTIONS dataset is ACTIVE.')

Waiting for INTERACTIONS data import to complete...
DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 119 ms, sys: 14.9 ms, total: 134 ms
Wall time: 19min 1s


**Just in case we made a mistake or need to clean the resources, we can delete them as follows:**

In [6]:
def delete_solution(s):
    try:
        personalize.delete_solution(solutionArn = s)
    except Exception as e:
        pass

In [7]:
def delete_dataset(d):
    try:
        print('Deleting {}'.format(d))
        personalize.delete_dataset(datasetArn=d)
    except Exception as e:
        print(e)
        pass

In [8]:
def delete_schema(s):
    try:
        print('Deleting {}'.format(s))
        personalize.delete_schema(schemaArn=s)
    except Exception as e:
        print(e)
        pass

In [9]:
# delete_solution('arn:aws:personalize:us-east-1:386102487792:solution/video-hrnn-metadata')

In [10]:
# delete_dataset("arn:aws:personalize:us-east-1:386102487792:dataset/video-dataset-group/INTERACTIONS")

Deleting arn:aws:personalize:us-east-1:386102487792:dataset/video-dataset-group/INTERACTIONS


In [13]:
# delete_schema(VIDEO_INTERACTION_SCHEMA_ARN)

Deleting arn:aws:personalize:us-east-1:386102487792:schema/video-interactions-schema
