## Setup

Just as in the first lab, we have to prepare our environment by importing dependencies and creating clients.

### Import dependencies

The following libraries are needed for this lab.

In [1]:
import boto3
import json
import uuid
import time
import pandas as pd
from botocore.exceptions import ClientError
import sagemaker
sess = sagemaker.Session()

bucket = sess.default_bucket()
account = sess.account_id()

prefix = 'personalize-demo'

### Create clients

We will need the following AWS service clients in this lab.

In [2]:
personalize = boto3.client('personalize')

s3_client = boto3.client('s3')

### Reviewing the dataset

In [25]:
users = 'users.csv'

users_filename = f'{prefix}/{users}'
s3_client.upload_file(users, bucket, users_filename)

product = pd.read_csv(users)
product

Unnamed: 0,USER_ID,AGE,GENDER
0,1,31,M
1,2,58,F
2,3,43,M
3,4,38,M
4,5,24,M
...,...,...,...
5245,5246,37,M
5246,5247,46,M
5247,5248,50,M
5248,5249,33,M


In [26]:
items = 'items.csv'
items_filename = f'{prefix}/{items}'
s3_client.upload_file(items, bucket, items_filename)

product = pd.read_csv(items)
product

Unnamed: 0,ITEM_ID,PRICE,CATEGORY_L1,CATEGORY_L2,PRODUCT_DESCRIPTION,GENDER
0,e1669081-8ffc-4dec-97a6-e9176d7f6651,124.99,apparel,scarf,Sans pareil scarf for women,F
1,cfafd627-7d6b-43a5-be05-4c7937be417d,57.99,housewares,kitchen,A must-have for your kitchen,Any
2,6e6ad102-7510-4a02-b8ce-5a0cd6f431d1,133.99,apparel,jacket,This gainsboro jacket for women is perfect for...,F
3,49b89871-5fe7-4898-b99d-953e15fb42b2,196.99,electronics,speaker,High definition speakers to fill the house wit...,Any
4,5cb18925-3a3c-4867-8f1c-46efd7eba067,9.99,footwear,sandals,This spiffy pair of sandals for woman is perfe...,F
...,...,...,...,...,...,...
2460,36cfd856-dd30-46a9-8654-1f1de77e674a,128.99,floral,wreath,Easter wreath grown sustainably on our organic...,Any
2461,1ea9439f-dff5-41cf-aac3-718a6b4e7af6,77.99,footwear,sneaker,An all-around voguish pair of white sneakers,F
2462,ccdf737c-c4fd-4c78-abd2-d5ef0428ef20,56.99,housewares,kitchen,Ideal for every kitchen,Any
2463,12f93a36-e282-4445-92ae-356eb6a560fd,98.99,floral,arrangement,Roses arrangement grown sustainably on our org...,Any


In [27]:
interactions = 'interactions.csv'

interactions_filename = f'{prefix}/{interactions}'
s3_client.upload_file(interactions, bucket, interactions_filename)

df_interactions = pd.read_csv(interactions)
df_interactions

Unnamed: 0,ITEM_ID,USER_ID,EVENT_TYPE,TIMESTAMP,DISCOUNT
0,26bb732f-9159-432f-91ef-bad14fedd298,3156,View,1591803788,No
1,26bb732f-9159-432f-91ef-bad14fedd298,3156,View,1591803788,No
2,dc073623-4b95-47d9-93cb-0171c20baa04,332,View,1591803812,Yes
3,dc073623-4b95-47d9-93cb-0171c20baa04,332,View,1591803812,Yes
4,31efcfea-47d6-43f3-97f7-2704a5397e22,3981,View,1591803830,Yes
...,...,...,...,...,...
674999,9bc87696-e9bd-4241-86b0-234e054a607b,5165,View,1598204678,Yes
675000,9bc87696-e9bd-4241-86b0-234e054a607b,5165,AddToCart,1598204681,Yes
675001,9bc87696-e9bd-4241-86b0-234e054a607b,5165,ViewCart,1598204686,Yes
675002,9bc87696-e9bd-4241-86b0-234e054a607b,5165,StartCheckout,1598204686,Yes


In [28]:
set(df_interactions['DISCOUNT'])

{'No', 'Yes'}

## Configure Amazon Personalize

Now that we've prepared our three datasets and uploaded them to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.

Note: if you deployed the Retail Demo Store with the "auto create Personalize resources" flag set to "Yes", the following steps have already been automatically completed for you.

### Create Schemas for Datasets

Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.

Let's define and create schemas in Personalize for our datasets.

Note that categorical fields include an additional attribute of `"categorical": true` and the textual field has an additional attribute of `"textual": true`. Categorical fields are those where one or more values can be specified for the field value (i.e. enumerated values). For example, one or more category names/codes for the `CATEGORY_L1` field. A textual field indicates that Personalize should apply a natural language processing (NLP) model to the field's value to extract model features from unstructured text. In this case, we're using the product description as the textual field. You can only have one textual field in the items dataset. 

Another detail to note is that when we call the [CreateSchema](https://docs.aws.amazon.com/personalize/latest/dg/API_CreateSchema.html) API, we pass an optional `domain` parameter with a value of `ECOMMERCE`. This tells Personalize that we are creating a schema for Retail/E-commerce domain. We will do this for all three schemas.

#### Items Datsaset Schema

In [29]:
items_schema_arn = 'arn:aws:personalize:us-west-2:420618410968:schema/octank-products-interactions'
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "CATEGORY_L2",
            "type": "string",
            "categorical": True,
        },
        {
            "name": "PRODUCT_DESCRIPTION",
            "type": "string",
            "textual": True
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        },
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "octank-products",
        domain = 'ECOMMERCE',
        schema = json.dumps(items_schema)
    )
    items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-items':
                items_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {items_schema_arn}")
                break

You aready created this schema, seemingly


#### Users Dataset Schema

In [30]:
users_schema_arn = 'arn:aws:personalize:us-west-2:420618410968:schema/octank-users'
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "AGE",
            "type": "int"
        },
        {
            "name": "GENDER",
            "type": "string",
            "categorical": True,
        }
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "octank-users",
        domain = "ECOMMERCE",
        schema = json.dumps(users_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-users':
                users_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {users_schema_arn}")
                break

You aready created this schema, seemingly


#### Interactions Dataset Schema

In [31]:
interactions_schema_arn = 'arn:aws:personalize:us-west-2:420618410968:schema/octank-products-interactions'
interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",  # "View", "Purchase", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "DISCOUNT",  # This is the contextual metadata - "Yes" or "No".
            "type": "string"
        },
    ],
    "version": "1.0"
}

try:
    create_schema_response = personalize.create_schema(
        name = "octank-products-interactions",
        domain = "ECOMMERCE",
        schema = json.dumps(interactions_schema)
    )
    print(json.dumps(create_schema_response, indent=2))
    interactions_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this schema, seemingly')
    paginator = personalize.get_paginator('list_schemas')
    for paginate_result in paginator.paginate():
        for schema in paginate_result['schemas']:
            if schema['name'] == 'retaildemostore-products-interactions':
                interactions_schema_arn = schema['schemaArn']
                print(f"Using existing schema: {interactions_schema_arn}")
                break

You aready created this schema, seemingly


### Create and Wait for Dataset Group

Next we need to create the dataset group that will contain our three datasets. This is one of many Personalize operations that are asynchronous. That is, we call an API to create a resource and have to wait for it to become active.

#### Create Dataset Group

Note that we are also passing `ECOMMERCE` for the `domain` parameter here too.

In [32]:
dataset_group_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset-group/octank-products'
try:
    create_dataset_group_response = personalize.create_dataset_group(
        name = 'octank-products',
        domain = 'ECOMMERCE'
    )
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset group, seemingly')
    paginator = personalize.get_paginator('list_dataset_groups')
    for paginate_result in paginator.paginate():
        for dataset_group in paginate_result['datasetGroups']:
            if dataset_group['name'] == 'retaildemostore-products':
                dataset_group_arn = dataset_group['datasetGroupArn']
                break

You aready created this dataset group, seemingly


#### Wait for Dataset Group to Have ACTIVE Status

In [33]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

DatasetGroup: ACTIVE


### Create Items Dataset

Next we will create the datasets in Personalize for our three dataset types. Let's start with the items dataset.

In [34]:
items_dataset_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/ITEMS'
try:
    dataset_type = "ITEMS"
    create_dataset_response = personalize.create_dataset(
        name = "octank-products-items",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = items_schema_arn
    )

    items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-items':
                items_dataset_arn = dataset['datasetArn']
                break
                
print(f'Items dataset ARN = {items_dataset_arn}')

You aready created this dataset, seemingly
Items dataset ARN = arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/ITEMS


### Create Users Dataset

In [35]:
users_dataset_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/USERS'
try:
    dataset_type = "USERS"
    create_dataset_response = personalize.create_dataset(
        name = "octank-products-users",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = users_schema_arn
    )

    users_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-users':
                users_dataset_arn = dataset['datasetArn']
                break
                
print(f'Users dataset ARN = {users_dataset_arn}')

You aready created this dataset, seemingly
Users dataset ARN = arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/USERS


### Create Interactions Dataset

In [36]:
interactions_dataset_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/INTERACTIONS'
try:
    dataset_type = "INTERACTIONS"
    create_dataset_response = personalize.create_dataset(
        name = "octank-products-interactions",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = interactions_schema_arn
    )

    interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this dataset, seemingly')
    paginator = personalize.get_paginator('list_datasets')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for dataset in paginate_result['datasets']:
            if dataset['name'] == 'retaildemostore-products-interactions':
                interactions_dataset_arn = dataset['datasetArn']
                break
                
print(f'Interactions dataset ARN = {interactions_dataset_arn}')

You aready created this dataset, seemingly
Interactions dataset ARN = arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/INTERACTIONS


### Wait for datasets to become active

It can take a minute or two for the datasets to be created. Let's wait for all three to become active.

In [37]:
%%time

dataset_arns = [ items_dataset_arn, users_dataset_arn, interactions_dataset_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for dataset_arn in reversed(dataset_arns):
        response = personalize.describe_dataset(
            datasetArn = dataset_arn
        )
        status = response["dataset"]["status"]

        if status == "ACTIVE":
            print(f'Dataset {dataset_arn} successfully completed')
            dataset_arns.remove(dataset_arn)
        elif status == "CREATE FAILED":
            print(f'Dataset {dataset_arn} failed')
            if response.get('failureReason'):
                print('   Reason: ' + response['failureReason'])
            dataset_arns.remove(dataset_arn)

    if len(dataset_arns) > 0:
        print('At least one dataset is still in progress')
        time.sleep(60)
    else:
        print("All datasets have completed")
        break

Dataset arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/INTERACTIONS successfully completed
Dataset arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/USERS successfully completed
Dataset arn:aws:personalize:us-west-2:420618410968:dataset/octank-products/ITEMS successfully completed
All datasets have completed
CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 117 ms


## Import Datasets to Personalize

Up to this point we have generated CSVs containing data for our users, items, and interactions and staged them in an S3 bucket. We also created schemas in Personalize that define the columns in our CSVs. Then we created a datset group and three datasets in Personalize that will receive our data. In the following steps we will create import jobs with Personalize that will import the datasets from our S3 bucket into the service.

### Inspect permissions

By default, the Personalize service does not have permission to acccess the data we uploaded into the S3 bucket in our account. In order to grant access to the  Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume.

The deployment process for the Retail Demo Store has already setup these resources for you. However, let's take a look at the bucket policy and IAM role to see the required permissions.

We'll start by displaying the bucket policy in the S3 staging bucket where we uploaded the CSVs. Note the service principal of `personalize.amazonaws.com` and the actions allowed on the staging bucket. The `s3:GetObject` is needed for import jobs to allow Personalize to read objects from the bucket and the `s3:PutObject` is used for export jobs, batch inference jobs, and batch segment jobs to allow Personalize to write output files to the bucket. The `s3:ListBucket` action allows Personalize to list the contents of a folder.

In [38]:
response = s3_client.get_bucket_policy(Bucket = bucket)
print(json.dumps(json.loads(response['Policy']), indent=2))

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "personalize.amazonaws.com"
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::sagemaker-us-west-2-420618410968",
        "arn:aws:s3:::sagemaker-us-west-2-420618410968/*"
      ]
    }
  ]
}


Next, let's look at the IAM role that Personalize will need to assume to access the S3 bucket. Again, this role was created for you during the Retail Demo Store deployment. We'll start by inspecting the role itself. Notice the same service principal as the bucket policy.

In [39]:
iam = boto3.client("iam")

role_name = "octank-us-west-2-PersonalizeS3"

response = iam.get_role(RoleName = role_name)
role_arn = response['Role']['Arn']
print(json.dumps(response['Role'], indent=2, default = str))

{
  "Path": "/",
  "RoleName": "octank-us-west-2-PersonalizeS3",
  "RoleId": "AROAWD3WPTPMFWCGVRNNU",
  "Arn": "arn:aws:iam::420618410968:role/octank-us-west-2-PersonalizeS3",
  "CreateDate": "2022-07-01 14:46:10+00:00",
  "AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
          "Service": "personalize.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  },
  "Description": "Allows Amazon Personalize to call AWS services on your behalf.",
  "MaxSessionDuration": 3600,
  "Tags": [
    {
      "Key": "ab3-demo",
      "Value": "jingswu"
    }
  ],
  "RoleLastUsed": {
    "LastUsedDate": "2022-07-01 15:20:12+00:00",
    "Region": "us-west-2"
  }
}


### Create Import Jobs

With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take several minutes to complete so we'll create all three import jobs and then wait for them all to complete. This allows them to import in parallel.

#### Create Items Dataset Import Job

In [40]:
import_job_suffix = str(uuid.uuid4())[:8]

items_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "octank-products-" + import_job_suffix,
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, items_filename)
    },
    roleArn = role_arn
)

items_dataset_import_job_arn = items_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(items_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-2e20d416",
  "ResponseMetadata": {
    "RequestId": "b80ff0a3-514e-4046-9cfe-8601ba495f53",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 01 Jul 2022 22:41:47 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "112",
      "connection": "keep-alive",
      "x-amzn-requestid": "b80ff0a3-514e-4046-9cfe-8601ba495f53"
    },
    "RetryAttempts": 0
  }
}


#### Create Users Dataset Import Job

In [41]:
import_job_suffix = str(uuid.uuid4())[:8]

users_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "octank-products-" + import_job_suffix,
    datasetArn = users_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, users_filename)
    },
    roleArn = role_arn
)

users_dataset_import_job_arn = users_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(users_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-3e2db439",
  "ResponseMetadata": {
    "RequestId": "d8a8598b-d803-4603-a725-24981c786eb5",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 01 Jul 2022 22:41:49 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "112",
      "connection": "keep-alive",
      "x-amzn-requestid": "d8a8598b-d803-4603-a725-24981c786eb5"
    },
    "RetryAttempts": 0
  }
}


#### Create Interactions Dataset Import Job

In [42]:
import_job_suffix = str(uuid.uuid4())[:8]

interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "octank-products-" + import_job_suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, interactions_filename)
    },
    roleArn = role_arn
)

interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(interactions_create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-cbc1efae",
  "ResponseMetadata": {
    "RequestId": "f531b54f-5aa7-4019-afe2-63510be3c97d",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Fri, 01 Jul 2022 22:41:51 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "112",
      "connection": "keep-alive",
      "x-amzn-requestid": "f531b54f-5aa7-4019-afe2-63510be3c97d"
    },
    "RetryAttempts": 0
  }
}


### Wait for Import Jobs to Complete

It will take 10-15 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

We will wait for all three jobs to finish.

#### Wait for Items Import Job to Complete

In [43]:
%%time

import_job_arns = [ items_dataset_import_job_arn, users_dataset_import_job_arn, interactions_dataset_import_job_arn ]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for job_arn in reversed(import_job_arns):
        import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = job_arn
        )
        status = import_job_response["datasetImportJob"]['status']

        if status == "ACTIVE":
            print(f'Import job {job_arn} successfully completed')
            import_job_arns.remove(job_arn)
        elif status == "CREATE FAILED":
            print(f'Import job {job_arn} failed')
            if import_job_response.get('failureReason'):
                print('   Reason: ' + import_job_response['failureReason'])
            import_job_arns.remove(job_arn)

    if len(import_job_arns) > 0:
        print('At least one dataset import job still in progress')
        time.sleep(60)
    else:
        print("All import jobs have ended")
        break

At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-cbc1efae successfully completed
At least one dataset import job still in progress
At least one dataset import job still in progress
Import job arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-3e2db439 successfully completed
Import job arn:aws:personalize:us-west-2:420618410968:dataset-import-job/octank-products-2e20d416 successfully completed
All import jobs have ended
CPU times: user 111 ms, sys: 11.6 ms, total: 123 ms
Wall time: 6min


## Create Recommenders

List all the recommendation Recipe for Ecommerce

In [26]:
recommender_response = personalize.list_recipes(domain = "ECOMMERCE")
print(json.dumps(recommender_response['recipes'], indent=2, default=str))

[
  {
    "name": "aws-ecomm-customers-who-viewed-x-also-viewed",
    "recipeArn": "arn:aws:personalize:::recipe/aws-ecomm-customers-who-viewed-x-also-viewed",
    "status": "ACTIVE",
    "creationDateTime": "2019-06-10 00:00:00+00:00",
    "lastUpdatedDateTime": "2022-06-30 18:30:54.674000+00:00",
    "domain": "ECOMMERCE"
  },
  {
    "name": "aws-ecomm-frequently-bought-together",
    "recipeArn": "arn:aws:personalize:::recipe/aws-ecomm-frequently-bought-together",
    "status": "ACTIVE",
    "creationDateTime": "2019-06-10 00:00:00+00:00",
    "lastUpdatedDateTime": "2022-06-30 18:30:54.674000+00:00",
    "domain": "ECOMMERCE"
  },
  {
    "name": "aws-ecomm-popular-items-by-purchases",
    "recipeArn": "arn:aws:personalize:::recipe/aws-ecomm-popular-items-by-purchases",
    "status": "ACTIVE",
    "creationDateTime": "2019-06-10 00:00:00+00:00",
    "lastUpdatedDateTime": "2022-06-30 18:30:54.674000+00:00",
    "domain": "ECOMMERCE"
  },
  {
    "name": "aws-ecomm-popular-items-by

## Recommend For You Recommender

aws-ecomm-recommended-for-you

In [27]:
recipe_arn = ''
for r in recommender_response['recipes']:
    if r['name'] == 'aws-ecomm-recommended-for-you':
        recipe_arn = r['recipeArn']
print(recipe_arn)

arn:aws:personalize:::recipe/aws-ecomm-recommended-for-you


In [28]:
# dataset_group_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset-group/octank-products'

try:
    response = personalize.create_recommender(
      name = 'octank-recommended-for-you',
      recipeArn = recipe_arn,
      datasetGroupArn = dataset_group_arn
    )
    rfy_recommender_arn = response['recommenderArn']
    print(json.dumps(response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this recommender, seemingly')
    paginator = personalize.get_paginator('list_recommenders')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for recommender in paginate_result['recommenders']:
            if recommender['name'] == 'octank-recommended-for-you':
                rfy_recommender_arn = recommender['recommenderArn']
                break
                
print(f'Recommended For You recommender ARN = {rfy_recommender_arn}')

You aready created this recommender, seemingly
Recommended For You recommender ARN = arn:aws:personalize:us-west-2:420618410968:recommender/octank-recommended-for-you


Other recommender recipes

## Customers Who Viewed X Also Viewed

In [38]:
recipe_arn = ''
for r in recommender_response['recipes']:
    if r['name'] == 'aws-ecomm-customers-who-viewed-x-also-viewed':
        recipe_arn = r['recipeArn']
print(recipe_arn)

arn:aws:personalize:::recipe/aws-ecomm-customers-who-viewed-x-also-viewed


In [39]:
# dataset_group_arn = 'arn:aws:personalize:us-west-2:420618410968:dataset-group/octank-products'
try:
    response = personalize.create_recommender(
      name = 'octank-who-viewed-x-also-viewed',
      recipeArn = recipe_arn,
      datasetGroupArn = dataset_group_arn
    )
    wvx_recommender_arn = response['recommenderArn']
    print(json.dumps(response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created this recommender, seemingly')
    paginator = personalize.get_paginator('list_recommenders')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for recommender in paginate_result['recommenders']:
            if recommender['name'] == 'octank-who-viewed-x-also-viewed':
                wvx_recommender_arn = recommender['recommenderArn']
                break
                
print(f'Recommended For You recommender ARN = {wvx_recommender_arn}')

You aready created this recommender, seemingly
Recommended For You recommender ARN = arn:aws:personalize:us-west-2:420618410968:recommender/octank-who-viewed-x-also-viewed


### Check if Recommender Solution Version is Ready

In [None]:
%%time

recommender_arns = [rfy_recommender_arn, wvx_recommender_arn]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    for recommender_arn in reversed(recommender_arns):
        response = personalize.describe_recommender(
            recommenderArn = recommender_arn
        )
        status = response["recommender"]["status"]

        if status == "ACTIVE":
            print(f'Recommender {recommender_arn} successfully completed')
            recommender_arns.remove(recommender_arn)
        elif status == "CREATE FAILED":
            print(f'Recommender {recommender_arn} failed')
            if response.get('failureReason'):
                print('   Reason: ' + response['failureReason'])
            recommender_arns.remove(recommender_arn)

    if len(recommender_arns) > 0:
        print('At least one recommender is still in progress')
        time.sleep(60)
    else:
        print("All recommenders have completed")
        break

Recommender arn:aws:personalize:us-west-2:420618410968:recommender/octank-recommended-for-you successfully completed
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least one recommender is still in progress
At least 

## Evaluate Recommender

### Get user persona

In [22]:
user_id = 100

persona = 'instruments_books_electronics'

print('Shopper persona for user {} is {}'.format(user_id, persona))

Shopper persona for user 100 is instruments_books_electronics


### Load product Image and Product info

In [23]:
items = pd.read_csv('products-details.csv')

item_details = items[items['id']=='e1669081-8ffc-4dec-97a6-e9176d7f6651'].to_dict(orient='records')[0]

item_details

{'id': 'e1669081-8ffc-4dec-97a6-e9176d7f6651',
 'url': 'http://d23ar2gbu3zcc4.cloudfront.net/#/product/e1669081-8ffc-4dec-97a6-e9176d7f6651',
 'sk': nan,
 'name': 'Sans Pareil Scarf',
 'category': 'apparel',
 'style': 'scarf',
 'description': 'Sans pareil scarf for women',
 'aliases': nan,
 'price': 124.99,
 'image': 'http://d23ar2gbu3zcc4.cloudfront.net/images/apparel/e1669081-8ffc-4dec-97a6-e9176d7f6651.jpg',
 'gender_affinity': 'F',
 'current_stock': 12,
 'featured': nan}

In [35]:
from ipywidgets import widgets, HBox, VBox
from IPython import display

personalize_runtime = boto3.client('personalize-runtime')

items = pd.read_csv('products-details.csv')

def evaluate_item_list(item_list):

    Vbox_list = []
    for item in item_list:
        item_details = items[items['id']==item['itemId']].to_dict(orient='records')[0]
        product_id = item_details["id"]
        product_name = item_details["name"]
        product_category = item_details["category"]

        Vbox_list.append(VBox([widgets.Label(f"Product Id: {product_id}"),
                               widgets.Label(f"Product Name: {product_name}"),
                               widgets.Label(f"Product Category: {product_category}"),
                               widgets.Image(value=open(f"../2-opensearch/images/{product_id}.jpg", 'rb').read())]))
    
    hbox = HBox(Vbox_list)
    display.display(hbox)

### Get recommendation using Recommend For You

In [37]:
# rfy_recommender_arn = 'arn:aws:personalize:us-west-2:420618410968:recommender/octank-recommended-for-you'

get_recommendations_response = personalize_runtime.get_recommendations(
    recommenderArn = rfy_recommender_arn,
    userId = str(user_id),
    numResults = 5
)

item_list = get_recommendations_response['itemList']

print('User persona: ' + persona)

evaluate_item_list(item_list)

User persona: instruments_books_electronics


HBox(children=(VBox(children=(Label(value='Product Id: 76fa669b-1611-4f31-8377-55c3e701ced4'), Label(value='Pr…

### Get recommendation using Who Viewed X Also Viewed

In [40]:
# wvx_recommender_arn = 'arn:aws:personalize:us-west-2:420618410968:recommender/octank-who-viewed-x-also-viewed'

get_recommendations_response = personalize_runtime.get_recommendations(
    recommenderArn = wvx_recommender_arn,
    userId = str(user_id),
    itemId = '5156955f-dda2-4e19-831e-752c92bd8f85',
    numResults = 5
)

item_list = get_recommendations_response['itemList']

print('User persona: ' + persona)

evaluate_item_list(item_list)

User persona: instruments_books_electronics


HBox(children=(VBox(children=(Label(value='Product Id: 8a94535e-4638-43ed-ab9a-2ac90849a98b'), Label(value='Pr…

### Create event tracker

In [41]:
try:
    event_tracker_response = personalize.create_event_tracker(
        datasetGroupArn=dataset_group_arn,
        name='octank-event-tracker'
    )

    event_tracker_arn = event_tracker_response['eventTrackerArn']
    event_tracking_id = event_tracker_response['trackingId']
except personalize.exceptions.ResourceAlreadyExistsException:
    print('You aready created an event tracker for this dataset group, seemingly')
    paginator = personalize.get_paginator('list_event_trackers')
    for paginate_result in paginator.paginate(datasetGroupArn = dataset_group_arn):
        for event_tracker in paginate_result['eventTrackers']:
            if event_tracker['name'] == 'retaildemostore-event-tracker':
                event_tracker_arn = event_tracker['eventTrackerArn']
                
                response = personalize.describe_event_tracker(eventTrackerArn = event_tracker_arn)
                event_tracking_id = response['eventTracker']['trackingId']
                break

print('Event Tracker ARN: ' + event_tracker_arn)
print('Event Tracking ID: ' + event_tracking_id)

Event Tracker ARN: arn:aws:personalize:us-west-2:420618410968:event-tracker/d82a4e40
Event Tracking ID: 197ec641-f3a7-4176-bbe1-eb9489c14e8e


### Check if event tracker is active

In [42]:
status = None
max_time = time.time() + 60*60 # 1 hours
while time.time() < max_time:
    describe_event_tracker_response = personalize.describe_event_tracker(
        eventTrackerArn = event_tracker_arn
    )
    status = describe_event_tracker_response["eventTracker"]["status"]
    print("EventTracker: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

EventTracker: CREATE PENDING
EventTracker: CREATE IN_PROGRESS
EventTracker: ACTIVE


### Filter out purchased product

In [43]:
response = personalize.create_filter(
    name = 'octank-filter-exclude-purchased-products',
    datasetGroupArn = dataset_group_arn,
    filterExpression = 'EXCLUDE itemId WHERE INTERACTIONS.event_type in ("Purchase")'
)
 
filter_arn = response['filterArn']
print(f'Filter ARN: {filter_arn}')

Filter ARN: arn:aws:personalize:us-west-2:420618410968:filter/octank-filter-exclude-purchased-products


In [44]:
status = None
max_time = time.time() + 60*60 # 1 hours
while time.time() < max_time:
    describe_filter_response = personalize.describe_filter(
        filterArn = filter_arn
    )
    status = describe_filter_response["filter"]["status"]
    print("Filter: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

Filter: CREATE IN_PROGRESS
Filter: CREATE IN_PROGRESS
Filter: ACTIVE


## Test Kenisis Stream

In [3]:
import random

event_type_sample_set = {'AddToCart', 'Purchase', 'StartCheckout', 'View', 'ViewCart'}

properties_sample_set = {'{"discount": "No"}', '{"discount": "Yes"}'}

def generate_personalize_event(user_id, item_list, event_tracking_id):
    event =dict()
    event['trackingId'] = event_tracking_id
    event['userId'] = str(user_id)
    event['sessionId'] = str(uuid.uuid4())
    
    event_list = []
    for item in item_list:
        interaction = dict()
        interaction['eventId'] = str(uuid.uuid4())
        interaction['eventType'] = random.choice(tuple(event_type_sample_set))
        interaction['itemId'] = item['itemId']
        interaction['sentAt'] = int(time.time())
        interaction['properties'] = random.choice(tuple(properties_sample_set))
        
        event_list.append(interaction)
    
    event['eventList'] = event_list
    return event

In [4]:
kds = boto3.client('kinesis')

stream_name = 'octank-event-stream'
event_tracking_id = '197ec641-f3a7-4176-bbe1-eb9489c14e8e'

item_list = [
    {'itemId': '988dde6a-b4a7-45a5-9e05-78dd796b6851'},
    {'itemId': '124db2fa-17c0-4e94-9844-d1b64a081df5'},
    {'itemId': '56dcfc2b-01d2-42d1-8002-32fdbe1a034a'},
]

for i in range(100):
    event = generate_personalize_event(100, item_list, event_tracking_id)

    response = kds.put_record(
        StreamName=stream_name,
        Data=json.dumps(event),
        PartitionKey=str(uuid.uuid4())
    )

    print(response)

{'ShardId': 'shardId-000000000000', 'SequenceNumber': '49630994345669250508510890036613360485269689110406103042', 'ResponseMetadata': {'RequestId': 'e1815f97-a270-56ba-bab8-d6b5ec389c8e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e1815f97-a270-56ba-bab8-d6b5ec389c8e', 'x-amz-id-2': 'Fq73gDWstA94UHGD6TkhWCFbbyfByogkhATWExVYUWvUqI9B/Yav7R+KyGBRnoudbQEyez6U0XUpl6tlxSGAZWUI3Qb+zCKw', 'date': 'Fri, 01 Jul 2022 23:44:34 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '110'}, 'RetryAttempts': 0}}
{'ShardId': 'shardId-000000000003', 'SequenceNumber': '49630994345736152744106481906039176565907248824098750514', 'ResponseMetadata': {'RequestId': 'f50fb13b-dc79-1dda-ae36-38199231d7ee', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f50fb13b-dc79-1dda-ae36-38199231d7ee', 'x-amz-id-2': 'IepLqNQTBStgPIJX0GnqywHNHVOQWTyXyUYzze40Z4IbxqYmAGjifnHUEQ6M60EVFkumNADmqic3Tzl0HB3ynYGxdAAFWr7W', 'date': 'Fri, 01 Jul 2022 23:44:34 GMT', 'content-type': 'applic

{'ShardId': 'shardId-000000000002', 'SequenceNumber': '49630994345713851998907951282925446141485737002330488866', 'ResponseMetadata': {'RequestId': 'c15a49b9-94f1-9812-9a63-c09adab95226', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c15a49b9-94f1-9812-9a63-c09adab95226', 'x-amz-id-2': 'CYdilLxmL4rh+k8+IP7TVsNfH0iMiaVlubdlmVnXkU2yVEE3V0B7IWM4QjUVbRLG+GPBKilvL8LNjYeJd7bKb0IEu7rh61p8', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '110'}, 'RetryAttempts': 0}}
{'ShardId': 'shardId-000000000002', 'SequenceNumber': '49630994345713851998907951282926655067305351631505195042', 'ResponseMetadata': {'RequestId': 'e4dc7cdc-770a-529c-bfe5-f5ff394298a8', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e4dc7cdc-770a-529c-bfe5-f5ff394298a8', 'x-amz-id-2': 'NXjvD7yrzrTatQf8GcNE962TkZxvSkPbOsLBawDWjmMnkE5GgqqHnFE1KYrRg7WqwLqZH4uloWM+pt4e3typuT34H9qUgoQL', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'applic

{'ShardId': 'shardId-000000000002', 'SequenceNumber': '49630994345713851998907951282955669286976102731698143266', 'ResponseMetadata': {'RequestId': 'f3e3fac9-7a3c-aa3e-a8da-73ea3474600a', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f3e3fac9-7a3c-aa3e-a8da-73ea3474600a', 'x-amz-id-2': 'BQWlMS2U9SFiIg1Tt6R2t/uhbSVHvXOp5Si6/Erk9nXxftryYZLuUG+tIAHFJdAcK56otuG/ybfjzBZgsxpkB+Wn0LTccNAf', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '110'}, 'RetryAttempts': 0}}
{'ShardId': 'shardId-000000000002', 'SequenceNumber': '49630994345713851998907951282956878212795717360872849442', 'ResponseMetadata': {'RequestId': 'fca471f3-ad5b-ed71-a79d-f8d0e3132745', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'fca471f3-ad5b-ed71-a79d-f8d0e3132745', 'x-amz-id-2': 'y+78uWr+9sbk/TE1QeIkJerP4sewzsoqcvbtxX4oKJ1W69LWofZ84qXhb+m9ngOPG4T7EMvpSSxHhebpEyHQ77YPFhpHZY7u', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'applic

{'ShardId': 'shardId-000000000001', 'SequenceNumber': '49630994345691551253709420659845565640013434728734523410', 'ResponseMetadata': {'RequestId': 'e165eb2b-3d60-e7de-ba5c-620873282dea', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e165eb2b-3d60-e7de-ba5c-620873282dea', 'x-amz-id-2': '1LRBvYd9uLUA6apirBTHOU5/PwgUo8Bcbqc63WRzJM5iRUHv+G1MpgDNINlud4s1IbZeHhlWLYA+Q2yMtQynCMAgF446WEri', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '110'}, 'RetryAttempts': 0}}
{'ShardId': 'shardId-000000000001', 'SequenceNumber': '49630994345691551253709420659846774565833049357909229586', 'ResponseMetadata': {'RequestId': 'f75c5b6f-fe1a-1aaf-ac65-d24cb052d09b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f75c5b6f-fe1a-1aaf-ac65-d24cb052d09b', 'x-amz-id-2': 'ALFICjrkp1+bozlW1kAs2zG0qVQHvtH9sSbCRDaEwaAQG0939TWlQtHmPcE0mTn79/Igr+iJUO2B+hJrfSXY+fAt3k73JXcy', 'date': 'Fri, 01 Jul 2022 23:44:35 GMT', 'content-type': 'applic