##  Create higher-quality recommendations in your e-commerce platform  with Amazon Personalize

This notebook is part of Amazon re:Invent 2019 Builder's Session "RET310 -  Create higher-quality recommendations in your e-commerce platform".

In this notebook, you will be able to run & deploy on your AWS account all steps taken during the session. By the end of it, endpoints will be ready to be used to perform recommendations on your e-commerce application.

More details about the Buider's Session at this link: https://www.portal.reinvent.awsevents.com/connect/sessionDetail.ww?SESSION_ID=98681

The dataset used on this exercise was created by Olist and made available on Kaggle platform. To download the latest version of this dataset, as also to check details about the data and their schemas, please check: https://www.kaggle.com/olistbr/brazilian-ecommerce 

#### Important Note

As for all Machine Learning solutions, data preparation is a key step to achieve higher quality on the final solution. For the data preparation steps taken on this exercise, please check the notebook "RET310 - Data preparation steps.ipynb" also present on this repository.

## Table of Contents

* [Section 1 - Preparation steps](#first-section)
* [Section 2 - Data schemas creation steps](#second-section)
* [Section 3 - Create dataset group and datasets](#third-section)
* [Section 4 - Importing Olist data into the datasets](#forth-section)
* [Section 5 - Creating Solutions and training new Solutions versions](#fifth-section)
* [Section 6 - Deploying Personalize Campaigns with the trained Solutions](#sixth-section)
* [Section 7 - Testing the deployed Campaigns using the Python SDK](#seventh-section)
* [Section 8 - Clean Up steps](#eighth-section)

### Section 1 - Preparation steps <a class="anchor" id="first-section"></a>

On this section you will perform the pre-requisites to deploy the Amazon Personalize solution. It includes importing Python modules, defining S3 bucket information and creating IAM roles with appropriate access.

In [None]:
import json
import time
import boto3
personalize = boto3.client('personalize')

At this point, the IAM Role to be used by Amazon Personalize will be created:

In [None]:
iam = boto3.client('iam')
path='/'
role_name='ret310-personalize-role' # you may change this role name if needed
description='IAM role with permissions to run the lab RET310'
trust_policy={
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "personalize.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

try:
    response = iam.create_role(
        Path=path,
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(trust_policy),
        Description=description,
        MaxSessionDuration=3600
    )
except Exception as e:
    print(e)
    
roleArn = response['Role']['Arn']

With the IAM Role created, an IAM Policy is created and attached to the IAM Role, allowing the appropriate access to the required services:

In [None]:
policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "personalize:*"
            ],
            "Resource": "*"
        }
    ]
}

response = iam.create_policy(PolicyName='ret310-policy', PolicyDocument=json.dumps(policy))
policyArn = response['Policy']['Arn']

iam.attach_role_policy(RoleName=role_name, PolicyArn=policyArn)

### Section 2 - Data schemas creation steps <a class="anchor" id="second-section"></a>

At this point, the data stored on S3 will be accessed, as also the initial Amazon Personalize preparation will be done, including data schemas

In [None]:
bucket = 'personalize-lcm'
users_data = 'users-olist.csv'
products_data = 'products-olist.csv'
interactions_data = 'orderItems-olist.csv'

In [None]:
# criar bucket policy
# corrigir as aspas

Here the Amazon Personalize schemas are created, based on Apache Avro standard. For more details, check: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

In [None]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ZipCode",
            "type": "long"
        },
        {
            "name": "State",
            "type": "string"
        }
    ],
    "version": "1.0"
}

products_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "CATEGORY",
            "type": "string",
            "categorical": True
        }
    ],
    "version": "1.0"
}

interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "Timestamp",
            "type": "long"
        }
    ],
    "version": "1.0"
}

With the Avro schemas defined, the schemas will be created for Users, Products and Interactions data:

In [None]:
create_users_schema_response = personalize.create_schema(name = 'ret310-users-schema', 
                                                         schema = json.dumps(users_schema))
create_products_schema_response = personalize.create_schema(name = 'ret310-products-schema', 
                                                            schema = json.dumps(products_schema))
create_interactions_schema_response = personalize.create_schema(name = 'ret310-interactions-schema', 
                                                                schema = json.dumps(interactions_schema))

users_schema_arn = create_users_schema_response['schemaArn']
products_schema_arn = create_products_schema_response['schemaArn']
interactions_schema_arn = create_interactions_schema_response['schemaArn']

### Section 3 - Create dataset group and datasets <a class="anchor" id="third-section"></a>

First you will create the dataset group to be used on this exercise - it will be called "ret310-dataset-group":

In [None]:
create_dataset_group_response = personalize.create_dataset_group(name = 'ret310-dataset-group')
dataset_group_arn = create_dataset_group_response['datasetGroupArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(datasetGroupArn = dataset_group_arn)
    status = describe_dataset_group_response['datasetGroup']['status']
    print('DatasetGroup: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

With the dataset group created, now you will create one dataset for each type of data: Users, Products and Interactions:

In [None]:
%%time

dataset_type = 'USERS'
create_dataset_response = personalize.create_dataset(name = 'ret310-users-dataset',
                                                     datasetType = dataset_type,
                                                     datasetGroupArn = dataset_group_arn,
                                                     schemaArn = users_schema_arn)

users_dataset_arn = create_dataset_response['datasetArn']

dataset_type = 'ITEMS'
create_dataset_response = personalize.create_dataset(name = 'ret310-items-dataset',
                                                     datasetType = dataset_type,
                                                     datasetGroupArn = dataset_group_arn,
                                                     schemaArn = products_schema_arn)

products_dataset_arn = create_dataset_response['datasetArn']

dataset_type = 'INTERACTIONS'
create_dataset_response = personalize.create_dataset(name = 'ret310-interactions-dataset',
                                                     datasetType = dataset_type,
                                                     datasetGroupArn = dataset_group_arn,
                                                     schemaArn = interactions_schema_arn)

interactions_dataset_arn = create_dataset_response['datasetArn']

### Section 4 - Importing Olist data into the datasets<a class="anchor" id="forth-section"></a>

Now with all the datasets properly created, you will import the Olist data, stored on S3, into them.

First you will import the Users data:

In [None]:
%%time

create_users_import_job_response = personalize.create_dataset_import_job(jobName = 'ret310-users-import-job',
                                                                         datasetArn = users_dataset_arn,
                                                                         roleArn = roleArn,
                                                                         dataSource = {
                                                                             'dataLocation': 's3://{}/{}'.format(bucket, users_data)
                                                                         })

dataset_import_job_arn = create_users_import_job_response['datasetImportJobArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    create_users_import_job_response = personalize.describe_dataset_import_job(datasetImportJobArn = dataset_import_job_arn)
    status = create_users_import_job_response['datasetImportJob']['status']
    print('DatasetImportJob: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

Then you will import the Products data:

In [None]:
%%time

create_products_import_job_response = personalize.create_dataset_import_job(jobName = 'ret310-items-import-job',
                                                                            datasetArn = products_dataset_arn,
                                                                            roleArn = roleArn,
                                                                            dataSource = {
                                                                                'dataLocation': 's3://{}/{}'.format(bucket, products_data)
                                                                            })

dataset_import_job_arn = create_products_import_job_response['datasetImportJobArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    create_products_import_job_response = personalize.describe_dataset_import_job(datasetImportJobArn = dataset_import_job_arn)
    status = create_products_import_job_response['datasetImportJob']['status']
    print('DatasetImportJob: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

And then you will finish the importing process, with the Interactions data import:

In [None]:
%%time

create_interactions_import_job_response = personalize.create_dataset_import_job(jobName = 'ret310-interactions-import-job',
                                                                                datasetArn = interactions_dataset_arn,
                                                                                roleArn = roleArn,
                                                                                dataSource = {
                                                                                    'dataLocation': 's3://{}/{}'.format(bucket, interactions_data)
                                                                                })

dataset_import_job_arn = create_interactions_import_job_response['datasetImportJobArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    create_interactions_import_job_response = personalize.describe_dataset_import_job(datasetImportJobArn = dataset_import_job_arn)
    status = create_interactions_import_job_response['datasetImportJob']['status']
    print('DatasetImportJob: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

### Section 5 - Creating Solutions and training new Solutions versions<a class="anchor" id="fifth-section"></a>

Now with all data ready, you will start creating and training Personalize Solutions with them - For this exercise, you will use Solutions based on the following recipes:
- Popularity
- Item-to-Item similarity (SIMS)
- Personalized Ranking

For more details about the Personalize predefined Recipes, check: https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html

First you will list all the available recipes:

In [None]:
list_recipes_response = personalize.list_recipes()
for recipe in list_recipes_response['recipes']:
    print(recipe['name'], '-', recipe['recipeArn'], '-', recipe['status'])

Then you will create the solutions to be used by this exercise into the "ret310-dataset-group":

In [None]:
popularity_arn = 'arn:aws:personalize:::recipe/aws-popularity-count'
sims_arn = 'arn:aws:personalize:::recipe/aws-sims'
ranking_arn = 'arn:aws:personalize:::recipe/aws-personalized-ranking'

create_solution_response = personalize.create_solution(name = 'ret310-popularity-solution',
                                                       datasetGroupArn = dataset_group_arn,
                                                       recipeArn = popularity_arn)
popularity_solution_arn = create_solution_response['solutionArn']

create_solution_response = personalize.create_solution(name = 'ret310-sims-solution',
                                                       datasetGroupArn = dataset_group_arn,
                                                       recipeArn = sims_arn)
sims_solution_arn = create_solution_response['solutionArn']

create_solution_response = personalize.create_solution(name='ret310-sims-hpo-solution',
                                                       recipeArn = sims_arn,
                                                       datasetGroupArn = dataset_group_arn,
                                                       performHPO=True,
                                                       performAutoML=False,
                                                       solutionConfig={
                                                                        "hpoConfig": {
                                                                            "algorithmHyperParameterRanges": {
                                                                                "categoricalHyperParameterRanges": [],
                                                                                "continuousHyperParameterRanges": [
                                                                                    {
                                                                                        "name": "popularity_discount_factor",
                                                                                        "minValue": 0,
                                                                                        "maxValue": 1
                                                                                    }
                                                                                ],
                                                                                "integerHyperParameterRanges": [
                                                                                    {
                                                                                        "name": "min_cointeraction_count",
                                                                                        "minValue": 0,
                                                                                        "maxValue": 10
                                                                                    }
                                                                                ]
                                                                            },
                                                                            "hpoResourceConfig": {
                                                                                "maxNumberOfTrainingJobs": "20",
                                                                                "maxParallelTrainingJobs": "5"
                                                                            }
                                                                        },
                                                                        "featureTransformationParameters": {
                                                                            "max_item_interaction_count_percentile": "0.9",
                                                                            "max_user_history_length_percentile": "0.995",
                                                                            "min_item_interaction_count_percentile": "0.01",
                                                                            "min_user_history_length_percentile": "0.005"
                                                                        },
                                                                        "algorithmHyperParameters": {
                                                                            "min_cointeraction_count": "3",
                                                                            "popularity_discount_factor": "0.5"
                                                                        },
                                                                    }
                                                      )
sims_hpo_solution_arn = create_solution_response['solutionArn']

create_solution_response = personalize.create_solution(name = 'ret310-ranking-solution',
                                                       datasetGroupArn = dataset_group_arn,
                                                       recipeArn = ranking_arn)
ranking_solution_arn = create_solution_response['solutionArn']

With all the solutions created, it is time to train them using the Olist dataset - First you will train the solution based on the Popularity - This one will be used just as a quality baseline, when comapring with SIMS based solutions:

In [None]:
%%time

create_solution_version_response = personalize.create_solution_version(solutionArn = popularity_solution_arn)
popularity_version_arn = create_solution_version_response['solutionVersionArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(solutionVersionArn = popularity_version_arn)
    status = describe_solution_version_response['solutionVersion']['status']
    print('SolutionVersion: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

get_solution_metrics_response = personalize.get_solution_metrics(solutionVersionArn = popularity_version_arn)
print(json.dumps(get_solution_metrics_response['metrics'], indent=2))

Now you will create the first solution using SIMS recipe - it will use the standard parameters while training:

In [None]:
%%time

create_solution_version_response = personalize.create_solution_version(solutionArn = sims_solution_arn)
sims_version_arn = create_solution_version_response['solutionVersionArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(solutionVersionArn = sims_version_arn)
    status = describe_solution_version_response['solutionVersion']['status']
    print('SolutionVersion: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

get_solution_metrics_response = personalize.get_solution_metrics(solutionVersionArn = sims_version_arn)
print(json.dumps(get_solution_metrics_response['metrics'], indent=2))

Now you will create another SIMS based solution, but this one will use the Hyperparameters Optimization (HPO) feature from Amazon Personalize. With HPO, Personalize will automatically tune the Solution and will provide you a final solution version using the best parameters from the training - based on the training metrics.

For more details about Amazon Personalize HPO, please check: https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html

In [None]:
%%time

create_solution_version_response = personalize.create_solution_version(solutionArn = sims_hpo_solution_arn)
sims_hpo_version_arn = create_solution_version_response['solutionVersionArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(solutionVersionArn = sims_hpo_version_arn)
    status = describe_solution_version_response['solutionVersion']['status']
    print('SolutionVersion: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

get_solution_metrics_response = personalize.get_solution_metrics(solutionVersionArn = sims_hpo_version_arn)
print(json.dumps(get_solution_metrics_response['metrics'], indent=2))

And the last Solution you will train is based on personalized ranking recipe. Instead of recommending products to a user as the last ones trained, it will prioritize a given list of products to a specific user. It is extremely useful to guide in which order is more adequate to present products to a given customer (like per example, when building a dynamic carousel on an e-commerce):

In [None]:
%%time

create_solution_version_response = personalize.create_solution_version(solutionArn = ranking_solution_arn)
ranking_version_arn = create_solution_version_response['solutionVersionArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(solutionVersionArn = ranking_version_arn)
    status = describe_solution_version_response['solutionVersion']['status']
    print('SolutionVersion: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

get_solution_metrics_response = personalize.get_solution_metrics(solutionVersionArn = ranking_version_arn)
print(json.dumps(get_solution_metrics_response['metrics'], indent=2))

### Section 6 - Deploying Personalize Campaigns with the trained Solutions<a class="anchor" id="sixth-section"></a>

Deploying Campaigns is the way you can start doing actual recommendations with Personalize. With a Campaign, you will have an endpoint, backed by the Solution version you chose.

For more details about Amazon Personalize Campaigns, please check: https://docs.aws.amazon.com/personalize/latest/dg/campaigns.html

Here you will first create the campaign for the solution based on SIMS recipe -- here without the HPO optimization:

In [None]:
%%time

create_campaign_response = personalize.create_campaign(name = 'ret310-sims-campaign',
                                                       solutionVersionArn = sims_version_arn,
                                                       minProvisionedTPS = 1)

sims_campaign_arn = create_campaign_response['campaignArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(campaignArn = sims_campaign_arn)
    status = describe_campaign_response['campaign']['status']
    print('Campaign: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

Then you will create the campaign for the SIMS based Solution where you used the HPO optimization:

In [None]:
%%time

create_campaign_response = personalize.create_campaign(name = 'ret310-sims-hpo-campaign',
                                                       solutionVersionArn = sims_hpo_version_arn,
                                                       minProvisionedTPS = 1)

sims_hpo_campaign_arn = create_campaign_response['campaignArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(campaignArn = sims_hpo_campaign_arn)
    status = describe_campaign_response['campaign']['status']
    print('Campaign: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

Then you will create the campaign for the personalized ranking based solution:

In [None]:
%%time

create_campaign_response = personalize.create_campaign(name = 'ret310-ranking-campaign',
                                                       solutionVersionArn = ranking_version_arn,
                                                       minProvisionedTPS = 1)

ranking_campaign_arn = create_campaign_response['campaignArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(campaignArn = ranking_campaign_arn)
    status = describe_campaign_response['campaign']['status']
    print('Campaign: {}'.format(status))
    
    if status == 'ACTIVE' or status == 'CREATE FAILED':
        break
        
    time.sleep(60)

### Section 7 - Testing the deployed Campaigns using the Python SDK<a class="anchor" id="seventh-section"></a>

As last step on this exercise we will simualte inferences against the Amazon Personalize Campaigns endpoints.

First we will use the get_recommendations() API against the SIMS based Solution:

In [None]:
personalize_runtime = boto3.client('personalize-runtime')

response = personalize_runtime.get_recommendations(campaignArn = sims_hpo_campaign_arn,
                                                   itemId = 'cef67bcfe19066a932b7673e239eb23d',
                                                   numResults = 10)

for item in response['itemList']:
    print('item:', item)

Then we will use the Personalize Ranking:

In [None]:
personalize_runtime = boto3.client('personalize-runtime')

list_of_products = ['99a4788cb24856965c36a24e339b6058', 'aca2eb7d00ea1a7b8ebd4e68314663af',
                    '422879e10f46682990de24d770e7f83d', 'd1c427060a0f73f6b889a5c7c61f2ac4',
                    '53b36df67ebb7c41585e8d54d6772e08', '389d119b48cf3043d311335e499d9c6b',
                    '368c6c730842d78016ad823897a372db', '53759a2ecddad2bb87a079a1f1519f73',
                    '154e7e31ebfa092203795c972e5804a6', '2b4609f8948be18874494203496bc318']

print('initial products list:\n', list_of_products, '\n')

response = personalize_runtime.get_personalized_ranking(campaignArn = ranking_campaign_arn,
                                                userId = '18955e83d337fd6b2def6b18a428ac77',
                                                inputList = list_of_products)

print('ranked products list:\n')

for item in response['personalizedRanking']:
    print (item['itemId'])
print()

### Section 8 - Clean Up steps<a class="anchor" id="eighth-section"></a>

After running all steps mentioned here, it is time to clean up your account. This will remove all objects/services created to make this exercise possible. It includes IAM and Personalize resources.

In [None]:
# Removing all the deployed Campaigns - the output of this cell must be []:

personalize.delete_campaign(campaignArn=sims_campaign_arn)
personalize.delete_campaign(campaignArn=sims_hpo_campaign_arn)
personalize.delete_campaign(campaignArn=ranking_campaign_arn)
time.sleep(300)

print(personalize.list_campaigns(solutionArn=sims_version_arn)['campaigns'])
print(personalize.list_campaigns(solutionArn=sims_hpo_version_arn)['campaigns'])
print(personalize.list_campaigns(solutionArn=ranking_version_arn)['campaigns'])

In [None]:
# Removing all the deployed Solutions versions - the output of this cell must be []:

personalize.delete_solution(solutionArn=popularity_solution_arn)
personalize.delete_solution(solutionArn=sims_solution_arn)
personalize.delete_solution(solutionArn=sims_hpo_solution_arn)
personalize.delete_solution(solutionArn=ranking_solution_arn)
time.sleep(300)

personalize.list_solutions(datasetGroupArn=dataset_group_arn)['solutions']

In [None]:
# Removing all datasets - the output of this cell must be []:

personalize.delete_dataset(datasetArn=users_dataset_arn)
personalize.delete_dataset(datasetArn=products_dataset_arn)
personalize.delete_dataset(datasetArn=interactions_dataset_arn)
time.sleep(300)

datasets = personalize.list_datasets(datasetGroupArn=dataset_group_arn)['datasets']
if datasets:
    for dataset in datasets:
        print(dataset['name'])
else:
    print([])

In [None]:
# Removing the dataset group - the output of this cell must be []:

personalize.delete_dataset_group(datasetGroupArn=dataset_group_arn)

dgs = personalize.list_dataset_groups()['datasetGroups']
available = []
for dg in dgs:
    if dg['name'] == dataset_group_arn:
        available.append(1)
if not available:
    print(available)

In [None]:
# Removing the data schemas - the output of this cell must be []:

personalize.delete_schema(schemaArn='arn:aws:personalize:us-west-2:230440465708:schema/ret310-users-schema')
personalize.delete_schema(schemaArn='arn:aws:personalize:us-west-2:230440465708:schema/ret310-products-schema')
personalize.delete_schema(schemaArn='arn:aws:personalize:us-west-2:230440465708:schema/ret310-interactions-schema')

ret310_schemas = ['ret310-users-schema', 'ret310-products-schema', 'ret310-interactions-schema']

schemas = personalize.list_schemas()['schemas']
available = []
for schema in schemas:
    if schema['name'] in ret310_schemas:
        available.append(1)
if not available:
    print(available)

In [None]:
# Deleting all IAM objects

iam.detach_role_policy(RoleName=role_name, PolicyArn=policyArn)
time.sleep(10)
iam.delete_policy(PolicyArn=policyArn)
time.sleep(10)
iam.delete_role(RoleName=role_name)