# Train your movie recommendation model in Amazon Personalize

This notebook shows how you can train a machine learning model with Amazon Personalize which can recommend movies to users based on a movie which they liked.

We will use the MovieLens 20m data set to train our model.  MovieLens is a well-known dataset storing movie ratings. It comes in different sizes and formats: here, we will use ml-20m, which contains 20 million ratings applied to 27,000 movies by 138,000 users, see https://grouplens.org/datasets/movielens/. 

In order to create a machine learning model we need to execute following steps:

1. Prepare the data so it can be imported into Amazon Personalize
2. Create a DataSet Group and configure an import job for the data set to import the data into Amazon Personalize
3. Train and evaluate our model by creating a solution and solution version
4. Validate our model performance and deploy an endpoint which can serve predictions

# Setup

First import libraries required in this notebook and do some basic initializations.


In [1]:
%matplotlib inline

import boto3, os
import json
import numpy as np
import pandas as pd
import time
import sagemaker
from sklearn.utils import shuffle
os.environ['AWS_DEFAULT_REGION']="us-east-1"


We will create a new s3 bucket to store our data and assets required for the chatbot. The bucket has the name movie-chatbot-resources-<account_number>. Overwrite this if you want another name!

In [2]:
sts = boto3.client('sts')
s3 = boto3.client('s3')
personalize = boto3.client('personalize')

accountId = sts.get_caller_identity()["Account"]
bucket = 'movie-chatbot-resources-' + accountId
s3.create_bucket(Bucket=bucket)

{'ResponseMetadata': {'RequestId': '3AC6698D551387EB',
  'HostId': 'PGrLhKVNPRmWSaLjxYCoh/DCeyRXvyNRQBCykyhBtewNfsj1M8JZFQ3yav6CTV1vjG1Zm5ixXec=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'PGrLhKVNPRmWSaLjxYCoh/DCeyRXvyNRQBCykyhBtewNfsj1M8JZFQ3yav6CTV1vjG1Zm5ixXec=',
   'x-amz-request-id': '3AC6698D551387EB',
   'date': 'Tue, 13 Aug 2019 17:42:55 GMT',
   'location': '/movie-chatbot-resources-028626156119',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Location': '/movie-chatbot-resources-028626156119'}

In [3]:
# file name for files containg rating
filename_ratings = "ratings.csv"
# file name for file containing movie title to ID mapping
filename_movies = "movies.csv"

suffix= "20m"

# Download and process data

Once we’ve downloaded and unzipped the dataset, let’s load the ‘ratings.csv’ file and apply the following processing:

- Shuffle reviews.
- Keep only movies rated 4 and above, and drop the ratings columns: In our use case we just want our model to recommend movies that users should really like.
- Rename columns to the names used in the schema.
- Keep only 1,000,000 interactions to minimize training time (this is just a demo after all!).

We can achieve this with standard functionality in [Pandas](https://pandas.pydata.org/) and [SciKit-Learn](https://scikit-learn.org/stable/).

In [4]:
!curl -O http://files.grouplens.org/datasets/movielens/ml-20m.zip
!unzip -o ml-20m.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  189M  100  189M    0     0  46.7M      0  0:00:04  0:00:04 --:--:-- 46.7M
Archive:  ml-20m.zip
  inflating: ml-20m/genome-scores.csv  
  inflating: ml-20m/genome-tags.csv  
  inflating: ml-20m/links.csv        
  inflating: ml-20m/movies.csv       
  inflating: ml-20m/ratings.csv      
  inflating: ml-20m/README.txt       
  inflating: ml-20m/tags.csv         


In [5]:
ratings = pd.read_csv('./ml-20m/ratings.csv', header=0, names=['USER_ID','ITEM_ID','RATING','TIMESTAMP'])
pd.set_option('display.max_rows', 10)
ratings

Unnamed: 0,USER_ID,ITEM_ID,RATING,TIMESTAMP
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


In [6]:
data = shuffle(ratings)
data = data[data['RATING'] > 3.5 ] # Only take "good" movies into account
data = data.drop(columns='RATING') # Drop ratings column to simplify training
data = data[:1000000] # Only use first million ratings to improve training speed

print('unique users %d; unique items %d'%(
    len(data['USER_ID'].unique()), len(data['ITEM_ID'].unique())))

unique users 125242; unique items 13229


The movies.dat file contains the mapping of item ids to movie names. We will need this later in our chat bot lambda function to map movie titles back to IDs. As we have stripped down the size of interactions to 1,000,000 we will strip out movies that have no ratings

In [7]:
movies = pd.read_csv('./ml-20m/movies.csv', header=0, names=['ITEM_ID','title','genre'])
uniqueMovieIds = data['ITEM_ID'].unique() # get unique movie ID'S
movies = movies[movies.ITEM_ID.isin(uniqueMovieIds)]  # filter movies which are not used in the data
movies


Unnamed: 0,ITEM_ID,title,genre
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
27054,130034,Stand by Me Doraemon (2014),Animation|Children|Drama|Fantasy
27069,130073,Cinderella (2015),Adventure|Children|Drama|Sci-Fi
27122,130490,Insurgent (2015),Action|Romance|Sci-Fi
27258,131166,WWII IN HD (2009),(no genres listed)


## Upload data

We will upload the preprocessed data to S3 in order to be able to import to Amazon Personalize later

In [8]:
data.to_csv(filename_ratings, index=False)
movies.to_csv(filename_movies, index=False)

boto3.Session().resource('s3').Bucket(bucket).Object(filename_ratings ).upload_file(filename_ratings)
boto3.Session().resource('s3').Bucket(bucket).Object(filename_movies ).upload_file(filename_movies)

# Configure permissions for import

Attach a bucket policy that allows Amazon PErsonalize

In [9]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy));

In [10]:
from botocore.exceptions import ClientError
iam = boto3.client("iam")

role_name = "PersonalizeS3Role-"+suffix
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}
try:
    create_role_response = iam.create_role(
        RoleName = role_name,
        AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
    );
    role_arn = create_role_response["Role"]["Arn"]
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        role_arn = iam.get_role(RoleName=role_name)['Role']['Arn']
    else:
        raise
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
);
print(role_arn)

arn:aws:iam::028626156119:role/PersonalizeS3Role-20m


# Create Schema

Now create a schema in Amazon Personalize which describes the data format, see https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html


In [11]:
schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}



create_schema_response = personalize.create_schema(
    name = "DEMO-sims-schema"+suffix,
    schema = json.dumps(schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:028626156119:schema/DEMO-sims-schema20m",
  "ResponseMetadata": {
    "RequestId": "a2313d29-f740-4673-a57a-79ef92ef2ded",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Tue, 13 Aug 2019 17:43:20 GMT",
      "x-amzn-requestid": "a2313d29-f740-4673-a57a-79ef92ef2ded",
      "content-length": "85",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


# Create and Wait for Dataset Group creation

First we create a Dataset Group in Amazon Personalize. A DataSet group is a container which contains all required data and Amazon Personalize objects for a specific use case. 

In [12]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "DEMO-sims-dataset-group-"+suffix
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:028626156119:dataset-group/DEMO-sims-dataset-group-20m",
  "ResponseMetadata": {
    "RequestId": "36475ddc-e2e1-49de-ad88-a512e23c0d2c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Tue, 13 Aug 2019 17:43:20 GMT",
      "x-amzn-requestid": "36475ddc-e2e1-49de-ad88-a512e23c0d2c",
      "content-length": "106",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [13]:
%%time
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(20)

DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE
CPU times: user 7.4 ms, sys: 3.94 ms, total: 11.3 ms
Wall time: 40.2 s


# Prepare, Create, and Wait for Dataset Import Job

In [14]:
%%time
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn,
    name = "DEMO-sims-dataset-"+suffix
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:028626156119:dataset/DEMO-sims-dataset-group-20m/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "694a7b65-4ac7-4229-8b57-d30921fa47d5",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Tue, 13 Aug 2019 17:44:00 GMT",
      "x-amzn-requestid": "694a7b65-4ac7-4229-8b57-d30921fa47d5",
      "content-length": "108",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}
CPU times: user 5.25 ms, sys: 137 µs, total: 5.39 ms
Wall time: 31.8 ms


In [15]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "DEMO-sims-dataset-import-job-"+suffix,
    datasetArn = dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket, filename_ratings)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:028626156119:dataset-import-job/DEMO-sims-dataset-import-job-20m",
  "ResponseMetadata": {
    "RequestId": "68fde107-c54d-4463-a8e2-3289c21cba7f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Tue, 13 Aug 2019 17:43:59 GMT",
      "x-amzn-requestid": "68fde107-c54d-4463-a8e2-3289c21cba7f",
      "content-length": "120",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Wait for Dataset Import Job Run to Have ACTIVE Status (should take about 15 min)

In [None]:
%%time
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    
    dataset_import_job = describe_dataset_import_job_response["datasetImportJob"]
    if "latestDatasetImportJobRun" not in dataset_import_job:
        status = dataset_import_job["status"]
        print("DatasetImportJob: {}".format(status))
    else:
        status = dataset_import_job["latestDatasetImportJobRun"]["status"]
        print("LatestDatasetImportJobRun: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

# Create Solution

In order to train a model which can serve recommendations, you need to first create a solution in Amazon Personalize and then create a solution version to kick of the training job..

In [None]:
recipe_list = personalize.list_recipes()
for recipe in recipe_list['recipes']:
    print(recipe['recipeArn'])

There are many recipes for different scenarios. In this example, we only have interactions data, so we will choose one from the basic recipes.

| Feasible? | Recipe | Description 
|-------- | -------- |:------------
| Y | aws-popularity-count | Calculates popularity of items based on count of events against that item in user-item interactions dataset.
| Y | aws-hrnn | Predicts items a user will interact with. A hierarchical recurrent neural network which can model the temporal order of user-item interactions.
| N - requires meta data | aws-hrnn-metadata | Predicts items a user will interact with. HRNN with additional features derived from contextual (user-item interaction metadata), user medata (user dataset) and item metadata (item dataset)
| N - for bandits and requires meta data | aws-hrnn-coldstart | Predicts items a user will interact with. HRNN-metadata with with personalized exploration of new items.
| N - for item-based queries | aws-sims | Computes items similar to a given item based on co-occurrence of item in same user history in user-item interaction dataset
| N - for reranking a short list | aws-personalized-ranking | Reranks a list of items for a user. Trains on user-item interactions dataset. 


We (or autoML) can run all of these basic recipes and choose the best-performing model from internal metrics. We recommend comparisons, especially with popularity-baseline, to see the lifts in metrics via personalization. However, in this demo, we will pick one recipe - aws-sims, to illustrate smell tests.

In [None]:
recipe_arn = "arn:aws:personalize:::recipe/aws-sims"

In [None]:
create_solution_response = personalize.create_solution(
    name = "DEMO-sims-solution-"+suffix,
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
)

solution_arn = create_solution_response['solutionArn']
print(json.dumps(create_solution_response, indent=2))

In [None]:
create_solution_version_response = personalize.create_solution_version(
    solutionArn = solution_arn
)

solution_version_arn = create_solution_version_response['solutionVersionArn']
print(json.dumps(create_solution_version_response, indent=2))

Now wait for Solution Version to Have ACTIVE Status, this can take about 40 minutes!

In [None]:
%%time
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(
        solutionVersionArn = solution_version_arn
    )
    status = describe_solution_version_response["solutionVersion"]["status"]
    print("SolutionVersion: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

### Get Metrics of Solution

Once youre training is finished, you can get various evaluation metrics for your model. An explanation of the evaluation metrics are provided at https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html

For example, suppose we recommend four items and two of them are relevant, $r=[0,1,0,1]$. In this case, the metrics are:

|Name	|Example	|Explanation
|:------|:----------|:----------
|Precision@K	|$\frac{2}{4} = 0.5$	|Total relevant items divided by total recommended items.
|Mean reciprocal ranks (MRR@K)	|${\rm mean}(\frac{1}{2} + \frac{1}{4}) = 0.375$	|Considers positional effects by computing the mean of the inverse positions of all relevant items.
|Normalized discounted cumulative gains (NDCG@K)	|$\frac{\frac{1}{\log(1 + 2)} + \frac{1}{\log(1 + 4)}}{\frac{1}{\log(1 + 1)} + \frac{1}{\log(1 + 2)}} = 0.65$	|Considers positional effects by applying inverse logarithmic weights based on the positions of relevant items, normalized by the largest possible scores from ideal recommendations.
|Average precision (AP@K)	|${\rm mean}(\frac{1}{2} + \frac{2}{4}) = 0.5$	|Average precision@K where K is the position of every relevant item.

In [None]:
get_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = solution_version_arn
)

print(json.dumps(get_metrics_response, indent=2))

# Create and Wait for Campaign

In order to deploy the solution version and serve predictions we need to define a campaing in Amazon Personalize.

In [None]:
%%time
create_campaign_response = personalize.create_campaign(
    name = "DEMO-sims-campaign-"+suffix,
    solutionVersionArn = solution_version_arn,
    minProvisionedTPS = 2,    
)

campaign_arn = create_campaign_response['campaignArn']
print(json.dumps(create_campaign_response, indent=2))

Wait for Campaign to Have ACTIVE Status (Takes about 10 minutes)

In [None]:
%%time
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_campaign_response = personalize.describe_campaign(
        campaignArn = campaign_arn
    )
    status = describe_campaign_response["campaign"]["status"]
    print("Campaign: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

## To aid in interpretation, lets look at some items 

To do better interpret the results we need to map the item id to a movie title. We can do this using the movies.csv file provided by the data set.

In [None]:
movies = pd.read_csv('./ml-20m/movies.csv', header=0, names=['ITEM_ID','title','genre'])
movies=movies.set_index('ITEM_ID')
movies.head()

Now we pick a couple of items and look at if items found are generally of similar genres.Note, the model did not use this meta-data (genre) for training, this is a sanity or smell test to see if the model discovered similar items that 'make sense'

### Similar movies like Toy Story

In [None]:
personalize_runtime = boto3.client('personalize-runtime')

rec_response = personalize_runtime.get_recommendations(
        campaignArn = campaign_arn,
        itemId = str(1)
    )
rec_items = [int(x['itemId']) for x in rec_response['itemList']]
movies.loc[rec_items[:5]]

### Similar movies like Father of the Bride Part 2

In [None]:
rec_response = personalize_runtime.get_recommendations(
        campaignArn = campaign_arn,
        itemId = str(2)
    )
rec_items = [int(x['itemId']) for x in rec_response['itemList']]

In [None]:
movies.loc[rec_items[:5]]

## Congratulations, we are now ready to use this campaign endpoint within our chatbot!

In [None]:
campaign_arn