# Generative AI for Personalized Marketing

## How to Use This Notebook

The code is broken up into cells. In a Jupyter notebook, to run code in a cell, you can click in the cell, and then click the play icon on the toolbar. Or, you can press the [Shift]+[Enter] key combination. When the code cell finishes running, the text to the left of it changes from an asterisk [ * ] to a number.

Use the following instructions and run the cells to generate this practice example.

### Items data

The items data consists of information about the content that is being interacted with, which generally comes from Content Management Systems (CMS).

### Interactions data

The ml-latest-small dataset from the [Movielens](https://grouplens.org/datasets/movielens/) project is used as a proxy for user-item interactions. 

### User data

User data is not used in this example to train the Amazon Personalize model because the Movielens dataset does not provide this data. However, you will experiment with different user personas when working on the email preparation and prompt.

## Set up environment

#### Cell 1

In [None]:
# Import packages
import boto3
import time
import pandas as pd
import json
import random

### Prepare the item metadata

Load the items data from a CSV file.

Note: Your use of IMDb data is for the sole purpose of completing the AWS workshop and/or tutorial. Any use of IMDb data outside of the AWS workshop and/or tutorial requires a data license from IMDb. To obtain a data license, please contact: imdb-licensing-support@imdb.com. You will not (and will not allow a third party to) (i) use IMDb data, or any derivative works thereof, for any purpose; (ii) copy, sublicense, rent, sell, lease or otherwise transfer or distribute IMDb data or any portion thereof to any person or entity for any purpose not permitted within the workshop and/or tutorial; (iii) decompile, disassemble, or otherwise reverse engineer or attempt to reconstruct or discover any source code or underlying ideas or algorithms of IMDb data by any means whatsoever; or (iv) knowingly remove any product identification, copyright or other notices from IMDb data.

#### Cell 2

In [None]:
item_data = pd.read_csv('imdb/items.csv', sep=',', dtype={'PROMOTION': "string"})
item_data.head(5)

#### Cell 3

In [None]:
movies = pd.read_csv('imdb/items.csv', sep=',', usecols=[0,1], encoding='latin-1', dtype={'movieId': "str", 'imdbId': "str", 'tmdbId': "str"})
pd.set_option('display.max_rows', 25)
movies

### Prepare the interactions data

Read the interactions data from a CSV file.

#### Cell 4

In [None]:
interactions_df = pd.read_csv('interactions.csv')
interactions_df

#### Cell 5

In [None]:
# Get all unique user IDs from the interaction dataset

user_ids = interactions_df['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]= user_ids
user_data

## Introduction to Amazon Personalize

[Amazon Personalize](https://aws.amazon.com/pm/personalize/) is a fully managed machine learning (ML) service that uses your data to generate item recommendations for your users. Amazon Personalize helps developers build applications with a wide array of personalization use cases, and it automates many of the complicated steps to build, train, and deploy an ML model.  

Regardless of the use case, the algorithms all share a base of learning on user-item-interaction data, which is defined by three core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

Generally speaking, your data will not arrive in a perfect form for Amazon Personalize and will take some modification to be structured correctly. This notebook guides you through that process.

### Items data

The items data consists of information about the content that is being interacted with, which generally comes from Content Management Systems (CMS). For the purpose of this practice example, the IMDb TT ID is used to provide a common identifier between the interactions data and the content metadata. Movielens provides its own identifier as well as a the IMDb TT ID (without the leading 'tt') in the 'links.csv' file. This dataset is not mandatory, but provided good item metadata will ensure the best results in your trained models.

### Interactions data

The interactions data consists of information about the interactions that the users of the fictional app will have with the content. This usually comes from analytics tools or customer data platforms (CDPs). The best interactions data for use with Amazon Personalize includes the sequential order of user behavior, what content was watched/clicked on, and the order it was interacted with. To simulate our interactions data, data from the [MovieLens project](https://grouplens.org/datasets/movielens/) is used. Movielens offers multiple versions of their dataset, and for the purposes of this practice example, a reduced version of this dataset (approx 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users) is used.

### User data

The user data is what information you have about your users. It usually comes from customer relationship management (CRM) or subscriber management systems. Because no user data is included in the MovieLens data, a small synthetic dataset will be generated to simulate this component of the practice example. This dataset is not manatory, but provided good user metadata will ensure the best results in your trained models. User data is not used to train the recommender in this practice example.

In this notebook, interactions and item data is imported into your environment. The data is inspected and converted to a format that can be used in Amazon Personalize to train models to get personalized recommendations.

#### Cell 6

In [None]:
# Configure the SDK to Amazon Personalize
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Creating Amazon Personalize resources and importing data 

#### Get the account ID and Region

#### Cell 7

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
print("account id:", account_id)

with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print("region:", region)

### IAM role

Amazon Personalize needs the ability to assume AWS Identity and Access Management (IAM) roles to have the permissions to run certain tasks and access an Amazon Simple Storage Service (Amazon S3) data bucket for the files needed.  

#### Cell 8

In [None]:
# Set up a Boto3 client to access IAM functions 
iam = boto3.client('iam')

#  A role has been set up for this solution. The following obtains the ARN for that role 
#  and also prints the role name for your information

role_name = iam.get_role(RoleName='personalize_exec_role')
role_arn = role_name['Role']['Arn']

role_name = role_arn.split('/')[1]
role_name

### S3 bucket

So far, we have downloaded, manipulated, and saved the data on the Amazon Elastic Block Store (Amazon EBS) instance attached to instance running this Jupyter notebook. 

By default, Amazon Personalize does not have permission to access the data uploaded to the S3 bucket in this practice account. To grant access to the Amazon Personalize to read the practice CSVs, you must set a bucket policy and create an IAM role that Amazon Personalize will assume. 

Use the metadata stored on the instance underlying this notebook to determine the Region that it operates in. (If you were using a Jupyter notebook outside of Amazon SageMaker, you'd simply define the Region as the following string. The S3 bucket must be in the same Region as the Amazon Personalize resources that have been created so far.

#### Cell 9

In [None]:
# Set up a Boto3 client to access S3 functions 
s3 = boto3.client('s3')

# Get a list of all S3 buckets so that we can find the one that starts with "personalized-marketing"
response = s3.list_buckets()

# Filter buckets that start with 'personalized-marketing'
buckets_list = [bucket['Name'] for bucket in response['Buckets'] if bucket['Name'].startswith('personalized-marketing')]

# Get the one bucket name from the list
for data_bucket in buckets_list:
    data_bucket_name = data_bucket

# Display the name of the bucket found    
data_bucket_name

### Upload data to Amazon S3

Now, upload the CSV files of our two datasets, items and interactions.

#### Cell 10

In [None]:
interactions_filename = 'interactions.csv'
items_filename = "items.csv"

interactions_file = interactions_filename

try:
    s3.get_object(
        Bucket=data_bucket_name,
        Key=interactions_filename,
    )
    print("{} already exists in the bucket {}".format(interactions_filename, data_bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    boto3.Session().resource('s3').Bucket(data_bucket_name).Object(interactions_filename).upload_file(interactions_filename)
    print("File {} uploaded to bucket {}".format(interactions_filename, data_bucket_name))

items_file = "imdb/" + items_filename

try:
    s3.get_object(
        Bucket=data_bucket_name,
        Key=items_filename,
    )
    print("{} already exists in the bucket {}".format(items_file, data_bucket_name))
except s3.exceptions.NoSuchKey:
    # Uploading the file if it does not already exist
    # Note that the following line will be needed for the DIY     
    boto3.Session().resource('s3').Bucket(data_bucket_name).Object(items_filename).upload_file(items_file)
    print("File {} uploaded to bucket {}".format(items_filename, data_bucket_name))
    

## Create the dataset group

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one — they are completely isolated. This way you can run many experiments, which is part of how your models are kept private and fully trained on only your data. 

Before importing the data prepared earlier, a dataset group and a dataset must be added that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

A dataset group must be created that contain your three datasets. Your dataset group can be one of the following types:

* A Domain dataset group is where you create preconfigured resources for different business domains and use cases, such as getting recommendations for similar videos (VIDEO_ON_DEMAND domain) or best-selling items (ECOMMERCE domain). You choose your business domain, import your data, and create recommenders. You use recommenders in your application to get recommendations. Use a Domain dataset group if you have a video on demand or ecommerce application and want Amazon Personalize to find the best configurations for your use cases. If you start with a Domain dataset group, you can also add custom resources such as solutions with solution versions trained with recipes for custom use cases.

* A Custom dataset group is where you create configurable resources for custom use cases and batch recommendation workflows. You choose a recipe, train a solution version (model), and deploy the solution version with a campaign. You use a campaign in your application to get recommendations. Use a Custom dataset group if you don't have a video on demand or ecommerce application or want to configure and manage only custom resources, or want to get recommendations in a batch workflow. If you start with a Custom dataset group, you can't associate it with a domain later. Instead, create a new Domain dataset group.

You can create and manage Domain dataset groups and Custom dataset groups on the AWS Management console, the AWS Command Line Interface (AWS CLI), or programmatically with the AWS SDKs.

In this solution, you create a Domain dataset group.

#### Cell 11

In [None]:
marketing_dataset_group_name = "marketing-email-dataset"
try:     
    # Try to create the dataset group. This block will run fully if the dataset group does not exist yet
    # Refer to this section for the DIY
    create_dataset_group_response = personalize.create_dataset_group(
        name = marketing_dataset_group_name,
        domain='VIDEO_ON_DEMAND'
    )

    marketing_dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(marketing_dataset_group_arn))

except personalize.exceptions.ResourceAlreadyExistsException as e:
    # If the dataset group already exists, get the unique identifier, marketing_dataset_group_arn, 
    # from the existing resource
    
    marketing_dataset_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset-group/'+marketing_dataset_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(marketing_dataset_group_arn))
    print ('\nWe will be using the existing Dataset Group dataset_group_arn = {}'.format(marketing_dataset_group_arn))

#### Wait for the dataset group to have an ACTIVE status. 

Before you can use the dataset group to create more resources, the dataset group must be active. This should take a minute or two. Run the next code cell, and then wait for it to show the ACTIVE status. It checks the status of the dataset group every 30 seconds.

#### Cell 12

In [None]:
max_time = time.time() + 3*60 # 3 minutes
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = marketing_dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

## Create the interactions schema

Now that you've loaded and prepared the three datasets, you'll configure Amazon Personalize to understand the data so that the service can be used to train models to generate recommendations. Amazon Personalize requires a schema for each dataset, so it can map the columns in the CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format. 

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. Several mandatory fields are required in the schema, depending on the type of dataset. For more information, see Schemas in the Amazon Personalize Developer Guide at https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html.

The interactions dataset has three required columns: `ITEM_ID`, `USER_ID`, and `TIMESTAMP`. The `TIMESTAMP` column represents when the user interacted with an item, and it must be expressed in Unix timestamp format (seconds). This dataset also has an `EVENT_TYPE` column. Columns must be defined in the same order in the schema as they appear in the dataset.

#### Cell 13

In [None]:
interactions_schema_name = "marketing_interactions_schema"

interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE", # "Watch", "Click", etc.
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

try:
    # Try to create the interactions dataset schema. This block will run fully 
    # if the interactions dataset schema does not exist yet
    create_schema_response = personalize.create_schema(
        name = interactions_schema_name,
        schema = json.dumps(interactions_schema),
        domain='VIDEO_ON_DEMAND'
    )
    print(json.dumps(create_schema_response, indent=2))
    marketing_interactions_schema_arn = create_schema_response['schemaArn']
    print ('\nCreating the Interactions Schema with marketing_interactions_schema_arn = {}'.format(marketing_interactions_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # If the interactions dataset schema already exists, get the unique identifier marketing_interactions_schema_arn
    # from the existing resource 
    
    marketing_interactions_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+interactions_schema_name 
    print('The schema {} already exists.'.format(marketing_interactions_schema_arn))
    print ('\nWe will be using the existing Interactions Schema with marketing_interactions_schema_arn = {}'.format(marketing_interactions_schema_arn))
 

## Create the interactions dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

#### Cell 14

In [None]:
interactions_dataset_name = "marketing_interactions"
try:
    # Try to create the interactions dataset. This block will run fully 
    # if the interactions dataset does not exist yet
    
    dataset_type = 'INTERACTIONS'
    create_dataset_response = personalize.create_dataset(
        name = interactions_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = marketing_dataset_group_arn,
        schemaArn = marketing_interactions_schema_arn
    )

    marketing_interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))
    print ('\nCreating the Interactions Dataset with marketing_interactions_dataset_arn = {}'.format(marketing_interactions_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # If the interactions dataset already exists, get the unique identifier, marketing_interactions_dataset_arn, 
    # from the existing resource 
    marketing_interactions_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+marketing_dataset_group_name+'/INTERACTIONS'
    print('The Interactions Dataset {} already exists.'.format(marketing_interactions_dataset_arn))
    print ('\nWe will be using the existing Interactions Dataset with marketing_interactions_dataset_arn = {}'.format(marketing_interactions_dataset_arn))
        

## Create the items (movies) schema

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. Several reserved and mandatory keywords are required in the schema, based on the type of dataset. For more information, see Schemas in the Amazon Personalize Developer Guide at https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html.

The items metadata data has the following columns: `ITEM_ID`, `TITLE`, `YEAR`, `IMDB_RATING`,`IMDB_NUMBEROFVOTES`, `US_MATURITY_RATING_STRING`, `US_MATURITY_RATING`,`GENRES`, `CREATION_TIMESTAMP`, and `PROMOTION`. These columns must be defined in the same order in the schema as they appear in the dataset.

#### Cell 15

In [None]:
items_schema_name = "marketing_items_schema"

items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TITLE",
            "type": "string"
        },
        {
            "name": "YEAR",
            "type": "int"
        },
        {
            "name": "IMDB_RATING",
            "type": "int"
        },
        {
            "name": "IMDB_NUMBEROFVOTES",
            "type": "int"
        },
        {
            "name": "US_MATURITY_RATING_STRING",
            "type": "string"
        },
        {
            "name": "US_MATURITY_RATING",
            "type": "int"
        },
        {
            "name": "GENRES",
            "type": "string",
            "categorical": True
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long"
        },
        {
            "name": "PROMOTION",
            "type": "string"
        }
    ],
    "version": "1.0"
}

try:
    # Try to create the items dataset schema. This block will run fully 
    # if the items dataset schema does not exist yet
    
    create_schema_response = personalize.create_schema(
        name = items_schema_name,
        schema = json.dumps(items_schema),
        domain='VIDEO_ON_DEMAND'
    )
    marketing_items_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))

    print ('\nCreating the Items Schema with marketing_items_schema_arn = {}'.format(marketing_items_schema_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # If the items dataset schema already exists, get the unique identifier, marketing_items_schema_arn, 
    # from the existing resource 
    
    marketing_items_schema_arn = 'arn:aws:personalize:'+region+':'+account_id+':schema/'+items_schema_name 
    print('The schema {} already exists.'.format(marketing_items_schema_arn))
    print ('\nWe will be using the existing Items Schema with marketing_items_schema_arn = {}'.format(marketing_items_schema_arn))
 

## Create the items dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

#### Cell 16

In [None]:
items_dataset_name = "marketing_items"

try:
    # Try to create the items dataset. This block will run fully if the items dataset does not exist yet
    dataset_type = "ITEMS"
    # Refer to the following code for the DIY
    create_dataset_response = personalize.create_dataset(
        name = items_dataset_name,
        datasetType = dataset_type,
        datasetGroupArn = marketing_dataset_group_arn,
        schemaArn = marketing_items_schema_arn
    )

    marketing_items_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

    print ('\nCreating the Items Dataset with marketing_items_dataset_arn = {}'.format(marketing_items_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # If the items dataset already exists, get the unique identifier, marketing_items_dataset_arn,
    # from the existing resource 
    
    marketing_items_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+marketing_dataset_group_name+'/ITEMS'
    print('The Items Dataset {} already exists.'.format(marketing_items_dataset_arn))
    print ('\nWe will be using the existing Items Dataset with marketing_items_dataset_arn = {}'.format(marketing_items_dataset_arn))   

Now, wait for all the datasets to be created.

#### Cell 17

In [None]:
%%time
start_time = time.time()
max_time = time.time() + 6*60 # 6 Minutes
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = marketing_interactions_dataset_arn
    )
    status_interaction_dataset =  describe_dataset_response["dataset"]['status']
    print("Interactions Dataset: {}".format(status_interaction_dataset))
    
    if status_interaction_dataset == "ACTIVE":
        print("Build succeeded for {}".format(marketing_interactions_dataset_arn))
        
    elif status_interaction_dataset == "CREATE FAILED":
        print("Build failed for {}".format(marketing_interactions_dataset_arn))
        break
        
    if not status_interaction_dataset == "ACTIVE":
        print("The interaction dataset creation is still in progress")
    else:
        print("The interaction dataset  is ACTIVE")
        

    describe_dataset_response = personalize.describe_dataset(
        datasetArn = marketing_items_dataset_arn
    )
    status_item_dataset =  describe_dataset_response["dataset"]['status']
    print("Items Dataset: {}".format(status_item_dataset))
    
    if status_item_dataset == "ACTIVE":
        print("Build succeeded for {}".format(marketing_items_dataset_arn))
        
    elif status_item_dataset == "CREATE FAILED":
        print("Build failed for {}".format(marketing_items_dataset_arn))
        break
        
    if not status_item_dataset == "ACTIVE":
        print("The item dataset creation is still in progress")
    else:
        print("The item dataset  is ACTIVE")
        
    if status_interaction_dataset == "ACTIVE" and status_item_dataset == "ACTIVE":
        end_time = time.time()
        break
    time.sleep(15)

time_elapsed = end_time - start_time
print(f"Time elapsed: {time_elapsed} seconds")

## Import the interactions data 

Earlier, the dataset group and dataset were created to house the information. Now, an import job is run to load the interactions data from the S3 bucket into the Amazon Personalize dataset. 

#### Cell 18

In [None]:
interactions_import_job_name = "dataset_import_interaction"
# Check if the import job already exists

# List the import jobs
interactions_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=marketing_interactions_dataset_arn,
    maxResults=100
)['datasetImportJobs']

# Check if there is an existing job with the prefix
job_exists = False  
job_arn = None

for job in interactions_dataset_import_jobs:
    if (interactions_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    marketing_interactions_dataset_import_job_arn = job_arn
    print('The Interactions Import Job {} already exists.'.format(marketing_interactions_dataset_import_job_arn))
    print ('\nWe will be using the existing Interactions Import Job with marketing_interactions_dataset_import_job_arn = {}'.format(marketing_interactions_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it   
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = interactions_import_job_name,
        datasetArn = marketing_interactions_dataset_arn,
        dataSource = {
            "dataLocation": f"s3://{data_bucket_name}/interactions.csv"
        },
        roleArn = role_arn
    )
    marketing_interactions_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    
    print ('\nImporting the Interactions Data with marketing_interactions_dataset_import_job_arn = {}'.format(marketing_interactions_dataset_import_job_arn))


## Import the item metadata 

#### Cell 19

In [None]:
items_import_job_name = "dataset_import_item"

# Check if the import job already exists

# List the import jobs
items_dataset_import_jobs = personalize.list_dataset_import_jobs(
    datasetArn=marketing_items_dataset_arn,
    maxResults=100
)['datasetImportJobs']

job_exists = False
job_arn = None

print (items_dataset_import_jobs)

# Check if there is an existing job with the prefix
for job in items_dataset_import_jobs:
    if (items_import_job_name in job['jobName']):
        job_exists = True
        job_arn = job['datasetImportJobArn']
    
if (job_exists):
    marketing_items_dataset_import_job_arn =  job_arn
    print('The Items Import Job {} already exists.'.format(marketing_items_dataset_import_job_arn))
    print ('\nWe will be using the existing Items Import Job with marketing_items_dataset_import_job_arn = {}'.format(marketing_items_dataset_import_job_arn))
        
else:
    # If there is no import job with the prefix, create it    
    # Refer to the following code for the DIY
    create_dataset_import_job_response = personalize.create_dataset_import_job(
        jobName = items_import_job_name,
        datasetArn = marketing_items_dataset_arn,
        dataSource = {
            "dataLocation": f"s3://{data_bucket_name}/items.csv"
        },
        roleArn = role_arn
    )

    marketing_items_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
    print(json.dumps(create_dataset_import_job_response, indent=2))
    print ('\nImporting the Items Data with marketing_items_dataset_import_job_arn = {}'.format(marketing_items_dataset_import_job_arn))
    
    

Before you can use the dataset, the import job must be active. Run the next code cell, and then wait for it to show the ACTIVE status. It checks the status of the import job every minute.

Importing the data takes about 15 minutes. 

You must wait for the data imports to be completed.

#### Cell 20

In [None]:
max_time = time.time() + 20*60 # 20 minutes
start_time = time.time()
while time.time() < max_time:

    # Interactions dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = marketing_interactions_dataset_import_job_arn
    )
    status_interactions_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_interactions_import == "ACTIVE":
        print("Build succeeded for {}".format(marketing_interactions_dataset_import_job_arn))
        
    elif status_interactions_import == "CREATE FAILED":
        print("Build failed for {}".format(marketing_interactions_dataset_import_job_arn))
        break
        
    if not status_interactions_import == "ACTIVE":
        print("The interactions dataset import is still in progress")
    else:
        print("The interactions dataset import is ACTIVE")

    # Items dataset import
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = marketing_items_dataset_import_job_arn
    )
    status_items_import = describe_dataset_import_job_response["datasetImportJob"]['status']
    
    if status_items_import == "ACTIVE":
        print("Build succeeded for {}".format(marketing_items_dataset_import_job_arn))
        
    elif status_items_import == "CREATE FAILED":
        print("Build failed for {}".format(marketing_items_dataset_import_job_arn))
        break
        
    if not status_items_import == "ACTIVE":
        print("The items dataset import is still in progress")
    else:
        print("The items dataset import is ACTIVE")

    if status_interactions_import == "ACTIVE" and status_items_import == 'ACTIVE':
        end_time = time.time()
        break

    print()
    time.sleep(30)
    
time_elapsed = end_time - start_time
print(f"Time elapsed: {time_elapsed} seconds")

## Create a "Top picks for you" recommender

Create a preconfigured VIDEO_ON_DEMAND recommender that matches the use case. 

Each domain has different use cases. When you create a recommender, you create it for a specific use case, and each use case has different requirements for getting recommendations.

Explore the recommenders supported for the VIDEO_ON_DEMAND domain.

#### Cell 21

In [None]:
available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND')
display_available_recipes = available_recipes ['recipes']
available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND',nextToken=available_recipes['nextToken'])#paging to get the rest of the recipes 
display_available_recipes = display_available_recipes + available_recipes['recipes']
display(display_available_recipes)

Create a "Top picks for you" recommender. This type of recommender offers personalized streaming content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos that the user watched, based on the user ID (that you specify) and watch events.

#### Cell 22

In [None]:
recommender_top_picks_for_you_name = "marketing_top_picks_for_you"

try:
    create_recommender_response = personalize.create_recommender(
        name = recommender_top_picks_for_you_name,
        recipeArn = 'arn:aws:personalize:::recipe/aws-vod-top-picks',
        datasetGroupArn = marketing_dataset_group_arn,
        recommenderConfig = {"enableMetadataWithRecommendations": True}
    )
    marketing_recommender_top_picks_arn = create_recommender_response["recommenderArn"]
    
    print (json.dumps(create_recommender_response))
    print ('\nCreating the Top Picks For You recommender with marketing_recommender_top_picks_arn = {}'.format(marketing_recommender_top_picks_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException as e:
    marketing_recommender_top_picks_arn =  'arn:aws:personalize:'+region+':'+account_id+':recommender/'+recommender_top_picks_for_you_name
    print('The Top Picks For You recommender {} already exists.'.format(marketing_recommender_top_picks_arn))
    print ('\nWe will be using the existing Top Picks For You recommender with marketing_recommender_top_picks_arn = {}'.format(marketing_recommender_top_picks_arn))
    
    

### View the recommender creation status

Set up a loop to see the creation status of the recommender. This can take more than 60 minutes to train. 

#### Cell 23

In [None]:
max_time = time.time() + 10*60*60 # 10 hours
start_time = time.time()
while time.time() < max_time:

    # Recommender top_picks_for_you
    version_response = personalize.describe_recommender(
        recommenderArn = marketing_recommender_top_picks_arn
    )
    status_top_picks = version_response["recommender"]["status"]

    if status_top_picks == "ACTIVE":
        print("Build succeeded for {}".format(marketing_recommender_top_picks_arn))
    elif status_top_picks == "CREATE FAILED":
        print("Build failed for {}".format(marketing_recommender_top_picks_arn))
        break

    if not status_top_picks == "ACTIVE":
        print("The Top Picks for Your recommender build is still in progress")
    else:
        print("The Top Picks for Your recommender is ACTIVE")

    if status_top_picks == 'ACTIVE':
        end_time = time.time()
        break
    print()
    time.sleep(60)
    
time_elapsed = end_time - start_time
print(f"Time elapsed: {time_elapsed} seconds")

## Get personalized recommendations from Amazon Personalize

Select a random user to see their recommendations.

#### Cell 24

In [None]:
user_id = random.sample(list(user_ids), 1)[0]
user_id

Get 15 recommendations from the "Top pics for you" recommender that you trained.

#### Cell 25

In [None]:
get_recommendations_response = personalize_runtime.get_recommendations(
    recommenderArn = marketing_recommender_top_picks_arn,
    userId = str(user_id),
    numResults = 15,
    metadataColumns = {
        "ITEMS": ['TITLE', 'GENRES']
    }
)

print (get_recommendations_response['itemList'])

Getting recomendations works!

To get recommended movies and their metadata for each user, a more user-friendly access method can be created.

#### Cell 26

In [None]:
def getRecommendedMoviesForUserId(
    user_id, 
    marketing_recommender_top_picks_arn, 
    item_data, 
    number_of_movies_to_recommend = 5):
    # For a user_id, get the top n (number_of_movies_to_recommend) movies by using Amazon Personalize 
    # and get the additional metadata for each movie (item_id) from the item_data
    # Return a list of movie dictionaries (movie_list) with the relevant data

    # Get recommended movies
    get_recommendations_response = personalize_runtime.get_recommendations(
        recommenderArn = marketing_recommender_top_picks_arn,
        userId = str(user_id),
        numResults = number_of_movies_to_recommend,
        metadataColumns = {
            "ITEMS": ['TITLE', 'GENRES']
        }
    )

    # Create a list of movies with title, genres 
    movie_list = []
    
    for recommended_movie in get_recommendations_response['itemList']:      
        movie_list.append(
            {
                'title' : recommended_movie['metadata']['title'],
                'genres' : recommended_movie['metadata']['genres'].replace('|', ' and ')
            }
        )
    return movie_list
    

A random user is selected next, and three movies are recommended for that user.  

Note that because users change each time, recommendations are different each time this code cell runs.

#### Cell 27

In [None]:
user_id = random.sample(list(user_ids), 1)[0]
number_of_movies_to_recommend = 3 

movie_list = getRecommendedMoviesForUserId(user_id, marketing_recommender_top_picks_arn, item_data, number_of_movies_to_recommend)

# Print each movie in the array
for movie in movie_list:
    print ('Title: '+movie['title'])
    print ('Genres: '+movie['genres'])
    print ()

#### Cell 28

In [None]:
context = "\n".join([f"{movie['title']} ({movie['genres']})" for movie in movie_list])
print(context)

## Get the user's favorite movie genre

To provide a better personalized marketing communication, in this section, you calculate a user's favorite movie genre based on the genres of all the movies they interacted with in the past.

#### Cell 29

In [None]:
def getUserFavouriteGenres(user_id, interactions_df, movie_data):
    # For a user_id, get the user's favorite genre by looking at the user's interactions 
    # with each movie in the past and counting the genres to find the most common genre 

    # Get all movies the user has watched     
    movies_df = interactions_df[interactions_df['USER_ID'] == user_id]

    genres = {}

    for movie_id in movies_df['ITEM_ID']:

        movie_genres = movie_data[movie_data['ITEM_ID']==movie_id]['GENRES']
        
        if not len(movie_genres.tolist())==0:
            for movie_genre in movie_genres.tolist()[0].split('|'):
                if movie_genre in genres:
                    genres[movie_genre] +=1
                else:
                    genres[movie_genre] = 1

    genres_df = pd.DataFrame(list(genres.items()), columns =['GENRE', 'COUNT'])
    
    # Sort by most common
    genres_df.sort_values(by=['COUNT'], inplace=True, ascending = False)
    
    # Return the most common (favorite) genre       
    return genres_df.iloc[[0]]['GENRE'].values[0]

#### Cell 30

In [None]:
user_favorite_genre = getUserFavouriteGenres(user_id, interactions_df, item_data)
user_favorite_genre

## Using Amazon Bedrock

Now that you have personalized recommendations for each user, communications can be sent out.  Rather than sending out a generic form email to each user, Amazon Bedrock can help you create custom emails for each user.

#### Cell 31

In [None]:
# Set up a Boto3 client to access the functions within Amazon Bedrock
bedrock = boto3.client('bedrock-runtime') 

Set up some parameters needed to access the slected Amazon Bedrock model. In this practice example, you use the Amazon Titan Text G1 - Lite foundation model (FM).

Note: At the time of this writing, FM access is not turned on by default. The practice example will fail if you did not follow the earlier step, which requested access to this Amazon Titan model.

#### Cell 32

In [None]:
# Model parameters
# The LLM you will be using
model_id = 'amazon.titan-text-lite-v1'

# The desired MIME type of the inference body in the response
accept = 'application/json'

# The MIME type of the input data in the request
content_type = 'application/json'

# The maximum number of tokens to use in the generated response
max_tokens_to_sample = 1000

## Add user demographic information

Generate emails by assuming two different demographics for the users.

#### Cell 33

In [None]:
# Sample user demographics
user_demographic_1 = f'The user is a 50 year old adult called Otto.'
user_demographic_3 = f'The user is a young adult called Jane.'

## Generate personalized marketing emails

Generating a marketing email requires a prompt that tells the FM what you want it to do.

#### Cell 34

In [None]:
def generate_personalized_prompt(user_demographic, favorite_genre, movie_list, model_id, max_tokens_to_sample = 50):

    prompt_template = f'''You are a skilled publicist. Write a high-converting marketing email advertising several movies available in a video-on-demand streaming platform next week, 
    given the movie and user information below. Your email will leverage the power of storytelling and persuasive language. 
    You want the email to impress the user, so make it appealing to them based on the information contained in the <user> tags, 
    and take into account the user's favorite genre in the <genre> tags. 
    The movies to recommend and their information is contained in the <movie> tag. 
    All movies in the <movie> tag must be recommended. Give a summary of the movies and why the human should watch them. 
    Put the email between <email> tags.
    Sign it from "Cloud island movies".
    
    <user>
    {user_demographic}
    </user>

    <genre>
    {favorite_genre}
    </genre>

    <movie>
    {movie_list}
    </movie>

    '''

    prompt_input = json.dumps({
        "inputText":prompt_template,
        "textGenerationConfig": {
            "maxTokenCount": 4096,
            "stopSequences": [],
            "temperature": 0.7,
            "topP": 0.9
        }
    })
      
    return prompt_input

Start with demographic 1, which is a 50-year-old user.

#### Cell 35

In [None]:
print ('User\'s demographic')
user_demographic = user_demographic_1
user_demographic

Next, use the prompt info from the previous code cell to create a prompt input that is JSON formatted.

#### Cell 36

In [None]:
# Create prompt input
prompt_input_json = generate_personalized_prompt(user_demographic, user_favorite_genre, movie_list, model_id, max_tokens_to_sample )
prompt_input_json

Now, put it all together and generate that personalized marketing email!

#### Cell 37

In [None]:
response = bedrock.invoke_model(
    body= prompt_input_json,
    modelId=model_id,
    accept=accept,    
    contentType=content_type
    )

response_body = json.loads(response.get('body').read())
model_output_string = response_body['results'][0]['outputText']
# model_output_str_clean = re.sub(r'<[^>]*>', '', model_output_string)

print(model_output_string)

# ------------ STOP --------------

## Use the following cells for the DIY section of the solution

The first step is to create a DataFrame that has only the movies that are rated G.

#### Cell DIY 1

In [None]:
item_data_rated_g = item_data[item_data['US_MATURITY_RATING_STRING'] == 'G']

item_data_rated_g.head(5)

Write the DataFrame out to a CSV file that can be uploaded to Amazon S3. Remember that Amazon Personalize must read the data from Amazon S3 to create the datasets.

#### Cell DIY 2

In [None]:
item_data_rated_g.to_csv('items_rated_g.csv', index=False)

The file created in the previous code cell must now be written to Amazon S3.  Refer to the earlier Cell 10 for the needed Boto3 command, and ensure that the new variable names are substituted for what was used before.

#### Cell DIY 3

In [None]:
items_rated_g_filename = "items_rated_g.csv"
items_rated_g_file = items_rated_g_filename

'''
For code copied from Cell 10, ensure that item_filename and 
item_file are both replaced with the new variable names above.

Also be aware of code indention to avoid errors. 
'''

# Insert your code here

Now that the file is in Amazon S3, create the new dataset group and dataset within Amazon Personalize. Start by creating the dataset group.  Refer to the earlier Cell 11 as needed, and ensure that the new group name is used.

#### Cell DIY 4

In [None]:
marketing_dataset_rated_g_group_name = "marketing_email_dataset_rated_g"
'''
For code copied from Cell 11, ensure that marketing_dataset_group_name 
is replaced with the new variable name above.

Also be aware of code indention to avoid errors. 
'''




try:     
    # Put your code here

    marketing_dataset_rated_g_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))
    print ('\nCreating the Dataset Group with dataset_group_arn = {}'.format(marketing_dataset_rated_g_group_arn))

except personalize.exceptions.ResourceAlreadyExistsException as e:
    # If the dataset group already exists, get the unique identifier for marketing_dataset_rated_g_group_arn 
    # from the existing resource
    
    marketing_dataset_rated_g_group_arn = 'arn:aws:personalize:'+region+':'+account_id+':dataset-group/'+marketing_dataset_rated_g_group_name 
    print ('\nThe the Dataset Group with dataset_group_arn = {} already exists'.format(marketing_dataset_rated_g_group_arn))
    print ('\nWe will be using the existing Dataset Group marketing_dataset_rated_g_group_arn = {}'.format(marketing_dataset_rated_g_group_arn))

Be sure that the dataset group is active.

#### Cell DIY 5

In [None]:
max_time = time.time() + 3*60 # 3 minutes
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = marketing_dataset_rated_g_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(15)

Now, create the items dataset within the dataset group.  Refer to the earlier Cell 16 as needed.  Note that because the schema is the same, it can be reused rather than recreated.

#### Cell DIY 6

In [None]:
items_rated_g_dataset_name = "marketing_items_rated_g"
'''
For code copied from Cell 16, ensure that items_dataset_name 
is replaced with the new variable name above. Also ensure to replace
marketing_dataset_group_arn with marketing_dataset_rated_g_group_arn 
as used in DIY cells 4 & 5.

Also be aware of code indention to avoid errors. 
'''



try:
    dataset_type = "ITEMS"
    # Enter your code here
    

    marketing_items_rated_g_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

    print ('\nCreating the Items Dataset with marketing_items_rated_g_dataset_arn = {}'.format(marketing_items_rated_g_dataset_arn))
    
except personalize.exceptions.ResourceAlreadyExistsException:
    # If the items dataset already exists, get the unique identifier, marketing_items_dataset_arn, 
    # from the existing resource 
    
    marketing_items_rated_g_dataset_arn =  'arn:aws:personalize:'+region+':'+account_id+':dataset/'+marketing_dataset_rated_g_group_name+'/ITEMS'
    print('The Items Dataset {} already exists.'.format(marketing_items_rated_g_dataset_arn))
    print ('\nWe will be using the existing Items Dataset with marketing_items_rated_g_dataset_arn = {}'.format(marketing_items_rated_g_dataset_arn))   

Be sure that the dataset is fully created.

#### Cell DIY 7

In [None]:
%%time
start_time = time.time()
max_time = time.time() + 6*60 # 6 Minutes
while time.time() < max_time:
    describe_dataset_response = personalize.describe_dataset(
        datasetArn = marketing_items_rated_g_dataset_arn
    )
    status_item_dataset =  describe_dataset_response["dataset"]['status']
    print("Items Dataset: {}".format(status_item_dataset))
    
    if status_item_dataset == "ACTIVE":
        print("Build succeeded for {}".format(marketing_items_rated_g_dataset_arn))
        
    elif status_item_dataset == "CREATE FAILED":
        print("Build failed for {}".format(marketing_items_rated_g_dataset_arn))
        break
        
    if not status_item_dataset == "ACTIVE":
        print("The item dataset creation is still in progress")
    else:
        print("The item dataset  is ACTIVE")
        
    if status_item_dataset == "ACTIVE":
        end_time = time.time()
        break
    time.sleep(15)

time_elapsed = end_time - start_time
print(f"Time elapsed: {time_elapsed} seconds")