# Spotlight Implicit Sequence

A model for recommending items given a sequence of previous user item interactions.

[Spotlight Documentation](https://maciejkula.github.io/spotlight/sequence/implicit.html)

## Table of contents

* [Sample files](#sample-files)
* [Step 1 - Prepare training data](#prepare-training-data)
 * [Download movielens 100k dataset](#download-movielens)
 * [Import ratings data](#import-ratings-data)
 * [Create training data file](#create-training-data-file)
 * [Upload training data file](#upload-training-data)
* [Step 2 - Create a model](#create-model)
 * [Run a SageMaker training job](#run-training-job)
 * [Create a SageMaker model](#create-sagemaker-model)
* [Step 3 - Get recommendations (inference)](#get-recommendations)
 * [Import movie titles](#import-movie-titles)
 * [Example users](#example-users)
 * [Create sequences for prediction](#create-sequences)
 * [Create batch transform input file](#create-batch-input)
 * [Upload the batch transform input file to s3](#upload-batch-input)
 * [Run the Batch Transform Job](#run-transform)
 * [Download the batch results](#download-batch-results)
 * [Recommendations with scores](#recommendations)
 * [User history](#user-history)
* [Step 4 - Optional Cleanup](#cleanup)

## Sample files <a id="sample-files"></a>

These links are to example files on github.

* [training input file](https://github.com/outpace/sagemaker-examples/blob/master/sequence_train_data/ml-100k-gt2.csv)
* [batch transform input file](https://github.com/outpace/sagemaker-examples/blob/master/sequence_batch_input/recommendation.requests)
* [batch transform output file](https://github.com/outpace/sagemaker-examples/blob/master/spotlight_sequence/recommendation.requests.out)

## Step 1 - Prepare training data <a id="prepare-training-data"></a>
### Download movielens 100k dataset <a id="download-movielens"></a>

In [1]:
!wget --no-clobber http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2018-11-02 15:31:55--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2018-11-02 15:31:55 (18.5 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

### Import ratings data <a id="import-ratings-data"></a>

Keep only ratings strictly higher than 2 to make this an implicit dataset.

In [2]:
import pandas as pd

sequence_df = pd.read_csv('ml-100k/u.data', sep="\t", header=None, names=["user_id", "item_id", "rating", "timestamp"])
sequence_df = sequence_df[sequence_df.rating>2].drop(["rating"], axis=1)
sequence_df.head()

Unnamed: 0,user_id,item_id,timestamp
0,196,242,881250949
1,186,302,891717742
5,298,474,884182806
7,253,465,891628467
8,305,451,886324817


### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, `item_id`, and `timestamp`. Show the head of the file.

In [3]:
train_data_dir = 'sequence_train_data'
train_data_file = '{}/ml-100k-gt2.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
sequence_df.to_csv(train_data_file, index=False)

!head {train_data_file}

user_id,item_id,timestamp
196,242,881250949
186,302,891717742
298,474,884182806
253,465,891628467
305,451,886324817
6,86,883603013
286,1014,879781125
200,222,876042340
210,40,891035994


### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket, optionally customize the prefix, and upload the csv created above.

In [4]:
import sagemaker as sage

bucket = "sagemaker-validation-us-east-2"
prefix = "spotlight-implicit-sequence-test"

sess = sage.Session()

s3_train = sess.upload_data(train_data_dir, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

'uploaded training data file to s3://sagemaker-validation-us-east-2/spotlight-implicit-sequence-test/training'

## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

This code will start a training job, wait for it to be done, and report its status.

In [5]:
%%time

import boto3
import time
from sagemaker import get_execution_role

role = get_execution_role()
ecr_image = "435525115971.dkr.ecr.us-east-2.amazonaws.com/sagemaker/spotlight-sequence:6"
job_name_prefix = 'spotlight-implicit-sequence-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": ecr_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/csv",
            "CompressionType": "None"
        }
    ]
}

sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = job_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training job ended with status: Completed
CPU times: user 148 ms, sys: 13.6 ms, total: 161 ms
Wall time: 4min


### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up the model created during training within SageMaker to be used later for recommendations.

In [6]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="spotlight-implicit-sequence-test" + timestamp
job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = job_info['ModelArtifacts']['S3ModelArtifacts']

primary_container = {
    'Image': ecr_image,
    'ModelDataUrl': model_data,
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

create_model_response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:435525115971:model/spotlight-implicit-sequence-test-2018-11-02-15-35-58',
 'ResponseMetadata': {'RequestId': '26e7d995-647c-4701-ae14-b9fefa5570d6',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '26e7d995-647c-4701-ae14-b9fefa5570d6',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '114',
   'date': 'Fri, 02 Nov 2018 15:35:58 GMT'},
  'RetryAttempts': 0}}

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Import movie titles <a id="import-movie-titles"></a>

Get movie titles in `u.item` from the movielens files downloaded earlier and join with interactions data.

In [7]:
titles_df = pd.read_csv('ml-100k/u.item', 
                        sep="|", 
                        header=None, 
                        encoding = "ISO-8859-1"
                       ).iloc[:,0:2].rename(index=str, columns={1:"movie_title"}).set_index([0])
sequence_df = sequence_df.join(titles_df, on='item_id').rename(index=str, columns={"user_id":"user_id",1:"movie_title"})
sequence_df.head()

Unnamed: 0,user_id,item_id,timestamp,movie_title
0,196,242,881250949,Kolya (1996)
1,186,302,891717742,L.A. Confidential (1997)
5,298,474,884182806,Dr. Strangelove or: How I Learned to Stop Worr...
7,253,465,891628467,"Jungle Book, The (1994)"
8,305,451,886324817,Grease (1978)


### Example users <a id="example-users"></a>

Find some example users to get their sequence of interactions in order to predict their next rating/watch.

In [8]:
example_users = sequence_df[sequence_df.user_id.isin([685, 302])].sort_values(by=['timestamp'], ascending=False)
example_users

Unnamed: 0,user_id,item_id,timestamp,movie_title
42723,685,269,879451401,"Full Monty, The (1997)"
43618,685,302,879451401,L.A. Confidential (1997)
53414,685,325,879451401,Crash (1996)
70388,685,324,879451401,Lost Highway (1997)
89437,685,882,879451401,Washington Square (1997)
98163,685,875,879451401,She's So Lovely (1997)
55280,302,358,879436981,Spawn (1997)
81824,302,271,879436911,Starship Troopers (1997)
58767,302,289,879436874,Evita (1996)
4826,302,328,879436844,Conspiracy Theory (1997)


### Create sequences for prediction <a id="create-sequences"></a>

Get each user's sequence of item_ids they interacted with.

In [9]:
user_685_sequence = example_users.sort_values(by=['timestamp'], ascending=False).item_id.values
user_685_sequence

array([269, 302, 325, 324, 882, 875, 358, 271, 289, 328, 301, 333, 307,
       258])

In [10]:
user_302_sequence = example_users.sort_values(by=['timestamp'], ascending=False).item_id.values
user_302_sequence

array([269, 302, 325, 324, 882, 875, 358, 271, 289, 328, 301, 333, 307,
       258])

### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing three keys:

* `sequence_id`: the id of the sequence - used to correlate results
* `sequence`: the item ids a user has interacted with in order
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [11]:
import json

batch_input_dir = 'sequence_batch_input'
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

with open(batch_input_file, 'w') as outfile:
    json.dump({"sequence_id": "user_302", "sequence": [str(i) for i in user_302_sequence], "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"sequence_id": "user_685", "sequence": [str(i) for i in user_685_sequence], "top_n": "5"}, outfile)
   
!head {batch_input_file}

{"sequence_id": "user_302", "sequence": ["269", "302", "325", "324", "882", "875", "358", "271", "289", "328", "301", "333", "307", "258"], "top_n": "5"}
{"sequence_id": "user_685", "sequence": ["269", "302", "325", "324", "882", "875", "358", "271", "289", "328", "301", "333", "307", "258"], "top_n": "5"}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [12]:
batch_input = sess.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-east-2/spotlight-implicit-sequence-test/batch_input'

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [13]:
%%time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "spotlight-implicit-sequence-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.p3.2xlarge",
    "InstanceCount": 1
  }
}

sagemaker.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sagemaker.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  spotlight-implicit-sequence-test-2018-11-02-15-35-58
Transform job ended with status: Completed
CPU times: user 103 ms, sys: 3.93 ms, total: 107 ms
Wall time: 4min 1s


### Download the batch results <a id="download-batch-results"></a>

Show the head of the file.

In [14]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} spotlight_sequence/

!head spotlight_sequence/recommendation.requests.out

download: s3://sagemaker-validation-us-east-2/spotlight-implicit-sequence-test-2018-11-02-15-35-58/output/recommendation.requests.out to spotlight_sequence/recommendation.requests.out
user_302,313,33.03297424316406
user_302,302,32.48430633544922
user_302,268,31.736766815185547
user_302,286,31.716033935546875
user_302,300,31.430864334106445
user_685,313,33.03297424316406
user_685,302,32.48430633544922
user_685,268,31.736766815185547
user_685,286,31.716033935546875
user_685,300,31.430864334106445


### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above and join with titles dataframe. These are the top 5 movie recommendations for users 685 and 302.

In [15]:
recommendations_df = pd.read_csv('recommendation.requests.out', 
                                 header=None, 
                                 names=["sequence_id", "item_id", "score"])
recommendations_df = recommendations_df.join(titles_df, 
                                             on='item_id')
recommendations_df

Unnamed: 0,sequence_id,item_id,score,movie_title
0,685,333,14.640368,"Game, The (1997)"
1,685,272,13.269605,Good Will Hunting (1997)
2,685,268,13.025604,Chasing Amy (1997)
3,685,347,12.637764,Wag the Dog (1997)
4,685,315,12.535725,Apt Pupil (1998)
5,302,748,12.058822,"Saint, The (1997)"
6,302,333,11.834901,"Game, The (1997)"
7,302,323,11.719286,Dante's Peak (1997)
8,302,258,10.959321,Contact (1997)
9,302,313,10.367278,Titanic (1997)


### User history <a id="user-history"></a>

Show the example users interaction history again for convenience.

In [16]:
example_users

Unnamed: 0,user_id,item_id,timestamp,movie_title
42723,685,269,879451401,"Full Monty, The (1997)"
43618,685,302,879451401,L.A. Confidential (1997)
53414,685,325,879451401,Crash (1996)
70388,685,324,879451401,Lost Highway (1997)
89437,685,882,879451401,Washington Square (1997)
98163,685,875,879451401,She's So Lovely (1997)
55280,302,358,879436981,Spawn (1997)
81824,302,271,879436911,Starship Troopers (1997)
58767,302,289,879436874,Evita (1996)
4826,302,328,879436844,Conspiracy Theory (1997)


## Step 4 - Optional Clean up <a id="cleanup"></a>

In [17]:
# optionally uncomment and run the code to clean everything up

#!rm ml-100k.zip 2> /dev/null
#!rm -fr spotlight_sequence/ 2> /dev/null
#!rm -fr ml-100k/ 2> /dev/null
#!rm -fr {train_data_dir} 2> /dev/null
#!rm -fr {batch_input_dir} 2> /dev/null
#sagemaker.delete_model(ModelName= model_name)