# Spotlight Implicit Factorization

## Step 1 - Prepare training data
### Download movielens 100k dataset

In [1]:
!wget --no-clobber http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2018-10-24 20:44:55--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2018-10-24 20:44:55 (16.6 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

### Import ratings data

Keep only ratings strictly higher than 2 to make this an implicit dataset.

In [2]:
import pandas as pd

implicit_df = pd.read_csv('ml-100k/u.data', sep="\t", header=None, names=["user_id", "item_id", "rating", "timestamp"])
implicit_df = implicit_df[implicit_df.rating>2][["user_id", "item_id"]]
implicit_df.head()

Unnamed: 0,user_id,item_id
0,196,242
1,186,302
5,298,474
7,253,465
8,305,451


### Create training data file

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id` and `item_id`.

In [3]:
from IPython.display import HTML

TRAIN_DATA_DIR = 'train_data'

!mkdir -p {TRAIN_DATA_DIR}
implicit_df.to_csv('{}/ml-100k-gt2.csv'.format(TRAIN_DATA_DIR), index=False)

HTML('After you run this block, click <a href="https://github.com/outpace/sagemaker-examples/blob/master/ml-100k-gt2.csv" target="_blank">here</a> to see what the training data file looks like.'%(TRAIN_DATA_DIR))

### Upload training data to s3

Choose a bucket, optionally customize the prefix, and upload the csv created above.

In [4]:
import sagemaker as sage

bucket = "<enter an s3 bucket here>"
prefix = "spotlight-implicit-factorization-test"

sess = sage.Session()

s3_train = sess.upload_data(TRAIN_DATA_DIR, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

'uploaded training data file to s3://sagemaker-validation-us-east-2/spotlight-implicit-factorization-test/training'

## Step 2 - Create a model

### Run a sagemaker training job

This code will start a training job, wait for it to be done, and report its status.

In [5]:
%%time

import boto3
import time
from sagemaker import get_execution_role

role = get_execution_role()
ecr_image = "435525115971.dkr.ecr.us-east-2.amazonaws.com/sagemaker/spotlight-implicit:76"
job_name_prefix = 'spotlight-implicit-factorization-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": ecr_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/csv",
            "CompressionType": "None"
        }
    ]
}

sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = job_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training job ended with status: Completed
CPU times: user 138 ms, sys: 18.3 ms, total: 156 ms
Wall time: 4min 1s


### Create sagemaker model

This will set up the model created during training within SageMaker to be used later for recommendations.

In [6]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="spotlight-implicit-factorization-test" + timestamp
job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = job_info['ModelArtifacts']['S3ModelArtifacts']

primary_container = {
    'Image': ecr_image,
    'ModelDataUrl': model_data,
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

create_model_response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:435525115971:model/spotlight-implicit-factorization-test-2018-10-24-20-48-58',
 'ResponseMetadata': {'RequestId': 'e385d5db-54f9-4f43-ba43-8b5224f90ef5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e385d5db-54f9-4f43-ba43-8b5224f90ef5',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '119',
   'date': 'Wed, 24 Oct 2018 20:48:58 GMT'},
  'RetryAttempts': 0}}

## Step 3 - Get recommendations (Inference)

### Create batch transform input file

Each row is a json object containing two keys:

* `user_id`: the id of the user to get recommendations for
* `top_n`: the number of top scoring recommendations to return

In [7]:
import json

BATCH_INPUT_DIR = 'batch_input'

!mkdir -p {BATCH_INPUT_DIR}

with open(BATCH_INPUT_DIR + '/recommendation.requests', 'w') as outfile:
    json.dump({"user_id": "685", "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"user_id": "302", "top_n": "5"}, outfile)
    
HTML('After you run this block, click <a href="https://github.com/outpace/sagemaker-examples/blob/master/recommendation.requests" target="_blank">here</a> to see what the batch input file looks like.'%(BATCH_INPUT_DIR))

### Upload the batch transform input file to s3

In [8]:
batch_input = sess.upload_data(BATCH_INPUT_DIR, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-east-2/spotlight-implicit-factorization-test/batch_input'

### Run the Batch Transform Job

This code will start a batch transform job, wait for it to be done, and report its status.

In [9]:
%%time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "spotlight-implicit-factorization-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.p3.2xlarge",
    "InstanceCount": 1
  }
}

sagemaker.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sagemaker.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  spotlight-implicit-factorization-test-2018-10-24-20-48-59
Transform job ended with status: Completed
CPU times: user 110 ms, sys: 7.96 ms, total: 118 ms
Wall time: 4min 31s


### Download the batch results

In [10]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} .

HTML('After you run this block, click <a href="https://github.com/outpace/sagemaker-examples/blob/master/recommendation.requests.out" target="_blank">here</a> to see what the batch output file looks like.')

Completed 269 Bytes/269 Bytes (2.1 KiB/s) with 1 file(s) remainingdownload: s3://sagemaker-validation-us-east-2/spotlight-implicit-factorization-test-2018-10-24-20-48-59/output/recommendation.requests.out to ./recommendation.requests.out


### Import movie titles

Get movie titles in `u.item` from the movielens files downloaded earlier and join with ratings data.

In [11]:
titles_df = pd.read_csv('ml-100k/u.item', sep="|", header=None, encoding = "ISO-8859-1").set_index([0]).iloc[:,0:1]
implicit_df = implicit_df.join(titles_df, on='item_id').rename(index=str, columns={"user_id":"user_id",1:"movie_title"})
implicit_df.head()

Unnamed: 0,user_id,item_id,movie_title
0,196,242,Kolya (1996)
1,186,302,L.A. Confidential (1997)
5,298,474,Dr. Strangelove or: How I Learned to Stop Worr...
7,253,465,"Jungle Book, The (1994)"
8,305,451,Grease (1978)


### Recommendations with scores

Import the recommendations from the batch output file downloaded above and join with titles dataframe. These are the top 5 movie recommendations for users 685 and 302.

In [12]:
recommendations_df = pd.read_csv('recommendation.requests.out', header=None, names=["user_id", "item_id", "score"])
recommendations_df = recommendations_df.join(titles_df, on='item_id').rename(index=str, columns={"user_id":"user_id",1:"movie_title"})
recommendations_df

Unnamed: 0,user_id,item_id,score,movie_title
0,685,333,14.640368,"Game, The (1997)"
1,685,272,13.269605,Good Will Hunting (1997)
2,685,268,13.025604,Chasing Amy (1997)
3,685,347,12.637764,Wag the Dog (1997)
4,685,315,12.535725,Apt Pupil (1998)
5,302,748,12.058822,"Saint, The (1997)"
6,302,333,11.834901,"Game, The (1997)"
7,302,323,11.719286,Dante's Peak (1997)
8,302,258,10.959321,Contact (1997)
9,302,313,10.367278,Titanic (1997)


### User history

For reference, here are the movies users 685 and 302 watched/rated.

In [13]:
implicit_df[implicit_df.user_id.isin([685,302])].sort_values(by=['user_id'], ascending=False)

Unnamed: 0,user_id,item_id,movie_title
42723,685,269,"Full Monty, The (1997)"
43618,685,302,L.A. Confidential (1997)
53414,685,325,Crash (1996)
70388,685,324,Lost Highway (1997)
89437,685,882,Washington Square (1997)
98163,685,875,She's So Lovely (1997)
4826,302,328,Conspiracy Theory (1997)
9848,302,307,"Devil's Advocate, The (1997)"
14758,302,258,Contact (1997)
32327,302,301,In & Out (1997)


## Optional Clean up

In [14]:
!rm ml-100k.zip 2> /dev/null
!rm -fr ml-100k/ 2> /dev/null
!rm -fr {TRAIN_DATA_DIR} 2> /dev/null
!rm -fr {BATCH_INPUT_DIR} 2> /dev/null
sagemaker.delete_model(ModelName= model_name)

{'ResponseMetadata': {'RequestId': 'e7830882-4301-419d-af9c-4a8397deeb24',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e7830882-4301-419d-af9c-4a8397deeb24',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 24 Oct 2018 20:53:31 GMT'},
  'RetryAttempts': 0}}