# Implicit Bayesian Personalized Ranking

A recommender model that learns a matrix factorization embedding based off minimizing the pairwise ranking loss.

[Iplicit BPR Documentation](https://implicit.readthedocs.io/en/latest/bpr.html)

## Table of contents

* [Sample files](#sample-files)
* [Step 1 - Prepare training data](#prepare-training-data)
 * [Download lastfm 360k dataset](#download)
 * [Prepare lastfm artist play data](#prepare-data)
 * [Create training data file](#create-training-data-file)
 * [Upload training data file](#upload-training-data)
* [Step 2 - Create a model](#create-model)
 * [Run a SageMaker training job](#run-training-job)
 * [Create a SageMaker model](#create-sagemaker-model)
* [Step 3 - Get recommendations (inference)](#get-recommendations)
 * [Example users](#example-users)
 * [Create batch transform input file](#create-batch-input)
 * [Upload the batch transform input file to s3](#upload-batch-input)
 * [Run the Batch Transform Job](#run-transform)
 * [Download the batch results](#download-batch-results)
 * [Recommendations with scores](#recommendations)
 * [User history](#user-history)
* [Step 4 - Optional Cleanup](#cleanup)

## Sample files <a id="sample-files"></a>

These links are to example files on github.

* [training input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/training/lastfm-360k-1mm-clean.csv)
* [batch transform input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/batch_input/recommendation.requests)
* [batch transform output file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/recommendation.requests.out)

## Step 1 - Prepare training data <a id="prepare-training-data"></a>
### Download lastfm 360k dataset <a id="download"></a>

In [1]:
!wget --no-clobber http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz
!tar -xvf lastfm-dataset-360K.tar.gz
!head lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv

--2018-11-08 17:54:04--  http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz
Resolving mtg.upf.edu (mtg.upf.edu)... 84.89.139.55
Connecting to mtg.upf.edu (mtg.upf.edu)|84.89.139.55|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 569202935 (543M) [application/x-gzip]
Saving to: ‘lastfm-dataset-360K.tar.gz’


2018-11-08 17:59:59 (1.53 MB/s) - ‘lastfm-dataset-360K.tar.gz’ saved [569202935/569202935]

lastfm-dataset-360K/
lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv
lastfm-dataset-360K/README.txt
lastfm-dataset-360K/mbox_sha1sum.py
lastfm-dataset-360K/usersha1-profile.tsv
00000c289a1829a808ac09c00daf10bc3c4e223b	3bd73256-3905-4f3a-97e2-8b341527f805	betty blowtorch	2137
00000c289a1829a808ac09c00daf10bc3c4e223b	f2fb0ff0-5679-42ec-a55c-15109ce6e320	die Ärzte	1099
00000c289a1829a808ac09c00daf10bc3c4e223b	b3ae82c2-e60b-4551-a76d-6620f1b456aa	melissa etheridge	897
00000c289a1829a808ac09c00daf10bc3c4e223b	3d6bbeb7-f90e-4d10-b440-e153c0d10b53	e

### Prepare lastfm artist play training data <a id="prepare-data"></a>

Import the tab separated lastfm file. Take only the first 1 million rows to save memory and processing time. Drop any rows with null values in `item_id`, `user_id`, `total_interactions`, `artist_name`.

In [2]:
import pandas as pd

df = pd.read_csv('lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv', 
                 sep="\t", 
                 header=None, 
                 names=["user_id", "item_id", "artist_name", "total_interactions"], 
                 nrows=1000000)
df = df.dropna(subset=['item_id', 'user_id', 'total_interactions', 'artist_name'])
print(df.shape)
df.head()

(987161, 4)


Unnamed: 0,user_id,item_id,artist_name,total_interactions
0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137
1,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ärzte,1099
2,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
3,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717
4,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706


### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, `item_id`, and `total_interactions`. Show the head of the file.

In [3]:
data_dir = 'implicit-bpr'
train_data_dir = '{}/training'.format(data_dir)
train_data_file = '{}/lastfm-360k-1mm-clean.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
df[["user_id", "item_id", "total_interactions"]].to_csv(train_data_file, index=False)

!head {train_data_file}

user_id,item_id,total_interactions
00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,2137
00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,1099
00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,897
00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,717
00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,706
00000c289a1829a808ac09c00daf10bc3c4e223b,8bfac288-ccc5-448d-9573-c33ea2aa5c30,691
00000c289a1829a808ac09c00daf10bc3c4e223b,6531c8b1-76ea-4141-b270-eb1ac5b41375,545
00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5,507
00000c289a1829a808ac09c00daf10bc3c4e223b,c5db90c4-580d-4f33-b364-fbaa5a3a58b5,424


### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket, optionally customize the prefix, and upload the csv created above.

In [4]:
import sagemaker as sage

bucket = "sagemaker-validation-us-east-2"
prefix = "implicit-bpr-test"

sess = sage.Session()

s3_train = sess.upload_data(train_data_dir, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

'uploaded training data file to s3://sagemaker-validation-us-east-2/implicit-bpr-test/training'

## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

This code will start a training job, wait for it to be done, and report its status.

In [5]:
%%time

import boto3
import time
from sagemaker import get_execution_role

role = get_execution_role()
ecr_image = "435525115971.dkr.ecr.us-east-2.amazonaws.com/sagemaker/implicit-bpr:12"
job_name_prefix = 'implicit-bpr-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": ecr_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/csv",
            "CompressionType": "None"
        }
    ]
}

sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = job_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training job ended with status: Completed
CPU times: user 184 ms, sys: 0 ns, total: 184 ms
Wall time: 4min


### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up the model created during training within SageMaker to be used later for recommendations.

In [6]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="implicit-bpr-test" + timestamp
job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = job_info['ModelArtifacts']['S3ModelArtifacts']

primary_container = {
    'Image': ecr_image,
    'ModelDataUrl': model_data,
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

create_model_response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:435525115971:model/implicit-bpr-test-2018-11-08-18-04-29',
 'ResponseMetadata': {'RequestId': 'eba0e9f5-db20-403c-878d-c8931968ad4c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'eba0e9f5-db20-403c-878d-c8931968ad4c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '99',
   'date': 'Thu, 08 Nov 2018 18:04:29 GMT'},
  'RetryAttempts': 0}}

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Example users <a id="example-users"></a>

Find some example users in order to predict their next rating/watch.

In [7]:
example_users = df[df.user_id.isin(["05c4bbb936abd2331e8f64037c95a61335d40e30",
                                    "030ebbd1d8b360ce465a20e30a67a43da97f1b20"])]
example_users

Unnamed: 0,user_id,item_id,artist_name,total_interactions
207869,030ebbd1d8b360ce465a20e30a67a43da97f1b20,a7022764-95fb-46af-a7d6-90056746451a,uma thurman,651
207870,030ebbd1d8b360ce465a20e30a67a43da97f1b20,0743b15a-3c32-48c8-ad58-cb325350befa,blink-182,649
391892,05c4bbb936abd2331e8f64037c95a61335d40e30,99d7b49c-c18e-4a11-bf3e-b71710938df6,phoenix,3
391893,05c4bbb936abd2331e8f64037c95a61335d40e30,bd4d397a-849a-48bf-be24-52eec87feeee,adriana calcanhotto,2


### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing two keys:

* `user_id`: the id of user
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [8]:
import json

batch_input_dir = '{}/batch_input'.format(data_dir)
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

with open(batch_input_file, 'w') as outfile:
    json.dump({"user_id": "05c4bbb936abd2331e8f64037c95a61335d40e30", "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"user_id": "030ebbd1d8b360ce465a20e30a67a43da97f1b20", "top_n": "5"}, outfile)
   
!head {batch_input_file}

{"user_id": "05c4bbb936abd2331e8f64037c95a61335d40e30", "top_n": "5"}
{"user_id": "030ebbd1d8b360ce465a20e30a67a43da97f1b20", "top_n": "5"}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [9]:
batch_input = sess.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-east-2/implicit-bpr-test/batch_input'

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [10]:
%%time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "implicit-bpr-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.p3.2xlarge",
    "InstanceCount": 1
  }
}

sagemaker.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sagemaker.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  implicit-bpr-test-2018-11-08-18-04-30
Transform job ended with status: Completed
CPU times: user 107 ms, sys: 1.35 ms, total: 109 ms
Wall time: 4min 1s


### Download the batch results <a id="download-batch-results"></a>

Show the head of the file.

In [11]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} {data_dir}

!head {data_dir}/recommendation.requests.out

download: s3://sagemaker-validation-us-east-2/implicit-bpr-test-2018-11-08-18-04-30/output/recommendation.requests.out to implicit-bpr/recommendation.requests.out
05c4bbb936abd2331e8f64037c95a61335d40e30,51a77a0b-69be-41c4-88f4-22f25a9a63e5,1.2097642421722412
05c4bbb936abd2331e8f64037c95a61335d40e30,c9e12a06-e0c5-4705-bb66-98231c5c9e71,1.1937133073806763
05c4bbb936abd2331e8f64037c95a61335d40e30,8a9ac1cb-faae-434e-8d60-b139a3707dfc,1.192223310470581
05c4bbb936abd2331e8f64037c95a61335d40e30,5083e4bf-3246-4ab9-a6d3-23be8db8db2c,1.1609201431274414
05c4bbb936abd2331e8f64037c95a61335d40e30,93942b87-215a-4626-b5ec-bf129d9fa2f6,1.153393030166626
030ebbd1d8b360ce465a20e30a67a43da97f1b20,9390a27f-d63d-43ac-a771-a0e0794fee61,1.1210126876831055
030ebbd1d8b360ce465a20e30a67a43da97f1b20,e9b062a3-8b46-48a7-87d2-ce94c7cc8322,1.1094765663146973
030ebbd1d8b360ce465a20e30a67a43da97f1b20,44f85621-f8db-45dd-93de-e9f069d28d7d,1.086712121963501
030ebbd1d8b360ce465a20e30a67a43da97f1b20,084308bd-1654-436f-ba03

### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above and join with artist names. These are the top 5 artist recommendations for our example users.

In [12]:
recommendations_df = pd.read_csv('{}/recommendation.requests.out'.format(data_dir), 
                                 header=None, 
                                 names=["user_id", "item_id", "score"])
artist_names = df.groupby(['item_id']).agg(lambda x: x.iloc[0])[["artist_name"]]
recommendations_df = recommendations_df.join(artist_names, on='item_id')
recommendations_df

Unnamed: 0,user_id,item_id,score,artist_name
0,05c4bbb936abd2331e8f64037c95a61335d40e30,51a77a0b-69be-41c4-88f4-22f25a9a63e5,1.209764,emily loizeau
1,05c4bbb936abd2331e8f64037c95a61335d40e30,c9e12a06-e0c5-4705-bb66-98231c5c9e71,1.193713,gepe
2,05c4bbb936abd2331e8f64037c95a61335d40e30,8a9ac1cb-faae-434e-8d60-b139a3707dfc,1.192223,mika
3,05c4bbb936abd2331e8f64037c95a61335d40e30,5083e4bf-3246-4ab9-a6d3-23be8db8db2c,1.16092,olivia ruiz
4,05c4bbb936abd2331e8f64037c95a61335d40e30,93942b87-215a-4626-b5ec-bf129d9fa2f6,1.153393,bénabar
5,030ebbd1d8b360ce465a20e30a67a43da97f1b20,9390a27f-d63d-43ac-a771-a0e0794fee61,1.121013,the string quartet
6,030ebbd1d8b360ce465a20e30a67a43da97f1b20,e9b062a3-8b46-48a7-87d2-ce94c7cc8322,1.109477,dean gray
7,030ebbd1d8b360ce465a20e30a67a43da97f1b20,44f85621-f8db-45dd-93de-e9f069d28d7d,1.086712,terry pratchett
8,030ebbd1d8b360ce465a20e30a67a43da97f1b20,084308bd-1654-436f-ba03-df6697104e19,1.085697,green day
9,030ebbd1d8b360ce465a20e30a67a43da97f1b20,6b014cfd-4927-4187-a741-715998e6d785,1.077631,lemon demon


### User history <a id="user-history"></a>

Show the example users' history again for convenience.

In [13]:
example_users

Unnamed: 0,user_id,item_id,artist_name,total_interactions
207869,030ebbd1d8b360ce465a20e30a67a43da97f1b20,a7022764-95fb-46af-a7d6-90056746451a,uma thurman,651
207870,030ebbd1d8b360ce465a20e30a67a43da97f1b20,0743b15a-3c32-48c8-ad58-cb325350befa,blink-182,649
391892,05c4bbb936abd2331e8f64037c95a61335d40e30,99d7b49c-c18e-4a11-bf3e-b71710938df6,phoenix,3
391893,05c4bbb936abd2331e8f64037c95a61335d40e30,bd4d397a-849a-48bf-be24-52eec87feeee,adriana calcanhotto,2


## Step 4 - Optional Clean up <a id="cleanup"></a>

In [14]:
def cleanup():
    !rm lastfm-dataset-360K.tar.gz 2> /dev/null
    !rm -fr implicit-bpr/ 2> /dev/null
    !rm -fr lastfm-dataset-360K/ 2> /dev/null
    sagemaker.delete_model(ModelName = model_name)
    
# optionally uncomment and run the code to clean everything up  

#cleanup()