# Implicit Bayesian Personalized Ranking

A recommender model that learns a matrix factorization embedding based off minimizing the pairwise ranking loss.

[Iplicit BPR Documentation](https://implicit.readthedocs.io/en/latest/bpr.html)

## Table of contents

* [Sample files](#sample-files)
* [Step 1 - Prepare training data](#prepare-training-data)
 * [Download lastfm 360k dataset](#download)
 * [Prepare lastfm artist play data](#prepare-data)
 * [Create training data file](#create-training-data-file)
 * [Upload training data file](#upload-training-data)
* [Step 2 - Create a model](#create-model)
 * [Run a SageMaker training job](#run-training-job)
 * [Create a SageMaker model](#create-sagemaker-model)
* [Step 3 - Get recommendations (inference)](#get-recommendations)
 * [Example users](#example-users)
 * [Create batch transform input file](#create-batch-input)
 * [Upload the batch transform input file to s3](#upload-batch-input)
 * [Run the Batch Transform Job](#run-transform)
 * [Download the batch results](#download-batch-results)
 * [Recommendations with scores](#recommendations)
 * [User history](#user-history)
* [Step 4 - Optional Cleanup](#cleanup)

## Sample files <a id="sample-files"></a>

These links are to example files on github.

* [training input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/training/lastfm-360k-1mm-clean.csv)
* [batch transform input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/batch_input/recommendation.requests)
* [batch transform output file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/recommendation.requests.out)

## Step 1 - Prepare training data <a id="prepare-training-data"></a>
### Download lastfm 1k dataset <a id="download"></a>

In [1]:
!wget --no-clobber http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz
!tar -xvf lastfm-dataset-1K.tar.gz
!head lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv

File 'lastfm-dataset-1K.tar.gz' already there; not retrieving.

x lastfm-dataset-1K/
x lastfm-dataset-1K/userid-profile.tsv
x lastfm-dataset-1K/README.txt
x lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv
user_000001	2009-05-04T23:08:57Z	f1b1cf71-bd35-4e99-8624-24a6e15f133a	Deep Dish		Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
user_000001	2009-05-04T13:54:10Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Composition 0919 (Live_2009_4_15)
user_000001	2009-05-04T13:52:04Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Mc2 (Live_2009_4_15)
user_000001	2009-05-04T13:42:52Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Hibari (Live_2009_4_15)
user_000001	2009-05-04T13:42:11Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Mc1 (Live_2009_4_15)
user_000001	2009-05-04T13:38:31Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		To Stanford (Live_2009_4_15)
user_000001	2009-05-04T13:33:28Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Improvisation (Live_2009_4_15)
user_000001	2009-05-04T13:23:45Z	a7f

### Prepare lastfm artist play training data <a id="prepare-data"></a>

Import the tab separated lastfm file. Drop any rows with null values in `item_id`, `user_id`, `artist_name`.

In [2]:
import pandas as pd

df = pd.read_csv('lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv', 
                 sep="\t", 
                 header=None, 
                 names=["user_id", "timestamp", "item_id", "artist_name", "song_id", "song_name"],
                 nrows=2000000)
df = df.dropna(subset=['user_id', 'item_id', 'artist_name'])
print(df.shape)
df.head()

(1940062, 6)


Unnamed: 0,user_id,timestamp,item_id,artist_name,song_id,song_name
0,user_000001,2009-05-04T23:08:57Z,f1b1cf71-bd35-4e99-8624-24a6e15f133a,Deep Dish,,Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
1,user_000001,2009-05-04T13:54:10Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Composition 0919 (Live_2009_4_15)
2,user_000001,2009-05-04T13:52:04Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc2 (Live_2009_4_15)
3,user_000001,2009-05-04T13:42:52Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Hibari (Live_2009_4_15)
4,user_000001,2009-05-04T13:42:11Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc1 (Live_2009_4_15)


### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, and `item_id` where each row is an interaction between the user and the item. Show the head of the file.

In [4]:
data_dir = 'implicit-bpr'
train_data_dir = '{}/training'.format(data_dir)
train_data_file = '{}/lastfm-1k-clean.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
df[["user_id", "item_id"]].to_csv(train_data_file, index=False)

!head {train_data_file}

user_id,item_id
user_000001,f1b1cf71-bd35-4e99-8624-24a6e15f133a
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8


### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket, optionally customize the prefix, and upload the csv created above.

In [5]:
%%time

import sagemaker as sage

bucket = "sagemaker-validation-us-east-2"
prefix = "implicit-bpr-test"

sess = sage.Session()

s3_train = sess.upload_data(train_data_dir, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

CPU times: user 3.64 s, sys: 2.5 s, total: 6.14 s
Wall time: 2min


## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

This code will start a training job, wait for it to be done, and report its status.

In [6]:
%%time

import boto3
import time
from sagemaker import get_execution_role

#role = get_execution_role()
role = "arn:aws:iam::435525115971:role/service-role/AmazonSageMaker-ExecutionRole-20181012T121978"
ecr_image = "435525115971.dkr.ecr.us-east-2.amazonaws.com/sagemaker/implicit-bpr:35"
job_name_prefix = 'implicit-bpr-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": ecr_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/csv",
            "CompressionType": "None"
        }
    ]
}

sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**create_training_params)
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = job_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress
Training job ended with status: Completed
CPU times: user 135 ms, sys: 29.5 ms, total: 164 ms
Wall time: 6min 4s


### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up the model created during training within SageMaker to be used later for recommendations.

In [7]:
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="implicit-bpr-test" + timestamp
job_info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = job_info['ModelArtifacts']['S3ModelArtifacts']

primary_container = {
    'Image': ecr_image,
    'ModelDataUrl': model_data,
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

create_model_response

{'ModelArn': 'arn:aws:sagemaker:us-east-2:435525115971:model/implicit-bpr-test-2019-02-19-16-37-35',
 'ResponseMetadata': {'RequestId': '5ce4b923-22c0-4287-b2ab-af93b72f0c1b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5ce4b923-22c0-4287-b2ab-af93b72f0c1b',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '99',
   'date': 'Tue, 19 Feb 2019 16:37:35 GMT'},
  'RetryAttempts': 0}}

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Example users <a id="example-users"></a>

Find some example users in order to predict their next rating/watch.

In [8]:
df.user_id.value_counts()

user_000033    90853
user_000054    89589
user_000012    74111
user_000074    72632
user_000068    72326
user_000021    69069
user_000002    56730
user_000022    48806
user_000089    48016
user_000067    46498
user_000095    44527
user_000026    44416
user_000023    41973
user_000029    41715
user_000093    39284
user_000025    37602
user_000069    37552
user_000008    36416
user_000031    35615
user_000064    32923
user_000084    31832
user_000052    31659
user_000075    30420
user_000006    26929
user_000019    26826
user_000060    23321
user_000016    21710
user_000087    20502
user_000072    19900
user_000017    19523
               ...  
user_000043     8345
user_000076     8290
user_000062     7655
user_000047     7248
user_000092     6693
user_000065     6431
user_000037     6065
user_000009     5052
user_000010     4684
user_000020     4599
user_000046     4524
user_000058     4300
user_000036     4143
user_000027     3555
user_000071     3110
user_000073     2993
user_000042  

In [11]:
df[df.user_id=='user_000088'].artist_name.value_counts()

Nada                      18
Robbie Williams           15
The Promise Ring          12
Manga                     12
Yalın                     10
Secret Machines            5
Zeki Müren                 5
Garbage                    5
[Unknown]                  5
The Killers                5
The Bravery                4
Feeder                     4
Özlem Tekin                3
William Shatner            3
Tarkan                     3
Bush                       3
Bob Dylan                  3
Placebo                    3
Nev                        3
Radiohead                  3
New Stories                3
Katy Rose                  2
Frank Sinatra              2
Tanju Okan                 2
Pilli Bebek                2
Sezen Aksu                 2
Badly Drawn Boy            2
The White Stripes          2
Foo Fighters               2
Blondie                    2
                          ..
Ferdi Tayfur               1
Mum                        1
Nilüfer                    1
Aylin Aslım   

In [12]:
df[df.user_id=='user_000061'].artist_name.value_counts()

Akon              1
Ne-Yo             1
Usher             1
*Nsync            1
Gwen Stefani      1
Npr               1
Mariah Carey      1
Kelly Clarkson    1
Jennifer Lopez    1
Name: artist_name, dtype: int64

In [29]:
example_users = df[df.user_id.isin(["user_000061",
                                    "user_000014"])]
example_users

Unnamed: 0,user_id,timestamp,item_id,artist_name,song_id,song_name
314829,user_000014,2009-05-01T00:45:51Z,61386f55-12b6-45be-baf2-7c9406965808,The Appleseed Cast,d1d7aae2-a9d9-477a-b917-cc172eb1a4ff,Marigold And Patchwork
314830,user_000014,2008-02-18T19:32:39Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,b2974285-d734-4d6d-9a09-12c5ee1fc1e4,I Know I Know I Know
314831,user_000014,2007-04-03T20:33:25Z,b2274e6d-10e8-4068-9dd2-eb55d4a144b5,Pretty Girls Make Graves,7d118308-65e4-4c8f-8143-b60605f602d2,Speakers Push The Air
314832,user_000014,2007-01-10T21:49:09Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,fcf9c7e9-8d43-4ba2-8db3-94d46933e6d5,Not With You
314833,user_000014,2007-01-10T21:45:59Z,fa6521a7-56b5-4e56-b946-fda469becba9,4 Strings,564c6b6f-c05f-4359-8814-8033ebcaaea9,Take Me Away (Into The Night) (Vocal Radio Mix)
314834,user_000014,2007-01-10T21:43:53Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,,Dont Go Looking
314835,user_000014,2007-01-10T21:40:03Z,ad92dd9c-56ef-4443-b2c0-ff2f7d9cca49,The Sounds,1ce3c67a-91bc-4a20-979a-7b5de71d31c5,Rock 'N Roll
314836,user_000014,2007-01-10T21:36:12Z,59745a87-e2d6-4892-9847-5c07b2708d6b,Morningwood,78cede7b-7b46-4361-8665-5b56f9090bde,Nth Degree
314837,user_000014,2007-01-10T21:31:53Z,340c151c-4c19-48a8-92d9-3ce8e1da6264,Scarling.,776490ac-44b6-4c4c-8a46-5eb8f6dded0d,Manorexic
314838,user_000014,2007-01-10T21:28:44Z,b2274e6d-10e8-4068-9dd2-eb55d4a144b5,Pretty Girls Make Graves,9899fd98-601f-4d0d-a356-7d9401698307,Sad Girls Por Vida


### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing two keys:

* `user_id`: the id of user
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [30]:
import json

batch_input_dir = '{}/batch_input'.format(data_dir)
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

with open(batch_input_file, 'w') as outfile:
    json.dump({"user_id": "user_000061", "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"user_id": "user_000014", "top_n": "5"}, outfile)
   
!head {batch_input_file}

{"user_id": "user_000061", "top_n": "5"}
{"user_id": "user_000014", "top_n": "5"}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [31]:
batch_input = sess.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-east-2/implicit-bpr-test/batch_input'

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [32]:
%%time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "implicit-bpr-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.p3.2xlarge",
    "InstanceCount": 1
  }
}

sagemaker.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sagemaker.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  implicit-bpr-test-2019-02-19-16-55-14
Transform job ended with status: Completed
CPU times: user 155 ms, sys: 20.8 ms, total: 176 ms
Wall time: 3min 34s


### Download the batch results <a id="download-batch-results"></a>

Show the head of the file.

In [33]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} {data_dir}

!head {data_dir}/recommendation.requests.out

download: s3://sagemaker-validation-us-east-2/implicit-bpr-test-2019-02-19-16-55-14/output/recommendation.requests.out to implicit-bpr/recommendation.requests.out
user_000061,183105b5-3e68-4748-9086-2c1c11bf7a3d,3.366915225982666
user_000061,73e5e69d-3554-40d8-8516-00cb38737a1c,3.3082168102264404
user_000061,494e8d09-f85b-4543-892f-a5096aed1cd4,3.1607978343963623
user_000061,8d53ba6e-968c-4f72-9571-4a4f3ed4b3f0,3.152585983276367
user_000061,5508631d-697f-4839-a669-06637e5bcb90,3.109872817993164
user_000014,59745a87-e2d6-4892-9847-5c07b2708d6b,5.114985942840576
user_000014,d614b0ad-fe3a-4927-b413-48cb831a814b,4.9967041015625
user_000014,4449ccf6-c948-4d33-aa97-b6ad98ce4b5b,4.7396039962768555
user_000014,d94f79b0-c690-4a60-9a45-a37a11b78051,4.510592937469482
user_000014,299278d3-25dd-4f30-bae4-5b571c28034d,4.473502159118652


### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above and join with artist names. These are the top 5 artist recommendations for our example users.

In [35]:
recommendations_df = pd.read_csv('{}/recommendation.requests.out'.format(data_dir), 
                                 header=None, 
                                 names=["user_id", "item_id", "score"])
artist_names = df.groupby(['item_id']).agg(lambda x: x.iloc[0])[["artist_name"]]
recommendations_df = recommendations_df.join(artist_names, on='item_id')
recommendations_df

Unnamed: 0,user_id,item_id,score,artist_name
0,user_000061,183105b5-3e68-4748-9086-2c1c11bf7a3d,3.366915,Beyoncé
1,user_000061,73e5e69d-3554-40d8-8516-00cb38737a1c,3.308217,Rihanna
2,user_000061,494e8d09-f85b-4543-892f-a5096aed1cd4,3.160798,Mariah Carey
3,user_000061,8d53ba6e-968c-4f72-9571-4a4f3ed4b3f0,3.152586,Pink
4,user_000061,5508631d-697f-4839-a669-06637e5bcb90,3.109873,Jordin Sparks
5,user_000014,59745a87-e2d6-4892-9847-5c07b2708d6b,5.114986,Morningwood
6,user_000014,d614b0ad-fe3a-4927-b413-48cb831a814b,4.996704,Frou Frou
7,user_000014,4449ccf6-c948-4d33-aa97-b6ad98ce4b5b,4.739604,Metric
8,user_000014,d94f79b0-c690-4a60-9a45-a37a11b78051,4.510593,Jem
9,user_000014,299278d3-25dd-4f30-bae4-5b571c28034d,4.473502,The Postal Service


### User history <a id="user-history"></a>

Show the example users' history again for convenience.

In [26]:
df[df.user_id=='user_000061'].artist_name.value_counts()[:5]

Akon            1
Ne-Yo           1
Usher           1
*Nsync          1
Gwen Stefani    1
Name: artist_name, dtype: int64

In [28]:
df[df.user_id=='user_000014'].artist_name.value_counts()[:5]

Tegan And Sara           36
The Organ                15
Ghostland Observatory    14
Metric                   13
Fischerspooner           12
Name: artist_name, dtype: int64

## Step 4 - Optional Clean up <a id="cleanup"></a>

In [None]:
def cleanup():
    !rm lastfm-dataset-360K.tar.gz 2> /dev/null
    !rm -fr implicit-bpr/ 2> /dev/null
    !rm -fr lastfm-dataset-360K/ 2> /dev/null
    sagemaker.delete_model(ModelName = model_name)
    
# optionally uncomment and run the code to clean everything up  

#cleanup()