# Implicit Bayesian Personalized Ranking

A recommender model that learns a matrix factorization embedding based off minimizing the pairwise ranking loss.

[Implicit BPR Documentation](https://implicit.readthedocs.io/en/latest/bpr.html)

## Table of contents

* [Sample files](#sample-files)
* [Step 1 - Prepare training data](#prepare-training-data)
 * [Download lastfm 1K dataset](#download)
 * [Prepare lastfm artist play data](#prepare-data)
 * [Create training data file](#create-training-data-file)
 * [Upload training data file](#upload-training-data)
* [Step 2 - Create a model](#create-model)
 * [Run a SageMaker training job](#run-training-job)
 * [Create a SageMaker model](#create-sagemaker-model)
* [Step 3 - Get recommendations (inference)](#get-recommendations)
 * [Example users](#example-users)
 * [Create batch transform input file](#create-batch-input)
 * [Upload the batch transform input file to s3](#upload-batch-input)
 * [Run the Batch Transform Job](#run-transform)
 * [Download the batch results](#download-batch-results)
 * [Recommendations with scores](#recommendations)
 * [User history](#user-history)
* [Step 4 - Optional Cleanup](#cleanup)

## Sample files <a id="sample-files"></a>

These links are to example files on github.

* [training input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/training/lastfm-1K-2mm-clean.csv)
* [batch transform input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/batch_input/recommendation.requests)
* [batch transform output file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/recommendation.requests.out)

## Step 1 - Prepare training data <a id="prepare-training-data"></a>
### Download lastfm 1K dataset <a id="download"></a>

In [1]:
!mkdir -p ../data
!wget -O ../data/lastfm-dataset-1K.tar.gz --no-clobber http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz
!tar -C  ../data/ -xvf ../data/lastfm-dataset-1K.tar.gz
!head ../data/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv

File `../data/lastfm-dataset-1K.tar.gz' already there; not retrieving.
x lastfm-dataset-1K/
x lastfm-dataset-1K/userid-profile.tsv
x lastfm-dataset-1K/README.txt
x lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv
user_000001	2009-05-04T23:08:57Z	f1b1cf71-bd35-4e99-8624-24a6e15f133a	Deep Dish		Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
user_000001	2009-05-04T13:54:10Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Composition 0919 (Live_2009_4_15)
user_000001	2009-05-04T13:52:04Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Mc2 (Live_2009_4_15)
user_000001	2009-05-04T13:42:52Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Hibari (Live_2009_4_15)
user_000001	2009-05-04T13:42:11Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Mc1 (Live_2009_4_15)
user_000001	2009-05-04T13:38:31Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		To Stanford (Live_2009_4_15)
user_000001	2009-05-04T13:33:28Z	a7f7df4a-77d8-4f12-8acd-5c60c93f4de8	坂本龍一		Improvisation (Live_2009_4_15)
user_000001	2009-05-04T13:23:

### Prepare lastfm artist play training data <a id="prepare-data"></a>

Import the tab separated lastfm file.  Take only the first 2 million rows to save memory and processing time. Drop any rows with null values in `item_id`, `user_id`, `artist_name`.

In [2]:
import pandas as pd

df = pd.read_csv('../data/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv', 
                 sep="\t", 
                 header=None, 
                 names=["user_id", "timestamp", "item_id", "artist_name", "song_id", "song_name"],
                 nrows=2000000)
df = df.dropna(subset=['user_id', 'item_id', 'artist_name'])
print(df.shape)
df.head()

(1940062, 6)


Unnamed: 0,user_id,timestamp,item_id,artist_name,song_id,song_name
0,user_000001,2009-05-04T23:08:57Z,f1b1cf71-bd35-4e99-8624-24a6e15f133a,Deep Dish,,Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
1,user_000001,2009-05-04T13:54:10Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Composition 0919 (Live_2009_4_15)
2,user_000001,2009-05-04T13:52:04Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc2 (Live_2009_4_15)
3,user_000001,2009-05-04T13:42:52Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Hibari (Live_2009_4_15)
4,user_000001,2009-05-04T13:42:11Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc1 (Live_2009_4_15)


### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, and `item_id` where each row is an interaction between the user and the item. Show the head of the file.

In [3]:
train_data_dir = 'training'
train_data_file = '{}/lastfm-1K-2mm-clean.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
df[["user_id", "item_id"]].to_csv(train_data_file, index=False)

!head {train_data_file}

user_id,item_id
user_000001,f1b1cf71-bd35-4e99-8624-24a6e15f133a
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8
user_000001,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8


### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket in the correct region, optionally customize the prefix, and upload the csv created above.

In [4]:
import sagemaker

bucket = "sagemaker-validation-us-west-2"
prefix = "implicit-bpr-test"

sess = sagemaker.Session()

s3_train = sess.upload_data(train_data_dir, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

'uploaded training data file to s3://sagemaker-validation-us-west-2/implicit-bpr-test/training'

## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

Provide a proper role and the algorithm arn from your subscription in the proper region. This code will start a training job, wait for it to be done, and report its status.

In [5]:
%%time

import time

role_arn = "arn:aws:iam::435525115971:role/service-role/AmazonSageMaker-ExecutionRole-20181012T121978"
algo_arn = "arn:aws:sagemaker:us-west-2:594846645681:algorithm/implicit-bpr-2-085a8f2de7e8d057d9c758785eb4e51d"
job_name_prefix = 'implicit-bpr-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp

estimator = sagemaker.AlgorithmEstimator(
    algorithm_arn = algo_arn,
    role=role_arn,
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge",
    input_mode='File',
    output_path='s3://{}/{}/output'.format(bucket, job_name_prefix),
    base_job_name=job_name_prefix
)

test = sagemaker.session.s3_input(
    s3_data="s3://{}/implicit-bpr/moments-all/testing/".format(bucket),
    distribution='FullyReplicated',
    content_type="text/csv",
    s3_data_type='S3Prefix')

inputs = {"training": s3_train}

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: implicit-bpr-test-2019-02-19-23-31-52-168


2019-02-19 23:31:52 Starting - Starting the training job...
2019-02-19 23:31:53 Starting - Launching requested ML instances......
2019-02-19 23:33:11 Starting - Preparing the instances for training...
2019-02-19 23:33:56 Downloading - Downloading input data...
2019-02-19 23:34:18 Training - Downloading the training image...
2019-02-19 23:34:46 Training - Training image download completed. Training in progress.
[31mCUDA is available: False[0m
[31mBeginning training[0m
[31mof pandas will change to not sort by default.
[0m
[31mTo accept the future behavior, pass 'sort=False'.
[0m

  data = pd.concat(raw_data)[0m
[31m/opt/ml/input/data/training data shape: (4867285, 3)[0m
[31mtrained_df_joined data shape: (4867285, 5)[0m
[31mFound users_max: 20558[0m
[31mFound items_max: 76452[0m
[31mpickling 20559 users[0m
[31mpickling 76453 items[0m
[31mpickling (20559, 76453) user items[0m
[31mEffective hyperparameters: {'use_gpu': False, 'learning_rate': 0.01, 'iterations': 100,

### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up a model_package and model within SageMaker from the artifacts created during training. This will be used later for recommendations.

In [6]:
model_name = estimator.latest_training_job.name
sess.create_model_package_from_algorithm(model_name, 'test', algo_arn, estimator.model_data)
sess.wait_for_model_package(model_name, poll=5)
sess.create_model(model_name, role_arn, [{'ModelPackageName': model_name}], enable_network_isolation=True)

INFO:sagemaker:Creating model package with name: implicit-bpr-test-2019-02-19-23-31-52-168


..........

INFO:sagemaker:Creating model with name: implicit-bpr-test-2019-02-19-23-31-52-168





'implicit-bpr-test-2019-02-19-23-31-52-168'

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Example users <a id="example-users"></a>

Find some example users in order to predict recommend artists for them.

In [7]:
example_users = df[df.user_id.isin(["user_000061",
                                    "user_000014"])]
example_users

Unnamed: 0,user_id,timestamp,item_id,artist_name,song_id,song_name
314829,user_000014,2009-05-01T00:45:51Z,61386f55-12b6-45be-baf2-7c9406965808,The Appleseed Cast,d1d7aae2-a9d9-477a-b917-cc172eb1a4ff,Marigold And Patchwork
314830,user_000014,2008-02-18T19:32:39Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,b2974285-d734-4d6d-9a09-12c5ee1fc1e4,I Know I Know I Know
314831,user_000014,2007-04-03T20:33:25Z,b2274e6d-10e8-4068-9dd2-eb55d4a144b5,Pretty Girls Make Graves,7d118308-65e4-4c8f-8143-b60605f602d2,Speakers Push The Air
314832,user_000014,2007-01-10T21:49:09Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,fcf9c7e9-8d43-4ba2-8db3-94d46933e6d5,Not With You
314833,user_000014,2007-01-10T21:45:59Z,fa6521a7-56b5-4e56-b946-fda469becba9,4 Strings,564c6b6f-c05f-4359-8814-8033ebcaaea9,Take Me Away (Into The Night) (Vocal Radio Mix)
314834,user_000014,2007-01-10T21:43:53Z,d13f0f47-36f9-4661-87fe-2de56f45c649,Tegan And Sara,,Dont Go Looking
314835,user_000014,2007-01-10T21:40:03Z,ad92dd9c-56ef-4443-b2c0-ff2f7d9cca49,The Sounds,1ce3c67a-91bc-4a20-979a-7b5de71d31c5,Rock 'N Roll
314836,user_000014,2007-01-10T21:36:12Z,59745a87-e2d6-4892-9847-5c07b2708d6b,Morningwood,78cede7b-7b46-4361-8665-5b56f9090bde,Nth Degree
314837,user_000014,2007-01-10T21:31:53Z,340c151c-4c19-48a8-92d9-3ce8e1da6264,Scarling.,776490ac-44b6-4c4c-8a46-5eb8f6dded0d,Manorexic
314838,user_000014,2007-01-10T21:28:44Z,b2274e6d-10e8-4068-9dd2-eb55d4a144b5,Pretty Girls Make Graves,9899fd98-601f-4d0d-a356-7d9401698307,Sad Girls Por Vida


### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing two keys:

* `user_id`: the id of user
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [8]:
import json

batch_input_dir = 'batch_input'
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

with open(batch_input_file, 'w') as outfile:
    json.dump({"user_id": "user_000061", "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"user_id": "user_000014", "top_n": "5"}, outfile)
   
!head {batch_input_file}

{"user_id": "user_000061", "top_n": "5"}
{"user_id": "user_000014", "top_n": "5"}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [9]:
batch_input = sess.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-west-2/implicit-bpr-test/batch_input'

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [10]:
%%time

import boto3

sage = boto3.client(service_name='sagemaker')

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "implicit-bpr-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.c5.2xlarge",
    "InstanceCount": 1
  }
}

sage.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sage.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  implicit-bpr-test-2019-02-19-23-36-43
Transform job ended with status: Completed
CPU times: user 196 ms, sys: 27.6 ms, total: 223 ms
Wall time: 3min 35s


### Download the batch results <a id="download-batch-results"></a>

Show the head of the file.

In [11]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} .

!head recommendation.requests.out

download: s3://sagemaker-validation-us-west-2/implicit-bpr-test-2019-02-19-23-36-43/output/recommendation.requests.out to ./recommendation.requests.out
user_000061,45a663b5-b1cb-4a91-bff6-2bef7bbfdd76,3.0888047218322754
user_000061,494e8d09-f85b-4543-892f-a5096aed1cd4,2.982588529586792
user_000061,1fda852b-92e9-4562-82fa-c52820a77b23,2.970466136932373
user_000061,a796b92e-c137-4895-9c89-10f900617a4f,2.9396846294403076
user_000061,4f9675d2-f6d5-486c-9b26-33dcca998500,2.9361276626586914
user_000014,59745a87-e2d6-4892-9847-5c07b2708d6b,4.627895355224609
user_000014,4449ccf6-c948-4d33-aa97-b6ad98ce4b5b,4.552852630615234
user_000014,d614b0ad-fe3a-4927-b413-48cb831a814b,4.408276557922363
user_000014,681bf706-8664-4658-ab1a-39e1d385fae2,4.340916156768799
user_000014,766a2b45-441f-4096-af05-dbbca9518c9d,4.335169315338135


### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above and join with artist names. These are the top 5 artist recommendations for our example users.

In [12]:
recommendations_df = pd.read_csv('recommendation.requests.out', 
                                 header=None, 
                                 names=["user_id", "item_id", "score"])
artist_names = df.groupby(['item_id']).agg(lambda x: x.iloc[0])[["artist_name"]]
recommendations_df = recommendations_df.join(artist_names, on='item_id')
recommendations_df

Unnamed: 0,user_id,item_id,score,artist_name
0,user_000061,45a663b5-b1cb-4a91-bff6-2bef7bbfdd76,3.088805,Britney Spears
1,user_000061,494e8d09-f85b-4543-892f-a5096aed1cd4,2.982589,Mariah Carey
2,user_000061,1fda852b-92e9-4562-82fa-c52820a77b23,2.970466,The Pussycat Dolls
3,user_000061,a796b92e-c137-4895-9c89-10f900617a4f,2.939685,Destiny'S Child
4,user_000061,4f9675d2-f6d5-486c-9b26-33dcca998500,2.936128,Fergie
5,user_000014,59745a87-e2d6-4892-9847-5c07b2708d6b,4.627895,Morningwood
6,user_000014,4449ccf6-c948-4d33-aa97-b6ad98ce4b5b,4.552853,Metric
7,user_000014,d614b0ad-fe3a-4927-b413-48cb831a814b,4.408277,Frou Frou
8,user_000014,681bf706-8664-4658-ab1a-39e1d385fae2,4.340916,Esthero
9,user_000014,766a2b45-441f-4096-af05-dbbca9518c9d,4.335169,Phantom Planet


### User history <a id="user-history"></a>

Show the top 5 listened to artists by each of the example users for reference.

In [13]:
df[df.user_id=='user_000061'].artist_name.value_counts()[:5]

Gwen Stefani      1
Usher             1
Jennifer Lopez    1
Kelly Clarkson    1
Npr               1
Name: artist_name, dtype: int64

In [14]:
df[df.user_id=='user_000014'].artist_name.value_counts()[:5]

Tegan And Sara           36
The Organ                15
Ghostland Observatory    14
Metric                   13
Fischerspooner           12
Name: artist_name, dtype: int64

## Step 4 - Optional Clean up <a id="cleanup"></a>

In [15]:
def cleanup():
    !rm ../data/lastfm-dataset-1K.tar.gz 2> /dev/null
    !rm -fr ../data/lastfm-dataset-1K/ 2> /dev/null
    sess.delete_model(model_name)
    
# optionally uncomment and run the code to clean everything up  

#cleanup()