# Implicit Bayesian Personalized Ranking

A recommender model that learns a matrix factorization embedding based off minimizing the pairwise ranking loss.

[Implicit BPR Documentation](https://implicit.readthedocs.io/en/latest/bpr.html)

## Table of contents

* [Sample files](#sample-files)
* [Step 1 - Prepare training data](#prepare-training-data)
 * [Download lastfm 360k dataset](#download)
 * [Prepare lastfm artist play data](#prepare-data)
 * [Create training data file](#create-training-data-file)
 * [Upload training data file](#upload-training-data)
* [Step 2 - Create a model](#create-model)
 * [Run a SageMaker training job](#run-training-job)
 * [Create a SageMaker model](#create-sagemaker-model)
* [Step 3 - Get recommendations (inference)](#get-recommendations)
 * [Example users](#example-users)
 * [Create batch transform input file](#create-batch-input)
 * [Upload the batch transform input file to s3](#upload-batch-input)
 * [Run the Batch Transform Job](#run-transform)
 * [Download the batch results](#download-batch-results)
 * [Recommendations with scores](#recommendations)
 * [User history](#user-history)
* [Step 4 - Optional Cleanup](#cleanup)

## Sample files <a id="sample-files"></a>

These links are to example files on github.

* [training input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/training/lastfm-360k-1mm-clean.csv)
* [batch transform input file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/batch_input/recommendation.requests)
* [batch transform output file](https://github.com/outpace/sagemaker-examples/blob/master/implicit-bpr/recommendation.requests.out)

## Step 1 - Prepare training data <a id="prepare-training-data"></a>
### Download lastfm 360k dataset <a id="download"></a>

In [8]:
!wget --no-clobber http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz
!tar -xvf lastfm-dataset-360K.tar.gz
!head lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv

File 'lastfm-dataset-360K.tar.gz' already there; not retrieving.

x lastfm-dataset-360K/
x lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv
x lastfm-dataset-360K/README.txt
x lastfm-dataset-360K/mbox_sha1sum.py
x lastfm-dataset-360K/usersha1-profile.tsv
00000c289a1829a808ac09c00daf10bc3c4e223b	3bd73256-3905-4f3a-97e2-8b341527f805	betty blowtorch	2137
00000c289a1829a808ac09c00daf10bc3c4e223b	f2fb0ff0-5679-42ec-a55c-15109ce6e320	die Ärzte	1099
00000c289a1829a808ac09c00daf10bc3c4e223b	b3ae82c2-e60b-4551-a76d-6620f1b456aa	melissa etheridge	897
00000c289a1829a808ac09c00daf10bc3c4e223b	3d6bbeb7-f90e-4d10-b440-e153c0d10b53	elvenking	717
00000c289a1829a808ac09c00daf10bc3c4e223b	bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8	juliette & the licks	706
00000c289a1829a808ac09c00daf10bc3c4e223b	8bfac288-ccc5-448d-9573-c33ea2aa5c30	red hot chili peppers	691
00000c289a1829a808ac09c00daf10bc3c4e223b	6531c8b1-76ea-4141-b270-eb1ac5b41375	magica	545
00000c289a1829a808ac09c00daf10bc3c4e223b	21f3573f-10cf-44b3-

### Prepare lastfm artist play training data <a id="prepare-data"></a>

Import the tab separated lastfm file. Take only the first 1 million rows to save memory and processing time. Drop any rows with null values in `item_id`, `user_id`, `total_interactions`, `artist_name`.

In [9]:
import pandas as pd

df = pd.read_csv('lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv', 
                 sep="\t", 
                 header=None, 
                 names=["user_id", "item_id", "artist_name", "total_interactions"], 
                 nrows=1000000)
df = df.dropna(subset=['item_id', 'user_id', 'total_interactions', 'artist_name'])
print(df.shape)
df.head()

(987161, 4)


Unnamed: 0,user_id,item_id,artist_name,total_interactions
0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137
1,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ärzte,1099
2,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
3,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717
4,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706


### Create training data file <a id="create-training-data-file"></a>

Create a csv file from the dataframe above. Do not include the index, but include headers `user_id`, `item_id`, and `total_interactions`. Show the head of the file.

In [49]:
data_dir = 'implicit-bpr'
train_data_dir = '{}/training'.format(data_dir)
train_data_file = '{}/lastfm-360k-1mm-clean.csv'.format(train_data_dir)

!mkdir -p {train_data_dir}
df[["user_id", "item_id"]].to_csv(train_data_file, index=False)

!head {train_data_file}

user_id,item_id
00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805
00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320
00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa
00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53
00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8
00000c289a1829a808ac09c00daf10bc3c4e223b,8bfac288-ccc5-448d-9573-c33ea2aa5c30
00000c289a1829a808ac09c00daf10bc3c4e223b,6531c8b1-76ea-4141-b270-eb1ac5b41375
00000c289a1829a808ac09c00daf10bc3c4e223b,21f3573f-10cf-44b3-aeaa-26cccd8448b5
00000c289a1829a808ac09c00daf10bc3c4e223b,c5db90c4-580d-4f33-b364-fbaa5a3a58b5


### Upload training data to s3 <a id="upload-training-data"></a>

Choose a bucket, optionally customize the prefix, and upload the csv created above.

In [50]:
import sagemaker

bucket = "sagemaker-validation-us-west-2"
prefix = "implicit-bpr-test"

sess = sagemaker.Session()

s3_train = sess.upload_data(train_data_dir, bucket, "{}/training".format(prefix))
"uploaded training data file to {}".format(s3_train)

'uploaded training data file to s3://sagemaker-validation-us-west-2/implicit-bpr-test/training'

## Step 2 - Create a model <a id="create-model"></a>

### Run a SageMaker training job <a id="run-training-job"></a>

This code will start a training job, wait for it to be done, and report its status.

In [13]:
%%time

import time
import sagemaker

role_arn = "arn:aws:iam::435525115971:role/service-role/AmazonSageMaker-ExecutionRole-20181012T121978"
algo_arn = "arn:aws:sagemaker:us-west-2:594846645681:algorithm/implicit-bpr-2-085a8f2de7e8d057d9c758785eb4e51d"
job_name_prefix = 'implicit-bpr-test'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp

estimator = sagemaker.AlgorithmEstimator(
    algorithm_arn = algo_arn,
    role=role_arn,
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge",
    input_mode='File',
    output_path='s3://{}/{}/output'.format(bucket, job_name_prefix),
    base_job_name=job_name_prefix
)

test = sagemaker.session.s3_input(
    s3_data="s3://marriott-ml-us-west-2/implicit-bpr/moments-all/testing/",
    distribution='FullyReplicated',
    content_type="text/csv",
    s3_data_type='S3Prefix')

inputs = {"training": s3_train#,
          #"testing": test
         }

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: implicit-bpr-test-2019-02-15-19-19-20-982


2019-02-15 19:19:21 Starting - Starting the training job...
2019-02-15 19:19:22 Starting - Launching requested ML instances......
2019-02-15 19:20:54 Starting - Preparing the instances for training...
2019-02-15 19:21:24 Downloading - Downloading input data...
2019-02-15 19:21:43 Training - Downloading the training image...
2019-02-15 19:22:30 Uploading - Uploading generated training model
[31mCUDA is available: False[0m
[31mBeginning training[0m
[31m/opt/ml/input/data/training data shape: (987161, 3)[0m
[31mtrained_df_joined data shape: (987161, 5)[0m
[31mFound users_max: 20461[0m
[31mFound items_max: 66796[0m
[31mpickling 20462 users[0m
[31mpickling 66797 items[0m
[31mpickling (20462, 66797) user items[0m
[31mEffective hyperparameters: {'use_gpu': False, 'learning_rate': 0.01, 'iterations': 100, 'regularization': 0.01, 'factors': 100, 'verify_negative_samples': False}[0m
[31m#015  0%|          | 0/100 [00:00<?, ?it/s]#015  1%|1         | 1/100 [00:00<00:10,  9.86

### Create a SageMaker model <a id="create-sagemaker-model"></a>

This will set up the model created during training within SageMaker to be used later for recommendations.

In [33]:
model_name = estimator.latest_training_job.name
sess.create_model_package_from_algorithm(model_name, 'test', algo_arn, estimator.model_data)

INFO:sagemaker:Creating model package with name: implicit-bpr-test-2019-02-15-19-19-20-982


In [43]:
sess.create_model(model_name, role_arn, [{'ModelPackageName': model_name}], enable_network_isolation=True)

INFO:sagemaker:Creating model with name: implicit-bpr-test-2019-02-15-19-19-20-982


'implicit-bpr-test-2019-02-15-19-19-20-982'

## Step 3 - Get recommendations (Inference) <a id="get-recommendations"></a>

### Example users <a id="example-users"></a>

Find some example users in order to predict their next rating/watch.

In [34]:
example_users = df[df.user_id.isin(["05c4bbb936abd2331e8f64037c95a61335d40e30",
                                    "030ebbd1d8b360ce465a20e30a67a43da97f1b20"])]
example_users

Unnamed: 0,user_id,item_id,artist_name,total_interactions
207869,030ebbd1d8b360ce465a20e30a67a43da97f1b20,a7022764-95fb-46af-a7d6-90056746451a,uma thurman,651
207870,030ebbd1d8b360ce465a20e30a67a43da97f1b20,0743b15a-3c32-48c8-ad58-cb325350befa,blink-182,649
391892,05c4bbb936abd2331e8f64037c95a61335d40e30,99d7b49c-c18e-4a11-bf3e-b71710938df6,phoenix,3
391893,05c4bbb936abd2331e8f64037c95a61335d40e30,bd4d397a-849a-48bf-be24-52eec87feeee,adriana calcanhotto,2


### Create batch transform input file <a id="create-batch-input"></a>

Each row is a json object containing two keys:

* `user_id`: the id of user
* `top_n`: the number of top scoring recommendations to return

The head of the batch input file is shown.

In [35]:
import json

batch_input_dir = '{}/batch_input'.format(data_dir)
batch_input_file = batch_input_dir + '/recommendation.requests'

!mkdir -p {batch_input_dir}

with open(batch_input_file, 'w') as outfile:
    json.dump({"user_id": "05c4bbb936abd2331e8f64037c95a61335d40e30", "top_n": "5"}, outfile)
    outfile.write("\n")
    json.dump({"user_id": "030ebbd1d8b360ce465a20e30a67a43da97f1b20", "top_n": "5"}, outfile)
   
!head {batch_input_file}

{"user_id": "05c4bbb936abd2331e8f64037c95a61335d40e30", "top_n": "5"}
{"user_id": "030ebbd1d8b360ce465a20e30a67a43da97f1b20", "top_n": "5"}

### Upload the batch transform input file to s3 <a id="upload-batch-input"></a>

In [36]:
batch_input = sess.upload_data(batch_input_dir, bucket, "{}/batch_input".format(prefix))
"uploaded training data file to {}".format(batch_input)

'uploaded training data file to s3://sagemaker-validation-us-west-2/implicit-bpr-test/batch_input'

### Run the Batch Transform Job <a id="run-transform"></a>

This code will start a batch transform job, wait for it to be done, and report its status.

In [48]:
%%time

sage = boto3.client(service_name='sagemaker')

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
batch_job_name = "implicit-bpr-test" + timestamp
batch_output = 's3://{}/{}/output'.format(bucket, batch_job_name)
request = \
{
  "TransformJobName": batch_job_name,
  "ModelName": model_name,
  "BatchStrategy": "SingleRecord",
  "TransformInput": {
    "DataSource": {
      "S3DataSource": {
        "S3DataType": "S3Prefix",
        "S3Uri": batch_input
      }
    },
    "ContentType": "application/json",
    "CompressionType": "None",
    "SplitType": "Line"
  },
  "TransformOutput": {
    "S3OutputPath": batch_output,
    "Accept": "text/csv",
    "AssembleWith": "Line"
  },
  "TransformResources": {
    "InstanceType": "ml.c5.2xlarge",
    "InstanceCount": 1
  }
}

sage.create_transform_job(**request)

print("Created Transform job with name: ", batch_job_name)

while(True):
    job_info = sage.describe_transform_job(TransformJobName=batch_job_name)
    status = job_info['TransformJobStatus']
    if status == 'Completed':
        print("Transform job ended with status: " + status)
        break
    if status == 'Failed':
        message = job_info['FailureReason']
        print('Transform failed with the following error: {}'.format(message))
        raise Exception('Transform job failed') 
    time.sleep(30)

Created Transform job with name:  implicit-bpr-test-2019-02-15-22-25-20
Transform job ended with status: Completed
CPU times: user 171 ms, sys: 18.2 ms, total: 189 ms
Wall time: 3min 34s


### Download the batch results <a id="download-batch-results"></a>

Show the head of the file.

In [None]:
!aws s3 cp {batch_output + '/recommendation.requests.out'} {data_dir}

!head {data_dir}/recommendation.requests.out

### Recommendations with scores <a id="recommendations"></a>

Import the recommendations from the batch output file downloaded above and join with artist names. These are the top 5 artist recommendations for our example users.

In [None]:
recommendations_df = pd.read_csv('{}/recommendation.requests.out'.format(data_dir), 
                                 header=None, 
                                 names=["user_id", "item_id", "score"])
artist_names = df.groupby(['item_id']).agg(lambda x: x.iloc[0])[["artist_name"]]
recommendations_df = recommendations_df.join(artist_names, on='item_id')
recommendations_df

### User history <a id="user-history"></a>

Show the example users' history again for convenience.

In [None]:
example_users

## Step 4 - Optional Clean up <a id="cleanup"></a>

In [None]:
def cleanup():
    !rm lastfm-dataset-360K.tar.gz 2> /dev/null
    !rm -fr implicit-bpr/ 2> /dev/null
    !rm -fr lastfm-dataset-360K/ 2> /dev/null
    sagemaker.delete_model(ModelName = model_name)
    
# optionally uncomment and run the code to clean everything up  

#cleanup()