# Factorization Machine Algorithm Hyperparameter Tuning

Before running this notebook, familiarize with the contents of [Recommendation-Machine](./Recommendation-Machine.ipynb), as this notebook only shows how to perform hyperparameter search for the algorithm, and doesn't focus on deployment/inference.

1. [Prerequisites and Preprocessing](#Prequisites)
2. [Perform Hyperparameter Tuning of the Model](#Hyperparameter-Tuning)
3. [Perform Batch Inference with the best Model](#Perform-Batch)
4. [Cleanup](#Clean)

<a id='Prequisites'></a>

## Prequisites and Preprocessing
---

### Permissions and environment variables

Here we set up the linkage and authentication to AWS services. There are three parts to this:

* The roles used to give learning and hosting access to your data. This will automatically be obtained from the role used to start the notebook
* The S3 bucket that you want to use for training and model data
* The Amazon sagemaker Factorization Machine docker image which need not be changed

In [1]:
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = "builtin-notebooks/Recomendation-Machine/Explicit"
print(f"role: {role} bucket: {bucket}")

train_key = 'train.protobuf'
train_prefix = '{}/{}'.format(prefix, 'train')
s3_train = 's3://{}/{}/train/'.format(bucket, prefix)

test_key = 'test.protobuf'
test_prefix = '{}/{}'.format(prefix, 'test')

#ubicación S3 de salida
output_prefix = 's3://{}/{}/output'.format(bucket, prefix)



role: arn:aws:iam::338408246139:role/service-role/AmazonSageMaker-ExecutionRole-20210707T172488 bucket: sagemaker-us-east-1-338408246139


In [2]:
import boto3
import sagemaker
from sagemaker.image_uris import retrieve

training_image = retrieve(region=boto3.Session().region_name, framework="factorization-machines", version='latest')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


### Data preparation

For more details consult the notebook [Recommendation-Machine-Explicit](./Recommendation-Machine-Explicit.ipynb).

In [3]:
import pandas as pd

ratings = pd.read_csv('ml-100k/ua.base', sep='\t', names=['userId','movieId','rating','timestamp'] )
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=['userId','movieId','rating','timestamp'] )

#ratings.drop(columns='timestamp', inplace=True)
print('Shape of ratings dataset for training: {}'.format(ratings.shape))
ratings.head()

Shape of ratings dataset for training: (90570, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [4]:
ratings['rating_bin'] = (ratings.rating>=4).astype('float32')
ratings_test['rating_bin'] = (ratings_test.rating>=4).astype('float32')
ratings_test = ratings_test.drop(index=ratings_test[ratings_test.movieId.isin(ratings.movieId)==False].index)
ratings_test.reset_index(drop=True, inplace=True)
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_bin
0,1,1,5,874965758,1.0
1,1,2,3,876893171,0.0
2,1,3,4,878542960,1.0
3,1,4,3,876893119,0.0
4,1,5,3,889751712,0.0


In [5]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='ignore',sparse=True)
enc.fit(ratings[['userId','movieId']])

X_train_OH = enc.transform(ratings[['userId','movieId']]).astype('float32')
Y_train_OH = ratings['rating_bin']

X_test_OH = enc.transform(ratings_test[['userId','movieId']]).astype('float32')
Y_test_OH = ratings_test['rating_bin']

columns = X_train_OH.shape[1]

In [6]:
import io
import sagemaker.amazon.common as smac

# Función que permite guardar en formato protobuf en un bucket de S3
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)

In [7]:
%%time
train_data = writeDatasetToProtobuf(X_train_OH, Y_train_OH, bucket, train_prefix, train_key)    
test_data = writeDatasetToProtobuf(X_test_OH, Y_test_OH, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/train/train.protobuf
s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/test/test.protobuf
Output: s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/output
CPU times: user 12.7 s, sys: 312 ms, total: 13 s
Wall time: 12.9 s


<a id='Hyperparameter-Tuning'></a>

# Perform Hyperparameter Tuning of the Model

***

Now that we are done with all the setup that is needed, we are ready to tune the hyperparameters our Factorization Machine.

For this example, three hyperparameters will be tuned: **epoch** and **mini_batch_size**, which has the greatest impact on the objective metric. See [here](https://docs.aws.amazon.com/sagemaker/latest/dg/fm-tuning.html) for more detail and the full list of hyperparameters that can be tuned.

 To begin, let us create a  object.
Before launching the tuning job, training jobs that the hyperparameter tuning job will launch need to be configured by defining a ``sagemaker.estimator.Estimator`` object that specifies the following information:

* The container image for the algorithm (Factorization-Machine).
* The s3 location for training and validation data.
* The type and number of instances to use for the training jobs.
* The output specification where the output can be stored after training.

In [8]:
s3_output_location = "s3://{}/{}/hp-tuning-output".format(bucket, prefix)
fm = sagemaker.estimator.Estimator(
    training_image,
    role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    output_path=s3_output_location,
    sagemaker_session=sess
)

The values of any hyperparameters that are not tuned in the tuning job:

* **feature_dim**: The dimension of the input feature space. This could be very high with sparse input.
* **num_factors**: The dimensionality of factorization. As mentioned initially, factorization machines find a lower dimensional representation of the interactions for all features. Making this value smaller provides a more parsimonious model, closer to a linear model, but may sacrifice information about interactions. Making it larger provides a higher-dimensional representation of feature interactions, but adds computational complexity and can lead to overfitting. In a practical application, time should be invested to tune this parameter to the appropriate value.
* **predictor_type**: The type of predictor. binary_classifier: For binary classification tasks. regressor: For regression tasks.
* **epochs**: The number of training epochs to run.
* **mini_batch_size**: The size of mini-batch used for training. This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases.

You can check all the available hyperparameters at [Factorization Machines Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html).


In [9]:
fm.set_hyperparameters(
    feature_dim=columns,
    num_factors=64,
    predictor_type='binary_classifier',
    epochs=15,
    mini_batch_size=400
)

In [10]:
data_channels = {
    "train": train_data,
    "test": test_data,
}

Next, the tuning job with the following configurations need to be specified:

* The hyperparameters that SageMaker Automatic Model Tuning will tune: **epochs** and **mini_batch_size**
* The maximum number of training jobs it will run to optimize the objective metric: 30
* The number of parallel training jobs that will run in the tuning job: 2
* The objective metric that Automatic Model Tuning will use: test:rmse

In [11]:
from time import gmtime, strftime
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

tuning_job_name = "Factorization-Machine-job-{}".format(
    strftime("%d-%H-%M-%S", gmtime())
)

hyperparameter_ranges = {
            "epochs": IntegerParameter(10,30),
            "mini_batch_size": IntegerParameter(100,600)
}

objective_metric_name = 'test:binary_classification_accuracy'

tuner = HyperparameterTuner(
    fm,
    objective_metric_name,
    hyperparameter_ranges,
    objective_type="Maximize",
    max_jobs=30,
    max_parallel_jobs=2,
    early_stopping_type="Auto"
)

## Launch the Training job
Start training by calling the `fit` method in the `estimator`. This will launch a SageMaker Training job with the requested parameters and hyperparameters.

When it's done, run the next cell to see the training results.

In [12]:
%%time
tuner.fit(inputs=data_channels, include_cls_metadata=False, logs=True)

.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!
CPU times: user 4.04 s, sys: 420 ms, total: 4.46 s
Wall time: 1

In [21]:
from IPython.display import display
from sagemaker import HyperparameterTuningJobAnalytics

tuner_metrics = HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name)
total_time = tuner_metrics.dataframe()["TrainingElapsedTimeSeconds"].sum() / 3600

display(f"The total training time is {total_time:.2f} hours")
display(tuner_metrics.dataframe().sort_values(["FinalObjectiveValue"], ascending=False))
display(tuner_metrics.dataframe()["TrainingJobStatus"].value_counts())

'The total training time is 1.00 hours'

Unnamed: 0,epochs,mini_batch_size,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
21,30.0,100.0,factorization-machin-211222-1919-009-051c5774,Completed,0.696648,2021-12-22 19:41:10+00:00,2021-12-22 19:43:28+00:00,138.0
5,30.0,114.0,factorization-machin-211222-1919-025-8d886616,Completed,0.694739,2021-12-22 20:24:46+00:00,2021-12-22 20:27:13+00:00,147.0
14,30.0,105.0,factorization-machin-211222-1919-016-ba1a0ddc,Completed,0.694633,2021-12-22 19:58:13+00:00,2021-12-22 20:00:15+00:00,122.0
4,30.0,121.0,factorization-machin-211222-1919-026-4e84a0fd,Completed,0.694527,2021-12-22 20:25:12+00:00,2021-12-22 20:27:17+00:00,125.0
18,30.0,110.0,factorization-machin-211222-1919-012-cd158a54,Completed,0.694527,2021-12-22 19:47:23+00:00,2021-12-22 19:49:21+00:00,118.0
19,30.0,102.0,factorization-machin-211222-1919-011-e392de49,Completed,0.694315,2021-12-22 19:46:44+00:00,2021-12-22 19:48:47+00:00,123.0
9,30.0,108.0,factorization-machin-211222-1919-021-7aaef662,Completed,0.694209,2021-12-22 20:14:26+00:00,2021-12-22 20:16:45+00:00,139.0
17,30.0,107.0,factorization-machin-211222-1919-013-8713387b,Completed,0.693997,2021-12-22 19:52:07+00:00,2021-12-22 19:54:45+00:00,158.0
16,30.0,101.0,factorization-machin-211222-1919-014-5868c818,Completed,0.693891,2021-12-22 19:52:43+00:00,2021-12-22 19:54:46+00:00,123.0
11,30.0,104.0,factorization-machin-211222-1919-019-b1d616ed,Completed,0.693466,2021-12-22 20:08:40+00:00,2021-12-22 20:11:12+00:00,152.0


Completed    30
Name: TrainingJobStatus, dtype: int64

<a id='Perform-Batch'></a>

# Perform Batch Inference with the best Model

***

We will use the best found model to perform a batch inference over a fixed set of observations.

In [22]:
def writeDatasetToProtobuf2(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type=="sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)

In [23]:
#upload inference data to S3
s3_batch_output = "s3://{}/{}/batch_output/".format(bucket, prefix)
prefix_batch = "{}/batch_inference".format(prefix)

s3_batch_input = writeDatasetToProtobuf2(X_test_OH, bucket, prefix_batch, test_key, "sparse")
print ("Batch inference data path: ", s3_batch_input)

Batch inference data path:  s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/batch_inference/test.protobuf


In [24]:
fm_transformer = tuner.best_estimator().transformer(
    instance_count=1,
    output_path=s3_batch_output,
    instance_type="ml.c4.xlarge",
    max_payload=1
)


2021-12-22 19:43:28 Starting - Preparing the instances for training
2021-12-22 19:43:28 Downloading - Downloading input data
2021-12-22 19:43:28 Training - Training image download completed. Training in progress.
2021-12-22 19:43:28 Uploading - Uploading generated training model
2021-12-22 19:43:28 Completed - Training job completed


### Launch the Transform job

In [25]:
import uuid

In [26]:
%%time 

transform_job_name = f"{tuner.best_training_job()}-{str(uuid.uuid4())[:8]}"
display(f"Launching Batch Transform Job {transform_job_name}")

s3_batch_inference = "s3://{}/{}/batch_inference/".format(bucket, prefix)

fm_transformer.transform(
    data=s3_batch_inference,
    split_type='RecordIO',
    content_type="application/x-recordio-protobuf",
    job_name=transform_job_name,
    wait=True,
    logs=False
)

'Launching Batch Transform Job factorization-machin-211222-1919-009-051c5774-1cdf5bf3'

................................................................!
CPU times: user 266 ms, sys: 14.8 ms, total: 281 ms
Wall time: 5min 22s


In [27]:
from IPython.display import display, display_markdown
transform_job_desc = sess.describe_transform_job(transform_job_name)

transform_job_name = transform_job_desc["TransformJobName"]
transform_job_arn = transform_job_desc["TransformJobArn"]
transform_creation_time = transform_job_desc["CreationTime"]
transform_start_time = transform_job_desc["TransformStartTime"]
transform_end_time = transform_job_desc["TransformEndTime"]
transform_job_output = transform_job_desc["TransformOutput"]["S3OutputPath"]
transform_time = int((transform_end_time - transform_start_time).total_seconds())

transform_desc_md = f"""# **Transform Job**
| | |
|---|---|
| **Name** | {transform_job_name} |
| **ARN**  | {transform_job_arn} |
| **Creation Time** | {transform_creation_time} |
| **Output** | {transform_job_output} |
| **Transform Start Time** | {transform_start_time} |
| **Transform End Time** | {transform_end_time} |
| **Transform Time** | {transform_time} seconds |
| **Transform Input** | {s3_batch_inference} |
"""

display_markdown(transform_desc_md, raw=True)

# **Transform Job**
| | |
|---|---|
| **Name** | factorization-machin-211222-1919-009-051c5774-1cdf5bf3 |
| **ARN**  | arn:aws:sagemaker:us-east-1:338408246139:transform-job/factorization-machin-211222-1919-009-051c5774-1cdf5bf3 |
| **Creation Time** | 2021-12-22 20:47:04.160000+00:00 |
| **Output** | s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/batch_output/ |
| **Transform Start Time** | 2021-12-22 20:50:44.725000+00:00 |
| **Transform End Time** | 2021-12-22 20:52:24.772000+00:00 |
| **Transform Time** | 100 seconds |
| **Transform Input** | s3://sagemaker-us-east-1-338408246139/builtin-notebooks/Recomendation-Machine/Explicit/batch_inference/ |


<a id='Clean'></a>
# Cleanup

***

In [28]:
fm_transformer.delete_model()