### Machine Learning Immersion Day 

# What Are We Doing Here?

We're going to make a [factorisation machine](https://sagemaker.readthedocs.io/en/stable/factorization_machines.html) that will act as a binary recommender (like vs don't like) against data from [movielens](https://movielens.org).

This is based on a [blog post](https://medium.com/@julsimon/building-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker-cedbfc8c93d8).


To get there, we'll do the following:

##### A. Data Preparation
1. Import dependencies
2. Download the movielens data set - this contains movies and ratings
3. Create separate training and test data sets from what we've downloaded
4. Shuffle this data (to aid with training)

##### B. Data Preparation
5. Create a one-hot encoded sparse matrix of features & labels
6. Serialize in ProtoBuf format (a requirement of the factorisation machine Estimator in SageMaker)

##### C. Train & Deploy the Model
7. Create a factorisation machine and configure it's hyperparameters (non-learnt parameters)
8. Train the model by calling `fit()`
9. Deploy a SageMaker endpoint with `deploy()`
10. Configure serialization options for the predictor

##### D. Test Inference
11. Call the predictor



# A. Data preparation


## 1. Import Dependencies
* Import the dependencies we need
* Configure the S3 bucket name we'll use

*Don't forget to change the your_initials value to the initials you used in Lab1*

In [22]:
your_initials = 'pmj'
bucket_name = your_initials + '-ml-id-lab'

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
  
import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

> While the code is running, there will be a bracketed asterisk showing to the left of the code **[*]**. Once finished, the asterisk will be replaced with a number showing the order of execution within the current notebook document's state.

Next, download one of the data files used in Lab1 to the notebook. 

## 2. Download the movielens data set 

In [23]:
s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file('movielens-data/u.data/data.csv', 'u.data')

The file downloaded is a compacted version if the data explored in Lab1. This is the description of the file:

> ```text
> u.data 
>
> The full u data set, 100000 ratings by 943 users on 1682 items.
> Each user has rated at least 20 movies.  Users and items are numbered consecutively from 1.
The data is randomly ordered. This is a tab separated list of:
> user id | item id | rating | timestamp
> The time stamps are unix seconds since 1/1/1970 UTC```

While this is an intuitive and realtively compact way of storing the information, it is not optimal for training factorisation machine models. In order to have good training data, this data needs to be split and transformed.

First, split the data into one larger training part and one smaller testing part (10 samples per user).

At the end of running the code, the two rating counters will be printed to an output that is added below the cell. 

## 3. Create separate training and test data sets


In [34]:
nbUsers = 943
nbMovies = 1682
nbFeatures = nbUsers+nbMovies
# Pick 10 ratings _per user_ and save as test data (to "ua.test")
# Save the rest as training data set (to "ua.base")
maxRatingsByUser = 10

def getStringUserId(userId):
    return str(int(userId)-1)


def initialiseTestRatings():
    # Since we only want a maximum of 10, we want to keep track of the number of ratings per user as we build our datasets
    # Create a dictionary, and initialise the count for each userId to 0
    testRatingsByUser = {}
    for userId in range(nbUsers):
        testRatingsByUser[str(userId)] = 0
    return testRatingsByUser

def initialiseFiles():

    # IPython allows us to use ! to execute shell commands
    # Clean any existing 'base' and 'test' files
    !rm -f ua.base || touch ua.base
    !rm -f ua.test || touch ua.test

    testRatingsByUser = initialiseTestRatings()

    # Obtain file handle to the main data file (read access)
    # Obtain file handles to the desired "base" and "test" files we just initialised (write access)
    with open('u.data', 'r') as data_file, open('ua.base', 'w') as uabase_file, open('ua.test', 'w') as uatest_file:

        # Use tabs as delimiters
        filedata_reader = csv.reader(data_file, delimiter='\t')
        uabase_writer = csv.writer(uabase_file, delimiter='\t')
        uatest_writer = csv.writer(uatest_file, delimiter='\t')

        # skip headers
        next(filedata_reader, None)

        # Initialise counters
        nbRatingsTrain = 0
        nbRatingsTest = 0

        # For every rating line in file
        for userId, movieId, rating, timestamp in filedata_reader:

            # If we've within the max ratings per user limit, keep the record as test data
            if testRatingsByUser[getStringUserId(userId)] < maxRatingsByUser:
                uatest_writer.writerow([userId, movieId, rating, timestamp])
                testRatingsByUser[getStringUserId(
                    userId)] = testRatingsByUser[getStringUserId(userId)] + 1
                nbRatingsTest = nbRatingsTest+1

            # If we've already got enough test data for the user in question, use for training
            else:
                uabase_writer.writerow([userId, movieId, rating, timestamp])
                nbRatingsTrain = nbRatingsTrain+1
                
    return nbRatingsTrain, nbRatingsTest


configure()
nbRatingsTrain, nbRatingsTest = initialiseFiles()


### 3.1 Check the newly partitioned data
Make sure the partitioned data looks good by printing the first 10 rows of each file. 
> Notice that the exclamation mark starting each line in this snippets means that the line is to be executed as a shell command, rather than as python code.

In [28]:
!echo
!echo "TRAINING DATA:"
!echo -e "userId\tmovieId\trating\ttimestamp"
!head -10 ua.base
!echo
!echo "TESTING DATA:"
!echo -e "userId\tmovieId\trating\ttimestamp"
!head -10 ua.test
!echo
!echo "TRAINING DATA (COUNT)"
!wc -l ua.base
!echo
!echo "TEST DATA (COUNT)"
!wc -l ua.test
!echo


TRAINING DATA:
userId	movieId	rating	timestamp
13	498	4	882139901
13	892	3	882774224
13	229	4	882397650
181	741	1	878962918
181	1015	1	878963121
13	864	4	882141924
222	812	2	881059117
269	234	1	891449406
13	901	1	883670672
276	70	4	874790826

TESTING DATA:
userId	movieId	rating	timestamp
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013

TRAINING DATA (COUNT)
90570 ua.base

TEST DATA (COUNT)
9430 ua.test



The output should show ten lines containing four columns for each file. You may notice that the training data seems to have reoccuring lines contains the same value in the first column (user_id). **These types of regularities in the training data can lead to suboptimal training**.

## 4. Shuffle this data (to aid with training)
Create a new file containing shuffled training data.

> If you're intrigued, `shuf` is available through `coreutils` on the Mac, but not installed by default.
> `brew install coreutils` to get it.

### 4.1 Check the newly shuffled data

In [29]:
!shuf ua.base -o ua.base.shuffled
!head -10 ua.base.shuffled

57	126	3	883697293
624	100	5	879792581
429	1425	3	882387633
915	347	5	891031477
642	1000	3	885602340
640	1054	1	886474010
371	73	5	880435397
38	409	5	892433135
286	117	2	876521650
843	566	3	879444766


Great, we now have some test and training data.

# B. Data preparation

Now that we have our files, we need to get them into a format that SageMaker expects.

1. We create a one-hot encoded sparse matrix
2. We serialise this matrix as ProtoBuf

## 5. Create a one-hot encoded sparse matrix of features & labels

You now have two sets of source data, but need to process them more before training and testing a factorization machine model. What is needed for each of the sets is:

- Create a one-hot encoded sparse matrix holding **features** (the input to the model)
- Create a **label** array (the expected output from the model)
- Serialize both of the above into protobuf format and write them to the S3 bucket.

If you've not come across the term before (I hadn't), [one-hot](https://en.wikipedia.org/wiki/One-hot) indicates that only a single bit is flipped - e.g. 010000:

> In digital circuits and machine learning, one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

This is a [sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix) by nature:

> In numerical analysis and computer science, a sparse matrix or sparse array is a matrix in which most of the elements are zero.

We'll define `loadDataset()` (that loads a dataset and returns a one-hot encoded feature sparse matrix and a label vector. To do so we we use [`lil_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html) which constructs a:

> Row-based linked list sparse matrix. This is a structure for constructing sparse matrices incrementally. 

In [31]:
def loadDataset(filename, lines, columns):

    # Features are one-hot encoded in a sparse matrix
    features = lil_matrix((lines, columns)).astype('float32')

    # Labels are stored in a vector
    labels = []

    lineNumber = 0
    with open(filename, 'r') as file:

        samples = csv.reader(file, delimiter='\t')

        for userId, movieId, rating, timestamp in samples:

            # Flip the bit for the co-ordinates of the line number & userId
            features[lineNumber, int(userId) - 1] = 1

            # Also flip the bit for the co-ordinates of the movieId
            # We effectively use a second vector after the first,
            # offset by the number of users.
            features[lineNumber, int(nbUsers) + int(movieId) - 1] = 1

            # If it's a 4-5 star movie, we append 1 to the label set
            # Recall that we are making a binary recommender (like/dislike)
            if int(rating) >= 4:
                labels.append(1)
            else:
                labels.append(0)

            lineNumber = lineNumber + 1

    return features, np.array(labels).astype('float32')





### 5.1 Verify the Sparse Matrixes we've created

#### 5.1.1 The Training Data Set

In [40]:
def checkDataSet(features, labels, nbRatings, nbFeatures):

    print("Features Shape:", features.shape)
    print("Labels Shape:", labels.shape)

    assert features.shape == (nbRatings, nbFeatures)
    assert labels.shape == (nbRatings, )

    nonzero_labels = np.count_nonzero(labels)

    print("Training labels: %d ones, %d zeros" %
          (nonzero_labels, nbRatings-nonzero_labels))

train_features, train_labels = loadDataset('ua.base.shuffled', nbRatingsTrain, nbFeatures)
checkDataSet(train_features, train_labels, nbRatingsTrain, nbFeatures)

Features Shape: (90570, 2625)
Labels Shape: (90570,)
Training labels: 49906 ones, 40664 zeros


#### 5.1.2 The Test Data Set

In [41]:
test_features, test_labels = loadDataset('ua.test', nbRatingsTest, nbFeatures)
checkDataSet(test_features, test_labels, nbRatingsTest, nbFeatures)

Features Shape: (9430, 2625)
Labels Shape: (9430,)
Training labels: 5469 ones, 3961 zeros


## 6. Serialize in ProtoBuf format
Now, you will serialise these structures in [protobuf](https://developers.google.com/protocol-buffers/) format on S3. Start by defining target names for the S3 objects, and a function to do the serialisation and return the path to the object on S3.

We use the [SageMaker API](https://aws.amazon.com/blogs/machine-learning/introduction-to-the-amazon-sagemaker-neural-topic-model/) to transform our sparse matrix. `write_spmatrix_to_sparse_tensor()` will:

> convert scipy sparse matrix into RecordIO Protobuf format.


In [42]:
prefix = 'sagemaker/recommender-fm'

train_key = 'train.protobuf'
train_prefix = '{}/{}'.format(prefix, 'train')

test_key = 'test.protobuf'
test_prefix = '{}/{}'.format(prefix, 'test')


def writeDatasetToProtobuf(X, Y, bucket, prefix, key):

    buf = io.BytesIO()

    # Using SageMaker
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)

    buf.seek(0)

    bucket_object = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(bucket_object).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket, bucket_object)


Last, write the data by calling the function for the two sets.

### 6.1 Serialise the data as ProtoBuf and store in S3:

In [45]:
train_data = writeDatasetToProtobuf(
    train_features, train_labels, bucket_name, train_prefix, train_key)
print("Training data at: %s" % (train_data))

test_data = writeDatasetToProtobuf(
    test_features, test_labels, bucket_name, test_prefix, test_key)
print("Testing data at: %s" % (test_data))


Training data at: s3://pmj-ml-id-lab/sagemaker/recommender-fm/train/train.protobuf
Testing data at: s3://pmj-ml-id-lab/sagemaker/recommender-fm/test/test.protobuf


### 6.2 Check the Sizes of the Serialised Data
You should now see objects at these paths in the S3 console. Note how efficiently the sparse matrix is stored, only 5.8 MB for the training set.

You have now finished preparing data and are ready to start training your model.

In [46]:
def printObjectSize(bucket, prefix, key):
    MBFACTOR = float(1<<20)
    size = boto3.resource('s3').Bucket(bucket).Object('{}/{}'.format(prefix, key)).content_length
    print(round(size/MBFACTOR,2), "Mb")

printObjectSize(bucket, train_prefix, train_key)
printObjectSize(bucket, test_prefix, test_key)

5.53 Mb
0.58 Mb


# C. Train & Deploy the Model

In this part of the lab, you will now invoke Amazon Sagemaker training and testing from the notebook.

## 7. Create a factorisation machine and configure its hyperparameters

Create a [factorization machine](https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines.html) Estimator object and set the hyperparameters to be used when training.

Wait, what's a factorisation machine?

> A factorization machine is a **general-purpose supervised learning algorithm** that you can use for **both classification and regression tasks**. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the factorization machine model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

And what are [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))?

> In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

Let's invoke the training process.
Note that a warning here is expected... we can ignore it.

    WARNING:sagemaker:Couldn't call 'get_role' to get Role ARN from role name pmj-ml-lab1-SagemakerExecutionRole-J02I9C9O0B26 to get Role path.

In [101]:
output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)
  
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
             'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
             'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
             'ap-northeast-1': '351501993468.dkr.ecr.ap-northeast-1.amazonaws.com/factorization-machines:latest',
             'ap-northeast-2': '835164637446.dkr.ecr.ap-northeast-2.amazonaws.com/factorization-machines:latest',
             'ap-southeast-2': '712309505854.dkr.ecr.ap-southeast-2.amazonaws.com/factorization-machines:latest',
             'eu-central-1': '664544806723.dkr.ecr.eu-central-1.amazonaws.com/factorization-machines:latest',
             'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest'}
  
print("The trained model will be written to: %s" % (output_prefix))

fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                  get_execution_role(), 
                                  train_instance_count=1, 
                                  train_instance_type='ml.c4.xlarge',
                                  output_path=output_prefix,
                                  sagemaker_session=sagemaker.Session())
  
fm.set_hyperparameters(feature_dim=nbFeatures,
                     predictor_type='binary_classifier',
                     mini_batch_size=1000,
                     num_factors=64,
                     epochs=100,
                     bias_init_sigma=0.1)

The trained model will be written to: s3://pmj-ml-id-lab/sagemaker/recommender-fm/output




## 8. Train the model by calling `fit()`

Now, invoke training on Amazon SageMaker.

### 8.1 Monitoring the Training Process
- While the training is running, Amazon SageMaker will continuously produce output below the cell. 
- This particular training job should take 4-5 minutes, the training is finished when you see `Billable seconds: ###` at the end of the output.
- You can also monitor progress of the training in the Amazon SageMaker console by selecting **Training jobs** in the main menu.

### 8.2 Model Output
The trained model will be written to the path defined by **output_prefix**, you can verify that there is a **model.tar.gz** object in the S3 console.

You have now trained your model and are ready to start using it.

In [102]:
fm.fit({'train': train_data, 'test': test_data})

INFO:sagemaker:Creating training-job with name: factorization-machines-2018-12-20-11-13-07-491


2018-12-20 11:13:07 Starting - Starting the training job...
2018-12-20 11:13:10 Starting - Launching requested ML instances...
2018-12-20 11:14:06 Starting - Preparing the instances for training.........
2018-12-20 11:15:15 Downloading - Downloading input data..
[31mDocker entrypoint called with argument(s): train[0m
[31m[12/20/2018 11:15:46 INFO 139924024543040] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_batch_size': u'1000', u'_use_full_symbolic': u'true', u'batch_metrics_publish_interval': u


2018-12-20 11:15:44 Training - Training image download completed. Training in progress.[31m[12/20/2018 11:15:55 INFO 139924024543040] #quality_metric: host=algo-1, epoch=13, train binary_classification_accuracy <score>=0.718208791209[0m
[31m[12/20/2018 11:15:55 INFO 139924024543040] #quality_metric: host=algo-1, epoch=13, train binary_classification_cross_entropy <loss>=0.596807039701[0m
[31m[12/20/2018 11:15:55 INFO 139924024543040] #quality_metric: host=algo-1, epoch=13, train binary_f_1.000 <score>=0.765511124116[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 594.398021697998, "sum": 594.398021697998, "min": 594.398021697998}}, "EndTime": 1545304555.750512, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304555.155709}
[0m
[31m[12/20/2018 11:15:55 INFO 139924024543040] #progress_metric: host=algo-1, completed 14 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {

[31m[12/20/2018 11:16:05 INFO 139924024543040] #quality_metric: host=algo-1, epoch=30, train binary_classification_accuracy <score>=0.732142857143[0m
[31m[12/20/2018 11:16:05 INFO 139924024543040] #quality_metric: host=algo-1, epoch=30, train binary_classification_cross_entropy <loss>=0.557855535822[0m
[31m[12/20/2018 11:16:05 INFO 139924024543040] #quality_metric: host=algo-1, epoch=30, train binary_f_1.000 <score>=0.768410751442[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 599.3161201477051, "sum": 599.3161201477051, "min": 599.3161201477051}}, "EndTime": 1545304565.846439, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304565.246688}
[0m
[31m[12/20/2018 11:16:05 INFO 139924024543040] #progress_metric: host=algo-1, completed 31 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[12/20/2018 11:16:15 INFO 139924024543040] #quality_metric: host=algo-1, epoch=47, train binary_classification_accuracy <score>=0.736824175824[0m
[31m[12/20/2018 11:16:15 INFO 139924024543040] #quality_metric: host=algo-1, epoch=47, train binary_classification_cross_entropy <loss>=0.540047135909[0m
[31m[12/20/2018 11:16:15 INFO 139924024543040] #quality_metric: host=algo-1, epoch=47, train binary_f_1.000 <score>=0.770257954971[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 584.3830108642578, "sum": 584.3830108642578, "min": 584.3830108642578}}, "EndTime": 1545304575.83591, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304575.251122}
[0m
[31m[12/20/2018 11:16:15 INFO 139924024543040] #progress_metric: host=algo-1, completed 48 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": 

[31m[12/20/2018 11:16:25 INFO 139924024543040] #quality_metric: host=algo-1, epoch=64, train binary_classification_accuracy <score>=0.744241758242[0m
[31m[12/20/2018 11:16:25 INFO 139924024543040] #quality_metric: host=algo-1, epoch=64, train binary_classification_cross_entropy <loss>=0.529100989038[0m
[31m[12/20/2018 11:16:25 INFO 139924024543040] #quality_metric: host=algo-1, epoch=64, train binary_f_1.000 <score>=0.775021749638[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 579.5421600341797, "sum": 579.5421600341797, "min": 579.5421600341797}}, "EndTime": 1545304585.780154, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304585.200201}
[0m
[31m[12/20/2018 11:16:25 INFO 139924024543040] #progress_metric: host=algo-1, completed 65 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[12/20/2018 11:16:35 INFO 139924024543040] #quality_metric: host=algo-1, epoch=81, train binary_classification_accuracy <score>=0.746307692308[0m
[31m[12/20/2018 11:16:35 INFO 139924024543040] #quality_metric: host=algo-1, epoch=81, train binary_classification_cross_entropy <loss>=0.521562185434[0m
[31m[12/20/2018 11:16:35 INFO 139924024543040] #quality_metric: host=algo-1, epoch=81, train binary_f_1.000 <score>=0.776545289119[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 588.4160995483398, "sum": 588.4160995483398, "min": 588.4160995483398}}, "EndTime": 1545304595.819903, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304595.231021}
[0m
[31m[12/20/2018 11:16:35 INFO 139924024543040] #progress_metric: host=algo-1, completed 82 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[12/20/2018 11:16:45 INFO 139924024543040] #quality_metric: host=algo-1, epoch=98, train binary_classification_accuracy <score>=0.748923076923[0m
[31m[12/20/2018 11:16:45 INFO 139924024543040] #quality_metric: host=algo-1, epoch=98, train binary_classification_cross_entropy <loss>=0.515757062304[0m
[31m[12/20/2018 11:16:45 INFO 139924024543040] #quality_metric: host=algo-1, epoch=98, train binary_f_1.000 <score>=0.778681855167[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 585.2870941162109, "sum": 585.2870941162109, "min": 585.2870941162109}}, "EndTime": 1545304605.807794, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1545304605.222088}
[0m
[31m[12/20/2018 11:16:45 INFO 139924024543040] #progress_metric: host=algo-1, completed 99 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

Billable seconds: 100


The [F1 score](https://en.wikipedia.org/wiki/F1_score) is a typical measure of success for a binary classifier:

> In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy.

With this set of hyperparameters, we've ended up with:

    #quality_metric: host=algo-1, test binary_classification_accuracy <score>=0.697348886532
    #quality_metric: host=algo-1, test binary_classification_cross_entropy <loss>=0.577387183342
    #quality_metric: host=algo-1, test binary_f_1.000 <score>=0.743206766241
    
69% accuracy - not great!

With the chosen thing we can tune:

- bias_init_sigma
- linear_init_sigma
- factors_init_sigma


    setting bias_init_sigma = 0.1 gave us test binary_classification_accuracy <score>=0.695758218452





## 9. Deploy a SageMaker endpoint with `deploy()`

In the last section of this lab you will deploy a development endpoint and test run some inferences of your model. **Do not start this section unless your training job from the earlier step has status Completed.**

The following will start up an endpoint instance. You can monitor progress through the notebook, or on the Amazon SageMaker console by choosing **Endpoints** in the menu.

In [49]:
fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)


fm. 

INFO:sagemaker:Creating model with name: factorization-machines-2018-12-20-09-55-12-702
INFO:sagemaker:Creating endpoint with name factorization-machines-2018-12-20-09-51-00-201


---------------------------------------------------------------------------!

## 10. Configure serialization options for the predictor

In [50]:
def fm_serializer(data):

    js = {'instances': []}

    for row in data:
        js['instances'].append({'features': row.tolist()})

    return json.dumps(js)

# For example:
# {
#     "instances": [
#         {"features": [1.5, 16.0, 14.0, 23.0]},
#         {"features": [-2.0, 100.2, 15.2, 9.2]}
#     ]
# }

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer

# json_deserializer is in sagemaker.predictor
fm_predictor.deserializer = json_deserializer

# D. Train & Deploy the Model



## 11. Call the predictor
Now you are ready to call the endpoint with 10 test inputs.

The output of the cell will produce a text table with three columns:

- **Prediction**
- **Score** (from the model)
- **Expected**

If the model works well, Prediction and Expected values should match on each row.

In [84]:
# Recall that test_features is our sparse matrix (not yet protobuf)
result = fm_predictor.predict(test_features[1000:1010].toarray())

def encodeFeatures(userId, nbUsers, nbMovies, movieId):

    # nbRatingsTest, nbFeatures as lines, columns
    # where nbFeatures = nbUsers+nbMovies
    lines = 1
    columns = nbUsers + nbMovies

    lineNumber = 0
    features = lil_matrix((lines, columns)).astype('float32')
    features[lineNumber, int(userId) - 1] = 1
    features[lineNumber, int(nbUsers) + int(movieId) - 1] = 1

    return features

def printPredictionWithExpectation(prediction, test_labels):
    # Header
    print("Prediction (Score) Expected")

    # Body
    for index, p in enumerate(prediction['predictions']):
        print("%10.2f %6.2f %8.2f" %
              (p['predicted_label'], p['score'], test_labels[1000 + index]))
        
printPredictionWithExpectation(result, test_labels)

Prediction (Score) Expected
      1.00   0.85     1.00
      0.00   0.48     0.00
      1.00   0.87     1.00
      0.00   0.48     0.00
      0.00   0.49     0.00
      0.00   0.34     0.00
      0.00   0.49     1.00
      0.00   0.35     0.00
      1.00   0.73     1.00
      1.00   0.58     1.00


## 12. Check Some Recommendations
### 12.1 Check we can get to Athena

In [2]:
import sys
!{sys.executable} -m pip install PyAthena

[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### 12.2 Check the favourite movies of a sample user

In [66]:
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://pmj-ml-id-lab/sagemaker/recommender-fm/test/',
               region_name='eu-west-1')

df = pd.read_sql('select * from "pmj-ml-lab-movielens"."user_ratings" where user_id = 269 order by rating asc', conn)
df

Unnamed: 0,user_id,movie_id,rating,movie_name
0,269,63,1,"Santa Clause, The (1994)"
1,269,167,1,Private Benjamin (1980)
2,269,717,1,"Juror, The (1996)"
3,269,405,1,Mission: Impossible (1996)
4,269,940,1,Airheads (1994)
5,269,809,1,Rising Sun (1993)
6,269,1478,1,Dead Presidents (1995)
7,269,121,1,Independence Day (ID4) (1996)
8,269,660,1,Fried Green Tomatoes (1991)
9,269,234,1,Jaws (1975)


In [121]:
def printPrediction(prediction, movie):
    for index, p in enumerate(prediction['predictions']):
        print("%30s %6.2f %8.2f" %
              (movie['name'], p['predicted_label'], p['score']))

def makeAndPrintPrediction(userId, movie):
    encodedPrediction = encodeFeatures(userId, nbUsers, nbMovies, movie['id'])
    prediction = fm_predictor.predict(encodedPrediction.toarray())
    printPrediction(prediction, movie)


movies = [
    {"id": 234, "name": "Jaws"},
    {"id": 834, "name": "Halloween"},
    {"id": 219, "name": "Nightmare on Elm Street"},
    {"id": 1, "name": "Toy Story"},
    {"id": 8, "name": "Babe"},
    {"id": 53, "name": "Natural Born Killers"},
    {"id": 520, "name": "The Great Escape"},    
    {"id": 177, "name": "The Good, The Bad and The Ugly"}        
]

scaredy_cat = 269

print("%30s %4.5s %6.9s" %
      (" ", "👍", "💯"))
for movie in movies:
    makeAndPrintPrediction(scaredy_cat, movie)


                                  👍      💯
                          Jaws   0.00     0.40
                     Halloween   0.00     0.23
       Nightmare on Elm Street   0.00     0.15
                     Toy Story   0.00     0.30
                          Babe   0.00     0.46
          Natural Born Killers   0.00     0.33
              The Great Escape   1.00     0.56
The Good, The Bad and The Ugly   1.00     0.58


In [123]:
fm_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint with name: factorization-machines-2018-12-20-09-51-00-201
