# NCF Recommender with Explict Feedback

In [None]:
#  /home/juimdpp/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.4.4.jar

In this notebook we demostrate how to build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. We use Recommender API in Analytics Zoo to build a model, and use optimizer of BigDL to train the model. 

The system ([Recommendation systems: Principles, methods and evaluation](http://www.sciencedirect.com/science/article/pii/S1110866515000341)) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.  

NCF([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf)) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization. 

Data: 
* The dataset we used is movielens-1M ([link](https://grouplens.org/datasets/movielens/1m/)), which contains 1 million ratings from 6000 users on 4000 movies.  There're 5 levels of rating. We will try classify each (user,movie) pair into 5 classes and evaluate the effect of algortithms using Mean Absolute Error.  
  
References: 
* A Keras implementation of Movie Recommendation([notebook](https://github.com/ririw/ririw.github.io/blob/master/assets/Recommending%20movies.ipynb)) from the [blog](http://blog.richardweiss.org/2016/09/25/movie-embeddings.html).
* Nerual Collaborative filtering ([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf))

## Intialization

import necessary libraries

In [1]:
from zoo.pipeline.api.keras.layers import *
from zoo.models.recommendation import UserItemFeature
from zoo.models.recommendation import NeuralCF
from zoo.common.nncontext import init_nncontext
import matplotlib
from sklearn import metrics
from operator import itemgetter
from bigdl.dataset import movielens
from bigdl.util.common import *

matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline



Prepending /home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
Adding /home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar to BIGDL_JARS
Prepending /home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
Populating the interactive namespace from numpy and matplotlib


Initilaize NN context, it will get a SparkContext with optimized configuration for BigDL performance.

In [2]:
sc = init_nncontext("NCF Example")

NN_CONTEXT
GETORCREATE
INIT
dict_items([('spark.shuffle.reduceLocality.enabled', 'false'), ('spark.shuffle.blockTransferService', 'nio'), ('spark.scheduler.minRegisteredResourcesRatio', '1.0'), ('spark.scheduler.maxRegisteredResourcesWaitingTime', '3600s'), ('spark.speculation', 'false'), ('spark.serializer', 'org.apache.spark.serializer.JavaSerializer')])
pyspark_submit_args is:  --driver-class-path /home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar:/home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/bigdl/share/lib/bigdl-0.12.2-jar-with-dependencies.jar pyspark-shell 
GETORCREATE


In [3]:
sc.defaultParallelism
sc._conf.getAll()


[('spark.executorEnv.OMP_NUM_THREADS', '1'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.serializer', 'org.apache.spark.serializer.JavaSerializer'),
 ('spark.eventLog.dir', 'file:///tmp/spark-events'),
 ('spark.shuffle.reduceLocality.enabled', 'false'),
 ('spark.shuffle.blockTransferService', 'nio'),
 ('spark.executor.id', 'driver'),
 ('spark.driver.host', 'anna-01'),
 ('spark.executorEnv.KMP_BLOCKTIME', '0'),
 ('spark.driver.port', '42263'),
 ('spark.executorEnv.KMP_AFFINITY', 'granularity=fine,compact,1,0'),
 ('spark.rdd.compress', 'True'),
 ('spark.speculation', 'false'),
 ('spark.history.fs.logDirectory', 'file:///tmp/spark-events'),
 ('spark.scheduler.maxRegisteredResourcesWaitingTime', '3600s'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.scheduler.minRegisteredResourcesRatio', '1.0'),
 ('spark.app.name', 'NCF Example'),
 ('spark.driver.extraClassPath',
  '/home/juimdpp/anaconda3/envs/venv/lib/python3.7/site-packages/zoo/share/lib/ana

## Data Preparation

Download and read movielens 1M data

In [4]:
# movielens_data = movielens.get_id_ratings("/tmp/movielens/")	

In [15]:
print(movielens_data)
print(type(movielens_data))
print(movielens_data.size * movielens_data.itemsize)

[[   1 1193    5]
 [   1  661    3]
 [   1  914    3]
 ...
 [6040  562    5]
 [6040 1096    4]
 [6040 1097    4]]
<class 'numpy.ndarray'>
24005016


Understand the data. Each record is in format of (userid, movieid, rating_score). UserIDs range between 1 and 6040. MovieIDs range between 1 and 3952. Ratings are made on a 5-star scale (whole-star ratings only). Counts of users and movies are recorded for later use.

In [6]:
min_user_id = np.min(movielens_data[:,0])
max_user_id = np.max(movielens_data[:,0])
min_movie_id = np.min(movielens_data[:,1])
max_movie_id = np.max(movielens_data[:,1])
rating_labels= np.unique(movielens_data[:,2])

print(movielens_data.shape)
print(min_user_id, max_user_id, min_movie_id, max_movie_id, rating_labels)

(1000209, 3)
1 6040 1 3952 [1 2 3 4 5]


Transform original data into RDD of sample. 
We use optimizer of BigDL directly to train the model, it requires data to be provided in format of RDD([Sample](https://bigdl-project.github.io/master/#APIGuide/Data/#sample)). A `Sample` is a BigDL data structure which can be constructed using 2 numpy arrays, `feature` and `label` respectively. The API interface is `Sample.from_ndarray(feature, label)`
Here, labels are tranformed into zero-based since original labels start from 1.

In [7]:
def build_sample(user_id, item_id, rating):
    sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))
    return UserItemFeature(user_id, item_id, sample)
pairFeatureRdds = sc.parallelize(movielens_data)\
    .map(lambda x: build_sample(x[0], x[1], x[2]-1))
pairFeatureRdds.take(3)

[<zoo.models.recommendation.recommender.UserItemFeature at 0x7f09f2bf4c90>,
 <zoo.models.recommendation.recommender.UserItemFeature at 0x7f0a85e69150>,
 <zoo.models.recommendation.recommender.UserItemFeature at 0x7f09fea76410>]

Randomly split the data into train (80%) and validation (20%)

In [8]:
trainPairFeatureRdds, valPairFeatureRdds = pairFeatureRdds.randomSplit([0.8, 0.2], seed= 1)
valPairFeatureRdds.cache()
train_rdd= trainPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd= valPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd.persist()

PythonRDD[3] at RDD at PythonRDD.scala:53

## Build Model

In Analytics Zoo, it is simple to build NCF model by calling NeuralCF API. You need specify the user count, item count and class number according to your data, then add hidden layers as needed, you can also choose to include matrix factorization in the network. The model could be fed into an Optimizer of BigDL or NNClassifier of analytics-zoo. Please refer to the document for more details. In this example, we demostrate how to use optimizer of BigDL. 

In [9]:
ncf = NeuralCF(user_count=max_user_id, 
               item_count=max_movie_id, 
               class_num=5, 
               hidden_layers=[20, 10], 
               include_mf = False)

HEYY
creating: createZooKerasInput
creating: createZooKerasFlatten
creating: createZooKerasSelect
CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.PythonNNFrames@798fd45d
com.intel.analytics.zoo.feature.python.PythonImageFeature@334b2238
com.intel.analytics.zoo.pipeline.api.keras.python.PythonAutoGrad@1612b3b0
result com.intel.analytics.zoo.pipeline.api.autograd.Variable@4a15823
CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.PythonNNFrames@798fd45d
com.intel.analytics.zoo.feature.python.PythonImageFeature@334b2238
com.intel.analytics.zoo.pipeline.api.keras.python.PythonAutoGrad@1612b3b0
result com.intel.analytics.zoo.pipeline.api.autograd.Variable@67596b1c
creating: createZooKerasFlatten
creating: createZooKerasSelect
CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.Pytho

## Compile model

Compile model given specific optimizers, loss, as well as metrics for evaluation. Optimizer tries to minimize the loss of the neural net with respect to its weights/biases, over the training set. To create an Optimizer in BigDL, you want to at least specify arguments: model(a neural network model), criterion(the loss function), traing_rdd(training dataset) and batch size. Please refer to ([ProgrammingGuide](https://bigdl-project.github.io/master/#ProgrammingGuide/optimization/))and ([Optimizer](https://bigdl-project.github.io/master/#APIGuide/Optimizers/Optimizer/)) for more details to create efficient optimizers.

In [10]:
ncf.compile(optimizer= "adam",
            loss= "sparse_categorical_crossentropy",
            metrics=['accuracy'])

COMP
creating: createAdam
creating: createZooKerasSparseCategoricalCrossEntropy
creating: createZooKerasSparseCategoricalAccuracy
CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.PythonNNFrames@798fd45d
com.intel.analytics.zoo.feature.python.PythonImageFeature@334b2238
com.intel.analytics.zoo.pipeline.api.keras.python.PythonAutoGrad@1612b3b0
result None


## Collect logs

You can leverage tensorboard to see the summaries.

In [11]:
tmp_log_dir = create_tmp_path()
ncf.set_tensorboard(tmp_log_dir, "training_ncf")

CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.PythonNNFrames@798fd45d
com.intel.analytics.zoo.feature.python.PythonImageFeature@334b2238
com.intel.analytics.zoo.pipeline.api.keras.python.PythonAutoGrad@1612b3b0
result None


## Train the model

In [12]:
ncf.fit(train_rdd, 
        nb_epoch= 3, 
        batch_size= 120000, 
        validation_data=val_rdd)

FIT
CALLZOOFUNC
com.intel.analytics.zoo.tfpark.python.PythonTFPark@7509b72b
com.intel.analytics.zoo.pipeline.nnframes.python.PythonNNFrames@798fd45d
com.intel.analytics.zoo.feature.python.PythonImageFeature@334b2238
com.intel.analytics.zoo.pipeline.api.keras.python.PythonAutoGrad@1612b3b0
result None


## Prediction

Zoo models make inferences based on the given data using model.predict(val_rdd) API. A result of RDD is returned. predict_class returns the predicted label. 

In [None]:
results = ncf.predict(val_rdd)
results.take(5)

results_class = ncf.predict_class(val_rdd)
results_class.take(5)

In Analytics Zoo, Recommender has provied 3 unique APIs to predict user-item pairs and make recommendations for users or items given candidates.
Predict for user item pairs

In [None]:
userItemPairPrediction = ncf.predict_user_item_pair(valPairFeatureRdds)
for result in userItemPairPrediction.take(5): print(result)

Recommend 3 items for each user given candidates in the feature RDDs

In [None]:
userRecs = ncf.recommend_for_user(valPairFeatureRdds, 3)
for result in userRecs.take(5): print(result)

Recommend 3 users for each item given candidates in the feature RDDs

In [None]:
itemRecs = ncf.recommend_for_item(valPairFeatureRdds, 3)
for result in itemRecs.take(5): print(result)

## Evaluation

Plot the train and validation loss curves

In [None]:
#retrieve train and validation summary object and read the loss data into ndarray's. 
train_loss = np.array(ncf.get_train_summary("Loss"))
val_loss = np.array(ncf.get_validation_summary("Loss"))
#plot the train and validation curves
# each event data is a tuple in form of (iteration_count, value, timestamp)
plt.figure(figsize = (12,6))
plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')
plt.plot(val_loss[:,0],val_loss[:,1],label='val loss',color='green')
plt.scatter(val_loss[:,0],val_loss[:,1],color='green')
plt.legend();
plt.xlim(0,train_loss.shape[0]+10)
plt.grid(True)
plt.title("loss")

plot accuracy

In [None]:
plt.figure(figsize = (12,6))
top1 = np.array(ncf.get_validation_summary("Top1Accuracy"))
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.title("top1 accuracy")
plt.grid(True)
plt.legend();