# NCF Recommender with Explict Feedback using Orca data-preprocessing and TF Estimator

In this notebook we demostrate how to leverage Orca data preprocessing and TF Estimator to scale python-based preprocessing and Tensorflow Training into Big Data cluster.

In this example, we build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. 

We use Orca preprocessing to do the Pandas Preprocessing in parallel and leverage Orca Tensorflow Estimator to train a Tensorflow graph model in the same cluster. 

For Orca Data Preprocessing, please refer [Orca Data](https://analytics-zoo.github.io/master/#Orca/data/).

For Orca Tensorflow Estimator, please refer [Orca TF Estimator](https://analytics-zoo.github.io/master/#Orca/orca-tf-estimator/).

The system ([Recommendation systems: Principles, methods and evaluation](http://www.sciencedirect.com/science/article/pii/S1110866515000341)) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.  

NCF([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf)) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization. 

Data: 
* The dataset we used is movielens-1M ([link](https://grouplens.org/datasets/movielens/1m/)), which contains 1 million ratings from 6000 users on 4000 movies.  There're 5 levels of rating. We will try classify each (user,movie) pair into 5 classes and evaluate the effect of algortithms using Mean Absolute Error.  
  
References: 
* A Tensorflow implementation of NCF model Recommendation([ali-ncf](https://github.com/alibaba/ai-matrix/tree/master/macro_benchmark/NCF)).
* Nerual Collaborative filtering ([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf))

## Intialization

import necessary libraries

In [1]:
import os
import zipfile
import argparse

import numpy as np
import tensorflow as tf

from bigdl.dataset import base
from sklearn.model_selection import train_test_split

import zoo.orca.learn.tf.estimator
from zoo.orca.data import SharedValue
import zoo.orca.data.pandas

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline


Populating the interactive namespace from numpy and matplotlib


## Data Preparation

Download movielens 1M data

In [2]:
SOURCE_URL = 'http://files.grouplens.org/datasets/movielens/'
WHOLE_DATA = 'ml-1m.zip'
data_dir='/tmp'

In [3]:
local_file = base.maybe_download(WHOLE_DATA, data_dir, SOURCE_URL + WHOLE_DATA)
zip_ref = zipfile.ZipFile(local_file, 'r')
extracted_to = os.path.join(data_dir, "ml-1m")
if not os.path.exists(extracted_to):
    print("Extracting %s to %s" % (local_file, data_dir))
    zip_ref.extractall(data_dir)
    zip_ref.close()
rating_files = os.path.join(extracted_to, "ratings.dat")

Replace "::" to ":" in ratings.dat and save to ratings_new.dat for spark 2.4 read csv support

In [4]:
new_rating_files = os.path.join(extracted_to, "ratings_new.dat")
if not os.path.exists(new_rating_files):
    fin = open(rating_files, "rt")
    # output file to write the result to
    fout = open(new_rating_files, "wt")
    # for each line in the input file
    for line in fin:
        # read replace the string and write to output file
        fout.write(line.replace('::', ':'))
    # close input and output files
    fin.close()
    fout.close()

Read csv file to XShards of Pandas DataFrame in parallel on Spark using Orca Data Preprocessing API:

In [5]:
COLUMN_NAMES = ['user', 'item', 'label']

In [6]:
full_data = zoo.orca.data.pandas.read_csv(new_rating_files, sep=':', header=None, names=COLUMN_NAMES,
                                              usecols=[0, 1, 2], dtype={0: np.int32, 1: np.int32, 2: np.int32}) \
        .partition_by('user')

Create user_id -> user_index, item_id -> item_index map.

In [7]:
user_set = set(full_data['user'].unique())
item_set = set(full_data['item'].unique())

user_size = len(user_set)
item_size = len(item_set)
print('user size %d' % user_size)
print('item size %d' % item_size)

user size 6040
item size 3706


In [8]:
def re_index(s):
    """ for reindexing the item set. """
    i = 0
    s_map = {}
    for key in s:
        s_map[key] = i
        i += 1

    return s_map

In [9]:
user_map = re_index(user_set)
item_map = re_index(item_set)

Change user id to user index, item id to item index with Pandas operation in parallel using Orca XShards transform_shard() API.

In [10]:
def set_user_item(df, item_map, user_map):
    user_list = []
    item_list = []
    item_map = item_map.value
    user_map = user_map.value
    for i in range(len(df)):
        user_list.append(user_map[df['user'][i]])
        item_list.append(item_map[df['item'][i]])
    df['user'] = user_list
    df['item'] = item_list
    return df

user_map_shared_value = SharedValue(user_map)
item_map_shared_value = SharedValue(item_map)

full_data = full_data.transform_shard(set_user_item, item_map_shared_value, user_map_shared_value)

Update label starting from 0 using XShards.transform_shard() API

In [11]:
def update_label(df):
    df['label'] = df['label'] - 1
    return df

full_data = full_data.transform_shard(update_label)

Split to train/test dataset using scikit-learn library.

In [12]:
# split to train/test dataset
def split_train_test(data):
    # splitting the full set into train and test sets.
    train, test = train_test_split(data, test_size=0.2, random_state=100)
    return (train, test)

train_data, test_data = full_data.transform_shard(split_train_test).split()

Change to XShards of dictionary of x, y before train/validation

In [13]:
def to_train_val_shard(df):
    result = {
        "x": (df['user'].to_numpy(), df['item'].to_numpy()),
        "y": df['label'].to_numpy()
    }
    return result

train_data = train_data.transform_shard(to_train_val_shard)
test_data = test_data.transform_shard(to_train_val_shard)

## Build Tensorflow Model

In [14]:
class NCF(object):
    def __init__(self, embed_size, user_size, item_size):
        self.user = tf.placeholder(dtype=tf.int32, shape=(None,))
        self.item = tf.placeholder(dtype=tf.int32, shape=(None,))
        self.label = tf.placeholder(dtype=tf.int32, shape=(None,))

        with tf.name_scope("GMF"):
            user_embed_GMF = tf.contrib.layers.embed_sequence(self.user,
                                                              vocab_size=user_size,
                                                              embed_dim=embed_size,
                                                              unique=False
                                                              )
            item_embed_GMF = tf.contrib.layers.embed_sequence(self.item,
                                                              vocab_size=item_size,
                                                              embed_dim=embed_size,
                                                              unique=False
                                                              )
            GMF = tf.multiply(user_embed_GMF, item_embed_GMF, name='GMF')

        # MLP part starts
        with tf.name_scope("MLP"):
            user_embed_MLP = tf.contrib.layers.embed_sequence(self.user,
                                                              vocab_size=user_size,
                                                              embed_dim=embed_size,
                                                              unique=False,
                                                              )

            item_embed_MLP = tf.contrib.layers.embed_sequence(self.item,
                                                              vocab_size=item_size,
                                                              embed_dim=embed_size,
                                                              unique=False
                                                              )
            interaction = tf.concat([user_embed_MLP, item_embed_MLP],
                                    axis=-1, name='interaction')

            layer1_MLP = tf.layers.dense(inputs=interaction,
                                         units=embed_size * 2,
                                         name='layer1_MLP')
            layer1_MLP = tf.layers.dropout(layer1_MLP, rate=0.2)

            layer2_MLP = tf.layers.dense(inputs=layer1_MLP,
                                         units=embed_size,
                                         name='layer2_MLP')
            layer2_MLP = tf.layers.dropout(layer2_MLP, rate=0.2)

            layer3_MLP = tf.layers.dense(inputs=layer2_MLP,
                                         units=embed_size // 2,
                                         name='layer3_MLP')
            layer3_MLP = tf.layers.dropout(layer3_MLP, rate=0.2)

        # Concate the two parts together
        with tf.name_scope("concatenation"):
            concatenation = tf.concat([GMF, layer3_MLP], axis=-1,
                                      name='concatenation')
            self.logits = tf.layers.dense(inputs=concatenation,
                                          units=5,
                                          name='predict')
            
            self.logits_softmax = tf.nn.softmax(self.logits)

            self.class_number = tf.argmax(self.logits_softmax, 1)

        with tf.name_scope("loss"):
            self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=self.label, logits=self.logits, name='loss'))

        with tf.name_scope("optimzation"):
            self.optim = tf.train.AdamOptimizer(1e-3, name='Adam')
            self.optimzer = self.optim.minimize(self.loss)

In [15]:
embedding_size = 16
model = NCF(embedding_size, user_size, item_size)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.dropout instead.


## Train the model

Create Orca TF Estimator for training/validation/prediction.

In [16]:
model_dir = "./"

estimator = zoo.orca.learn.tf.estimator.Estimator.from_graph(
            inputs=[model.user, model.item],
            outputs=[model.class_number],
            labels=[model.label],
            loss=model.loss,
            optimizer=model.optim,
            model_dir=model_dir,
            metrics={"loss": model.loss})






Train TF model in parallel on Spark

In [17]:
batch_size = 1280
epochs = 10

estimator.fit(data=train_data,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_data=test_data
            )



creating: createFakeOptimMethod
Instructions for updating:
Use `tf.cast` instead.

creating: createStatelessMetric





creating: createTFTrainingHelper
creating: createIdentityCriterion
creating: createTFParkSampleToMiniBatch
creating: createTFParkSampleToMiniBatch
creating: createEstimator
creating: createMaxEpoch
creating: createEveryEpoch
INFO:tensorflow:Restoring parameters from /tmp/tmp2txi_b2r/model


<zoo.orca.learn.tf.estimator.TFOptimizerWrapper at 0x7f25dc1f5390>

## Save TF model

Save trained model in Tensorflow checkpoint for later evaluation and prediction.

In [18]:
checkpoint_path = os.path.join(model_dir, "NCF.ckpt")
estimator.save_tf_checkpoint(checkpoint_path)
estimator.sess.close()

'TFNdarrayDataset' object has no attribute 'name'
'TFNdarrayDataset' object has no attribute 'name'
'TFNdarrayDataset' object has no attribute 'name'


## Prediction

Restore saved Tensorflow checkpoint.

In [23]:
tf.reset_default_graph()

sess = tf.Session()
model = NCF(embedding_size, user_size, item_size)

saver = tf.train.Saver(tf.global_variables())
checkpoint_path = os.path.join(model_dir, "NCF.ckpt")
saver.restore(sess, checkpoint_path)

INFO:tensorflow:Restoring parameters from ./NCF.ckpt


Create new TF Estimator to predict.

In [24]:
estimator = zoo.orca.learn.tf.estimator.Estimator.from_graph(
        inputs=[model.user, model.item],
        outputs=[model.class_number],
        labels=[model.label],
        sess=sess,
        model_dir=model_dir
        )

Prepare predict XShards. The predict XShards should only contain 'x' values.

In [26]:
def to_predict(data):
    del data['y']
    return data

predict_data = test_data.transform_shard(to_predict)

Predict data XShards. The prediction result is also an XShards, but only contains 'prediction' values.

In [28]:
predict_result = estimator.predict(predict_data)
predictions = predict_result.collect()
print(predictions[0]['prediction'])

INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /tmp/tmpaxod7ns5/saved_model.pb
[2. 3. 2. ... 0. 2. 3.]
