# NCF Recommender with Explict Feedback using Orca data-preprocessing and TF Estimator

In this notebook we demostrate how to leverage Orca data preprocessing and TF Estimator to scale python-based preprocessing and Tensorflow Training into Big Data cluster.

In this example, we build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. 

We use Orca preprocessing to do the Pandas Preprocessing in parallel and leverage Orca TF Estimator to train a Tensorflow graph model in the same cluster. 

The system ([Recommendation systems: Principles, methods and evaluation](http://www.sciencedirect.com/science/article/pii/S1110866515000341)) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.  

NCF([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf)) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization. 

Data: 
* The dataset we used is movielens-1M ([link](https://grouplens.org/datasets/movielens/1m/)), which contains 1 million ratings from 6000 users on 4000 movies.  There're 5 levels of rating. We will try classify each (user,movie) pair into 5 classes and evaluate the effect of algortithms using Mean Absolute Error.  
  
References: 
* A Keras implementation of Movie Recommendation([notebook](https://github.com/ririw/ririw.github.io/blob/master/assets/Recommending%20movies.ipynb)) from the [blog](http://blog.richardweiss.org/2016/09/25/movie-embeddings.html).
* Nerual Collaborative filtering ([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf))

## Intialization

import necessary libraries

In [3]:
import os
import zipfile
import argparse

import numpy as np
import tensorflow as tf

from bigdl.dataset import base
from sklearn.model_selection import train_test_split

import zoo.orca.learn.tf.estimator
from zoo.orca.data import SharedValue
import zoo.orca.data.pandas

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Data Preparation

Download movielens 1M data

In [7]:
SOURCE_URL = 'http://files.grouplens.org/datasets/movielens/'
WHOLE_DATA = 'ml-1m.zip'
data_dir='/tmp'

In [8]:
local_file = base.maybe_download(WHOLE_DATA, data_dir, SOURCE_URL + WHOLE_DATA)
zip_ref = zipfile.ZipFile(local_file, 'r')
extracted_to = os.path.join(data_dir, "ml-1m")
if not os.path.exists(extracted_to):
    print("Extracting %s to %s" % (local_file, data_dir))
    zip_ref.extractall(data_dir)
    zip_ref.close()
rating_files = os.path.join(extracted_to, "ratings.dat")

Replace "::" to ":" in ratings.dat and save to ratings_new.dat for spark 2.4 read csv support

In [9]:
new_rating_files = os.path.join(extracted_to, "ratings_new.dat")
if not os.path.exists(new_rating_files):
    fin = open(rating_files, "rt")
    # output file to write the result to
    fout = open(new_rating_files, "wt")
    # for each line in the input file
    for line in fin:
        # read replace the string and write to output file
        fout.write(line.replace('::', ':'))
    # close input and output files
    fin.close()
    fout.close()