![](../../../image/colab_logo_32px.png)[Run in Google Colab](https://colab.research.google.com/github/intel-analytics/analytics-zoo/blob/master/docs/docs/colab-notebook/friesian/examples/basic_ranking.ipynb) &nbsp;![](../../../image/GitHub-Mark-32px.png)[View source on GitHub](https://github.com/intel-analytics/analytics-zoo/blob/master/docs/docs/colab-notebook/friesian/examples/basic_ranking.ipynb)

In [None]:
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
#
# Copyright 2020 The TensorFlow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This example is based on Tensorflow Recommenders example [basic ranking](https://www.tensorflow.org/recommenders/examples/basic_ranking).
# 

# Basic Ranking Example

In this tutorial, we're going to:

1. Use Friesian FeatureTable to get and preprocess the movielens data and split it into a training and test set.
2. Convert the preprocessed FeatureTable to an Orca TF Dataset and do some further data preprocessing.
3. Fit and evaluate the TFRS ranking model using Orca TF Estimator and Orca TF Dataset.

# Environment Preparation

### Install Java 8

Run the cell on the **Google Colab** to install jdk 1.8.

**Note**: if you run this notebook on your computer, root permission is required when running the cell to install Java 8. (You may ignore this cell if Java 8 has already been set up in your computer).

In [None]:
# Install jdk8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
# Set environment variable JAVA_HOME.
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version

### Install BigDL Friesian

You can install the latest pre-release version using pip install --pre --upgrade bigdl-friesian[train].

In [1]:
# Install latest pre-release version of BigDL Friesian 
# Installing BigDL Friesian from pip will automatically install pyspark and their dependencies.
!pip install --pre --upgrade bigdl-friesian[train]

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bigdl-friesian[train]
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/62/d026698fa78f206dd3ac15a6f248a7cfcaff87471e22412d6ed364532eba/bigdl_friesian-2.1.0b20220612-py3-none-macosx_10_11_x86_64.whl (183 kB)
[K     |████████████████████████████████| 183 kB 213 kB/s eta 0:00:01
[?25hCollecting bigdl-orca==2.1.0b20220612
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/5c/2b/e998effe89061fc9a8f76363df8ec0fa53871bff28127e0c12c6604cd544/bigdl_orca-2.1.0b20220612-py3-none-macosx_10_11_x86_64.whl (28.0 MB)
[K     |████████████████████████████████| 28.0 MB 1.1 MB/s eta 0:00:01     |████████████████▌               | 14.4 MB 685 kB/s eta 0:00:20
Collecting bigdl-dllib==2.1.0b20220612
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fd/9b/9cb4bd4ec2812db50e7a01a0801ab7a86c51cea287621c6540a25eef45a0/bigdl_dllib-2.1.0b20220612-py3-none-macosx_10_11_x86_64.whl (51.8 MB)
[K     |████████████████████

In [None]:
# Install required dependencies
!pip install tensorflow tensorflow-recommenders

In [2]:
import tensorflow as tf
print(tf.__version__)

2.8.0


## Distributed TFRS using Orca and Friesian APIs



In [3]:
import os
import math
import tensorflow as tf
import tensorflow_recommenders as tfrs

from bigdl.friesian.models import TFRSModel
from bigdl.orca import init_orca_context, stop_orca_context
from bigdl.orca import OrcaContext
from bigdl.friesian.feature import FeatureTable
from bigdl.orca.learn.tf2 import Estimator
from bigdl.orca.data.tf.data import Dataset

In [4]:
# recommended to set it to True when running BigDL in Jupyter notebook. 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).

cluster_mode = "local"

if cluster_mode == "local":
    init_orca_context(cores=1, memory="4g") # run in local mode
elif cluster_mode == "k8s":
    init_orca_context(cluster_mode="k8s", num_nodes=2, cores=4) # run on K8s cluster
elif cluster_mode == "yarn":
    init_orca_context(
        cluster_mode="yarn-client", cores=4, num_nodes=2, memory="2g",
        driver_memory="10g", driver_cores=1
        ) # run on Hadoop YARN cluster


Initializing orca context
Current pyspark location is : /Users/yita/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/__init__.py
Start to getOrCreate SparkContext
pyspark_submit_args is:  --driver-class-path /Users/yita/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.1.0-20220314.094552-2.jar:/Users/yita/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/Users/yita/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/friesian/lib/bigdl-friesian-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/Users/yita/anaconda3/envs/py37/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 
2022-06-13 20:30:25 WARN  Utils:66 - Your hostname, chenyinadeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.104 instead (on interface en0)
2022-06-13 20:30:25 WARN  Utils:66 - Set S

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-06-13 20:31:01,170 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-13 20:31:01,173 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-13 20:31:01,174 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-06-13 20:31:01,175 Thread-3 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-06-13 20:31:01 [Thread-3] INFO  Engine$:121 - Auto detect executor number and executor cores number
22-06-13 20:31:01 [Thread-3] INFO  Engine$:123 - Executor number is 1 and executor cores number is 1
22-06-13 20:31:02 [Thread-3] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 14
2022-06-13 20:31:02 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-06-13 20:31:02 [Thread-3] INFO  Engine$:456 - Find existing spark context. Checking the spark conf...
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample
BigDLBasePickler registe


User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1
   OMP_NUM_THREADS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=0
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN=

This is the only place where you need to specify local or distributed mode. View Orca Context for more details.

**Note**: You should export HADOOP_CONF_DIR=/path/to/hadoop/conf/dir when you run on Hadoop YARN cluster.

### Define the model

In [5]:
class SampleRankingModel(tfrs.models.Model):
    def __init__(self, user_id_num, movie_title_num):
        super().__init__()
        embedding_dim = 32
        self.task = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()]
        )
        self.user_embedding = tf.keras.layers.Embedding(user_id_num + 1, embedding_dim)
        self.movie_embedding = tf.keras.layers.Embedding(movie_title_num + 1, embedding_dim)
        self.ratings = tf.keras.Sequential([
              # Learn multiple dense layers.
              tf.keras.layers.Dense(256, activation="relu"),
              tf.keras.layers.Dense(64, activation="relu"),
              # Make rating predictions in the final layer.
              tf.keras.layers.Dense(1)
          ])

    def call(self, features):
        embeddings = tf.concat([self.user_embedding(features["user_id"]),
                               self.movie_embedding(features["movie_title"])], axis=1)
        return self.ratings(embeddings)

    def compute_loss(self, inputs, training: bool = False) -> tf.Tensor:
        labels = inputs["user_rating"]
        rating_predictions = self(inputs)
        return self.task(labels=labels, predictions=rating_predictions)

### Define the dataset

Use Friesian FeatureTable to get and preprocess the movielens data and split it into a training and test set.

First, we will download the [ml-1m dataset](https://grouplens.org/datasets/movielens/1m/) and unzip it.

In [12]:
!wget https://files.grouplens.org/datasets/movielens/ml-1m.zip && unzip ml-1m.zip

/bin/bash: wget: command not found


In [None]:
data_dir = "./ml-1m/"

# UserID::MovieID::Rating::Timestamp
# UserID::Gender::Age::Occupation::Zip-code
# MovieID::Title::Genres
dataset = {
    "ratings": ['userid', 'movieid', 'rating', 'timestamp'],
    "movies": ["movieid", "title", "genres"]
}

Then we will use Friesian FeatureTable to read the .dat files.

In [None]:
tbl_dict = dict()
for data, cols in dataset.items():
    tbl = FeatureTable.read_csv(os.path.join(data_dir, data + ".dat"),
                                delimiter=":", header=False)
    tmp_cols = tbl.columns[::2]
    tbl = tbl.select(tmp_cols)
    col_dict = {c[0]: c[1] for c in zip(tmp_cols, cols)}
    tbl = tbl.rename(col_dict)
    tbl_dict[data] = tbl

In [None]:
full_tbl = tbl_dict["ratings"].join(tbl_dict["movies"], "movieid")\
    .dropna(columns=None).select(["userid", "title", "rating"])
full_tbl = full_tbl.cast(["rating"], "int")
full_tbl = full_tbl.cast(["userid"], "string")
full_tbl.show(5, False)

Generate unique index value of categorical features and encode these columns with generated string indices.

In [None]:
str_idx = full_tbl.gen_string_idx(["userid", "title"])
user_id_size = str_idx[0].size()
title_size = str_idx[1].size()
full_tbl = full_tbl.encode_string(["userid", "title"], str_idx)
full_tbl.show(5, False)

Sample 10% data and split it into a training and test set.

In [None]:
part_tbl = full_tbl.sample(0.1, seed=42)
train_tbl, test_tbl = part_tbl.random_split([0.8, 0.2])

In [None]:
train_count = train_tbl.size()
steps = math.ceil(train_count / 8192)
print("train size: ", train_count, ", steps: ", steps)

test_count = test_tbl.size()
test_steps = math.ceil(test_count / 4096)
print("test size: ", test_count, ", steps: ", test_steps)

Create Orca TF Datasets from a Friesian FeatureTables.

In [None]:
train_ds = Dataset.from_feature_table(train_tbl)
test_ds = Dataset.from_feature_table(test_tbl)

Once the Orca TF Dataset is created, we can perform some data preprocessing using the map function. Since the model use `input["movie_title"], input["user_id"] and input["user_rating"]` in the model `call` and `compute_loss` function, we should change the key name of the Dataset.

In [None]:
train_ds = train_ds.map(lambda x: {
    "movie_title": x["title"],
    "user_id": x["userid"],
    "user_rating": x["rating"],
})
test_ds = test_ds.map(lambda x: {
    "movie_title": x["title"],
    "user_id": x["userid"],
    "user_rating": x["rating"],
})

Create an Orca Estimator using the SampleRankingModel.

In [None]:
def model_creator(config):
    model = SampleRankingModel(user_id_num=user_id_size, movie_title_num=title_size)
    model = TFRSModel(model)
    model.compile(loss=tf.keras.losses.MeanSquaredError(),
                  optimizer=tf.keras.optimizers.Adagrad(config["lr"]))
    return model

In [None]:
config = {
    "lr": 0.1
}

est = Estimator.from_keras(model_creator=model_creator,
                           verbose=True,
                           config=config, backend="tf2")

Then train the model using Orca TF Dataset.

In [None]:
est.fit(train_ds, 3, batch_size=8192, steps_per_epoch=steps)

Finally, we can evaluate our model on the test set.

In [None]:
est.evaluate(test_ds, 4096, num_steps=test_steps)

Shutdown the Estimator and stop the orca context.

In [None]:
est.shutdown()
stop_orca_context()