# Wide & Deep Recommendation for large scale data - Deep Learning Model Training

This notebook demonstrates the distributed training for [Wide & Deep Learning](https://arxiv.org/abs/1606.07792) with the preprocessed features of the [Twitter Recsys Challenge 2021 dataset](http://www.recsyschallenge.com/2021/).

<img src="figures/overview-recsys.png" alt="overview-recsys" width="750"/>

First of all, we import the necessary packages in BigDL for cluster initialization and distributed training. Also import TensorFlow for Wide & Deep model definition.

In [2]:
import math
from time import time

import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping

from bigdl.orca import init_orca_context, stop_orca_context, OrcaContext
from bigdl.orca.learn.tf2.estimator import Estimator
from bigdl.friesian.feature import FeatureTable

2022-03-23 10:55:39.569652: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-23 10:55:39.569679: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Initialize the environment on the YARN cluster. You simply need to prepare the Python environment on the driver node with [Anaconda](https://www.anaconda.com/products/individual) and BigDL will automatically distribute and prepare the environment for you across the cluster.
Besides, you can specify the allocated resources for this application during the initialization, including the number of nodes, cores and the amount of memory to use, etc. BigDL provides detailed guidance to be easily deployed on [Hadoop/YARN](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/hadoop.html) or [K8S](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/k8s.html) clusters.

In [4]:
# To display terminal's stdout and stderr in the Jupyter notebook.
OrcaContext.log_output = True

cluster_mode = "yarn"

executor_cores = 28
num_executor = 6
executor_memory = "30g"
driver_cores = 4
driver_memory = "36g"
conf = {"spark.executor.memoryOverhead": "130g",
        "spark.network.timeout": "10000000",
        "spark.sql.broadcastTimeout": "7200",
        "spark.sql.shuffle.partitions": "2000",
        "spark.locality.wait": "0s",
        "spark.sql.crossJoin.enabled": "true",
        "spark.task.cpus": "1",
        "spark.executor.heartbeatInterval": "200s",
        "spark.driver.maxResultSize": "40G",
        "spark.eventLog.enabled": "true",
        "spark.app.name": "recsys-demo-train"}

if cluster_mode == "local":  # For local machine
    sc = init_orca_context(cluster_mode="local",
                           cores=executor_cores, memory=executor_memory)
elif cluster_mode == "yarn":  # For Hadoop/YARN cluster
    sc = init_orca_context(cluster_mode="yarn", cores=executor_cores,
                           num_nodes=num_executor, memory=executor_memory,
                           driver_cores=driver_cores, driver_memory=driver_memory,
                           conf=conf, object_store_memory="80g",
                           env={"KMP_BLOCKTIME": "1",
                                "KMP_AFFINITY": "granularity=fine,compact,1,0",
                                "OMP_NUM_THREADS": "28"})


Initializing orca context
Current pyspark location is : /root/anaconda3/envs/bigdl/lib/python3.7/site-packages/pyspark/__init__.py
Initializing SparkContext for yarn-client mode
Start to pack current python env
Collecting packages...
Packing environment at '/root/anaconda3/envs/bigdl' to '/tmp/tmpmq670r4z/python_env.tar.gz'
[########################################] | 100% Completed | 13.4s
Packing has been completed: /tmp/tmpmq670r4z/python_env.tar.gz
pyspark_submit_args is: --master yarn --deploy-mode client --archives /tmp/tmpmq670r4z/python_env.tar.gz#python_env --driver-cores 4 --driver-memory 36g --num-executors 6 --executor-cores 28 --executor-memory 30g --driver-class-path /root/anaconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.0.0-jar-with-dependencies.jar:/root/anaconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.0.0-jar-with-dependencies.jar:/root/anaconda3/envs/bigdl/lib/python3.7/site-

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2022-03-23 10:55:59 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-03-23 10:56:00 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2022-03-23 10:57:08,470 Thread-5 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-03-23 10:57:08,471 Thread-5 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-03-23 10:57:08,471 Thread-5 WARN The bufferSize is set to 4000 but bufferedIo is false: false
2022-03-23 10:57:08,472 Thread-5 WARN The bufferSize is set to 4000 but bufferedIo is false: false
22-03-23 10:57:08 [Thread-5] INFO  Engine$:121 - Auto detect executor number and executor cores number
22-03-23 10:57:08 [Thread-5] INFO  Engine$:123 - Executor number is 6 and executor cores number is 28



User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1
   OMP_NUM_THREADS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=224
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_HAND_THREAD=false
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_MWAIT_HINTS=0
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_

22-03-23 10:57:08 [Thread-5] INFO  ThreadPool$:95 - Set mkl threads to 1 on thread 28
2022-03-23 10:57:08 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
22-03-23 10:57:08 [Thread-5] INFO  Engine$:446 - Find existing spark context. Checking the spark conf...
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample
BigDLBasePickler registering: bigdl.dllib.utils.common  Sample
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult
BigDLBasePickler registering: bigdl.dllib.utils.common  EvaluatedResult
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor
BigDLBasePickler registering: bigdl.dllib.utils.common  JTensor
cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity
BigDLBasePickler registering: bigdl.dllib.utils.common  JActivity


Load the preprocessed train and validation data.

In [5]:
train_tbl = FeatureTable.read_parquet("/path/to/preprocessed/train/data")
valid_tbl = FeatureTable.read_parquet("/path/to/preprocessed/valid/data")

                                                                                

Define different features for Wide & Deep inputs:

- wide_cols and cross_cols would be one hot encoded to be fed into the Linear layer.
- embedding_cols would be converted to embedding vectors to be fed into the DNN.
- indicator_cols would be one hot encoded to be fed into the DNN.
- continuous_cols would be directly fed into the DNN.

In [8]:
wide_cols = ['engaged_with_user_is_verified', 'enaging_user_is_verified']
wide_dims = [1, 1]
cross_cols = ['present_media_language']
cross_dims = [600]

embedding_cols = []
embedding_dims = []

cat_cols = ['present_media',
            'tweet_type',
            'language']
cat_dims = [12, 2, 66]
count_cols = ['engaged_with_user_follower_count',
              'engaged_with_user_following_count',
              'enaging_user_follower_count',
              'enaging_user_following_count']
count_dims = [7, 7, 7, 7]
indicator_cols = cat_cols + count_cols
indicator_dims = cat_dims + count_dims

continuous_cols = ['len_hashtags',
                   'len_domains',
                   'len_links']

column_info = { "wide_base_cols": wide_cols,
                "wide_base_dims": wide_dims,
                "wide_cross_cols": cross_cols,
                "wide_cross_dims": cross_dims,
                "indicator_cols": indicator_cols,
                "indicator_dims": indicator_dims,
                "continuous_cols": continuous_cols,
                "embed_cols": [],
                "embed_in_dims": [],
                "embed_out_dims": [],
                "label": "label"}

Define the Wide & Deep model with TensorFlow Keras API.

In [9]:
def build_model(column_info, hidden_units=[100, 50, 25]):
    """Build an estimator appropriate for the given model type."""
    wide_base_input_layers = []
    wide_base_layers = []
    for i in range(len(column_info["wide_base_cols"])):
        wide_base_input_layers.append(tf.keras.layers.Input(shape=[], dtype="int32"))
        wide_base_layers.append(tf.keras.backend.one_hot(wide_base_input_layers[i], column_info["wide_base_dims"][i] + 1))

    wide_cross_input_layers = []
    wide_cross_layers = []
    for i in range(len(column_info["wide_cross_cols"])):
        wide_cross_input_layers.append(tf.keras.layers.Input(shape=[], dtype="int32"))
        wide_cross_layers.append(tf.keras.backend.one_hot(wide_cross_input_layers[i], column_info["wide_cross_dims"][i]))

    indicator_input_layers = []
    indicator_layers = []
    for i in range(len(column_info["indicator_cols"])):
        indicator_input_layers.append(tf.keras.layers.Input(shape=[], dtype="int32"))
        indicator_layers.append(tf.keras.backend.one_hot(indicator_input_layers[i], column_info["indicator_dims"][i] + 1))

    embed_input_layers = []
    embed_layers = []
    for i in range(len(column_info["embed_in_dims"])):
        embed_input_layers.append(tf.keras.layers.Input(shape=[], dtype="int32"))
        iembed = tf.keras.layers.Embedding(column_info["embed_in_dims"][i] + 1,
                                           output_dim=column_info["embed_out_dims"][i])(embed_input_layers[i])
        flat_embed = tf.keras.layers.Flatten()(iembed)
        embed_layers.append(flat_embed)

    continuous_input_layers = []
    continuous_layers = []
    for i in range(len(column_info["continuous_cols"])):
        continuous_input_layers.append(tf.keras.layers.Input(shape=[]))
        continuous_layers.append(tf.keras.layers.Reshape(target_shape=(1,))(continuous_input_layers[i]))

    if len(wide_base_layers + wide_cross_layers) > 1:
        wide_input = tf.keras.layers.concatenate(wide_base_layers + wide_cross_layers, axis=1)
    else:
        wide_input = (wide_base_layers + wide_cross_layers)[0]
    wide_out = tf.keras.layers.Dense(1)(wide_input)
    if len(indicator_layers + embed_layers + continuous_layers) > 1:
        deep_concat = tf.keras.layers.concatenate(indicator_layers +
                                                  embed_layers +
                                                  continuous_layers, axis=1)
    else:
        deep_concat = (indicator_layers + embed_layers + continuous_layers)[0]
    linear = deep_concat
    for ilayer in range(0, len(hidden_units)):
        linear_mid = tf.keras.layers.Dense(hidden_units[ilayer])(linear)
        bn = tf.keras.layers.BatchNormalization()(linear_mid)
        relu = tf.keras.layers.ReLU()(bn)
        dropout = tf.keras.layers.Dropout(0.1)(relu)
        linear = dropout
    deep_out = tf.keras.layers.Dense(1)(linear)
    added = tf.keras.layers.add([wide_out, deep_out])
    out = tf.keras.layers.Activation("sigmoid")(added)
    model = tf.keras.models.Model(wide_base_input_layers +
                                  wide_cross_input_layers +
                                  indicator_input_layers +
                                  embed_input_layers +
                                  continuous_input_layers,
                                  out)

    return model

The following figure shows the architecture of the Wide & Deep model.

<img src="figures/wnd.png" alt="wnd" width="750"/>

Set the hyperparameters for model training.

In [11]:
config = {
    "lr": 0.0001,
    "column_info": column_info,
    "inter_op_parallelism": 4,
    "intra_op_parallelism": 24
}
batch_size = 25600

Define the model_creator to create the Wide & Deep model on each node and compile it with optimizer, loss and metrics.

In [12]:
def model_creator(config):
    model = build_model(column_info=config["column_info"],
                        hidden_units=[1024, 1024])
    optimizer = tf.keras.optimizers.Adam(config["lr"])
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['binary_accuracy', 'binary_crossentropy', 'AUC', 'Precision', 'Recall'])
    return model

The training architecture is shown in the following diagram. Our distributed training is implemented based on [RayOnSpark](https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-and-analytics-zoo-923e0136ed6a) and [RaySGD](https://medium.com/distributed-computing-with-ray/faster-and-cheaper-pytorch-with-raysgd-a5a44d4fd220). On the underlying YARN cluster, the Spark Driver first launches multiple Spark executors across the cluster and the existing SparkContext creates a RayContext to launch Ray in the same cluster. The Ray processes stay alongside Spark executors and are managed by Spark executors. TensorFlow runners exist in Ray processes and they would take the local Spark in-memory data partitions for model training. TensorFlow runners communicate and synchronize with each other via [TensorFlow distribute MirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy).

The cluster setup and implementation details are transparent to users. Users simply need to create a TensorFlow Estimator in BigDL for the entire process.

<img src="figures/e2etrain.png" alt="e2etrain" width="800"/>

Create the TensorFlow Estimator in BigDL for distributed Wide & Deep training.

In [14]:
estimator = Estimator.from_keras(
    model_creator=model_creator,
    verbose=True,
    config=config)

                                                                                

Launching Ray on cluster with Spark barrier mode


                                                                                

Start to launch ray driver on local
Executing command: ray start --address 172.168.0.107:38853 --redis-password 123456 --num-cpus 0 --object-store-memory 80000000000 --node-ip-address 172.168.0.101
2022-03-23 10:58:43,531	INFO scripts.py:747 -- Local node IP: 172.168.0.101
2022-03-23 10:58:43,654	SUCC scripts.py:755 -- --------------------
2022-03-23 10:58:43,654	SUCC scripts.py:756 -- Ray runtime started.
2022-03-23 10:58:43,654	SUCC scripts.py:757 -- --------------------
2022-03-23 10:58:43,654	INFO scripts.py:759 -- To terminate the Ray runtime, run
2022-03-23 10:58:43,655	INFO scripts.py:760 --   ray stop

E0323 10:58:43.621758087  303499 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0323 10:58:43.629680754  303499 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
[2022-03-23 10:58:43,653 I 303499 303499] global_state_accessor.cc:360: This node has an IP address of 172.1

2022-03-23 10:58:44,960	INFO worker.py:843 -- Connecting to existing Ray cluster at address: 172.168.0.107:38853


{'node_ip_address': '172.168.0.101', 'raylet_ip_address': '172.168.0.101', 'redis_address': '172.168.0.107:38853', 'object_store_address': '/tmp/ray/session_2022-03-23_10-58-37_224781_259004/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-03-23_10-58-37_224781_259004/sockets/raylet', 'webui_url': '172.168.0.107:8265', 'session_dir': '/tmp/ray/session_2022-03-23_10-58-37_224781_259004', 'metrics_export_port': 56881, 'node_id': '21217676fa1a3e4eee9e05be637197366ff94ddb13ed6ecc56433c9b'}


[2m[36m(Worker pid=259340, ip=172.168.0.107)[0m 2022-03-23 10:58:46.251583: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: python_env/lib:python_env/lib/python3.7/lib-dynload::/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/lib/native
[2m[36m(Worker pid=259340, ip=172.168.0.107)[0m 2022-03-23 10:58:46.251618: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2m[36m(Worker pid=279949, ip=172.168.0.102)[0m 2022-03-23 10:58:46.293903: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: python_env/lib:python_env/lib/python3.7/lib-dynload::/opt/cloudera/p

We evaluate on the validation dataset every 50 iterations. We add an EarlyStopping callback so that when validation auc keeps dropping for three rounds, the training would be stopped.

In [15]:
train_count = train_tbl.size()
print("Total number of train records: {}".format(train_count))
total_steps = math.ceil(train_count / batch_size)
steps_per_epoch = 50
epochs = math.ceil(total_steps / steps_per_epoch)   # To train the full dataset for an entire epoch
val_count = valid_tbl.size()
print("Total number of val records: {}".format(val_count))
val_steps = math.ceil(val_count / batch_size)

callbacks = [EarlyStopping(monitor='val_auc', mode='max',verbose=1, patience=3)]

2022-03-23 10:59:01 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.


                                                                                

Total number of train records: 747694282
Total number of val records: 14461760


Great! Let's launch the distributed training across the underlying YARN cluster! The training progress and the metrics at each step would be printed out during the training.

In [18]:
def label_cols(column_info):
    return [column_info["label"]]

def feature_cols(column_info):
    return column_info["wide_base_cols"] + column_info["wide_cross_cols"] +\
                  column_info["indicator_cols"] + column_info["embed_cols"] + column_info["continuous_cols"]

start = time()
estimator.fit(data=train_tbl.df,
              epochs=epochs,
              batch_size=batch_size,
              steps_per_epoch=steps_per_epoch,
              validation_data=valid_tbl.df,
              validation_steps=val_steps,
              callbacks=callbacks,
              feature_cols=feature_cols(column_info),
              label_cols=label_cols(column_info))
end = time()
print("Training time is: ", end - start)

Try to unpersist an uncached rdd


[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 1/585


[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m 2022-03-23 11:13:27.299909: W tensorflow/core/framework/dataset.cc:768] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


 1/50 [..............................] - ETA: 4:20 - loss: 0.8724 - binary_accuracy: 0.4895 - binary_crossentropy: 0.8724 - auc: 0.4726 - precision: 0.4919 - recall: 0.7762
 2/50 [>.............................] - ETA: 9s - loss: 0.8396 - binary_accuracy: 0.5014 - binary_crossentropy: 0.8396 - auc: 0.4978 - precision: 0.4987 - recall: 0.8051  
 3/50 [>.............................] - ETA: 8s - loss: 0.8149 - binary_accuracy: 0.5174 - binary_crossentropy: 0.8149 - auc: 0.5247 - precision: 0.5102 - recall: 0.8144
 4/50 [=>............................] - ETA: 8s - loss: 0.7979 - binary_accuracy: 0.5313 - binary_crossentropy: 0.7979 - auc: 0.5474 - precision: 0.5197 - recall: 0.8127
 5/50 [==>...........................] - ETA: 8s - loss: 0.7878 - binary_accuracy: 0.5408 - binary_crossentropy: 0.7878 - auc: 0.5627 - precision: 0.5263 - recall: 0.8050
 6/50 [==>...........................] - ETA: 8s - loss: 0.7790 - binary_accuracy: 0.5500 - binary_crossentropy: 0.7790 - auc: 0.5759 - preci

[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 2/585
 1/50 [..............................] - ETA: 13s - loss: 0.6617 - binary_accuracy: 0.6115 - binary_crossentropy: 0.6617 - auc: 0.6546 - precision: 0.6065 - recall: 0.6300
 2/50 [>.............................] - ETA: 10s - loss: 0.6624 - binary_accuracy: 0.6108 - binary_crossentropy: 0.6624 - auc: 0.6538 - precision: 0.6051 - recall: 0.6286
 3/50 [>.............................] - ETA: 9s - loss: 0.6626 - binary_accuracy: 0.6113 - binary_crossentropy: 0.6626 - auc: 0.6534 - precision: 0.6050 - recall: 0.6295 
 4/50 [=>............................] - ETA: 9s - loss: 0.6621 - binary_accuracy: 0.6115 - binary_crossentropy: 0.6621 - auc: 0.6541 - precision: 0.6052 - recall: 0.6305
 5/50 [==>...........................] - ETA: 8s - loss: 0.6625 - binary_accuracy: 0.6116 - binary_crossentropy: 0.6625 - auc: 0.6537 - precision: 0.6044 - recall: 0.6316
 6/50 [==>...........................] - ETA: 8s - loss: 0.6623 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 3/585
 1/50 [..............................] - ETA: 11s - loss: 0.6547 - binary_accuracy: 0.6221 - binary_crossentropy: 0.6547 - auc: 0.6662 - precision: 0.6178 - recall: 0.6274
 2/50 [>.............................] - ETA: 8s - loss: 0.6559 - binary_accuracy: 0.6194 - binary_crossentropy: 0.6559 - auc: 0.6633 - precision: 0.6135 - recall: 0.6232 
 3/50 [>.............................] - ETA: 8s - loss: 0.6553 - binary_accuracy: 0.6191 - binary_crossentropy: 0.6553 - auc: 0.6633 - precision: 0.6136 - recall: 0.6228
 4/50 [=>............................] - ETA: 8s - loss: 0.6551 - binary_accuracy: 0.6189 - binary_crossentropy: 0.6551 - auc: 0.6632 - precision: 0.6127 - recall: 0.6242
 5/50 [==>...........................] - ETA: 8s - loss: 0.6556 - binary_accuracy: 0.6192 - binary_crossentropy: 0.6556 - auc: 0.6626 - precision: 0.6128 - recall: 0.6250
 6/50 [==>...........................] - ETA: 7s - loss: 0.6557 - binary_accurac



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 4/585
 1/50 [..............................] - ETA: 9s - loss: 0.6535 - binary_accuracy: 0.6242 - binary_crossentropy: 0.6535 - auc: 0.6668 - precision: 0.6217 - recall: 0.6360
 2/50 [>.............................] - ETA: 10s - loss: 0.6525 - binary_accuracy: 0.6236 - binary_crossentropy: 0.6525 - auc: 0.6683 - precision: 0.6179 - recall: 0.6358
 3/50 [>.............................] - ETA: 9s - loss: 0.6522 - binary_accuracy: 0.6229 - binary_crossentropy: 0.6522 - auc: 0.6685 - precision: 0.6175 - recall: 0.6343 
 4/50 [=>............................] - ETA: 8s - loss: 0.6527 - binary_accuracy: 0.6219 - binary_crossentropy: 0.6527 - auc: 0.6677 - precision: 0.6184 - recall: 0.6316
 5/50 [==>...........................] - ETA: 8s - loss: 0.6535 - binary_accuracy: 0.6211 - binary_crossentropy: 0.6535 - auc: 0.6664 - precision: 0.6169 - recall: 0.6301
 6/50 [==>...........................] - ETA: 8s - loss: 0.6539 - binary_accurac



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 5/585
 1/50 [..............................] - ETA: 9s - loss: 0.6526 - binary_accuracy: 0.6253 - binary_crossentropy: 0.6526 - auc: 0.6681 - precision: 0.6164 - recall: 0.6440
 2/50 [>.............................] - ETA: 8s - loss: 0.6540 - binary_accuracy: 0.6211 - binary_crossentropy: 0.6540 - auc: 0.6656 - precision: 0.6155 - recall: 0.6381
 3/50 [>.............................] - ETA: 8s - loss: 0.6531 - binary_accuracy: 0.6224 - binary_crossentropy: 0.6531 - auc: 0.6673 - precision: 0.6166 - recall: 0.6384
 4/50 [=>............................] - ETA: 8s - loss: 0.6527 - binary_accuracy: 0.6232 - binary_crossentropy: 0.6527 - auc: 0.6681 - precision: 0.6170 - recall: 0.6393
 5/50 [==>...........................] - ETA: 8s - loss: 0.6528 - binary_accuracy: 0.6228 - binary_crossentropy: 0.6528 - auc: 0.6680 - precision: 0.6162 - recall: 0.6374
 6/50 [==>...........................] - ETA: 7s - loss: 0.6525 - binary_accuracy:



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 6/585
 1/50 [..............................] - ETA: 12s - loss: 0.6524 - binary_accuracy: 0.6213 - binary_crossentropy: 0.6524 - auc: 0.6658 - precision: 0.6171 - recall: 0.6241
 2/50 [>.............................] - ETA: 10s - loss: 0.6519 - binary_accuracy: 0.6214 - binary_crossentropy: 0.6519 - auc: 0.6671 - precision: 0.6201 - recall: 0.6238
 3/50 [>.............................] - ETA: 9s - loss: 0.6518 - binary_accuracy: 0.6217 - binary_crossentropy: 0.6518 - auc: 0.6679 - precision: 0.6177 - recall: 0.6259 
 4/50 [=>............................] - ETA: 8s - loss: 0.6506 - binary_accuracy: 0.6233 - binary_crossentropy: 0.6506 - auc: 0.6700 - precision: 0.6190 - recall: 0.6292
 5/50 [==>...........................] - ETA: 8s - loss: 0.6508 - binary_accuracy: 0.6232 - binary_crossentropy: 0.6508 - auc: 0.6698 - precision: 0.6189 - recall: 0.6304
 6/50 [==>...........................] - ETA: 8s - loss: 0.6515 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 7/585
 1/50 [..............................] - ETA: 11s - loss: 0.6478 - binary_accuracy: 0.6271 - binary_crossentropy: 0.6478 - auc: 0.6737 - precision: 0.6190 - recall: 0.6316
 2/50 [>.............................] - ETA: 8s - loss: 0.6489 - binary_accuracy: 0.6255 - binary_crossentropy: 0.6489 - auc: 0.6718 - precision: 0.6169 - recall: 0.6298 
 3/50 [>.............................] - ETA: 8s - loss: 0.6484 - binary_accuracy: 0.6262 - binary_crossentropy: 0.6484 - auc: 0.6728 - precision: 0.6198 - recall: 0.6300
 4/50 [=>............................] - ETA: 8s - loss: 0.6481 - binary_accuracy: 0.6265 - binary_crossentropy: 0.6481 - auc: 0.6733 - precision: 0.6194 - recall: 0.6309
 5/50 [==>...........................] - ETA: 8s - loss: 0.6484 - binary_accuracy: 0.6262 - binary_crossentropy: 0.6484 - auc: 0.6730 - precision: 0.6194 - recall: 0.6300
 6/50 [==>...........................] - ETA: 7s - loss: 0.6480 - binary_accurac



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 8/585
 1/50 [..............................] - ETA: 10s - loss: 0.6498 - binary_accuracy: 0.6251 - binary_crossentropy: 0.6498 - auc: 0.6724 - precision: 0.6171 - recall: 0.6364
 2/50 [>.............................] - ETA: 8s - loss: 0.6497 - binary_accuracy: 0.6271 - binary_crossentropy: 0.6497 - auc: 0.6725 - precision: 0.6205 - recall: 0.6381 
 3/50 [>.............................] - ETA: 8s - loss: 0.6493 - binary_accuracy: 0.6259 - binary_crossentropy: 0.6493 - auc: 0.6726 - precision: 0.6201 - recall: 0.6349
 4/50 [=>............................] - ETA: 8s - loss: 0.6487 - binary_accuracy: 0.6263 - binary_crossentropy: 0.6487 - auc: 0.6732 - precision: 0.6207 - recall: 0.6336
 5/50 [==>...........................] - ETA: 8s - loss: 0.6485 - binary_accuracy: 0.6266 - binary_crossentropy: 0.6485 - auc: 0.6734 - precision: 0.6212 - recall: 0.6319
 6/50 [==>...........................] - ETA: 8s - loss: 0.6483 - binary_accurac



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 9/585
 1/50 [..............................] - ETA: 12s - loss: 0.6463 - binary_accuracy: 0.6300 - binary_crossentropy: 0.6463 - auc: 0.6757 - precision: 0.6237 - recall: 0.6336
 2/50 [>.............................] - ETA: 9s - loss: 0.6453 - binary_accuracy: 0.6306 - binary_crossentropy: 0.6453 - auc: 0.6776 - precision: 0.6251 - recall: 0.6350 
 3/50 [>.............................] - ETA: 8s - loss: 0.6468 - binary_accuracy: 0.6290 - binary_crossentropy: 0.6468 - auc: 0.6756 - precision: 0.6245 - recall: 0.6313
 4/50 [=>............................] - ETA: 7s - loss: 0.6472 - binary_accuracy: 0.6289 - binary_crossentropy: 0.6472 - auc: 0.6752 - precision: 0.6249 - recall: 0.6320
 5/50 [==>...........................] - ETA: 8s - loss: 0.6485 - binary_accuracy: 0.6275 - binary_crossentropy: 0.6485 - auc: 0.6733 - precision: 0.6237 - recall: 0.6312
 6/50 [==>...........................] - ETA: 7s - loss: 0.6484 - binary_accurac



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 10/585
 1/50 [..............................] - ETA: 11s - loss: 0.6495 - binary_accuracy: 0.6244 - binary_crossentropy: 0.6495 - auc: 0.6705 - precision: 0.6233 - recall: 0.6261
 2/50 [>.............................] - ETA: 8s - loss: 0.6471 - binary_accuracy: 0.6253 - binary_crossentropy: 0.6471 - auc: 0.6734 - precision: 0.6239 - recall: 0.6267 
 3/50 [>.............................] - ETA: 8s - loss: 0.6478 - binary_accuracy: 0.6247 - binary_crossentropy: 0.6478 - auc: 0.6728 - precision: 0.6233 - recall: 0.6275
 4/50 [=>............................] - ETA: 7s - loss: 0.6488 - binary_accuracy: 0.6243 - binary_crossentropy: 0.6488 - auc: 0.6718 - precision: 0.6229 - recall: 0.6279
 5/50 [==>...........................] - ETA: 7s - loss: 0.6490 - binary_accuracy: 0.6242 - binary_crossentropy: 0.6490 - auc: 0.6717 - precision: 0.6229 - recall: 0.6287
 6/50 [==>...........................] - ETA: 7s - loss: 0.6491 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 11/585
 1/50 [..............................] - ETA: 10s - loss: 0.6472 - binary_accuracy: 0.6282 - binary_crossentropy: 0.6472 - auc: 0.6751 - precision: 0.6209 - recall: 0.6417
 2/50 [>.............................] - ETA: 8s - loss: 0.6467 - binary_accuracy: 0.6291 - binary_crossentropy: 0.6467 - auc: 0.6758 - precision: 0.6234 - recall: 0.6420 
 3/50 [>.............................] - ETA: 8s - loss: 0.6466 - binary_accuracy: 0.6293 - binary_crossentropy: 0.6466 - auc: 0.6759 - precision: 0.6235 - recall: 0.6428
 4/50 [=>............................] - ETA: 8s - loss: 0.6473 - binary_accuracy: 0.6294 - binary_crossentropy: 0.6473 - auc: 0.6750 - precision: 0.6256 - recall: 0.6416
 5/50 [==>...........................] - ETA: 8s - loss: 0.6476 - binary_accuracy: 0.6286 - binary_crossentropy: 0.6476 - auc: 0.6744 - precision: 0.6245 - recall: 0.6404
 6/50 [==>...........................] - ETA: 7s - loss: 0.6479 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 12/585
 1/50 [..............................] - ETA: 10s - loss: 0.6475 - binary_accuracy: 0.6250 - binary_crossentropy: 0.6475 - auc: 0.6734 - precision: 0.6206 - recall: 0.6374
 2/50 [>.............................] - ETA: 7s - loss: 0.6481 - binary_accuracy: 0.6259 - binary_crossentropy: 0.6481 - auc: 0.6731 - precision: 0.6200 - recall: 0.6382 
 3/50 [>.............................] - ETA: 8s - loss: 0.6454 - binary_accuracy: 0.6291 - binary_crossentropy: 0.6454 - auc: 0.6772 - precision: 0.6235 - recall: 0.6396
 4/50 [=>............................] - ETA: 8s - loss: 0.6462 - binary_accuracy: 0.6289 - binary_crossentropy: 0.6462 - auc: 0.6765 - precision: 0.6237 - recall: 0.6402
 5/50 [==>...........................] - ETA: 7s - loss: 0.6466 - binary_accuracy: 0.6289 - binary_crossentropy: 0.6466 - auc: 0.6758 - precision: 0.6242 - recall: 0.6400
 6/50 [==>...........................] - ETA: 7s - loss: 0.6463 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 13/585
 1/50 [..............................] - ETA: 10s - loss: 0.6440 - binary_accuracy: 0.6320 - binary_crossentropy: 0.6440 - auc: 0.6793 - precision: 0.6309 - recall: 0.6373
 2/50 [>.............................] - ETA: 10s - loss: 0.6436 - binary_accuracy: 0.6321 - binary_crossentropy: 0.6436 - auc: 0.6794 - precision: 0.6287 - recall: 0.6397
 3/50 [>.............................] - ETA: 9s - loss: 0.6448 - binary_accuracy: 0.6309 - binary_crossentropy: 0.6448 - auc: 0.6778 - precision: 0.6265 - recall: 0.6404 
 4/50 [=>............................] - ETA: 8s - loss: 0.6444 - binary_accuracy: 0.6317 - binary_crossentropy: 0.6444 - auc: 0.6788 - precision: 0.6261 - recall: 0.6413
 5/50 [==>...........................] - ETA: 8s - loss: 0.6448 - binary_accuracy: 0.6313 - binary_crossentropy: 0.6448 - auc: 0.6784 - precision: 0.6256 - recall: 0.6415
 6/50 [==>...........................] - ETA: 8s - loss: 0.6449 - binary_accur



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 14/585
 1/50 [..............................] - ETA: 12s - loss: 0.6470 - binary_accuracy: 0.6288 - binary_crossentropy: 0.6470 - auc: 0.6754 - precision: 0.6250 - recall: 0.6384
 2/50 [>.............................] - ETA: 8s - loss: 0.6477 - binary_accuracy: 0.6288 - binary_crossentropy: 0.6477 - auc: 0.6737 - precision: 0.6227 - recall: 0.6397 
 3/50 [>.............................] - ETA: 8s - loss: 0.6470 - binary_accuracy: 0.6284 - binary_crossentropy: 0.6470 - auc: 0.6746 - precision: 0.6218 - recall: 0.6393
 4/50 [=>............................] - ETA: 8s - loss: 0.6453 - binary_accuracy: 0.6307 - binary_crossentropy: 0.6453 - auc: 0.6770 - precision: 0.6237 - recall: 0.6408
 5/50 [==>...........................] - ETA: 8s - loss: 0.6447 - binary_accuracy: 0.6308 - binary_crossentropy: 0.6447 - auc: 0.6777 - precision: 0.6250 - recall: 0.6402
 6/50 [==>...........................] - ETA: 7s - loss: 0.6444 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 15/585
 1/50 [..............................] - ETA: 9s - loss: 0.6448 - binary_accuracy: 0.6301 - binary_crossentropy: 0.6448 - auc: 0.6785 - precision: 0.6212 - recall: 0.6437
 2/50 [>.............................] - ETA: 9s - loss: 0.6427 - binary_accuracy: 0.6327 - binary_crossentropy: 0.6427 - auc: 0.6814 - precision: 0.6240 - recall: 0.6467
 3/50 [>.............................] - ETA: 8s - loss: 0.6430 - binary_accuracy: 0.6317 - binary_crossentropy: 0.6430 - auc: 0.6807 - precision: 0.6256 - recall: 0.6447
 4/50 [=>............................] - ETA: 9s - loss: 0.6437 - binary_accuracy: 0.6315 - binary_crossentropy: 0.6437 - auc: 0.6797 - precision: 0.6264 - recall: 0.6442
 5/50 [==>...........................] - ETA: 8s - loss: 0.6437 - binary_accuracy: 0.6321 - binary_crossentropy: 0.6437 - auc: 0.6799 - precision: 0.6277 - recall: 0.6429
 6/50 [==>...........................] - ETA: 8s - loss: 0.6436 - binary_accuracy



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 16/585
 1/50 [..............................] - ETA: 10s - loss: 0.6453 - binary_accuracy: 0.6283 - binary_crossentropy: 0.6453 - auc: 0.6764 - precision: 0.6227 - recall: 0.6390
 2/50 [>.............................] - ETA: 10s - loss: 0.6464 - binary_accuracy: 0.6277 - binary_crossentropy: 0.6464 - auc: 0.6750 - precision: 0.6247 - recall: 0.6378
 3/50 [>.............................] - ETA: 9s - loss: 0.6453 - binary_accuracy: 0.6294 - binary_crossentropy: 0.6453 - auc: 0.6768 - precision: 0.6273 - recall: 0.6415 
 4/50 [=>............................] - ETA: 9s - loss: 0.6456 - binary_accuracy: 0.6294 - binary_crossentropy: 0.6456 - auc: 0.6766 - precision: 0.6263 - recall: 0.6417
 5/50 [==>...........................] - ETA: 8s - loss: 0.6455 - binary_accuracy: 0.6303 - binary_crossentropy: 0.6455 - auc: 0.6767 - precision: 0.6254 - recall: 0.6434
 6/50 [==>...........................] - ETA: 8s - loss: 0.6451 - binary_accur



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 17/585
 1/50 [..............................] - ETA: 11s - loss: 0.6447 - binary_accuracy: 0.6342 - binary_crossentropy: 0.6447 - auc: 0.6785 - precision: 0.6285 - recall: 0.6445
 2/50 [>.............................] - ETA: 9s - loss: 0.6431 - binary_accuracy: 0.6346 - binary_crossentropy: 0.6431 - auc: 0.6807 - precision: 0.6294 - recall: 0.6462 
 3/50 [>.............................] - ETA: 8s - loss: 0.6424 - binary_accuracy: 0.6361 - binary_crossentropy: 0.6424 - auc: 0.6817 - precision: 0.6311 - recall: 0.6476
 4/50 [=>............................] - ETA: 9s - loss: 0.6434 - binary_accuracy: 0.6341 - binary_crossentropy: 0.6434 - auc: 0.6799 - precision: 0.6286 - recall: 0.6460
 5/50 [==>...........................] - ETA: 8s - loss: 0.6430 - binary_accuracy: 0.6337 - binary_crossentropy: 0.6430 - auc: 0.6805 - precision: 0.6283 - recall: 0.6456
 6/50 [==>...........................] - ETA: 8s - loss: 0.6428 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 18/585
 1/50 [..............................] - ETA: 10s - loss: 0.6420 - binary_accuracy: 0.6319 - binary_crossentropy: 0.6420 - auc: 0.6812 - precision: 0.6317 - recall: 0.6446
 2/50 [>.............................] - ETA: 7s - loss: 0.6419 - binary_accuracy: 0.6345 - binary_crossentropy: 0.6419 - auc: 0.6816 - precision: 0.6306 - recall: 0.6494 
 3/50 [>.............................] - ETA: 8s - loss: 0.6419 - binary_accuracy: 0.6342 - binary_crossentropy: 0.6419 - auc: 0.6813 - precision: 0.6285 - recall: 0.6493
 4/50 [=>............................] - ETA: 8s - loss: 0.6414 - binary_accuracy: 0.6345 - binary_crossentropy: 0.6414 - auc: 0.6821 - precision: 0.6294 - recall: 0.6511
 5/50 [==>...........................] - ETA: 7s - loss: 0.6414 - binary_accuracy: 0.6348 - binary_crossentropy: 0.6414 - auc: 0.6823 - precision: 0.6289 - recall: 0.6512
 6/50 [==>...........................] - ETA: 8s - loss: 0.6420 - binary_accura



[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 19/585
 1/50 [..............................] - ETA: 10s - loss: 0.6404 - binary_accuracy: 0.6369 - binary_crossentropy: 0.6404 - auc: 0.6846 - precision: 0.6419 - recall: 0.6403
 2/50 [>.............................] - ETA: 8s - loss: 0.6423 - binary_accuracy: 0.6334 - binary_crossentropy: 0.6423 - auc: 0.6814 - precision: 0.6330 - recall: 0.6392 
 3/50 [>.............................] - ETA: 8s - loss: 0.6421 - binary_accuracy: 0.6338 - binary_crossentropy: 0.6421 - auc: 0.6814 - precision: 0.6323 - recall: 0.6407
 4/50 [=>............................] - ETA: 8s - loss: 0.6427 - binary_accuracy: 0.6341 - binary_crossentropy: 0.6427 - auc: 0.6805 - precision: 0.6316 - recall: 0.6418
 5/50 [==>...........................] - ETA: 8s - loss: 0.6414 - binary_accuracy: 0.6354 - binary_crossentropy: 0.6414 - auc: 0.6826 - precision: 0.6326 - recall: 0.6441
 6/50 [==>...........................] - ETA: 8s - loss: 0.6417 - binary_accura



[2m[36m(Worker pid=226680, ip=172.168.0.103)[0m Epoch 19: early stopping
[2m[36m(Worker pid=259340, ip=172.168.0.107)[0m Epoch 19: early stopping
[2m[36m(Worker pid=314019, ip=172.168.0.104)[0m Epoch 19: early stopping
[2m[36m(Worker pid=310081, ip=172.168.0.105)[0m Epoch 19: early stopping
[2m[36m(Worker pid=279949, ip=172.168.0.102)[0m Epoch 19: early stopping
[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m 
[2m[36m(Worker pid=314929, ip=172.168.0.106)[0m Epoch 19: early stopping
Training time is:  1185.4728910923004


After 19 rounds of training, we get validation [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) 0.7044, [precision](https://en.wikipedia.org/wiki/Precision_and_recall) 0.6566 and [recall](https://en.wikipedia.org/wiki/Precision_and_recall) 0.6757. The total training time is around half an hour.

After the training, we save the trained model to be used for serving.

In [19]:
model = estimator.get_model()
tf.saved_model.save(model, "recsys_wnd/")

stop_orca_context()

2022-03-23 11:26:01.562497: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-23 11:26:01.562564: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-23 11:26:01.562601: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (aep-001): /proc/driver/nvidia/version does not exist
2022-03-23 11:26:01.563329: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-23 11:26:02.425917: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequenc

INFO:tensorflow:Assets written to: recsys_wnd/assets
Stopping orca context
