## Parallelized Prophet Modeling with Ray

Prophet is a simple, yet powerful, additive forecasting model. To the former, it's implementation is intuitive and requires editing a few parameters and, to the latter, it provides an algorithmically efficient way to identify time-related patterns in the data. These two aspects make Prophet an ideal starting, and possibly end, point for a forecasting model. 

However, in real-world production use-cases we must overcome scaling challenges in model training and inference. Specifically, in retail use-cases we'd like to generate forecasting models for every combination of store x SKU. This can lead to 100K+ models. Furthermore, business demands may require all these models be trained overnight on a weekly basis!

In [0]:
%pip install -qU mlflow ray[default]==2.44.1 ray[data]==2.44.1 delta-sharing
dbutils.library.restartPython()

In [0]:
catalog = "main"
schema = "ray_gtm_examples"
table = "data_synthetic_timeseries_10000_groups"
write_table = "prophet_model_directory"
label="y"

## Optional: Generate a massive time-series dataset

In [0]:
# default is 10k groups x 50k datapoints per group. Edit within the 00_generate_timeseries_data notebook
# This may take 15 mins or so to generate all the data and save. 
%run ./00_generate_timeseries_data

In [0]:
# Synthetic data generation 
import pandas as pd

if not spark.catalog.tableExists(f"{catalog}.{schema}.{table}"): 
  # Create table for features
  id_sdf.write.mode('overwrite').saveAsTable(f"{catalog}.{schema}.{table}")
  print(f"... OK!")

Ray Data with `map_groups`


In [0]:
import ray
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster

restart = True
if restart is True:
  try:
    shutdown_ray_cluster()
  except:
    pass
  try:
    ray.shutdown()
  except:
    pass

# The below configuration mirrors my Spark worker cluster set up. Change this to match your cluster configuration. 
setup_ray_cluster(
  min_worker_nodes=6,
  max_worker_nodes=6,
  num_cpus_worker_node=16,
  num_gpus_worker_node=0,
  collect_log_to_path="/dbfs/Users/jon.cheung@databricks.com/ray_collected_logs"
)

In [0]:
from prophet import Prophet
import pandas as pd
import mlflow
import os
from mlflow.utils.databricks_utils import get_databricks_env_vars
from ray.data import from_spark
import pickle
from datetime import datetime

def train_and_inference_prophet(grouped_data:pd.DataFrame):
        # Create nested child runs named after the group
        group_name = grouped_data.loc[0,'group_name']

        # fit the model and generate forecasts
        m = Prophet(daily_seasonality=True)
        m.fit(grouped_data)

        base_dir = f'/Volumes/{catalog}/{schema}/prophet_binaries/'
        file_name = f'{group_name}_prophet.pkl'
        with open(base_dir+file_name, 'wb') as file:
                pickle.dump(m, file)

        # Write an output dataframe
        to_write = pd.DataFrame({'group_name': [group_name],
                                'model_binary_directory': [base_dir+file_name],
                                'algorithm': ['prophet'],
                                'creation_time': [str(datetime.now())]})

        return to_write

In [0]:

# Create volume for Ray Data to store the data shards. This intermediate volume is also used when writing back to a Spark table from Ray Data. 
os.environ['RAY_UC_VOLUMES_FUSE_TEMP_DIR'] = f'/Volumes/{catalog}/{schema}/ray_data_tmp_dir'

# Convert Spark data to Ray data. Note that this is an in-memory operation.
ray_data = from_spark(spark.read.table(f"{catalog}.{schema}.{table}"), 
                      use_spark_chunk_api=False)


# TODO # Read in Ray Data from Delta Sharing for memory efficient load compared to from_spark (i.e. no need to do in-memory Spark -> Ray Data)
# profile_file = "config.share"
# SHARE_NAME = 'internal-ray-share'
# SHARE_SCHEMA = 'ray_gtm_examples'
# SHARE_TABLE = 'data_synthetic_timeseries_100_groups'
# table_url = f"{profile_file}#{SHARE_NAME}.{SHARE_SCHEMA}.{SHARE_TABLE}"

# ray_data = ray.data.read_delta_sharing_tables(
#     url=table_url
# )

# Start the map_groups process.                       
grouped = ray_data.groupby("group_name")
results = grouped.map_groups(train_and_inference_prophet, 
                             num_cpus=1)

# # write grouped results to a Delta table
ray.data.Dataset.write_databricks_table(results, 
                                        f"{catalog}.{schema}.{write_table}",
                                         mode='overwrite')

In [0]:
experiment_name = '/Users/jon.cheung@databricks.com/ray_prophet_map_batches'

import pickle
from pprint import pprint
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import *

w = WorkspaceClient(host='https://xxx.databricks.com', token='xxx')

class ProphetInferenceRouter(mlflow.pyfunc.PythonModel):
  def __init__(self, catalog, schema, table):
    self.model_table = f"{catalog}.{schema}.{table}"

  def predict(self, identifier, horizon):
    
    
    loaded = pickle.loads(model_binary)
    future = loaded.make_future_dataframe(periods=horizon)
    forecast = loaded.predict(future)

    
  