## Parallelized Prophet Modeling with Ray

Prophet is a simple, yet powerful, additive forecasting model. To the former, it's implementation is intuitive and requires editing a few parameters and, to the latter, it provides an algorithmically efficient way to identify time-related patterns in the data. These two aspects make Prophet an ideal starting, and possibly end, point for a forecasting model. 

However, in real-world production use-cases we must overcome scaling challenges in model training and inference. Specifically, in retail use-cases we'd like to generate forecasting models for every combination of store x SKU. This can lead to 100K+ models. Furthermore, business demands may require all these models be trained and inferenced within short time frames.

In this specific notebook, we will use Ray Data and map groups to parallelize training across the Ray cluster. At the same time, we'll use Spark for efficient data loading + writing the results back to Delta. 

For compute you simply need 5 worker nodes (`m5.4xlarge` which has 16 CPU and 64GiB memory) without any special configurations.

```json
{
    "num_workers": 5,
    "cluster_name": "Multi Node MLR",
    "spark_version": "16.4.x-cpu-ml-scala2.13",
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
        "ebs_volume_type": "GENERAL_PURPOSE_SSD",
        "ebs_volume_count": 1,
        "ebs_volume_size": 100
    },
    "node_type_id": "m5.4xlarge",
    "driver_node_type_id": "m5.4xlarge",
    "autotermination_minutes": 45,
    "enable_elastic_disk": true,
    "single_user_name": "jon.cheung@databricks.com",
    "enable_local_disk_encryption": false,
    "data_security_mode": "SINGLE_USER",
    "runtime_engine": "STANDARD",
    "assigned_principal": "user:jon.cheung@databricks.com",
}
```

In [0]:
%pip install -qU ray[all]=2.47.1

dbutils.library.restartPython()

In [0]:
catalog_name = "main"
schema_name = "ray_gtm_examples"
data_table = "data_synthetic_timeseries_10000_groups"
write_table = "prophet_inference_10000_groups"
label="y"

# This is for stashing the cluster logs
ray_logs_path = "/dbfs/Users/jon.cheung@databricks.com/ray_collected_logs"

## Optional: Generate a massive time-series dataset

Default is 10k groups x 50k datapoints per group. Edit within the 00_generate_timeseries_data notebook
This may take 15 mins or so to generate all the data and save. 

In [0]:
%run ./00-generate-timeseries-data

In [0]:
# Synthetic data generation 
if not spark.catalog.tableExists(f"{catalog_name}.{schema_name}.{data_table}"): 
  # Create table for features
  id_sdf.write.mode('overwrite').saveAsTable(f"{catalog_name}.{schema_name}.{data_table}")
  print(f"... OK!")

Ray Data with `map_groups`


In [0]:
import ray
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster

restart = True
if restart is True:
  try:
    shutdown_ray_cluster()
  except:
    pass
  try:
    ray.shutdown()
  except:
    pass

# This notebook will use a hybrid Ray + Spark cluster. Spark for data handling (i.e. read/write) and Ray for task parallelism (i.e. ML training). 
# Since I have 5 worker nodes, I'll set 3 workers for the Ray cluster, leaving 2 for Spark. 
setup_ray_cluster(
  min_worker_nodes=3,
  max_worker_nodes=3,
  num_cpus_worker_node=16,
  num_gpus_worker_node=0,
  num_cpus_head_node=16, 
  collect_log_to_path=ray_logs_path
)

In [0]:
from prophet import Prophet
import pandas as pd
import mlflow
import os

def train_and_inference_prophet(grouped_data: pd.DataFrame, horizon: int):
    """
    Trains a Prophet model on grouped time series data and generates future forecasts.

    This function is designed to be applied to a Ray dataset using `map_groups`.
    It expects the input DataFrame to have 'ds' (datestamp), 'y' (target variable),
    and 'group_name' columns.

    Args:
        grouped_data (pd.DataFrame): A DataFrame containing the time series data
                                     for a specific group. It must have 'ds', 'y',
                                     and 'group_name' columns.
        horizon (int): The number of future periods to forecast.

    Returns:
        pd.DataFrame: A DataFrame containing the forecasted data for the specified
                      horizon, including the 'ds' (converted to string) and 'group_name'
                      columns.
    """
    # Extract the group name from the first row of the grouped data
    group_name = grouped_data.loc[0, 'group_name']

    # Initialize and fit the Prophet model
    m = Prophet(daily_seasonality=True)
    m.fit(grouped_data)

    # Create a dataframe for future dates and generate forecasts
    future = m.make_future_dataframe(periods=horizon)
    forecast = m.predict(future)
    
    # Extract the forecasted data for the specified horizon
    to_write = forecast.iloc[-horizon:]
    to_write['ds'] = to_write['ds'].astype(str)  # Convert date to string for Spark writes back to Delta
    to_write['group_name'] = group_name  # Add group name to the forecasted data

    return to_write

In [0]:
import os
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import catalog
from databricks.sdk.errors import ResourceAlreadyExists

# Create Volumes path for Ray Data to store the data shards when converting from Spark --> Ray Data.
# This path is also used when writing back to a Spark table from Ray Data. 

w = WorkspaceClient()
ray_fuse_temp_directory = f'/Volumes/{catalog_name}/{schema_name}/ray_data_tmp_dir/'

try:
    created_volume = w.volumes.create(catalog_name=catalog_name,
                                        schema_name=schema_name,
                                        name='ray_data_tmp_dir',
                                        volume_type=catalog.VolumeType.MANAGED
                                        )
    print(f"Volume {ray_fuse_temp_directory} created successfully")
except ResourceAlreadyExists:
    print(f"Volume {ray_fuse_temp_directory} already exists. Skipping volumes creation.")

os.environ['RAY_UC_VOLUMES_FUSE_TEMP_DIR']=ray_fuse_temp_directory


In [0]:
import ray

# Since we are using a hybrid cluster, we are going to read data using Spark and then convert to Ray. 
# Note that this is an in-memory operation and requires using a non-autoscaling cluster.
sdf = spark.read.table(f"{catalog_name}.{schema_name}.{data_table}")
ray_data = ray.data.from_spark(sdf)

# Consider using ray.data.read_delta_sharing_tables() for memory efficient load compared to from_spark (i.e. no need to do in-memory Spark -> Ray Data)

# Group data by group_name and apply the train_and_inference_prophet function to each group. 
# Note that we are using 1 CPU per group to train a Prophet model.
grouped = ray_data.groupby("group_name")
results = grouped.map_groups(train_and_inference_prophet, 
                             fn_kwargs={"horizon": 14}, 
                             num_cpus=1)

# Write grouped results to Databricks Unity Catalog. 
ray.data.Dataset.write_databricks_table(results, 
                                        f"{catalog_name}.{schema_name}.{write_table}",
                                         mode='overwrite')