-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Training with Pandas Function API

This notebook demonstrates how to use Pandas Function API to manage and scale machine learning models for IoT devices. 

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html?highlight=applyinpandas#pyspark.sql.GroupedData.applyInPandas" target="_blank"> **`.groupBy().applyInPandas()`** </a> to build many models in parallel for each IoT Device

In [0]:
%pip install mlflow

In [0]:
%run ./Includes/Classroom-Setup

Create dummy data with:
- **`device_id`**: 10 different devices
- **`record_id`**: 10k unique records
- **`feature_1`**: a feature for model training
- **`feature_2`**: a feature for model training
- **`feature_3`**: a feature for model training
- **`label`**: the variable we're trying to predict

In [0]:
import pyspark.sql.functions as f

df = (spark
      .range(1000*100)
      .select(f.col("id").alias("record_id"), (f.col("id")%10).alias("device_id"))
      .withColumn("feature_1", f.rand() * 1)
      .withColumn("feature_2", f.rand() * 2)
      .withColumn("feature_3", f.rand() * 3)
      .withColumn("label", (f.col("feature_1") + f.col("feature_2") + f.col("feature_3")) + f.rand())
     )

display(df)

record_id,device_id,feature_1,feature_2,feature_3,label
0,0,0.6946372461392856,0.7577028817542781,0.0112440557844216,1.56102021994577
1,1,0.8713828508946553,1.1141254210835765,2.517164660698353,5.471653720015507
2,2,0.0713600393164457,0.2148640603489986,1.8818924505111343,3.139185509359318
3,3,0.4475890179885491,0.6059122241814121,1.5708151023130503,2.790459974902946
4,4,0.1963568462238419,0.5273904726915719,0.0433482439231922,0.9006344069750835
5,5,0.2031819875110142,1.5124529767075523,0.2020990183099075,2.699533484567216
6,6,0.2582358696874994,0.2562560668274218,2.1035239599806306,3.0891780248298124
7,7,0.722510008102463,0.8859570885874051,1.5639290290286838,3.408172837045554
8,8,0.3686123783052292,1.7515756059583216,2.7347358457668496,5.54605150614343
9,9,0.5167366016033235,0.8600531466342609,2.0866882952183143,3.908036057417946


Define the return schema

In [0]:
import pyspark.sql.types as t

train_return_schema = t.StructType([
    t.StructField("device_id", t.IntegerType()), # unique device ID
    t.StructField("n_used", t.IntegerType()),    # number of records used in training
    t.StructField("model_path", t.StringType()), # path to the model for a given device
    t.StructField("mse", t.FloatType())          # metric for model performance
])

Define a pandas function that takes all the data for a given device, train a model, saves it as a nested run, and returns a spark object with the above schema

In [0]:
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def train_model(df_pandas: pd.DataFrame) -> pd.DataFrame:
    """
    Trains an sklearn model on grouped instances
    """
    # Pull metadata
    device_id = df_pandas["device_id"].iloc[0]
    n_used = df_pandas.shape[0]
    run_id = df_pandas["run_id"].iloc[0] # Pulls run ID to do a nested run

    # Train the model
    X = df_pandas[["feature_1", "feature_2", "feature_3"]]
    y = df_pandas["label"]
    rf = RandomForestRegressor()
    rf.fit(X, y)

    # Evaluate the model
    predictions = rf.predict(X)
    mse = mean_squared_error(y, predictions) # Note we could add a train/test split

    # Resume the top-level training
    with mlflow.start_run(run_id=run_id) as outer_run:
        # Small hack for for running as a job
        experiment_id = outer_run.info.experiment_id
        print(f"Current experiment_id = {experiment_id}")

        # Create a nested run for the specific device
        with mlflow.start_run(run_name=str(device_id), nested=True, experiment_id=experiment_id) as run:
            mlflow.sklearn.log_model(rf, str(device_id))
            mlflow.log_metric("mse", mse)

            artifact_uri = f"runs:/{run.info.run_id}/{device_id}"
            # Create a return pandas DataFrame that matches the schema above
            return_df = pd.DataFrame([[device_id, n_used, artifact_uri, mse]], 
                                    columns=["device_id", "n_used", "model_path", "mse"])

    return return_df 


Apply the pandas function to grouped data. 

Note that the way you would apply this in practice depends largely on where the data for inference is located. In this example, we'll reuse the training data which contains our device and run id's.

In [0]:
with mlflow.start_run(run_name="Training session for all devices") as run:
    run_id = run.info.run_id

    model_directories_df = (df
        .withColumn("run_id", f.lit(run_id)) # Add run_id
        .groupby("device_id")
        .applyInPandas(train_model, schema=train_return_schema)
        .cache()
    )

combined_df = df.join(model_directories_df, on="device_id", how="left")
display(combined_df)

device_id,record_id,feature_1,feature_2,feature_3,label,n_used,model_path,mse
0,0,0.6946372461392856,0.7577028817542781,0.0112440557844216,1.56102021994577,10000,runs:/c5ed006757ee4583b986f47596276d44/0,0.013440495
1,1,0.8713828508946553,1.1141254210835765,2.517164660698353,5.471653720015507,10000,runs:/780c96732a284965b6e5f4b216dd6279/1,0.013498714
2,2,0.0713600393164457,0.2148640603489986,1.8818924505111343,3.139185509359318,10000,runs:/f8fd2f66e542423c965fcde2fa23b3c3/2,0.013742719
3,3,0.4475890179885491,0.6059122241814121,1.5708151023130503,2.790459974902946,10000,runs:/b73f58c71660480a9d046e067d7cb353/3,0.013303993
4,4,0.1963568462238419,0.5273904726915719,0.0433482439231922,0.9006344069750835,10000,runs:/fe21643293f74bd88586757f375fb60c/4,0.013726118
5,5,0.2031819875110142,1.5124529767075523,0.2020990183099075,2.699533484567216,10000,runs:/06d2c3e5b2074c7aa442489d5b64a1fc/5,0.0136968205
6,6,0.2582358696874994,0.2562560668274218,2.1035239599806306,3.0891780248298124,10000,runs:/d545df3bdf6845bab0baa5d181b5a739/6,0.013768146
7,7,0.722510008102463,0.8859570885874051,1.5639290290286838,3.408172837045554,10000,runs:/a01069bfa4b4492c97dbd24ee06b21da/7,0.013405702
8,8,0.3686123783052292,1.7515756059583216,2.7347358457668496,5.54605150614343,10000,runs:/f339307ee01f4333863d89038d633482/8,0.013560054
9,9,0.5167366016033235,0.8600531466342609,2.0866882952183143,3.908036057417946,10000,runs:/6ae06f3da94141c9a48d8d120e54e8cc/9,0.013675543


Define a pandas function to apply the model.  *This needs only one read from DBFS per device.*

In [0]:
apply_return_schema = t.StructType([
    t.StructField("record_id", t.IntegerType()),
    t.StructField("prediction", t.FloatType())
])

def apply_model(df_pandas: pd.DataFrame) -> pd.DataFrame:
    """
    Applies model to data for a particular device, represented as a pandas DataFrame
    """
    model_path = df_pandas["model_path"].iloc[0]

    input_columns = ["feature_1", "feature_2", "feature_3"]
    X = df_pandas[input_columns]

    model = mlflow.sklearn.load_model(model_path)
    prediction = model.predict(X)

    return_df = pd.DataFrame({
        "record_id": df_pandas["record_id"],
        "prediction": prediction
    })
    return return_df

prediction_df = combined_df.groupby("device_id").applyInPandas(apply_model, schema=apply_return_schema)
display(prediction_df)

record_id,prediction
0,1.853209
10,4.7296352
20,4.135297
30,5.5866423
40,2.9728177
50,3.346861
60,2.4716926
70,2.760603
80,4.108098
90,4.4955606


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>