## ONNX Inference on Spark

In this example, we will train a LightGBM model, convert the model to ONNX format and use the converted model to infer some testing data on Spark.

Maven dependencies:

- com.microsoft.onnxruntime:onnxruntime:1.8.1
- com.microsoft.ml.spark:mmlspark-core:{mmlspark_version}
- com.microsoft.ml.spark:mmlspark-deep-learning:{mmlspark_version}

> For MML Spark dependencies, set the resolver to `https://mmlspark.azureedge.net/maven`, and set the version to master version.

Python dependencies:

- onnxmltools==1.7.0
- lightgbm==3.2.1

Download training data

In [None]:
import pandas as pd
data=pd.read_csv("https://mmlspark.blob.core.windows.net/publicwasb/company_bankruptcy_prediction_data.csv")
data

Use LightGBM to train a model

In [None]:
from lightgbm import LGBMClassifier, Dataset
from sklearn.model_selection import train_test_split

y = data["Bankrupt?"].values
x = data.drop(["Bankrupt?"], axis=1).values
x, x_test, y, y_test = train_test_split(x, y, test_size=0.15, random_state=42, stratify=y)
train_data = Dataset(x, label=y)
test_data = Dataset(x_test, label=y_test)

model = LGBMClassifier(boosting_type="gbdt", num_leaves=31, reg_alpha=0.5, reg_lambda=1, learning_rate=0.05, max_depth=-1, n_estimators=1000, subsample=0.7, colsample_bytree=0.7, subsample_freq=2, objective="binary", is_unbalance="true", min_child_weight=20, random_state=2021, n_jobs=-1, min_split_gain=0.01)
model.fit(x, y, verbose=1, eval_set=[(x, y),(x_test, y_test)], eval_names = ['train', 'test'], eval_metric='auc', early_stopping_rounds=300)


Convert the model to ONNX format, load it into an `ONNXModel`, and inspect the model inputs and outputs.

In [None]:
from mmlspark.onnx import ONNXModel
import numpy as np

def convertModel(lgbm_model: LGBMClassifier, X: np.ndarray) -> bytes:
  from onnxmltools.convert import convert_lightgbm
  from onnxconverter_common.data_types import FloatTensorType
  initial_types = [("input", FloatTensorType([-1, x.shape[1]]))]
  onnx_model = convert_lightgbm(lgbm_model, initial_types=initial_types, target_opset=9)
  return onnx_model.SerializeToString()

model_payload_ml = convertModel(model, x)
onnx_ml = ONNXModel().setModelPayload(model_payload_ml)

print("Model inputs:" + str(onnx_ml.getModelInputs()))
print("Model outputs:" + str(onnx_ml.getModelOutputs()))

Map the model input to the input dataframe's column name (FeedDict), and map the output dataframe's column names to the model outputs (FetchDict).

In [None]:
input_name = list(onnx_ml.getModelInputs().keys())[0]
output_name_prob = list(onnx_ml.getModelOutputs().keys())[0]
output_name_pred = list(onnx_ml.getModelOutputs().keys())[1]

onnx_ml.setDeviceType("CPU").setFeedDict({input_name: "features"}).setFetchDict({"probability": output_name_prob, "prediction": output_name_pred}).setMiniBatchSize(5000)

Create some testing data and transform the data through the ONNX model.

In [None]:
from pyspark.ml.feature import VectorAssembler

n = 1000 * 1000
m = 95
test = np.random.rand(n, m)
testPdf = pd.DataFrame(test)
cols = list(map(str, testPdf.columns))
testDf = spark.createDataFrame(testPdf).repartition(200)
testDf = VectorAssembler().setInputCols(cols).setOutputCol("features").transform(testDf).drop(*cols).cache()

display(onnx_ml.transform(testDf))