##Vectorized PandasUDF + Keras for Inference

One of the exciting things we can do with Arrow/Vectorized PandasUDF is efficiently integrate Spark with numeric or even GPU code that was *not* designed with Spark in mind.

For example, we might have a model that we've build with Keras+TensorFlow -- without Spark in the picture at all -- and then efficiently use that model to perform inference on big datasets with Spark.

In this module, we'll do just that: we'll train a simple neural network using Keras, save it to disk, and then use it from Spark via PandasUDF.

We'll start by using Pandas to read our Diamonds dataset, and to save time (and featurization) we'll just use the 6 continuous variables in our model to predict price.

In [3]:
import pandas as pd
import IPython.display as disp

input_file = "/dbfs/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"

df = pd.read_csv(input_file, header = 0)
df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[1:4], axis=1, inplace=True)
disp.display(df)

We'll do a train/test split, and look at a few rows as a sanity check:

In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(df.columns[3], axis=1)
y = df.iloc[:,3:4]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(X[:5])

print(y[:5])

Now we'll build a simple feed-forward perceptron network in Keras, and train it for a minute or so, then check our performance

In [7]:
dbutils.library.installPyPI("tensorflow")

In [8]:
import tensorflow as tf
import numpy as np

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(16, input_dim=6, kernel_initializer='normal', activation='relu')) 
model.add(tf.keras.layers.Dense(1, kernel_initializer='normal', activation='linear'))

model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_squared_error'])
model.fit(X_train, y_train, epochs=64, batch_size=128, validation_split=0.1, verbose=2)

scores = model.evaluate(X_test, y_test)
print("\nroot %s: %f" % (model.metrics_names[1], np.sqrt(scores[1])))

In [9]:
model.save("/tmp/model")

In [10]:
%sh cp /tmp/model /dbfs/tmp/model

Ok, now let's get Spark DataFrame with our test data. In real life, we'd read this directly from S3, HDFS, Kafka, or wherever.

But since this test set is already on the driver (and not very big), we can make a distributed Spark DF from the Pandas DF.

In [12]:
from pyspark.ml.feature import VectorAssembler

testDF = spark.createDataFrame(pd.DataFrame(X_test, columns=["carat", "depth", "table", "x", "y", "z"]))

display(testDF)

When our data shows up, we'll have to reshape it a little bit, and then we can do a regular keras `model.predict()` on it. A  raw prediction looks like this:

In [14]:
model.predict(X_test[:5])

That's almost perfect ... but Spark is going to expect a flat array of outputs from the PandasUDF (one for each input)

In [16]:
pd.Series(model.predict(X_test[:5]).flatten())

Hopefully with that out of the way, the following implementation (which uses those reshape tricks above) will work.

*Note: in real life, don't load the model on each call ... make sure the model is available ahead of time on the executors and just loaded once (e.g. by distributing a module/pyfile)*

In [18]:
from pyspark.sql.functions import *
from pyspark.sql.functions import PandasUDFType

import numpy as np

@pandas_udf("double", PandasUDFType.SCALAR)
def keras_predict(*v):
  # reshape to records x len(v) columns
  reshaped = np.asarray(v).reshape(len(v), -1).transpose()
  keras_model = tf.keras.models.load_model('/tmp/model')
  return pd.Series(keras_model.predict(reshaped).flatten())    

And now we can `keras_predict` on the columns in the Spark DataFrame:

In [20]:
testDF.select(keras_predict(*testDF.columns)).show()

And now ... we've got Spark and all of our Python goodness playing nice!