[Feature Store] Performance improvements for preview and hist calculation in Spark engine #1860

theSaarco · 2022-04-03T07:40:25Z

When ingesting a feature-set using Spark engine, by default the code produces a preview (20 first rows of the data-frame) and histogram (calculated over all the DF) for int & float columns. The existing code was naive in its approach to these calculations, basically running a select on the DF for each column, which may result in tens of queries for the calculations.
This poses no issue when the DF is a basic tabular DF, and would complete in negligible time. However, when adding aggregations, these queries are done on the query that calculates the aggregations, which contains summaries and group-by expressions. Running these queries so many times would accumulate to very long execution times.
This PR reduces the amount of queries performed in both of these stages. It uses a single query with sample for the preview (not sure why the previous approach was needed, it was needlessly complex). For histogram calculation it basically adds a column per histogram bin, and to protect from the number of columns becoming too large, it puts a limit on the number of histograms calculated per query (500 columns, which are 25 fields to calculate since we're using 20 bins per histogram). If there are more fields that need histograms, more queries will be performed. Still, it reduces the number of queries performed by a factor of 25.
The max number of columns is configurable by passing an environment variable to the Spark runtime, for example:

my_func = code_to_function("func", kind="remote-spark")
# Add up to 100 columns to each Spark query
my_func.set_env("MLRUN_MAX_HISTOGRAM_COLUMNS_IN_QUERY", 100)
config = fstore.RunConfig(local=False, function=my_func, handler="ingest_handler")

And passing this config to the ingest call.

This also fixes a bug in the existing preview code that used zip to align lists of values, but in the case where there are nulls in columns would cause results to be empty - this is often the case when doing aggregations in the usual, emit-per-period mode, since the calculation fields only have values for a given window per row, and if multiple windows are used there are always null values in some fields.

…r column

…work

…query in most cases)

…work

gtopper · 2022-04-04T08:45:33Z

mlrun/data_types/spark.py

-                pass
+            hist_columns.append(col)
+
+    # We may need multiple queries here. See above comment for reasoning.


"above comment" is ambiguous.

It's the only comment above this one 😄

gtopper · 2022-04-04T08:47:00Z

mlrun/data_types/spark.py

+            hist_columns.append(col)
+
+    # We may need multiple queries here. See above comment for reasoning.
+    max_columns_per_query = int(MAX_HISTOGRAM_COLUMNS_IN_QUERY // num_bins)


I think // already results in an int, so the cast is redundant.

I thought that too, but you'll be surprised to find out it doesn't. If you do // between two floats (or float and int), the result would be a float. See for example: https://stackoverflow.com/questions/1282945/why-does-integer-division-yield-a-float-instead-of-another-integer.
So, the cast is not redundant.

Hedingber · 2022-04-04T12:40:49Z

mlrun/data_types/spark.py

+# how many histograms will be calculated in a single query. By default we're using 20 bins per histogram, so
+# using 500 will calculate histograms over 25 columns in a single query. If there are more, more queries will
+# be executed.
+MAX_HISTOGRAM_COLUMNS_IN_QUERY = 500


make it configurable ?

Sure, why not?

theSaarco added 7 commits March 30, 2022 10:36

Making spark preview use only a single select rather than a select pe…

81f2891

…r column

Merge branch 'development' of github.com:mlrun/mlrun into spark_perf_…

d8e6eb7

…work

Fixing histogram data calculation to use much less queries (a single …

46200f5

…query in most cases)

debug

e1eb726

fix

9d819d1

reduce number of added columns

18f18b7

Merge branch 'development' of github.com:mlrun/mlrun into spark_perf_…

8e46d64

…work

theSaarco requested review from gtopper and yaronha April 3, 2022 08:14

gtopper approved these changes Apr 4, 2022

View reviewed changes

Hedingber suggested changes Apr 4, 2022

View reviewed changes

theSaarco added 3 commits April 4, 2022 16:41

configurable max columns in query

e50c652

change env variable name

1087638

lint

affc8f1

Hedingber approved these changes Apr 5, 2022

View reviewed changes

Hedingber merged commit 241db77 into mlrun:development Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Store] Performance improvements for preview and hist calculation in Spark engine #1860

[Feature Store] Performance improvements for preview and hist calculation in Spark engine #1860

theSaarco commented Apr 3, 2022 •

edited

gtopper Apr 4, 2022

theSaarco Apr 4, 2022

gtopper Apr 4, 2022

theSaarco Apr 4, 2022

Hedingber Apr 4, 2022

theSaarco Apr 4, 2022

[Feature Store] Performance improvements for preview and hist calculation in Spark engine #1860

[Feature Store] Performance improvements for preview and hist calculation in Spark engine #1860

Conversation

theSaarco commented Apr 3, 2022 • edited

gtopper Apr 4, 2022

Choose a reason for hiding this comment

theSaarco Apr 4, 2022

Choose a reason for hiding this comment

gtopper Apr 4, 2022

Choose a reason for hiding this comment

theSaarco Apr 4, 2022

Choose a reason for hiding this comment

Hedingber Apr 4, 2022

Choose a reason for hiding this comment

theSaarco Apr 4, 2022

Choose a reason for hiding this comment

theSaarco commented Apr 3, 2022 •

edited