Tune hlink partition usage and memory options #40

riley-harper · 2022-09-13T18:19:48Z

This PR adjusts the heuristic used to compute the number of partitions to request from Spark and modifies the hlink.spark.session.SparkConnection class to expose the spark.driver.memory option through its convenience functions. It also renames a related test file to match the convention we've been using.

The value returned by spark_shuffle_partitions_heuristic() is now capped at 10,000, so the returned value lies in the range [200..10000]. This prevents large datasets from requesting too many partitions. A dataset of size ~2.7 billion would request about 100,000 partitions before this change, but now it requests only 10,000.
The SparkConnection.connect method now has a driver_memory argument that can be used to adjust the amount of memory required for the Spark driver. Connections made with SparkConnection.local will now automatically set the driver memory to be the same as the executor memory, since the driver and executor are the same machine in this case. This allows users to tune how much memory they need for their driver instead of always using Spark's default amount.

…ed at 10,000

… range [200..10000] - This prevents very large datasets from requesting ~100000+ partitions

This should allow us to add a --driver_memory argument that lets users adjust the amount of memory required for the Spark driver.

jacwellington

Looks great!

riley-harper added 6 commits August 30, 2022 09:02

Rename util test file to match other test files

0a80be9

Add a failing test - the number of partitions returned should be capp…

daaa15c

…ed at 10,000

Modify spark_shuffle_partitions_heuristic() to clamp its value to the…

f91b87b

… range [200..10000] - This prevents very large datasets from requesting ~100000+ partitions

Increase Spark driver memory to 20G

20e00ee

Add driver memory settings to SparkConnection

8854934

This should allow us to add a --driver_memory argument that lets users adjust the amount of memory required for the Spark driver.

Remove hard-coded spark.driver.memory setting

3652b53

riley-harper requested a review from jacwellington September 13, 2022 18:19

jacwellington approved these changes Sep 13, 2022

View reviewed changes

riley-harper merged commit 1e9ed71 into main Sep 13, 2022

riley-harper deleted the max_partitions branch September 13, 2022 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune hlink partition usage and memory options #40

Tune hlink partition usage and memory options #40

riley-harper commented Sep 13, 2022

jacwellington left a comment

Tune hlink partition usage and memory options #40

Tune hlink partition usage and memory options #40

Conversation

riley-harper commented Sep 13, 2022

jacwellington left a comment

Choose a reason for hiding this comment