Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune hlink partition usage and memory options #40

Merged
merged 6 commits into from
Sep 13, 2022
Merged

Conversation

riley-harper
Copy link
Contributor

This PR adjusts the heuristic used to compute the number of partitions to request from Spark and modifies the hlink.spark.session.SparkConnection class to expose the spark.driver.memory option through its convenience functions. It also renames a related test file to match the convention we've been using.

  • The value returned by spark_shuffle_partitions_heuristic() is now capped at 10,000, so the returned value lies in the range [200..10000]. This prevents large datasets from requesting too many partitions. A dataset of size ~2.7 billion would request about 100,000 partitions before this change, but now it requests only 10,000.
  • The SparkConnection.connect method now has a driver_memory argument that can be used to adjust the amount of memory required for the Spark driver. Connections made with SparkConnection.local will now automatically set the driver memory to be the same as the executor memory, since the driver and executor are the same machine in this case. This allows users to tune how much memory they need for their driver instead of always using Spark's default amount.

… range [200..10000]

- This prevents very large datasets from requesting ~100000+ partitions
This should allow us to add a --driver_memory argument that lets users
adjust the amount of memory required for the Spark driver.
Copy link
Collaborator

@jacwellington jacwellington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@riley-harper riley-harper merged commit 1e9ed71 into main Sep 13, 2022
@riley-harper riley-harper deleted the max_partitions branch September 13, 2022 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants