Tune hlink partition usage and memory options #40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adjusts the heuristic used to compute the number of partitions to request from Spark and modifies the
hlink.spark.session.SparkConnection
class to expose thespark.driver.memory
option through its convenience functions. It also renames a related test file to match the convention we've been using.spark_shuffle_partitions_heuristic()
is now capped at 10,000, so the returned value lies in the range [200..10000]. This prevents large datasets from requesting too many partitions. A dataset of size ~2.7 billion would request about 100,000 partitions before this change, but now it requests only 10,000.SparkConnection.connect
method now has adriver_memory
argument that can be used to adjust the amount of memory required for the Spark driver. Connections made withSparkConnection.local
will now automatically set the driver memory to be the same as the executor memory, since the driver and executor are the same machine in this case. This allows users to tune how much memory they need for their driver instead of always using Spark's default amount.