Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Optionally distribute Hive writes on partition keys #579
Today, when writing Hive tables, Presto arbitrarily distributes the data across the writer nodes. This results in good parallelization for the common case of a query writing a single partition, since each worker will write a separate file for the partition. For queries that need to writer hundreds or thousands of partitions, this behavior causes problems as each worker will end up writing a file to each partition, so there are hundreds (or thousands) of large output buffers and open file streams.
To resolve this issue, we should add a session property that when set changes the InsertLayout and CreateTableLayout in HiveMetadata to declare a partitioning based on the partition keys.