You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CREATE TABLE hive.temp.test_table AS
select * from hive.temp.big_table limit 100000000
The temp database points to the s3-key: bucket/prefix/
The query runs quickly and writes the data to the temporary s3 key
s3://bucket/tmp/<unique-query-identifier>/
There are 15 worker and so we end up with 15 files of around 4-5 GiBs in that directory.
My issue arises in the next step when presto is moving the files from this tmp directory to the actual s3 directory (s3://bucket/prefix/test_table) that the hive table points to. This step is extremely slow.
I can watch the files move. It is currently in the process of this move and it has already taken 25 minutes with still another 5 files to move. Ie., it is 2/3 done.
If I just do a simple aws s3 cp --recursive command from that same tmp directory it takes around 10 minutes. The copy speed is 100+ GiB/s of about 60GiB (15*4GiB).
Some background. Presto was initially failing on this query with error
com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
I am seeing similar behaviour. Did you ever find a fix? Lot's of
2021-08-02T10:32:38.504Z INFO transaction-finishing-304 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem rename s3://XXXXXX/tmp/presto-root/75e2a5ff-bd1e-4425-8932-15488bdc16f4/year=2019/month=10/day=30 s3://XXXXXXX/foo/year=2019/month=10/day=30 using algorithm version 1
It looks like the coordinator is doing this all serially, whereas it seems like a very good thing to run in parallel.
I am trying to create a table via a query like
The temp database points to the s3-key: bucket/prefix/
The query runs quickly and writes the data to the temporary s3 key
There are 15 worker and so we end up with 15 files of around 4-5 GiBs in that directory.
My issue arises in the next step when presto is moving the files from this tmp directory to the actual s3 directory (
s3://bucket/prefix/test_table
) that the hive table points to. This step is extremely slow.I can watch the files move. It is currently in the process of this move and it has already taken 25 minutes with still another 5 files to move. Ie., it is 2/3 done.
If I just do a simple
aws s3 cp --recursive
command from that same tmp directory it takes around 10 minutes. The copy speed is 100+ GiB/s of about 60GiB (15*4GiB).Some background. Presto was initially failing on this query with error
So I changed the hive.properties
Any thoughts on how to speed this up would be greatly appreciated.
The text was updated successfully, but these errors were encountered: