Hive S3 connector slow to rename s3 directories #14453

benrifkind · 2020-04-28T16:17:04Z

I am trying to create a table via a query like

CREATE TABLE hive.temp.test_table AS
select * from hive.temp.big_table limit 100000000

The temp database points to the s3-key: bucket/prefix/

The query runs quickly and writes the data to the temporary s3 key

s3://bucket/tmp/<unique-query-identifier>/

There are 15 worker and so we end up with 15 files of around 4-5 GiBs in that directory.

My issue arises in the next step when presto is moving the files from this tmp directory to the actual s3 directory (s3://bucket/prefix/test_table) that the hive table points to. This step is extremely slow.

I can watch the files move. It is currently in the process of this move and it has already taken 25 minutes with still another 5 files to move. Ie., it is 2/3 done.

If I just do a simple aws s3 cp --recursive command from that same tmp directory it takes around 10 minutes. The copy speed is 100+ GiB/s of about 60GiB (15*4GiB).

Some background. Presto was initially failing on this query with error

com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out

So I changed the hive.properties

hive.s3.max-client-retries=20
hive.s3.connect-timeout=60s
hive.s3.socket-timeout=60s

Any thoughts on how to speed this up would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

mbasmanova · 2020-06-27T03:04:18Z

@same3r @wenleix Any ideas?

friendofasquid · 2021-08-02T11:00:18Z

I am seeing similar behaviour. Did you ever find a fix? Lot's of

2021-08-02T10:32:38.504Z INFO transaction-finishing-304 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem rename s3://XXXXXX/tmp/presto-root/75e2a5ff-bd1e-4425-8932-15488bdc16f4/year=2019/month=10/day=30 s3://XXXXXXX/foo/year=2019/month=10/day=30 using algorithm version 1

It looks like the coordinator is doing this all serially, whereas it seems like a very good thing to run in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive S3 connector slow to rename s3 directories #14453

Hive S3 connector slow to rename s3 directories #14453

benrifkind commented Apr 28, 2020

mbasmanova commented Jun 27, 2020

friendofasquid commented Aug 2, 2021

Hive S3 connector slow to rename s3 directories #14453

Hive S3 connector slow to rename s3 directories #14453

Comments

benrifkind commented Apr 28, 2020

mbasmanova commented Jun 27, 2020

friendofasquid commented Aug 2, 2021