Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive S3 connector slow to rename s3 directories #14453

Open
benrifkind opened this issue Apr 28, 2020 · 2 comments
Open

Hive S3 connector slow to rename s3 directories #14453

benrifkind opened this issue Apr 28, 2020 · 2 comments

Comments

@benrifkind
Copy link

I am trying to create a table via a query like

CREATE TABLE hive.temp.test_table AS
select * from hive.temp.big_table limit 100000000

The temp database points to the s3-key: bucket/prefix/

The query runs quickly and writes the data to the temporary s3 key

s3://bucket/tmp/<unique-query-identifier>/

There are 15 worker and so we end up with 15 files of around 4-5 GiBs in that directory.

My issue arises in the next step when presto is moving the files from this tmp directory to the actual s3 directory (s3://bucket/prefix/test_table) that the hive table points to. This step is extremely slow.

I can watch the files move. It is currently in the process of this move and it has already taken 25 minutes with still another 5 files to move. Ie., it is 2/3 done.

If I just do a simple aws s3 cp --recursive command from that same tmp directory it takes around 10 minutes. The copy speed is 100+ GiB/s of about 60GiB (15*4GiB).

Some background. Presto was initially failing on this query with error

com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out

So I changed the hive.properties

hive.s3.max-client-retries=20
hive.s3.connect-timeout=60s
hive.s3.socket-timeout=60s

Any thoughts on how to speed this up would be greatly appreciated.

@mbasmanova
Copy link
Contributor

@same3r @wenleix Any ideas?

@friendofasquid
Copy link

I am seeing similar behaviour. Did you ever find a fix? Lot's of

2021-08-02T10:32:38.504Z INFO transaction-finishing-304 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem rename s3://XXXXXX/tmp/presto-root/75e2a5ff-bd1e-4425-8932-15488bdc16f4/year=2019/month=10/day=30 s3://XXXXXXX/foo/year=2019/month=10/day=30 using algorithm version 1

It looks like the coordinator is doing this all serially, whereas it seems like a very good thing to run in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants