Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

presto 0.198 all queries slow down and worker seems do not work #11952

Open
yangjun616 opened this issue Nov 21, 2018 · 13 comments
Open

presto 0.198 all queries slow down and worker seems do not work #11952

yangjun616 opened this issue Nov 21, 2018 · 13 comments

Comments

@yangjun616
Copy link

we use presto 0.198 ,14 workers, sometimes query suddenly slow down and queue a lots ,even a very easy query takes long time . what we can see is that ,code_cache_usage use less ,all workers gc time and counts is less ,it seems that the worker do very litter tasks.
First , we restart the coordinater ,it do not work.Then we restart all the workers ,it works .
And we can make sure tha all the catalogs are normal ,because we have two presto clusers using the same catalogs ,the other presto cluster work without any problems

@yangjun616 yangjun616 changed the title presto 0.198 query slow down and work seems do not work presto 0.198 all queries slow down and worker seems do not work Nov 21, 2018
@mnoumanshahzad
Copy link

I am running Presto 0.212 on AWS EMR 5.19 and I am facing the same issue as well.

Can someone help me on how to debug the situation?

Presto UI:
image

Some of the running queries
image

Ganglia cluster overview
image

You can notice in the Ganglia overview, that the cluster performs fine when it starts off, but then something is off. CPU usage is very low, although there are 11 queries running my case (the oldest one running for approx. 38 minutes).

I have had much complex queries that have had greater input data, which ran fine.

I have upgraded from Presto 0.203 recently and I remember this issue still existed but appeared after several days on my running cluster.

@dain
Copy link
Contributor

dain commented Nov 22, 2018

In my experience, this is almost always caused by slow storage, and specifically S3 throttling. You can verify this by looking at a jstack for you workers to see where the worker threads are "stuck" (worker threads are named after the query). If they are stuck in s3 code, you are getting throttled.

@nezihyigitbasi
Copy link
Contributor

When S3 throttles your requests the individual HTTP requests should fail with 503 error code instead of threads getting stuck (then Presto will retry those requests and log that requests failed in the server logs), more info can be found in S3 docs: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html and https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html#UsingErrorsSlowDown

If threads are getting stuck in S3 code, then it can be due to many different issues (we need to see the stacktrace to understand the cause) either due to S3 being slow, network being slow, no connection in HTTP client connection pools, etc.

@dpat
Copy link

dpat commented Jun 6, 2019

Seeing these same issues still with 0.206, has a workaround been found?

@dain The server logs show no S3 failures so this is not an S3 throttling issue

Ganglia logs look identical in behavior to @mnoumanshahzad's, with initially normal CPU usage level and then a drop in CPU usage as Memory plateaus at around 50% of cluster size

Also: is there a once-over command that can restart all presto workers from the coordinator?

Edit / Temp-Workaround: Restarting presto-server on the worker nodes clears up memory and solves the slow down issues: have not looked into what is causing this memory build up yet, temporary workaround was to have a cron job run a stop start on the presto-server across the worker nodes every couple hours

@zsaltys
Copy link

zsaltys commented Jan 3, 2020

We are observing same issues .. Over time presto performance degrades. We don't use S3. We run on EMR with HDFS. Version 0.229. I went over ganglia. We did a restart thursday evening and some things definitely changed.

Screenshot 2020-01-03 at 11 38 58
Screenshot 2020-01-03 at 11 38 06
Screenshot 2020-01-03 at 11 37 06

The most interesting one is the processes count .. Which seems to have went down with the restart. Also it seems it was able to read much faster from HDFS after a restart.

No idea how to troubleshoot this further to find the root cause. Is there anything we can do / provide to help you guys figure out what might be happening? The only choice I might have at this point is doing regular restarts.

@zsaltys
Copy link

zsaltys commented Jan 3, 2020

@dain could you take a look at these screenshots. Could you provide any tips / clues what I could look into next time this happens?

@findepi
Copy link
Contributor

findepi commented Jan 3, 2020

@zsaltys for your information, @dain now works on https://github.com/prestosql/presto/ repo, you can find him there.
You can also reach him on Presto Community slack.

@shixuan-fan
Copy link
Contributor

Random thought: JStack might be something interesting to look at just to see what the threads are busy with.

@tooptoop4
Copy link

on 'Performance boost - timeline' slide of https://www.starburstdata.com/wp-content/uploads/2019/06/Lyft-Dynamic-Presto-Scaling.pdf Lyft mentions daily recycling of nodes

@tooptoop4
Copy link

@zsaltys can u do heap dump on fresh cluster and week later?

@tooptoop4
Copy link

did u solve @yangjun616 @mnoumanshahzad @zsaltys @dpat ? I also face particular older worker processing splits much slower than other workers on v336

@friendofasquid
Copy link

friendofasquid commented Jun 30, 2021

We discovered this issue after experiencing similar behaviour.

We used jstack as recommended and found the issue to be a user issuing a rather large REGEX query, perhaps repeatedly. And subsequent failures seem to have been related to something fixed in 0.255. (We're using latest EMR version which is 0.245).

image

@tooptoop4
Copy link

@friendofasquid or anyone else have you encountered the issue on release >= 0.255 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants