presto 0.198 all queries slow down and worker seems do not work #11952

yangjun616 · 2018-11-21T02:43:32Z

we use presto 0.198 ,14 workers, sometimes query suddenly slow down and queue a lots ,even a very easy query takes long time . what we can see is that ,code_cache_usage use less ,all workers gc time and counts is less ,it seems that the worker do very litter tasks.
First , we restart the coordinater ,it do not work.Then we restart all the workers ,it works .
And we can make sure tha all the catalogs are normal ,because we have two presto clusers using the same catalogs ,the other presto cluster work without any problems

mnoumanshahzad · 2018-11-21T13:27:55Z

I am running Presto 0.212 on AWS EMR 5.19 and I am facing the same issue as well.

Can someone help me on how to debug the situation?

Presto UI:

Some of the running queries

Ganglia cluster overview

You can notice in the Ganglia overview, that the cluster performs fine when it starts off, but then something is off. CPU usage is very low, although there are 11 queries running my case (the oldest one running for approx. 38 minutes).

I have had much complex queries that have had greater input data, which ran fine.

I have upgraded from Presto 0.203 recently and I remember this issue still existed but appeared after several days on my running cluster.

dain · 2018-11-22T23:58:41Z

In my experience, this is almost always caused by slow storage, and specifically S3 throttling. You can verify this by looking at a jstack for you workers to see where the worker threads are "stuck" (worker threads are named after the query). If they are stuck in s3 code, you are getting throttled.

nezihyigitbasi · 2018-11-27T20:08:29Z

When S3 throttles your requests the individual HTTP requests should fail with 503 error code instead of threads getting stuck (then Presto will retry those requests and log that requests failed in the server logs), more info can be found in S3 docs: https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html and https://docs.aws.amazon.com/AmazonS3/latest/dev/ErrorBestPractices.html#UsingErrorsSlowDown

If threads are getting stuck in S3 code, then it can be due to many different issues (we need to see the stacktrace to understand the cause) either due to S3 being slow, network being slow, no connection in HTTP client connection pools, etc.

dpat · 2019-06-06T18:08:21Z

Seeing these same issues still with 0.206, has a workaround been found?

@dain The server logs show no S3 failures so this is not an S3 throttling issue

Ganglia logs look identical in behavior to @mnoumanshahzad's, with initially normal CPU usage level and then a drop in CPU usage as Memory plateaus at around 50% of cluster size

Also: is there a once-over command that can restart all presto workers from the coordinator?

Edit / Temp-Workaround: Restarting presto-server on the worker nodes clears up memory and solves the slow down issues: have not looked into what is causing this memory build up yet, temporary workaround was to have a cron job run a stop start on the presto-server across the worker nodes every couple hours

zsaltys · 2020-01-03T11:50:27Z

We are observing same issues .. Over time presto performance degrades. We don't use S3. We run on EMR with HDFS. Version 0.229. I went over ganglia. We did a restart thursday evening and some things definitely changed.

The most interesting one is the processes count .. Which seems to have went down with the restart. Also it seems it was able to read much faster from HDFS after a restart.

No idea how to troubleshoot this further to find the root cause. Is there anything we can do / provide to help you guys figure out what might be happening? The only choice I might have at this point is doing regular restarts.

zsaltys · 2020-01-03T11:51:14Z

@dain could you take a look at these screenshots. Could you provide any tips / clues what I could look into next time this happens?

findepi · 2020-01-03T12:41:55Z

@zsaltys for your information, @dain now works on https://github.com/prestosql/presto/ repo, you can find him there.
You can also reach him on Presto Community slack.

shixuan-fan · 2020-01-08T21:15:29Z

Random thought: JStack might be something interesting to look at just to see what the threads are busy with.

tooptoop4 · 2020-03-01T15:37:20Z

on 'Performance boost - timeline' slide of https://www.starburstdata.com/wp-content/uploads/2019/06/Lyft-Dynamic-Presto-Scaling.pdf Lyft mentions daily recycling of nodes

tooptoop4 · 2020-03-08T09:33:54Z

@zsaltys can u do heap dump on fresh cluster and week later?

tooptoop4 · 2020-11-26T00:30:42Z

did u solve @yangjun616 @mnoumanshahzad @zsaltys @dpat ? I also face particular older worker processing splits much slower than other workers on v336

friendofasquid · 2021-06-30T00:51:03Z

We discovered this issue after experiencing similar behaviour.

We used jstack as recommended and found the issue to be a user issuing a rather large REGEX query, perhaps repeatedly. And subsequent failures seem to have been related to something fixed in 0.255. (We're using latest EMR version which is 0.245).

tooptoop4 · 2022-08-25T00:06:22Z

@friendofasquid or anyone else have you encountered the issue on release >= 0.255 ?

yangjun616 changed the title ~~presto 0.198 query slow down and work seems do not work~~ presto 0.198 all queries slow down and worker seems do not work Nov 21, 2018

tooptoop4 mentioned this issue Nov 26, 2020

Queries are getting stuck. #13488

Closed

tooptoop4 mentioned this issue Dec 21, 2020

cluster becomes slower and slower trinodb/trino#6405

Open

jerryleooo mentioned this issue Oct 28, 2022

health check api? trinodb/trino#14663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

presto 0.198 all queries slow down and worker seems do not work #11952

presto 0.198 all queries slow down and worker seems do not work #11952

yangjun616 commented Nov 21, 2018

mnoumanshahzad commented Nov 21, 2018

dain commented Nov 22, 2018

nezihyigitbasi commented Nov 27, 2018

dpat commented Jun 6, 2019 •

edited

Loading

zsaltys commented Jan 3, 2020 •

edited

Loading

zsaltys commented Jan 3, 2020

findepi commented Jan 3, 2020

shixuan-fan commented Jan 8, 2020

tooptoop4 commented Mar 1, 2020

tooptoop4 commented Mar 8, 2020

tooptoop4 commented Nov 26, 2020

friendofasquid commented Jun 30, 2021 •

edited

Loading

tooptoop4 commented Aug 25, 2022

presto 0.198 all queries slow down and worker seems do not work #11952

presto 0.198 all queries slow down and worker seems do not work #11952

Comments

yangjun616 commented Nov 21, 2018

mnoumanshahzad commented Nov 21, 2018

dain commented Nov 22, 2018

nezihyigitbasi commented Nov 27, 2018

dpat commented Jun 6, 2019 • edited Loading

zsaltys commented Jan 3, 2020 • edited Loading

zsaltys commented Jan 3, 2020

findepi commented Jan 3, 2020

shixuan-fan commented Jan 8, 2020

tooptoop4 commented Mar 1, 2020

tooptoop4 commented Mar 8, 2020

tooptoop4 commented Nov 26, 2020

friendofasquid commented Jun 30, 2021 • edited Loading

tooptoop4 commented Aug 25, 2022

dpat commented Jun 6, 2019 •

edited

Loading

zsaltys commented Jan 3, 2020 •

edited

Loading

friendofasquid commented Jun 30, 2021 •

edited

Loading