Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueuedThreadPool increased thread usage and no idle thread decay #4105

Closed
AdrianFarmadin opened this issue Sep 19, 2019 · 38 comments
Closed
Assignees
Labels
Bug For general bugs on Jetty side

Comments

@AdrianFarmadin
Copy link

AdrianFarmadin commented Sep 19, 2019

I recently updated my Jetty version from 9.4.7.v20170914 to 9.4.20.v20190813 and I noticed increase in JVM threads count from 410 to 2300. The update was done on test environment with stable load.

According to thread dump there is 1912 threads in TIMED_WAITING state.

"jetty-1982" #1982 daemon prio=5 os_prio=0 tid=0x00007f9f28019800 nid=0x13ba0 waiting on condition [0x00007f9ae3677000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000003c0001150> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:851)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:890)
	at java.lang.Thread.run(Thread.java:748)

The active threads were also affected.
activeThreads

Is this a regression in new version?

@sbordet
Copy link
Contributor

sbordet commented Sep 19, 2019

That stack trace shows a thread idle in the pool.

How many cores in your machine?

2300 seems a too large number, but may be due to your application allocating perhaps HttpClient instances?

What happened on 13th? You reverted to 9.4.7?

@joakime
Copy link
Contributor

joakime commented Sep 19, 2019

Are you using the QueuedThreadPool?
I ask, as the idle thread name jetty-1982 is not the naming scheme for QueuedThreadPool.
If you are using QueuedThreadPool, what is configuration?

If you are using a Java Executor (instead of the QueuedThreadPool), which one? and how is it configured?

@AdrianFarmadin
Copy link
Author

How many cores in your machine?

Physical CPU cores: 32
Logical CPU cores: 64

2300 seems a too large number, but may be due to your application allocating perhaps HttpClient instances?

I'm not awere that it allocates HttpClient instances. I'll investigate it further to be sure.

What happened on 13th? You reverted to 9.4.7?

Yes, on 13th I reverted back to 9.4.7.

Are you using the QueuedThreadPool?

Yes, I'm using the QueuedThreadPool with following configuration:

minThreads = 8;
maxThreads = 2048;
maxCapacity = 50000;
threadPoolQueue = new BlockingArrayQueue<>(maxThreads, minThreads, maxCapacity);
threadPool = new QueuedThreadPool(maxThreads, minThreads, 60 000, threadPoolQueue);
threadPool.setName("jetty");
threadPool.setDaemon(true);

I have Java Flight Recodings before and after the Jetty update. Maybe it can be helpful.

@gregw
Copy link
Contributor

gregw commented Sep 23, 2019

Interesting?!?!? We have done some work on the QueueThreadPool, but if anything we believe it should result in less idle threads rather than more.

With so many idle threads and your configured timeout of 60s, I would expect to the number of idle threads slowly decay over time... but that doesn't appear to be the case. So something must be momentarily using all those threads at least once per minute OR we have a bug were we prefer to create a new thread rather than one that is idle (or just about to become idle).

I can't see anything particularly wrong with your QueueThreadPool configuration (well I don't like the bound on its size, as the failures you get with a bounded threadpool queue are no better or less random than out of memory exceptions.... but at least you have a large bound). However, just to remove doubt about that, could you run with a configuration of

minThreads = 8;
maxThreads = 4096;
threadPool = new QueuedThreadPool(maxThreads, minThreads, 60 000);
threadPool.setName("jetty");
threadPool.setDaemon(true);

Note that I have double maxThreads, because if the problem persists (as I suspect it will) I want to see if it goes well beyond 2048 - ie is the 1912 some natural level, or was it just bouncing off maxThreads (hopefully you have the memory for that?)

It would then also be good to run with the idle time reduced to 10s, just so we can see if there is any reduction and how frequent are the peaks that are creating more idle threads. Seeing a plot of thread count with 1sec resolution would be handy.

@AdrianFarmadin
Copy link
Author

Could the problem be caused by following scenario?

In QueueThreadPool.ensureThreads() method:
Whenever there are not enough idle threads available to handle a new Job (idle count goes below 0) the "ensureThreads()" method is called.
This is essentially a loop which spawns new threads until the idle count is >= 0 or the max limit is reached.

But it does not change the idle count itself. Instead it relies on one of the spawned threads to increase the idle count "addCounts". This introduces a race condition between the spawning loop and the "Runner" code that is executed in the threads

So depending on when the first threads are executed that loop might already have started dozens or even hundreds of threads before enough of them are executed to bring the count back above 0.

screenshot-9

@AdrianFarmadin
Copy link
Author

I ran the tests with suggested configuraiton:

minThreads = 8;
maxThreads = 4096;
threadPool = new QueuedThreadPool(maxThreads, minThreads, 60 000);
threadPool.setName("jetty");
threadPool.setDaemon(true);

The number of threds increased to 4400 (3636 were in TIMED_WAITING state). I noticed there was an spike in active threads at the process start.

With idle timeout of 1 min and 10s the number of threads didn't decreased over time. But with idle timeout of 1s the number of threads fall over time to ~1700.

Charts for 1min, 10s and 1s idle timeout tests:
idleTimeoutThreads
idleTimeoutActiveThreads

Only 1s test run:
idleTimeout1sThreads
idleTimeout1sActiveThreads

@joakime
Copy link
Contributor

joakime commented Sep 24, 2019

Looking at the "1s test run", it seems like each peak in the "Threads" graph is correlated with a peak of "Active Threads" in the "Webserver" graph.

Out of curiosity, what is your configured selector count on this 64 core machine?
Also, how many ServerConnector's do you have?

@sbordet
Copy link
Contributor

sbordet commented Sep 24, 2019

@gregw I think @AdrianFarmadin has a point in his analysis about the race. Thoughts?

@gregw
Copy link
Contributor

gregw commented Sep 24, 2019

@sbordet Agreed that looks like it could indeed be a problem.... pondering!

lachlan-roberts added a commit that referenced this issue Sep 25, 2019
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Sep 25, 2019
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
@joakime joakime added this to To do in Jetty 9.4.21 via automation Sep 25, 2019
@joakime joakime moved this from To do to In Progress in Jetty 9.4.21 Sep 25, 2019
lachlan-roberts added a commit that referenced this issue Sep 25, 2019
…d in QTP (#4118)

* Issue #4105 - starting a thread in QTP now increments idle count

Signed-off-by: Lachlan Roberts <lachlan@webtide.com>

* Issue #4105 - improve comments in test

Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
@joakime
Copy link
Contributor

joakime commented Sep 25, 2019

Merged PR #4118 to address idle count issue pointed out by @AdrianFarmadin

@AdrianFarmadin
Copy link
Author

I tested the fix from PR #4118 and there is an improvement. The thread count is now around 3700 and there is no spike in active threads on startup.

threads-PR4118

Out of curiosity, what is your configured selector count on this 64 core machine?
Also, how many ServerConnector's do you have?

selector count should be -1 (default) and there should be only 2 ServerConnector's one for secure connections and one for unsecure connection(not used while testing).

@sbordet
Copy link
Contributor

sbordet commented Sep 26, 2019

@AdrianFarmadin so if you set an aggressive idle timeout for threads (e.g. 5 seconds), do you see the thread count to be constrained to a lower number?

Point being that if you have a moderate activity on the server, every thread may be "touched" before it expires, so that a single spike may create a lot of threads that then will never expire - each will do a little work and pause for a long time, but not enough to be timed out.

Setting a more aggressive idle timeout will reduce the likelihood of this situation.

@gregw
Copy link
Contributor

gregw commented Sep 29, 2019

Note that the default state of the QueuedThreadPool is that our idle threads are all blocked in a call to poll(int time, TimeUnit unit) using a default instance of our BlockingArrayQueue, which using an unfair head lock.

The issue may be that on your JVM/OS the implementation of the lock is actually being fair (as it is allowed to be), and waking up the longest waiting thread.... thus effectively touching the most idle thread and preventing shinkage.

If this is the case, there is not much we can do about it, other than introducing a slower threadpool that implements a stack of idle threads.... or perhaps we can implement our own lock that is deliberately unfair and always wakes up the most recent waiter?

But for now, the main question we need to answer, is were all 3700 threads actually non idle at the same time? Did you have enough load so that 3699 threads were busy and that starting the 3700th was the correct thing to do?

@gregw
Copy link
Contributor

gregw commented Sep 29, 2019

Ignore me about unfair RentrantLocks. It is the Condition implementation that we really want to be unfair.... investigating how easily that can be done....

@gregw
Copy link
Contributor

gregw commented Sep 30, 2019

@AdrianFarmadin if you'd like to be experimental, perhaps you could try running with the following extension to QueuedThreadPool:

public class ReservedQueuedThreadPool extends QueuedThreadPool
{
    public ReservedQueuedThreadPool(int maxThreads, int minThreads)
    {
        super(maxThreads, minThreads);
        setReservedThreads(maxThreads / 2);
    }

    @Override
    public void execute(Runnable job)
    {
        if (!job.getClass().getSimpleName().equals("ReservedThread") && tryExecute(job))
            return;
        super.execute(job);
    }

    @Override
    public void setIdleTimeout(int idleTimeout)
    {
        ReservedThreadExecutor reserved = getBean(ReservedThreadExecutor.class);
        if (reserved != null)
            reserved.setIdleTimeout(idleTimeout, TimeUnit.MILLISECONDS);
        super.setIdleTimeout(idleTimeout);
    }
}

The idea here is that this is a QueuedThreadPool that first offers tasks to it's nested ReservedThreadExectutor (sized to half max threads), which does have the semantic of offering jobs to the most recently active thread. Queued jobs are offered first to the "idle" threads in the corePool and only if there are no threads reserved there are they executed normally.

"idle" threads will first become idle reserved threads, then normal idle waiting for jobs on the real job queue.

Note the simpleName test and the half maxThread size are just hacks to test the concept of giving priority to recently active thread to help shrink the thread pool. It may also be beneficial as recently active threads are more likely to be running CPU cores with hot code/data caches.

@sbordet thoughts?

@AdrianFarmadin
Copy link
Author

I did some more testing based on comments above. All tests were done with same load and hardware.

Baseline without changes:
noChange

@AdrianFarmadin so if you set an aggressive idle timeout for threads (e.g. 5 seconds), do you see the thread count to be constrained to a lower number?

Fix from PR #4118 + 5 seconds idle timeout for threads
patch5s

@AdrianFarmadin if you'd like to be experimental, perhaps you could try running with the following extension to QueuedThreadPool:

Fix from PR #4118 + 5 seconds idle timeout for threads + ReservedThreadExectutor
patch5s+ReservedQueuedThreadPool

But for now, the main question we need to answer, is were all 3700 threads actually non idle at the same time? Did you have enough load so that 3699 threads were busy and that starting the 3700th was the correct thing to do?

In version 9.4.7.v20170914 there was nearly all the time 75 active threads.
September03-9 4 7 v20170914

After update to 9.4.20.v20190813 the number of active threads increased to 186 with some spikes up to nearly1000.
September28-9 4 20 v20190813

The fix from PR #4118 had no influence on count of active threads.
September28-patch

In the test for PR #4118 + 5 seconds idle timeout for threads + ReservedThreadExectutor the number of active threads was back to 75 active threads.
ReservedQueuedThreadPool

@gregw
Copy link
Contributor

gregw commented Sep 30, 2019

Thanks for the detailed info.

@sbordet interesting that without the reserved thread executor the OP is still seeing spikes in active threads, but that the hack removes those spikes.

That makes me think that on this machine/load, the QTP is still favouring new or long term idle threads rather than recently active threads.

I think there could be benefit in using reserved thread for normal dispatch to try to favour recently active threads. This can be done as per the coffee snippet above (extra reserved executor) or I have a branch where we can use the existing reserved threads on the QTP. I'll push that later today.

@AdrianFarmadin
Copy link
Author

@sbordet interesting that without the reserved thread executor the OP is still seeing spikes in active threads, but that the hack removes those spikes.

The spikes are removed only for 5 seconds idle timeout. If I use 1 minute idle timeout, then the spikes occur again and the active thread count is around 100. The spikes with ReservedQueuedThreadPool are not so frequent as for QueuedThreadPool.

Test for PR #4118 + 1 minute idle timeout + ReservedThreadExectutor
ReservedQueuedThreadPool-1m-idle-timeout

@sbordet
Copy link
Contributor

sbordet commented Oct 1, 2019

@AdrianFarmadin in your comment with the various configurations and their graph, there is one where you say "Fix from PR #4118 + 5 seconds idle timeout for threads + ReservedThreadExectutor" and the graphs shows about 800-1000 threads (third graph from the top).
But at the end of the comment you say "In the test for PR #4118 + 5 seconds idle timeout for threads + ReservedThreadExectutor the number of active threads was back to 75 active threads.
ReservedQueuedThreadPool" (last graph).

The 2 graphs are quite different. What I am missing?

Do you have a reproducer that we can try ourselves?
Or at least a detailed description of the load you have on your system?

@sbordet
Copy link
Contributor

sbordet commented Oct 1, 2019

@gregw I don't know. BlockingArrayQueue has not changed between 9.4.7 and 9.4.20.
The scheme is that all idle threads are polling the queue which uses an unfair j.u.c.l.ReentrantLock.

Having said that, the Runnable used by QTP has changed.
In 9.4.7 we were starting threads from there, rather than from execute() like we do now, so there may be a timing issue, but I'd be surprised that it would result in a quite precise fairness.

@AdrianFarmadin how do you measure "Active Threads"? Is there any possibility that the value from the 2 different Jetty versions is not measuring the same quantity?
Do you have a measure of the max queue size?

Also, what kind of traffic do you have? I ask because in 9.4.9 we changed the way we create and destroy EndPoints (submitted to the thread pool rather than just run).
So if you have a high churn of connection opening/closing, you will have more tasks submitted at a high rate, causing more threads, etc.
HTTP/1.0 traffic would cause this high connection churn.

@gregw
Copy link
Contributor

gregw commented Oct 1, 2019

@sbordet I think the difference is that the first few graphs are for total threads (including idle), the last few graphs are for active threads only.

@gregw gregw closed this as completed Oct 2, 2019
@joakime joakime added this to To do in Jetty 9.4.22 via automation Oct 2, 2019
@joakime joakime moved this from To do to Done in Jetty 9.4.22 Oct 2, 2019
@joakime joakime added the Bug For general bugs on Jetty side label Oct 2, 2019
@joakime joakime changed the title JVM threads increase after update from 9.4.7.v20170914 to 9.4.20.v20190813 QueuedThreadPool does not expire threads as it once used to Oct 3, 2019
@joakime joakime changed the title QueuedThreadPool does not expire threads as it once used to QueuedThreadPool should reuse oldest threads first, to allow idle threads to expire Oct 3, 2019
@AdrianFarmadin
Copy link
Author

I tested the fix in #4146 over longer timeframe and the results are still different as in 9.4.7.v20170914. I also found out that the regression started in jetty-9.4.8.v20171121.

JVM-thread-count
active-threads

@gregw
Copy link
Contributor

gregw commented Oct 8, 2019

So the fix has remedied the idle thread problem, as we see idle thread decay for both timeout settings. The difference is not so much idle threads, but from 9.4.8 there are spikes in active threads that are not seen in 9.4.7. The bug in 9.4.20 was that we were racing when creating new threads, but they were not active threads and I don't think we see that bug in the active thread spikes.

So the question is - are these active thread spikes real or not? is these spikes due to a change in the way we offer load to the QTP or a change in the way the QTP handles load? If we have changed the QTP, so that an active thread took more time or was excluded from becoming idle, then that could cause more threads to be created. Or perhaps we are just offering more load?

Issues from 9.4.8 (a big release) that we need to examine include:

These look like there are changes that may affect QTP behaviour, the offered load and the fairness/behaviour of contended locks on the QTP. We will need to research..

@AdrianFarmadin can you confirm/deny if you are seeing any quality of service changes? Are requests still being handled with similar latency? is max throughput about the same?

@sbordet I have reopened this issue for now... but I think this probably deserves a new issue... but thought we'd do some research first before deciding.

@gregw gregw reopened this Oct 8, 2019
Jetty 9.4.22 automation moved this from Done to In progress Oct 8, 2019
@gregw gregw changed the title QueuedThreadPool should reuse oldest threads first, to allow idle threads to expire QueuedThreadPool increased thread usage and no idle thread decay Oct 8, 2019
@AdrianFarmadin
Copy link
Author

@AdrianFarmadin can you confirm/deny if you are seeing any quality of service changes? Are requests still being handled with similar latency? is max throughput about the same?

Are there any built-in Jetty metrics I can use?

@gregw
Copy link
Contributor

gregw commented Oct 8, 2019

@AdrianFarmadin The built in jetty metrics are from the StatsHandler, but they will be measuring the latency within handlers and thus exclude some of the vital accepting/selecting/scheduling/flushing parts of QoS that really can only be measured externally. Unfortunately our external own CI load test results were only recorded permanently from 9.4.8.

@olamy can you run up our load tests at several rates of load so we can compare 9.4.7 and 9.4.8 just to make sure there is no obvious change in capacity. If possible, could you also measure idle threads and active threads during those tests - looking for any spikes that match what @AdrianFarmadin is seeing.

My current thinking is that that changes we made in 9.4.8 were as a result of a client pushing the server to extreme loads where we were suffering from live locks and unfairness issues at very high request rates. I believe we had to make some scheduling changes that may favour executing a jobs over a single thread iterating on them. This could explain the spikes in used threads observed.

It would also be really valuable if you could collect information from the reserved thread pool as well. In 9.4.22 that is jmx org.eclipse.jetty.util.thread:type=reservedthreadexecutor,id=0 and looking at available, capacity and pending threads there would be good information to have.
In 9.4.7 the corresponding reserved thread pool is per connector so probably named org.eclipse.jetty.util.thread:context=HTTP/1.1@6f8e8894,type=reservedthreadexecutor,id=0 for HTTP/1.1

Failing that... or perhaps anyway, I think we should write a special version of the QTP that detects spikes of new thread creation and records why and who scheduled the job that caused it. That will tell us a lot about why you are seeing these spikes.

@gregw
Copy link
Contributor

gregw commented Oct 8, 2019

@AdrianFarmadin is your testing using persistent connection ? One big difference we introduced in 9.4.8 was that we now execute the creation of new connections and closing old ones. Previously we did this in the selector thread, but that had issues as it can take some time... and for some protocols it can call user code. Thus it is plausible that the active thread spikes are related to connection spikes that were previously handled by the selector threads.

I think the next step is definitely an instrumented QTP to record who/why threads are started and we can confirm/deny this theory. This will take me a day or two.

If it is this, then I do not believe this is a problem. Running more threads is still below the total memory commitment represented by max threads and will be giving you a fairer less DOS vulnerable server.

@olamy
Copy link
Member

olamy commented Oct 8, 2019

@gregw tests are running now to collect datas
@AdrianFarmadin for my curiosity what tool are you using to generate such graph and how do you collect the informations.

@AdrianFarmadin
Copy link
Author

Regarding QoS metrics I have only statistics about Jetty from self-monitoring.

Statistics

reservedthreadexecutor Capacity in Jmx is same (64) for all versions.
Available
Pending
I'll monitor if the spikes in Pending threads reappear.

is your testing using persistent connection ?

I need to check this.

for my curiosity what tool are you using to generate such graph and how do you collect the informations.

I'm using Dynatrace for metric collecting and graph generation. All metrics are exposed as JMX.

@AdrianFarmadin
Copy link
Author

AdrianFarmadin commented Oct 11, 2019

is your testing using persistent connection ?

Yes, all connections are persistent.

@gregw
Copy link
Contributor

gregw commented Oct 13, 2019

@AdrianFarmadin
thanks for that extra information. The two stats results are interesting, both in their similarities and differences.

Is this live traffic or unit testing?

The mean request time is a longer in 9.4.x at 333ms vs 259. Could that be natural variation in the load/processing or is that something that is repeated on all runs?

Another big difference is the max open connections at 28211 vs 17972, which means that at one point 9.4.x had to deal with 10k more open connections, which means 2*10k more dispatched jobs, which is plausibly the source of the active thread spikes.

I think the total connections stat is wrong - it looks like the current connections. Is that an error in your stats collection (wrong jmx name) or is our stats collection not working for you?

@olamy
Copy link
Member

olamy commented Oct 14, 2019

@gregw argh I just figured the error I have with our load testing and 9.4.7. Our load test is based on #1807 and it was introduced with 9.4.8. I need to find a way to ignore this with 9.4.7

@gregw gregw moved this from In progress to Review in progress in Jetty 9.4.22 Oct 15, 2019
@AdrianFarmadin
Copy link
Author

Is this live traffic or unit testing?

Live traffic. The servers and traffic generators are running on separate hosts. All machines are bare-metal servers on same network.

The mean request time is a longer in 9.4.x at 333ms vs 259. Could that be natural variation in the load/processing or is that something that is repeated on all runs?

It seems as a variation. After couple of hours the difference is a lot smaller.

Statistics for 4 days
Statistics4days

Statistics for 20h after refresh. (Reset all values to zero. No server restart)
Statistics20h

I think the total connections stat is wrong - it looks like the current connections. Is that an error in your stats collection (wrong jmx name) or is our stats collection not working for you?

It's a wrong jmx name. Thank you for pointing out.

@sbordet
Copy link
Contributor

sbordet commented Oct 21, 2019

What stands out is the number of concurrent dispatches, 1558 vs 180 and 555 vs 139.
Along with that, the max dispatch time, 2835 vs 7766 and 705 vs 5969.

The server seems runs at ~4000 requests/s and 9.4.x seems to be a lot more concurrent and with less latency than 9.4.7. That could explain the additional thread usage.

@gregw thoughts?

@joakime joakime added this to To do in Jetty 9.4.23 via automation Oct 22, 2019
@joakime joakime removed this from To do in Jetty 9.4.23 Oct 22, 2019
@joakime joakime moved this from Review in progress to In progress in Jetty 9.4.22 Oct 23, 2019
@joakime joakime added this to To do in Jetty 9.4.23 via automation Oct 23, 2019
@joakime joakime moved this from To do to In progress in Jetty 9.4.23 Oct 23, 2019
@gregw
Copy link
Contributor

gregw commented Nov 14, 2019

We dispatch more, so we should expect more threads.
Performance is better and we have fixed our shrinking and other issues. I think we are good

@gregw gregw closed this as completed Nov 14, 2019
Jetty 9.4.23 automation moved this from In progress to Done Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
No open projects
Jetty 9.4.21
  
In Progress
Jetty 9.4.22
  
In progress
Jetty 9.4.23
  
Done
Development

No branches or pull requests

6 participants