New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate "Stun4J Message Processor" by replacing it with thread pool. #155

Merged
merged 38 commits into from Nov 29, 2018

Conversation

Projects
None yet
4 participants
@mstyura
Contributor

mstyura commented Nov 4, 2018

It is observed in test environment that ice4j creates too many Stun4J Message Processor threads.

When there are C conferences each with P peers, then under the hood C * P * 3 (default value) threads were created to decode and process stun messages.
So, when there are 30 conferences 3 peers each there was 270 threads created to process stun messages.

All these threads were replaced with single ForkJoinPool.

Java version was also updated to 1.8, because I've used some of 1.8's API to reduce threads usage. Rest of Jitsi's repositories seem to be already migrated to 1.8, so I hope it is not a problem that I've updated version. for ice4j.
Permission of migration to Java 8 has been confirmed by Ingo Bauersachs.

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 4, 2018

Similar table as provided in pull request #154 (the difference is observable when compared to #154 tables)

org.ice4j.TERMINATION_DELAY = 3 seconds, ice4j + with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.1-SNAPSHOT # with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)
org.ice4j.TERMINATION_DELAY, ms 3000
Conf per server: 30
Peers per conf: 3
Est. RTP streams: 270
CPU load: 97-100%
Active threads: 1173
Network IN: ~ 2838 KBit/s
Network OUT: ~ 5689 KBit/s KBit/s
Average RTT: ~739-892 ms
ICE connection stability: stable

org.ice4j.TERMINATION_DELAY = 1 week, ice4j + with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.1-SNAPSHOT # with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 30
Peers per conf: 3
Est. RTP streams: 270
CPU load: 98-100%
Active threads: 1177
Network IN: ~2659 KBit/s
Network OUT: ~5346 KBit/s
Average RTT: ~946-1040 ms
ICE connection stability: stable
@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 4, 2018

@bbaldino could you please review these changes?

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 5, 2018

For the sake of comparison adding tables for smaller load when CPU usage does not even close to 100%. It is visible from the table that reducing number of threads decreases CPU load when it not close to 100%.

20 conf, 3 peers each, no fixes, default termination delay = 3 sec (borrowed from #154)

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.0-20181005.221549-11 # without my fixes
org.ice4j.TERMINATION_DELAY, ms 3000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 66-76%
Active threads: 1094
Network IN: ~2443 KBits/s
Network OUT: ~4896 KBits/s
Average RTT: ~5-150 ms
ICE connection stability: stable by stable I mean that my test client does not report temporary ICE connection state changed to "disconnected" during call.

20 conf, 3 peers each, PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR), default termination delay = 3 sec

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.1-SNAPSHOT # with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)
org.ice4j.TERMINATION_DELAY, ms 3000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 51-61%
Active threads: 799
Network IN: ~ 2409 KBit/s
Network OUT: ~ 4832 KBit/s
Average RTT: ~10-55 ms
ICE connection stability: stable

20 conf, 3 peers each, PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR), termination delay = 1 week

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.1-SNAPSHOT # with PR 150, 151, 152, 153, 154 (not merged yet), 155 (this PR)
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 50-64%
Active threads: 789
Network IN: ~ 2431 KBit/s
Network OUT: ~ 4878 KBit/s
Average RTT: ~3-28-ms
ICE connection stability: stable
@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 5, 2018

Jenkins, test this please

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 5, 2018

@bbaldino is there anything I could do to this (#155) and #154 merged sooner? Maybe some comments/explanations from me needed?

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 6, 2018

Adding threads histogram for 20 conferences 3 peers each when both #154 & #155 applied to current master, single instance threads were omitted.

org.jitsi.impl.neomedia.RTPConnectorInputStream.receiveThread -> 120
org.jitsi.impl.neomedia.RTPConnectorOutputStream$Queue.sendThread -> 120
org.jitsi.videobridge.SctpConnection-pool-12-thread -> 61
CsrcAudioLevelDispatcher-pool-14-thread -> 60
IceConnector@ -> 60
MergingDatagramSocket reader thread for -> 60
org.jitsi.util.RawPacketQueue-SctpConnection -> 60
org.jitsi.impl.neomedia.rtp.translator.OutputDataStreamImpl -> 40
org.jitsi.impl.neomedia.rtp.translator.PushSourceStreamImpl -> 40
ConferenceSpeechActivity-pool-9-thread -> 20
RTCP Forwarder: org.jitsi.impl.neomedia.rtp.translator.RTPConnectorImpl@ -> 20
RTCP Reporter -> 20
RTP Forwarder: org.jitsi.impl.neomedia.rtp.translator.RTPConnectorImpl@ -> 20
RTPEventHandler -> 20
SSRC Cache Cleaner -> 20
GC task thread -> 8
ice4j.Agent-pool-10-thread -> 8
ice4j.StunClientTransaction-pool-13-thread -> 8
C2 CompilerThread -> 3
ForkJoinPool-1-worker -> 3
org.ice4j.ice.harvest.AbstractUdpListener thread for -> 3
org.jitsi.util.concurrent.RecurringRunnableExecutor.thread -> 3
org.jitsi.impl.osgi.framework.AsyncExecutor -> 2
@saghul

This comment has been minimized.

Member

saghul commented Nov 6, 2018

is there anything I could do to this (#155) and #154 merged sooner? Maybe some comments/explanations from me needed?

@mstyura Please be patient. We are currently transitioning to a new home and it may take us a bit longer than usual to handle issues / PRs until things settle. It should just be a few weeks, but be warned.

Cheers,

@bbaldino

This comment has been minimized.

Member

bbaldino commented Nov 12, 2018

@mstyura i am off of vacation now and starting to get back into things, so I will try and take a look at this. I worry about the java version change though, @damencho @bgrozev are we ok to change the java version in ice4j?

@bbaldino

some thoughts at a micro level just going through. also will need to check on the java version change.

* @throws IllegalArgumentException if any of the mentioned properties of
* <tt>netAccessManager</tt> are <tt>null</tt>
*/
MessageProcessor(NetAccessManager netAccessManager)
MessageProcessor(

This comment has been minimized.

@bbaldino

bbaldino Nov 12, 2018

Member

so previously we had an instance of this which would repeatedly read from a message queue, but now we create a new instance for each message?

This comment has been minimized.

@mstyura

mstyura Nov 13, 2018

Contributor

@bbaldino yes, the meaning of class is slightly changed, earlier it was started thread which reads from queue. Now this class represent kind of "job" which is put into queue along with message which need to be processed. I used ExecutorService to done queued processing, but then in need Runnable to put into ExecutorService queue, so I've reused MessageProcessor as it was already Runnable.

This comment has been minimized.

@bbaldino

bbaldino Nov 13, 2018

Member

Gotcha, will this be done for every packet though? Have you noticed if creating one of these per packet has affected performance at all?

This comment has been minimized.

@mstyura

mstyura Nov 14, 2018

Contributor

Gotcha, will this be done for every packet though?

Yes for now in this pull request MessageProcessor is created per RawMessage. Essentially MessageProcessor is just a Runnable wrapping RawMessage to make it possible to put RawMessage processing into ExecutorServices queue.

Have you noticed if creating one of these per packet has affected performance at all?

I did not notice performance issue due to increased allocations count for MessageProcessor.
What I've noticed is a performance gain due to decreased number of threads, see tables below, which compares versions without this pull request and with this pull request for 20 audio only conferences with 3 peers each.

baseline

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.0-20181105.145933-16 # with PR 154 (not merged yet)
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 58-68 %
Active threads: 974
Network IN: ~ 2538 KBit/s
Network OUT: ~ 4892 KBit/s
Average RTT: ~3-20 ms
ICE connection stability: stable

baseline + PR 155

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.0-20181105.145933-16 # with PR 154 (not merged yet), 155 (this PR)
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 50-64% # notice ~4% drop here
Active threads: 789
Network IN: ~ 2431 KBit/s
Network OUT: ~ 4878 KBit/s
Average RTT: ~3-28 ms
ICE connection stability: stable

Actually such short-living allocations like MessageProcessor are cheap both when allocated and cheap for GC collection, since MessageProcessor object is short living and will not likely to survive first generation which is cheap to collect.

Since you worried about allocation I'll add a pool of MessageProcessor objects per NetAccessManager, so allocations will go down, but probably MessageProcessor objects will live in older generations eventually. This might have an impact, but I'm not sure how hard to measure it.

This comment has been minimized.

@mstyura

mstyura Nov 14, 2018

Contributor

Since you worried about allocation I'll add a pool of MessageProcessor objects per NetAccessManager, so allocations will go down, but probably MessageProcessor objects will live in older generations eventually. This might have an impact, but I'm not sure how hard to measure it.

@bbaldino I've added pooling of MessageProcessor into separate branch on my fork to estimate complexity how it could be implemented.
If you think it worth using pool despite slightly increased complexity - I'll update current PR to include changes.
Thank you very much for review and comments!

This comment has been minimized.

@mstyura

mstyura Nov 14, 2018

Contributor

In case I'm lucky enough today 😸 such that pull request is acceptable, but pooling is decided to be included and I'm not responding due to my timezone, feel free to merge branch in my repository, so this pull request will also be updated to include pooling.

This comment has been minimized.

@mstyura

mstyura Nov 15, 2018

Contributor

After all the changes I like version with pooling more, so adding it by default in this PR.

* but underlying transport (UDP) also does not have such guarantee,
* so it is ok when some of the messages will be processed out of
* arrival (enqueuing) order, but faster.
*/

This comment has been minimized.

@bbaldino

bbaldino Nov 12, 2018

Member

while this is certainly true, i'd have to think about whether or not we want to introduce a place where packets could be reordered. from what i've seen, a single thread is capable of reading a very high bitrate off of a socket, so depending on where this is in the flow (I'm afraid I don't exactly know) i'm not sure if the tradeoff of introducing potential reordering makes sense.

This comment has been minimized.

@mstyura

mstyura Nov 13, 2018

Contributor

@bbaldino yes, I do agree that this need to be considered. There is an option to use different executor service (for example Executors.newFixedThreadPool), which has one queue for all processing threads, instead of ExecutorService which has different queues per each thread. But I think conceptually even when queue is single, imagine if you have queued pkt1 and pkt2 into queue: thread 1 takes pkt1 from queue and then thread 2 immediately takes pkt2 from queue, then for some reason OS kernel decides to suspend thread 1 execution, but continue with execution of thread 2, so it is likely pkt2 will be handled before pkt1 even so it was popped form queue first... So, the only way to have truly sequential in order execution is to have single thread, but earlier there was already created 3 threads to handle single queue, it was already possible that items were put into the queue in one order, but item processing is effectively done in different order, because earlier there was a lot more threads which OS kernel were switching a lot more aggressively.

This comment has been minimized.

@bbaldino

bbaldino Nov 13, 2018

Member

Understood, but did we have multiple threads reading from this queue before? I thought it was only a single thread (MessageProcessor had a queue which it read from a single thread, no?)

This comment has been minimized.

@mstyura

mstyura Nov 14, 2018

Contributor

Understood, but did we have multiple threads reading from this queue before?

@bbaldino Yes, the queue was single per connected peer. Actually queue was created per NetAccessManager, which is created per StunStack, StunStack is created per Agent and Agent is created for each peer connected to JVB.
By default there was 3 MessageProcessors (which was basically custom Thread implementation) created to process RawMessages from single queue.
This results in extremely large number of threads created to process STUN messages. For 30 audio-only conferences with 3 peers each there was created 270 threads to process stun messages, no matter that physical machine is only have for example 4 cores and there is no reason to create more threads than number of cores for computational work like decoding stun messages.

I thought it was only a single thread (MessageProcessor had a queue which it read from a single thread, no?)

Actually no, even so that it was possible to set default pool size to 1 by replacing DEFAULT_THREAD_POOL_SIZE constant, it would anyway create 1 thread per connected peer. So for 30 conferences with 3 peers each it will result in 90 threads, on for example 4 cores machine.

The thread allocation strategy, when thread count is linear to number of connected peers does not scale well.
The number of CPU cores is fixed and when there is too many threads created, it results in extra pressure on OS kernel scheduler as well as on JVM itself, which slows down performance of system in general.
In all my PRs I'm reducing number of threads by putting more work on existing threads and by avoid blocking & waiting inside tasks executed on thread.

This comment has been minimized.

@bbaldino

bbaldino Nov 28, 2018

Member

where did we end up on this? do we still have the multiple threads? a couple places you mentioned are per X, but we only ever have one instance of X per connection (one peer connection and everything is mux'd).

This comment has been minimized.

@mstyura

mstyura Nov 29, 2018

Contributor

do we still have the multiple threads?

Yes.
With moving from ForkJoinPool to regular pool there is again single queue of tasks to execute, but multiple (fixed count) executor threads.
Processing threads are managed by pool and it's number is equal to CPU cores.

a couple places you mentioned are per X, but we only ever have one instance of X per connection (one peer connection and everything is mux'd).

The problem with old code was the number of processing threads were linear to number of peers connected to JVB, actually it was 3 thread for each peer.
Number of processing threads is now fixed and equal to the number of CPU cores.

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 13, 2018

@mstyura i am off of vacation now and starting to get back into things, so I will try and take a look at this. I worry about the java version change though, @damencho @bgrozev are we ok to change the java version in ice4j?

When I did #157 I also needed JDK 8 classes, which make it problematic to compile locally, but actually build server was able to compile without JDK version update in pom file. Maybe the version update is not necessary and InilliJ IDEA confused me, but I've checked videobridge and libjitsi - they have specified Java 8 in their pom files.

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 13, 2018

BTW, @bbaldino may I ask you which timezone are you living? I'm on UTC+3. Knowing time when I could contact you here could help me adjust my working time, so we could proceed quicker with fixing my PRs.

@saghul

This comment has been minimized.

Member

saghul commented Nov 13, 2018

Brian is on PST, so I’m afraid the latency will be high. We do value your contributions so please don’t give up :-)

@bbaldino

This comment has been minimized.

Member

bbaldino commented Nov 13, 2018

@mstyura yes we do value this work a lot, thanks for the patience. Beyond the time zone, part of the reluctance just comes from the size of the changes--I think it's great work and in the right direction, but the lack of tests makes it a bit nerve-wracking to approve. Actually I wonder if you have developed any ice4j throughput/performance tests? If so, we could get those in and use the results as a baseline for these changes (it would also be great to be able to measure just how much things are improving). If not, would you be open to exploring implementing such a test?

I had done some throughput tests of the different socket types, etc a while back (some of the work I believe is here--I'd have to double check to see if everything made it into that repo)

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 14, 2018

but the lack of tests makes it a bit nerve-wracking to approve.

@bbaldino Do you mean unit tests or performance tests for comparison?

Actually I wonder if you have developed any ice4j throughput/performance tests?

Yes, I do have node.js script which uses puppeter to launch multiple headless Chromium instances, which runs Javascript generating load for JVB.
On a high level this script creates C conferences and connects P peers to each conference and start sending specified audio file as voice into conference.

If so, we could get those in and use the results as a baseline for these changes (it would also be great to be able to measure just how much things are improving).

I'v provided performance measurements on my environment within description and comments to pull requests I've created.
I do measure the version before change (baseline) and after change when I load JVB with my script.
But of course the number are not trusted from your point of view, because my test environment is closed, the script is closed and I'm a random man from the Internet who tries to change core parts of ice4j, and I'm not earned my "trust" budget yet (I hope) :)
That is fine, I'll keep trying to pursue that I'm very-very interested in making ice4j and JVB better and more performant, because who want to pay for extra server instances, when you could change a very little code to achieve better performance?
And I'm really do not want to create internal fork of jitsi products to achieve my goal - to make JVB as performant as I need.
I'm really trying hard to make everything that internal fork is not happening right now - my internal opponents has an argument that changes necessary right now, there is no time to push them into upstream project.
I'm really believe all performance (and of course all other) enhancements should stay open.

So, back to details of my test environment which I'm not ready to open right now.
For local tests I ran JVB on relatively small 4 core mac mini machine, and node.js with Chromium on much faster Windows machine, so I could easily generate such a load with audio-only conferences that JVB can't handle on mac mini machine. There is no bottleneck in machine generating load, the bottleneck happen on machine running JVB.

I don't have a working solution to test video conferences now, because for first phase of using JVB I only need audio conferences from my server which uses JVB.
As I've mentioned in other pull requests and here, I'm using JVB with my custom REST API for JVB.
On a high level API has 2 methods: one to allocate conferences on JVB and second to join conferences with SDP offer generated from client.

So my load testing script rely on this API and only compatible with my server, but in essence my server is not doing any work other than translating create/join request with SDP to corresponding method calls on Videobridge objects, which is mostly handleColibriConferenceIQ.

Basically I do test only JVB (not meet and other software around bridge) and only in audio only mode.
This scenario is be a bit different from how JVB was tested before, but on my test I don't even close to be bottle-necked on network.
Audio only conferences with low traffic (under ~2.5 MBit/s IN, ~5 MBit/s OUT) revealed that I do have huge bottleneck on CPU inside JVB.
My guess was that it happens due to abnormal (from my point of view) number of threads created (thousands of threads for 4 core machine is more that much, for example, currently my 8-core Windows machine with 296 processes only has 3779 threads across all processes).

So I've started work on reducing number of threads used in JVB, the threads mostly originated from ice4j.
The first tests with TerminationThread revealed, that just killing 60 threads which did nothing useful at all, the only work the did - stuck in Object.wait, noticeably reduces CPU usage.
Duplicating measurements here:

baseline

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.0-20181005.221549-11 # before my fixes
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 94-99 %
Active threads: 1323
Network IN: ~ 2499 KBit/s
Network OUT: ~ 5010 KBit/s
Average RTT: 80-170 ms
ICE connection stability: stable

baseline + PR 150

Name Value Comment
Computer: 3 GHz Intel Core i7 mac mini, osx 10.14
JVB version: 1.1-20180615.220358-69
ice4j version: 2.0.0-20181024.160538-12 # with PR 150
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 20
Peers per conf: 3
Est. RTP streams: 180
CPU load: 88-93% # please note noticeable drop here
Active threads: 1230
Network IN: ~ 2432 KBit/s
Network OUT: ~ 4906 KBit/s
Average RTT: 8-90 ms # for some reason it result in better RTT as well
ICE connection stability: stable

I had done some throughput tests of the different socket types, etc a while back (some of the work I believe is here--I'd have to double check to see if everything made it into that repo)

As I said above in my environment with audio only conferences the bottleneck is not socket (at least not yet) - it is CPU whose resources is wasted on thread context switches and synchronization due to too many threads created.
Regarding throughput, I've already noticed, that there is 2 * K * C * P threads created to handle socket datagrams when there is C conferences with P peers each having K ICE candidates.
I'm not yet ready to solve this (but it on my list), for now I'm getting familiar with ice4j in general and solving simpler cases.

I'm not against developing/adopting my load script specifically for JVB to work via standard rest API, but right now I really pressed on time (it need ASAP) to make JVB faster (without making internal fork), so I just don't have time to do so right now.
But I could share more details to load testing with puppeter and headless Chromium with anyone interested in performing tests right now and not wait until I have time to implement it specifically for JVB standard REST API.

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 15, 2018

@paweldomas could you please have a look at this PR as well when you have time? Thank you very much in advance!

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 20, 2018

@bbaldino if you have time, could you please have a look at this PR again? Since last time it looks like I've introduced pooling to not to introduce extra allocation per each RawMessage queued. Thanks a lot!!!

Show resolved Hide resolved pom.xml
@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 29, 2018

could you maybe lay out a high level flow of packets now?

Sure.
How it was done before my changes:

  1. Each NetAccessManager had instance field LinkedBlockingQueue<RawMessage> to store all incoming RawMessage messages which are added by org.ice4j.stack.Connector.
  2. org.ice4j.stack.Connector is a thread created by NetAccessManager per socket added to NetAccessManager.
  3. Each NetAccessManager was creating 3 MessageProcesor thread to process messages from single NetAccessManagers queue.
  4. Since there are N NetAccessManager instances per N peers, it resulted in 3 * N threads.

Here is how it's done now:

  1. Each NetAccessManager instead of passing BlockingQueue<RawMessage> into org.ice4j.stack.Connector now passes java.util.function.Consumer<RawMessage> , so when message is read by Connector thread it now passed to consumer callback instead of being added to queue.
  2. The consumer callback is passed by NetAccessManager. It does wrap RawMessage into MeassageProcessingTask and put into executor's queue.
  3. There is a single executor service shared between all instances of NetAccessManager. The executor service has fixed number of threads equal to the number of CPU cores.
  4. No matter how many peers are connected to JVB there is fixed number of processing threads equal to CPU cores.
@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 29, 2018

Good day, @bbaldino!
I've fixed what you've asked (and a bit more, after careful self-review) and answered question. Feel free to ask more. I think today I'll be online until 13.00 PST and will manage to quickly answer questions and done fixes if necessary. Thank you very much, Brian!

@bbaldino

Ok, think I follow the changes enough. Have you done some load tests/before & after with video as well as audio? Still a bit paranoid about regressions.

Also, please add your name to the author section for the classes you made substantial changes to.

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 29, 2018

Also, please add your name to the author section for the classes you made substantial changes to.

Done.

Have you done some load tests/before & after with video as well as audio?

I've done load test with audio-only conferences on one of the first revisions of PR. I'll re-do them with latest version as well, it will be done within 1-2 hours. Unfortunately I only have an infrastructure to perform load testing of audio-only conferences. Since the code I've touched only works with STUN messages, AFAIK their number does not increase when video stream is added (as long as audio and video multiplexed on same socket), so I believe audio only test does accurately reflect changes for video as well.

Still a bit paranoid about regressions.

I understand you. In case a regression detected the change could be undone or even better - investigated and if necessary additional threads (but fixed amount) added to pool 😄

@bbaldino

This comment has been minimized.

Member

bbaldino commented Nov 29, 2018

Also, please add your name to the author section for the classes you made substantial changes to.

Done.

Have you done some load tests/before & after with video as well as audio?

I've done load test with audio-only conferences on one of the first revisions of PR. I'll re-do them with latest version as well, it will be done within 1-2 hours. Unfortunately I only have an infrastructure to perform load testing of audio-only conferences. Since the code I've touched only works with STUN messages, AFAIK their number does not increase when video stream is added (as long as audio and video multiplexed on same socket), so I believe audio only test does accurately reflect changes for video as well.

Yes, that's a good point.

Still a bit paranoid about regressions.

I understand you. In case a regression detected the change could be undone or even better - investigated and if necessary additional threads (but fixed amount) added to pool 😄

👍

@mstyura

This comment has been minimized.

Contributor

mstyura commented Nov 29, 2018

@bbaldino here is fresh results comparing latest ice4j master and master +this PR. Note, my test machine changed since very first tests from HW mac to VM with Linux, which is local attempt to mimic Amazon c5.large intsance.

before

Name Value Comment
VM CPU 2 cores Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
RAM 4 Gb
OS CentOS Linux release 7.5.1804 (Core)
JVB version: 1.1-20180615.220358-69 # old version, i know
libjitsi version: 1.0-20180607.160234-352
ice4j version: 2.0.0-20181120.222039-17 # latest master
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 25
Peers per conf: 5
Est. RTP streams: 625
CPU load: 87-96 %
Active threads: 1783
Network IN: ~ 5072 KBit/s
Network OUT: ~ 20273 KBit/s
Average RTT: 4-180 ms # mostly stay low
ICE connection stability: stable

after

Name Value Comment
VM CPU 2 cores Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
RAM 4 Gb
OS CentOS Linux release 7.5.1804 (Core)
JVB version: 1.1-20180615.220358-69
libjitsi version: 1.0-20180607.160234-352
ice4j version: 2.0.0-20181120.222039-17 # latest master + PR 155
org.ice4j.TERMINATION_DELAY, ms 604800000
Conf per server: 25
Peers per conf: 5
Est. RTP streams: 625
CPU load: 84-94 % # a little bit lower
Active threads: 1414 # 369 threads less, estimated reduction should be was (373 == 125 * 3 - N = 2 cores), probably some other threads live on their own
Network IN: ~ 5113 KBit/s
Network OUT: ~ 20455 KBit/s
Average RTT: 2 - 172 ms # even more stayed at low values
ICE connection stability: stable
@bbaldino

Cool, thanks for the patience and hark work @mstyura

@bbaldino bbaldino merged commit ee4e8c9 into jitsi:master Nov 29, 2018

1 check passed

default 490 tests run, 356 skipped, 0 failed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment