Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't use safe caller for event delegation to subscribers #861

Merged
merged 1 commit into from
Jun 19, 2019
Merged

don't use safe caller for event delegation to subscribers #861

merged 1 commit into from
Jun 19, 2019

Conversation

maggu2810
Copy link
Contributor

This changes should address #829

At the moment we warn if an event subscriber exceeds the limit of expected max. runtime.
This has been done before, too.
It would also be possible (which another implementation I played around today, too) to interrupt the event subscriber if it exceeds the limit. I don't think we should do this. This could perhaps lead into a very confusion behaviour on very high load. The warning should be enough to detect a really bad working event subscriber.

Signed-off-by: Markus Rathgeb <maggu2810@gmail.com>
@mhilbush
Copy link
Contributor

mhilbush commented Jun 9, 2019

Adding a summary of my testing from here.

Test parameters

Build 1607:

  • CPU: ranging between 25% and 85% on one core
  • Memory: ~15.4%
  • Event Manager thread CPU time rate of increase (see note 2 below)
   8322719   ESH-OSGiEventManager
   8328524   ESH-OSGiEventManager
   8334892   ESH-OSGiEventManager
   8341976   ESH-OSGiEventManager

Build 1607 with org.openhab.core as described above:

  • CPU: ranging between 2% and 5% with occasional spikes to 20-25%
  • Memory: ~8.0%
  • Event Manager thread CPU time rate of increase (see note 2 below)
 2839      OH-OSGiEventManager
 10243     OH-EventHandlerExecutor-1
 758       OH-EventHandlerWatcher-1
-----------------------
 2854      OH-OSGiEventManager
 10301     OH-EventHandlerExecutor-1
 762       OH-EventHandlerWatcher-1
-----------------------
 2868      OH-OSGiEventManager
 10354     OH-EventHandlerExecutor-1
 766       OH-EventHandlerWatcher-1
-----------------------
 2884      OH-OSGiEventManager
 10412     OH-EventHandlerExecutor-1
 770       OH-EventHandlerWatcher-1
-----------------------
 2900      OH-OSGiEventManager
 10471     OH-EventHandlerExecutor-1
 775       OH-EventHandlerWatcher-1

Note 1: Event rate measured with all event types in org.ops4j.pax.logging.cfg set to INFO level and the following command:

    tail -f /opt/openhab2/userdata/logs/events.log | pv --line-mode --average-rate > /dev/null

Note 2: Event Manager thread CPU time measured by taking snapshots every 10 seconds using the following command:

        /opt/openhab2/runtime/bin/client -p habopen -l 0 -- threads --list | \
            awk -F"│" '{ gsub(/[ \t]+$/, "", $2);  print $4 $2 }' | \
            sort --numeric-sort --reverse

@maggu2810
Copy link
Contributor Author

Thanks a lot to @mhilbush for his testing!

@mhilbush
Copy link
Contributor

mhilbush commented Jun 9, 2019

I cranked the event rate up to 69 events per second. Pretty darn good!

CPU: 15-20% with spikes to 40%
Memory: ~8.5%
Event Manager thread CPU

 7030      OH-OSGiEventManager
 34676     OH-EventHandlerExecutor-1
 2200      OH-EventHandlerWatcher-1
-----------------------
 7060      OH-OSGiEventManager
 35206     OH-EventHandlerExecutor-1
 2216      OH-EventHandlerWatcher-1
-----------------------
 7087      OH-OSGiEventManager
 35696     OH-EventHandlerExecutor-1
 2232      OH-EventHandlerWatcher-1
-----------------------
 7118      OH-OSGiEventManager
 36207     OH-EventHandlerExecutor-1
 2249      OH-EventHandlerWatcher-1
-----------------------
 7147      OH-OSGiEventManager
 36718     OH-EventHandlerExecutor-1
 2265      OH-EventHandlerWatcher-1
-----------------------
 7178      OH-OSGiEventManager
 37254     OH-EventHandlerExecutor-1
 2283      OH-EventHandlerWatcher-1
-----------------------
 7207      OH-OSGiEventManager
 37752     OH-EventHandlerExecutor-1
 2301      OH-EventHandlerWatcher-1
-----------------------

Thanks a lot to @mhilbush for his testing!

Thanks. Ran a Fortune 100 web site for 8 years. Our philosophy was:

  1. Test it till it breaks
  2. Fix issues
  3. Go back to step 1

private final ScheduledExecutorService watcher = Executors
.newSingleThreadScheduledExecutor(new NamedThreadFactory("EventHandlerWatcher"));
private final ExecutorService executor = Executors
.newSingleThreadExecutor(new NamedThreadFactory("EventHandlerExecutor"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afair, the safeCaller does not use a single thread, but a pool of x threads. Are we safe here to reduce it to a single thread or might we run into some other issues?

I also remember a feature that we assigned events to bindings and made sure that if one binding exhausts all threads of a pool, that this does not negatively impact the others. I cannot find this in the code right now, so I assume this was not part of the internal event handling itself.

Copy link
Contributor Author

@maggu2810 maggu2810 Jun 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events should be handled in order.
Using multiple threads for the event handling can result into a different event order.
Assume changed events "1->2", "2->3".
If you are using different threads how do you want to ensure that the submitted execution does not change the order because the one thread comes in front of the other?
E.g. you submit the event handling for "1->2" to the executor, after that you submit the event handling for "2->3" to the executor.
The executor uses thread1 for "1->2" and thread2 for "2->3".
Now, sometimes thread1 is executed in front of thread2, other times thread2 is executed in front of thread1 (or they are executed at the same time using different CPUs, or the execution is switched between etc.).
The event order will be non deterministic anymore.

So, regardless how it has been done, it should be done in order.
IIRC this has been also discussed some years ago.

As written above already, we could interrupt the event subscriber if the limit has been exceeded.
But if you realize that an addon consumes to much resources, perhaps it would be better to fix that addon (and realize the bad behaviour).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to put one more data point out there...

On my test system (with a ~9 year old AMD CPU with each of 6 cores running at 3.7 GHz) I max out one core at ~90 events per second. I produce this load with 5 systeminfo things set to update all channels (except disk because I don't know the performance penalty of that function) at a refresh rate of interval_high=1, interval_medium=4.

The top cpu consuming threads are:

 735734    OH-EventHandlerExecutor-1
 20119     Timer-7
 19935     ESH-usb-serial-discovery-linux-sysfs-1
 18066     pool-4-thread-1
 17530     OH-OSGiEventManager
 11320     EventAdminAsyncThread #11
 11261     EventAdminAsyncThread #12
 9450      safeCall-40
 9211      OH-EventHandlerWatcher-1
 9129      safeCall-38
 8779      pool-11-thread-1
 8565      Start Level: Equinox Container: 7dcb28cd-c6e7-4135-a1f8-6d3d46449a09
 8531      safeCall-39
 6941      sshd-SshServer[5e8ba151]-nio2-thread-1
 6514      sshd-SshServer[5e8ba151]-nio2-thread-2
 6321      safeCall-37
 6212      pool-9-thread-1
 6205      ZWaveReceiveInputThread
 4087      safeCall-41
 2974      Finalizer
 2922      Reference Handler
 2681      items-67
 2461      pool-706-thread-1
 2097      fileinstall-/opt/openhab2/userdata/etc
 2078      Karaf Lock Monitor Thread
 1925      items-68
 1643      openHAB-job-scheduler_Worker-1
 1580      openHAB-job-scheduler_Worker-2
 1555      ResourcePublisher
 1516      Active Thread: Equinox Container: 7dcb28cd-c6e7-4135-a1f8-6d3d46449a09

Overall, openHAB is consuming a little less than 2 CPUs worth of load.

top - 08:03:15 up 10 days, 14:04,  2 users,  load average: 2.03, 1.43, 1.13
Tasks: 330 total,   2 running, 263 sleeping,   0 stopped,   0 zombie
%Cpu(s): 32.6 us,  1.8 sy,  0.0 ni, 65.3 id,  0.1 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  8167104 total,  2072852 free,  2031328 used,  4062924 buff/cache
KiB Swap: 16876540 total, 16876540 free,        0 used.  5811896 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                
31984 root      20   0 7042648 747712  23360 S 181.7  9.2  65:11.77 java                                                                                                                                                   
10228 mark      20   0  819608 100000  53848 S   9.3  1.2  23:08.14 Xorg                                                                                                                                                   
15095 xrdp      20   0   59036  28976   8084 S   7.0  0.4  20:59.19 xrdp                                                                                                                                                   
10900 mark      20   0  542100  41572  25628 S   4.3  0.5   8:15.48 xfce4-terminal                                                                                                                                         
10323 mark      20   0  429236  26368  20684 S   0.7  0.3   0:41.54 xfwm4                                                                                                                                                  
 1932 root      20   0 1668112  66936  38988 S   0.3  0.8  15:09.72 dockerd                                                                                                                                                
 3341 gdm       20   0  589708  24896  18020 R   0.3  0.3   5:55.77 gsd-color                                                                                                                                              
27420 mark      20   0   13824    876    776 S   0.3  0.0   0:01.59 tail                                                                                                                                                   
31641 mark      20   0   42052   3800   2912 R   0.3  0.0   0:00.07 top                                                                                                                                                    
    1 root      20   0  226512  10252   6708 S   0.0  0.1   0:42.78 systemd                                                                                                                                                

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maggu2810 We never had the requirement to respect a specific order and afair, older OSGi specs didn't ask for an ordered delivery either. I remember discussions where we found out that Equinox used single threaded delivery while Felix used a thread pool instead.

It seems that OSGi 7 has now introduced a specific setting for defining whether ordering is required or not.

But if you realize that an addon consumes to much resources, perhaps it would be better to fix that addon (and realize the bad behaviour).

Sure, but the idea was that its effect is constrained to its own functionality and does not impact others to misbehave at the same time. I remember situations where a binding took too long when consuming events and this resulted in late delivery of events to other bindings, which then showed very odd behavior and it was difficult to actually figure out, which binding was the culprit. Much better if only the one behaves badly, which actually has the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this PR is a dramatic improvement over the current implementation. If there are no issues other than the ability to scale beyond the rates I've outlined above, can't this be merged, as it resolves a very critical issue in the current snapshots.

If there's a desire to change the implementation to scale better, can't that be done in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the code and the changes in this PR don't impact how commands are sent to ThingHandlers using the SafeCaller:

safeCaller.create(handler, ThingHandler.class)
.withTimeout(CommunicationManager.THINGHANDLER_EVENT_TIMEOUT).onTimeout(() -> {
logger.warn("Handler for thing '{}' takes more than {}ms for handling a command",
handler.getThing().getUID(), CommunicationManager.THINGHANDLER_EVENT_TIMEOUT);
}).build().handleCommand(link.getLinkedUID(), command);

So commands will still be handled asynchronously which was my main concern. Commands will thus continue to be handled even if a couple of ThingHandlers hijack threads e.g. due to network timeouts or when waiting on locks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just about to ask @mhilbush to add some command handling to the load generator (with e.g. 300ms processing time of the handler).
But @wborn's observation that this had actually moved to the ProfileCallback should hopefully answer it - this should not be related to the event dispatching itself anymore. I consider my second concern to be clarified then.

Thank you all for the hard work and your patience with me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the hard work and your patience with me!

No worries. It's a change to a critical area, so I get the need for diligence.

I was just about to ask @mhilbush to add some command handling to the load generator (with e.g. 300ms processing time of the handler).

I wrote the load generator in a hurry, so there's room for improvement. As I plan to use it in the future, and possibly extend it, what were you looking for here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what were you looking for here?

A feature that makes the load generator not only create status updates, but also commands and then have the handleCommand method in the ThingHandler to block for e.g. 300ms to simulate some heavy work (like accessing external devices). This would then better simulate a setup where users can directly observe whether there is a delay in event processing or not - if switching on a device suddenly takes a few seconds, because the command hang in a single-threaded queue, it is instantly visible and annoying to the user. But as mentioned above: This PR should not have an issue on this behavior, so such an addition would only be interesting for future tests in other contexts.

@maggu2810
Copy link
Contributor Author

maggu2810 commented Jun 19, 2019

@mhilbush Can you check that summary of your test?

Is my understanding correct, that you state the test makes no sense against the current master branch, because it is such slow?

id naming here #763 #861 notes events / seconds cpu load
1 Baseline no no some time in the past before we noticed that the proxies using wrong classloading > 1370
2 Test 2 no yes should not be important as #763 should not be reverted again > 1370 perform substantially better than "1" in terms of CPU consumption
perform better than "1" in terms of top CPU consuming threads
3 Current Master yes no current master <= 40
4 Test 1 yes yes master if this one gets merged > 1370 perform substantially better than "1" in terms of CPU consumption
perform better than "1" in terms of top CPU consuming threads

@mhilbush
Copy link
Contributor

Is my understanding correct, that you state the test makes no sense against the current master branch, because it is such slow?

Yes, the current master branch is too slow to run anything close to the volume I ran against Baseline, Test 1, and Test 2.

can you fill such a table WRT your current machine?

I was just thinking about doing a table like this. 😉 Working on it now.

@mhilbush
Copy link
Contributor

Can you update the ID column with the following so that the rows map to the tests above? Otherwise, you've captured it correctly.

1 => Baseline
2 => Test 2
3 => Current Master
4 => Test 1

@maggu2810
Copy link
Contributor Author

I assume the naming "baseline" is very confusing. So I preferred not choosing a special name at all, but an ID, and a description, and a truth table. 😉
But if you are more happy, I will add another table to refer the naming in your comment above

@mhilbush
Copy link
Contributor

You can leave it as is. I used the term Baseline to reflect the performance of the system prior to PRs #763 and #861.

@kaikreuzer
Copy link
Member

Thank you all for the hard work and your patience with me!

@openhab-bot
Copy link
Collaborator

This pull request has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/preparation-for-2-5m2/75738/73

@wborn wborn added this to the 2.5 milestone Jul 30, 2019
Rosi2143 pushed a commit to Rosi2143/openhab-core that referenced this pull request Dec 26, 2020
splatch pushed a commit to ConnectorIO/copybara-hab-core that referenced this pull request Jul 11, 2023
Signed-off-by: Markus Rathgeb <maggu2810@gmail.com>
GitOrigin-RevId: 8e195d4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants