Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poller service starting multiple polling process for same machine #5619

Closed
5 tasks done
boudreau opened this issue Jan 26, 2017 · 17 comments
Closed
5 tasks done

poller service starting multiple polling process for same machine #5619

boudreau opened this issue Jan 26, 2017 · 17 comments

Comments

@boudreau
Copy link
Contributor

boudreau commented Jan 26, 2017

I'm using distribution pollers and starting using the poller service instead of crontab.

The poller service seem to be doing pretty good, but I have often multiple polleing process started with the same device ID to poll:

librenms 12553 59796 0 08:32 ? 00:00:00 /bin/sh -c /usr/bin/env php /var/www/librenms/poller.php -h 677 >> /dev/null 2>&1
librenms 13051 59796 0 08:32 ? 00:00:00 /bin/sh -c /usr/bin/env php /var/www/librenms/poller.php -h 677 >> /dev/null 2>&1
librenms 13263 59796 0 08:32 ? 00:00:00 /bin/sh -c /usr/bin/env php /var/www/librenms/poller.php -h 677 >> /dev/null 2>&1

This happens on the main poller/DB/web site as well as the other pollers.

It should not start multiple poller for the same device since there should be a DB lock on the device. Could there be a multiple queries before the lock is active in the DB?
How can I solve this?

Thanks for the help

Component Version
LibreNMS 08edfc6
DB Schema 157
PHP 7.0.14
MySQL 5.5.50-MariaDB
RRDTool 1.5.6
SNMP NET-SNMP 5.7.2

DO NOT DELETE THIS INFORMATION.

Please read this information carefully.

GitHub issues is for feature requests or bugs, please do not post issues asking for help or how to do X, Y or Z.
You can use our irc channel ##librenms on freenode to ask questions or our community site.

Please confirm each of the sections below by putting an x in the box like [x].

  • Is your install up to date? Updating your install
    Please do not submit an issue if your install is not up to date within the last 24 hours or on a stable monthly release.
  • Please include all of the information between the ==================================== section of ./validate.php which you can run from the cli.
  • Unless your issue is for a WebUI fix or feature then please provide ALL info asked for here.
  • Please provide as much detail as possible.
  • Please do NOT post more than 10 lines of debug information here, use a pastebin service or GitHub Gists.
@laf
Copy link
Member

laf commented Jan 26, 2017

@clinta any input on this?

@clinta
Copy link
Contributor

clinta commented Jan 27, 2017

Can you provide a log from the poller service?

@pollix
Copy link

pollix commented Jan 30, 2017

I have the same issue using 32 service worker on a non distributed poller.
Restarting the service several times can reproduce the issue.

@laf
Copy link
Member

laf commented Jan 30, 2017

I'm sure the info asked for above is still required :)

@pollix
Copy link

pollix commented Jan 30, 2017

Setting the service loglevel to debug.
Basically ddossing my weakest devices

poller-service.txt

cat poller-service.txt | grep 141

@boudreau
Copy link
Contributor Author

boudreau commented Jan 30, 2017

Just a sample log, got multiple polls on machines like 93, 122, 136, 175, 224, 371, 1288

poller-service-log.txt

Not too bad when polling a performing machine, but I have some older ones that gets hammered and finish at over 300sec poll time

@laf
Copy link
Member

laf commented Feb 6, 2017

@clinta some debug above if it's of help?

@laf
Copy link
Member

laf commented Mar 24, 2017

@clinta Are you able to spare some time to look at this?

@laf
Copy link
Member

laf commented Apr 23, 2017

@clinta any joy in taking a look at this?

@f0o
Copy link
Member

f0o commented Apr 30, 2017

@boudreau are you using a single SQL instance or are you using some Clustered/MultiMaster setup? SQL Locks are not replicated in a cluster so that could cause a device being polled multiple times. just an idea.

@boudreau
Copy link
Contributor Author

boudreau commented May 1, 2017

Got a single instance running.
Got a threaded poller, could there be multiple poller thread receiving the same response from the DB for the "oldest not polled device" and starting the job before the lock on the db?
Could we confirm the lock on the DB before starting the poll?
just an idea.

@murrant
Copy link
Member

murrant commented May 4, 2017

I started to look into this, but I don't know enough. It looked like the locking was not working anymore with a specific version of mysql/mariadb...

@boudreau
Copy link
Contributor Author

Do you know which version of mariadb has a working lock system?

@boudreau
Copy link
Contributor Author

I upgraded our mariadb on our second server to version 10.1.22 and I have much less duplicate polling started at the same time:
for the same time period on our older server with mariadb 5.5.52: got about 3525 duplicate poller started on the same device, even going to 22 time in the same second.

grep poller librenms.log | uniq -d -c | sort
10 /var/www/librenms/poller.php 1099 2017-05-29 01:47:56 - 1 devices polled in 12.48 secs
11 /var/www/librenms/poller.php 1099 2017-05-29 01:47:55 - 1 devices polled in 12.48 secs
11 /var/www/librenms/poller.php 48 2017-05-29 06:53:44 - 1 devices polled in 133.4 secs
12 /var/www/librenms/poller.php 184 2017-05-28 12:47:21 - 1 devices polled in 101.1 secs
12 /var/www/librenms/poller.php 452 2017-05-29 01:40:31 - 1 devices polled in 138.1 secs
13 /var/www/librenms/poller.php 230 2017-05-28 21:34:48 - 1 devices polled in 103.2 secs
13 /var/www/librenms/poller.php 77 2017-05-28 15:26:01 - 1 devices polled in 219.7 secs
22 /var/www/librenms/poller.php 668 2017-05-28 05:29:09 - 1 devices polled in 124.1 secs
2 /var/www/librenms/poller.php 100 2017-05-28 19:51:19 - 1 devices polled in 10.57 secs
2 /var/www/librenms/poller.php 1002 2017-05-28 02:27:09 - 1 devices polled in 16.13 secs
2 /var/www/librenms/poller.php 1003 2017-05-28 16:14:25 - 1 devices polled in 10.38 secs
<....>
7 /var/www/librenms/poller.php 239 2017-05-29 07:08:01 - 1 devices polled in 102.6 secs
7 /var/www/librenms/poller.php 61 2017-05-28 02:27:34 - 1 devices polled in 164.2 secs
7 /var/www/librenms/poller.php 64 2017-05-28 11:32:27 - 1 devices polled in 109.4 secs
7 /var/www/librenms/poller.php 65 2017-05-28 02:15:43 - 1 devices polled in 138.8 secs
7 /var/www/librenms/poller.php 667 2017-05-28 08:41:33 - 1 devices polled in 140.5 secs
7 /var/www/librenms/poller.php 85 2017-05-29 08:18:47 - 1 devices polled in 126.3 secs
8 /var/www/librenms/poller.php 1075 2017-05-29 05:05:26 - 1 devices polled in 12.48 secs
8 /var/www/librenms/poller.php 364 2017-05-28 05:59:25 - 1 devices polled in 100.7 secs

First number is number of duplicate entries in the log

On our new server:

only 143 duplicated entries and only a max of 2 duplicated entries at a time.
2 /var/www/librenms/poller.php 1019 2017-05-28 13:44:12 - 1 devices polled in 0.057 secs
2 /var/www/librenms/poller.php 1020 2017-05-28 18:35:47 - 1 devices polled in 3.169 secs
2 /var/www/librenms/poller.php 1021 2017-05-28 07:42:15 - 1 devices polled in 0.058 secs
2 /var/www/librenms/poller.php 1059 2017-05-28 11:06:57 - 1 devices polled in 12.46 secs
<...>

So I guest it much better with a more recent mariadb.

Hope it give some insight

@murrant
Copy link
Member

murrant commented Jun 1, 2017

Wish I could help here since I think the service based poller is superior, but you may have to use cron based polling for now.

Basically, the locking is broken. So either repair the current locking scheme or replace with a different one.

@laf
Copy link
Member

laf commented Jul 3, 2017

@boudreau @pollix Can you both test #6938 - that should contain a fix for this.

murrant added a commit to murrant/librenms that referenced this issue Jul 11, 2017
I believe this will fix the locking on mysql >= 5.7.5.  As of that version, GET_LOCK will not return 0 if calling from the same connection.

fixes: librenms#5619
@laf
Copy link
Member

laf commented Mar 20, 2018

Closing this, it's documented here for anyone experiencing the issue but no fix is forthcoming

@laf laf closed this as completed Mar 20, 2018
@lock lock bot locked as resolved and limited conversation to collaborators May 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants