Devices stop polling if down for some time #398

rkerr · 2018-04-13T14:11:12Z

I'm not exactly sure when this when this issue was introduced - currently running 2.39.21 and noticed it with several other 2.39 versions but not clear when it started.

If a device is down when the job schedules a discover/macsuck the job never completes (or possibly never runs?), and remains in the queue as 'queued'. This means no further discovers get queued, so when the device comes back up it will never be discovered again.

eg if I select queued jobs from the admin table for a number of devices that went down around 18:00 I see:

job	entered	device	action	status
87351050	2018-04-12 18:30:36.427629	x.x.x.47	discover	queued
87351052	2018-04-12 18:30:36.427629	x.x.x.44	discover	queued
87351135	2018-04-12 18:30:36.427629	x.x.x.40	discover	queued
87351132	2018-04-12 18:30:36.427629	x.x.x.45	discover	queued
87351235	2018-04-12 18:30:36.427629	x.x.x.42	discover	queued
87351034	2018-04-12 18:30:36.427629	x.x.x.43	discover	queued
87350976	2018-04-12 18:30:36.427629	x.x.x.46	discover	queued

These devices are now up and running but haven't been discovered since. Trying to manually prompt a rediscover through the web frontend results in an error in netdisco-web.log:

[54497] 2018-04-13 14:02:51 error bless( {'msg' => 'DBIx::Class::Storage::DBI::_dbh_execute_for_fetch(): execute_for_fetch() aborted with 'ERROR: duplicate key value violates unique constraint "jobs_queued"
DETAIL: Key (action, (COALESCE(subaction, 'x'::text)), (COALESCE(device, '255.255.255.255'::inet)), (COALESCE(port, 'x'::text)))=(discover, x,x.x.x.40, x) already exists.'

Running the discover on the CLI with netdisco-do works fine for that run, but doesn't clear the problem so further automatic discoveries don't run.

Restarting the backend seems to resolve the issue temporarily, but the next time a device is down for any period of time that device will stop automatically discovering again.

ollyg · 2018-04-18T21:03:29Z

Hi @rkerr, thanks for this report, and sorry for the bug. There are a few different things going on here:

When devices are not contactable Netdisco tries up to 10 times and then blocks the device for a week, before trying again. Restarting the daemon tries once more, which is what you're seeing.
It looks like you haven't got schema update 32 applied on your system which would have removed the jobs_queued constraint. You can run bin/netdisco-db-deploy --redeploy-all to fix this. You will then be able to queue jobs which are duplicates.
I have applied a patch which causes manually queued jobs to always run (and ignore the failed connections counter).

So to fix this, there are a couple of options:

change the workers retry_after to something else:

workers:
  retry_after: '1 day'

or with the next release queue the job from the web interface
or with the next release trigger the job at the CLI but use --queue flag to enqueue it

Hope this helps,
Oliver.

rkerr · 2018-04-19T11:10:44Z

Interesting, certainly the database thought it was schema version 54. Having redeployed all the changes that constraint has now gone, so somehow it got missed. Possibly I was running a dev version at some point.

Is it possible to switch this defer behaviour off? I guess I can set the retry time down to 10 seconds but it doesn't really seem like a behaviour I have any use/need for. To be honest I'm struggling to think of why anyone would want that behaviour...

Also... shouldn't the status be error/defer not stuck on queued if it was deferring?

ollyg · 2018-04-19T11:26:07Z

Hi @rkerr I'm glad this issue is resolved.

It should be possible to switch the feature off but I think there's a bug. I need to fix that! In the meantime you can do the following in config:

workers:
  max_deferrals: 10000

(see the docs here but ignore the advice to set to zero: https://github.com/netdisco/netdisco/wiki/Configuration#workers)

There are certainly many use cases for this feature! Most commonly, it makes the backend much more efficient where there are a lot of devices discovered through neighbor relations but Netdisco is not able to contact them. Also it is key to allowing multiple backends to work smoothly where different backends can contact different sets of devices. To be honest, the case (as yours) where devices go away and come back is seen as more unusual :-).

rkerr · 2018-04-19T11:41:32Z

I'm not entirely sure it is resolved... I haven't updated yet as doing so would restart the backend and that fixes the issue in the short term anyway. I thought it better to leave in a broken state to try to understand a bit better.

Having now read the commit you linked (my bad, should have read it first) I'm not sure that deferrals are the issue. It looks like deferral happens if the device_skip table has deferrals > 10 and last_defer is less than retry_after ago. For these devices deferrals = 0 and last_defer is null - is that what you're expecting here.

I can understand why you'd want this behaviour for devices identified by LLDP but never previously discovered, however that's a slightly different case to devices that have been previously been discovered but are now timing out? Perhaps it would make sense to have different behaviour for those two cases?

I tend to use rather wide regexs in discover_no_type to keep most of the LLDP neighbours out of the way, but I guess that only works if you have a single vendor network.

inphobia · 2018-05-30T02:29:05Z

i've been seeing similar things on 2.39.30. items queued for over a day (job overview were all queued jobs). i didnt notice at first since i was setting up a new install, but i can say that items where queued for long over the deadline.

i did notice some strange database behaviour: when my jobqueue stops it seems i tend to have a database connection thats's stuck on a commit transaction, but i havent figured out why yet:

database side:

postgres=# SELECT * FROM pg_stat_activity where pid = 24326;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query
--------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------+-------------------------------+-------------------------------+---------+-------+--------
315286 | testdisc | 24326 | 289223 | testdisc | | | | -1 | 2018-05-28 00:20:17.455307+02 | | 2018-05-30 02:50:29.507962+02 | 2018-05-30 02:50:29.508725+02 | f | idle | commit
(1 row)

os side:
postgres@linux001:~> ps -aef | grep 24326
postgres 24326 12046 0 May28 ? 00:00:00 postgres: testdisc testdisc [local] idle

postgres@linux001:~> date
Wed May 30 03:30:05 CEST 2018

postgres@linux001:~> strace -p 24326
Process 24326 attached
recvfrom(11, ^CProcess 24326 detached
<detached ...>

postgres@linux001:~> date
Wed May 30 03:31:29 CEST 2018

lsof -p 24326

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

postgres 24326 postgres 11u unix 0xffff8805f6b1fbc0 0t0 1752433 /var/run/postgresql/.s.PGSQL.5432 type=STREAM

from what i can see is that the postgres side has 1 process that's waiting for data on its unix socket, but is not receiving anything, but that's about as far as i've gotten before i ran out of time.

i do have to add that i only started seeing this after i upgraded, i didnt report it due to having no time to troubleshoot.
i did set up a second netdisco instance on the same host, under a diiferent user, where i used "perlbrew" to set up a complete seperate environment to rule out a bad system perl module (a few got updated on the system), but that one hasn't been online long enough to give useful data.

second strange thing i noticed is that sometimes i have devices where netdisco wont stop polling, even if they always refuse the connection. i saw 1 device with over 7000 connections refused while i'm using the default 10 when it should start to wait.

but as i said, i upgraded netdisco a few weeks ago so i was first trying to figure out if i screwed up something, but it seems i'm not the only one with these symptoms. i think i first saw them some around may 10-15th, but not sure.

taylor5042 · 2018-05-30T02:43:35Z

I upgraded last week and I am seeing similar issues with devices that stop polling. I know these devices were unreachable for some time, but they have returned to being reachable however they do not get discovered again unless I manually poll them. I haven’t had time to diagnose these further.

…

On May 29, 2018, 8:29 PM -0600, nick n. ***@***.***>, wrote: i've been seeing similar things on 2.39.30. items queued for over a day (job overview were all queued jobs). i didnt notice at first since i was setting up a new install, but i can say that items where queued for long over the deadline. i did notice some strange database behaviour: when my jobqueue stops it seems i tend to have a database connection thats's stuck on a commit transaction, but i havent figured out why yet: database side: > postgres=# SELECT * FROM pg_stat_activity where pid = 24326; > datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query > --------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------+-------------------------------+-------------------------------+---------+-------+-------- > 315286 | testdisc | 24326 | 289223 | testdisc | | | | -1 | 2018-05-28 00:20:17.455307+02 | | 2018-05-30 02:50:29.507962+02 | 2018-05-30 02:50:29.508725+02 | f | idle | commit > (1 row) > os side: > ***@***.***:~> ps -aef | grep 24326 > postgres 24326 12046 0 May28 ? 00:00:00 postgres: testdisc testdisc [local] idle > ***@***.***:~> date > Wed May 30 03:30:05 CEST 2018 > ***@***.***:~> strace -p 24326 > Process 24326 attached > recvfrom(11, ^CProcess 24326 detached > <detached ...> > ***@***.***:~> date > Wed May 30 03:31:29 CEST 2018 > lsof -p 24326 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > postgres 24326 postgres 11u unix 0xffff8805f6b1fbc0 0t0 1752433 /var/run/postgresql/.s.PGSQL.5432 type=STREAM from what i can see is that the postgres side has 1 process that's waiting for data on its unix socket, but is not receiving anything, but that's about as far as i've gotten before i ran out of time. i do have to add that i only started seeing this after i upgraded, i didnt report it due to having no time to troubleshoot. i did set up a second netdisco instance on the same host, under a diiferent user, where i used "perlbrew" to set up a complete seperate environment to rule out a bad system perl module (a few got updated on the system), but that one hasn't been online long enough to give useful data. second strange thing i noticed is that sometimes i have devices where netdisco wont stop polling, even if they always refuse the connection. i saw 1 device with over 7000 connections refused while i'm using the default 10 when it should start to wait. but as i said, i upgraded netdisco a few weeks ago so i was first trying to figure out if i screwed up something, but it seems i'm not the only one with these symptoms. i think i first saw them some around may 10-15th, but not sure. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

inphobia · 2018-06-08T23:04:55Z

well, 2 weeks (and 1 or 2 netdisco restarts) later someone really really wants get some data from 1 of my devices & wont take no for an answer:

select * from device_skip order by deferrals;

linux001.aquafin.be | 10.98.65.126 | {} | 10 | 2018-06-06 07:22:00.639148
linux001.aquafin.be | 10.98.52.116 | {} | 10 | 2018-06-06 07:22:02.09192
linux001.aquafin.be | 10.98.51.117 | {} | 10 | 2018-06-06 07:22:04.979775
linux001.aquafin.be | 10.98.51.115 | {} | 10 | 2018-06-06 07:22:31.160593
linux001.aquafin.be | 10.98.4.138 | {} | 10 | 2018-06-06 07:11:52.663068
linux001.aquafin.be | 10.98.8.123 | {} | 10 | 2018-06-06 07:11:52.850854
linux001.aquafin.be | 10.98.3.116 | {} | 10 | 2018-06-06 07:13:28.013948
linux001.aquafin.be | 10.98.64.1 | {} | 11 | 2018-06-06 12:54:12.521929
linux001.aquafin.be | 10.98.60.1 | {} | 11 | 2018-06-05 17:11:08.389781
linux001.aquafin.be | 10.50.110.100 | {} | 11 | 2018-06-05 17:13:15.885833
linux001.aquafin.be | 10.111.100.50 | {} | 7575 | 2018-06-09 00:48:31.001755

now i might have manually added that ip manually (was lugging around a 3year old database the never had any nodes expired on it, so extracted all devices and added them to the new setup with netdisco-do discover ), but if in understand the wiki right manual jobs take precedence over scheduled ones but should still trigger the 10times timeout code.

ollyg · 2018-06-17T21:06:11Z

Hi @inphobia thanks for the update, this does help. Please give release 2.039031 a try which I have just pushed to CPAN. It should address this issue with user submitted jobs that are being retried forever.

inphobia · 2018-06-18T05:19:21Z

it seems i have 1 device that still goes to 12:

linux001.aquafin.be | 172.22.139.77 | {} | 10 | 2018-06-18 04:40:01.816838
linux001.aquafin.be | 10.50.110.104 | {} | 10 | 2018-06-18 03:25:46.037538
linux001.aquafin.be | 10.50.110.102 | {} | 10 | 2018-06-18 03:33:39.230272
linux001.aquafin.be | 10.50.110.103 | {} | 10 | 2018-06-18 03:33:30.155197
linux001.aquafin.be | 10.50.102.23 | {} | 10 | 2018-06-18 04:54:38.133412
linux001.aquafin.be | 10.80.1.11 | {} | 10 | 2018-06-18 05:00:21.084916
linux001.aquafin.be | 10.80.1.20 | {} | 10 | 2018-06-18 05:00:21.040312
linux001.aquafin.be | 10.80.1.12 | {} | 10 | 2018-06-18 04:59:23.692281
linux001.aquafin.be | 10.98.48.110 | {} | 10 | 2018-06-18 03:25:46.046965
linux001.aquafin.be | 10.98.67.115 | {} | 10 | 2018-06-18 03:25:30.520367
linux001.aquafin.be | 10.80.1.10 | {} | 10 | 2018-06-18 05:10:08.988231
linux001.aquafin.be | 172.22.161.65 | {} | 10 | 2018-06-18 04:40:18.894744
linux001.aquafin.be | 10.50.110.100 | {} | 12 | 2018-06-18 07:04:24.566127

but i guess that's my doing. after updating to 2.039031 & restarting it stayed at 11, and when i pressed "discover" it changed to 12 but did not run out of control as before. seems to be fixed.

thx!

// nick

ollyg · 2018-06-19T03:59:08Z

Thanks for the confirmation! Great to hear!

ollyg closed this as completed Apr 18, 2018

ollyg added a commit that referenced this issue Apr 18, 2018

#398 user submitted jobs are run regardless of max deferrals

ea592c6

inphobia mentioned this issue Dec 13, 2018

discover jobs are never dequeued or skipped for non discoverable devices #466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devices stop polling if down for some time #398

Devices stop polling if down for some time #398

rkerr commented Apr 13, 2018

ollyg commented Apr 18, 2018

rkerr commented Apr 19, 2018

ollyg commented Apr 19, 2018

rkerr commented Apr 19, 2018

inphobia commented May 30, 2018

lsof -p 24326

taylor5042 commented May 30, 2018 via email

inphobia commented Jun 8, 2018

ollyg commented Jun 17, 2018

inphobia commented Jun 18, 2018

ollyg commented Jun 19, 2018

Devices stop polling if down for some time #398

Devices stop polling if down for some time #398

Comments

rkerr commented Apr 13, 2018

ollyg commented Apr 18, 2018

rkerr commented Apr 19, 2018

ollyg commented Apr 19, 2018

rkerr commented Apr 19, 2018

inphobia commented May 30, 2018

lsof -p 24326

taylor5042 commented May 30, 2018 via email

inphobia commented Jun 8, 2018

ollyg commented Jun 17, 2018

inphobia commented Jun 18, 2018

ollyg commented Jun 19, 2018