Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devices stop polling if down for some time #398

Closed
rkerr opened this issue Apr 13, 2018 · 10 comments
Closed

Devices stop polling if down for some time #398

rkerr opened this issue Apr 13, 2018 · 10 comments

Comments

@rkerr
Copy link

rkerr commented Apr 13, 2018

I'm not exactly sure when this when this issue was introduced - currently running 2.39.21 and noticed it with several other 2.39 versions but not clear when it started.

If a device is down when the job schedules a discover/macsuck the job never completes (or possibly never runs?), and remains in the queue as 'queued'. This means no further discovers get queued, so when the device comes back up it will never be discovered again.

eg if I select queued jobs from the admin table for a number of devices that went down around 18:00 I see:

job entered started finished device port action subaction status username userip log debug device_key
87351050 2018-04-12 18:30:36.427629 x.x.x.47 discover queued
87351052 2018-04-12 18:30:36.427629 x.x.x.44 discover queued
87351135 2018-04-12 18:30:36.427629 x.x.x.40 discover queued
87351132 2018-04-12 18:30:36.427629 x.x.x.45 discover queued
87351235 2018-04-12 18:30:36.427629 x.x.x.42 discover queued
87351034 2018-04-12 18:30:36.427629 x.x.x.43 discover queued
87350976 2018-04-12 18:30:36.427629 x.x.x.46 discover queued

These devices are now up and running but haven't been discovered since. Trying to manually prompt a rediscover through the web frontend results in an error in netdisco-web.log:

[54497] 2018-04-13 14:02:51 error bless( {'msg' => 'DBIx::Class::Storage::DBI::_dbh_execute_for_fetch(): execute_for_fetch() aborted with 'ERROR: duplicate key value violates unique constraint "jobs_queued"
DETAIL: Key (action, (COALESCE(subaction, 'x'::text)), (COALESCE(device, '255.255.255.255'::inet)), (COALESCE(port, 'x'::text)))=(discover, x,x.x.x.40, x) already exists.'

Running the discover on the CLI with netdisco-do works fine for that run, but doesn't clear the problem so further automatic discoveries don't run.

Restarting the backend seems to resolve the issue temporarily, but the next time a device is down for any period of time that device will stop automatically discovering again.

@ollyg
Copy link
Member

ollyg commented Apr 18, 2018

Hi @rkerr, thanks for this report, and sorry for the bug. There are a few different things going on here:

  • When devices are not contactable Netdisco tries up to 10 times and then blocks the device for a week, before trying again. Restarting the daemon tries once more, which is what you're seeing.
  • It looks like you haven't got schema update 32 applied on your system which would have removed the jobs_queued constraint. You can run bin/netdisco-db-deploy --redeploy-all to fix this. You will then be able to queue jobs which are duplicates.
  • I have applied a patch which causes manually queued jobs to always run (and ignore the failed connections counter).

So to fix this, there are a couple of options:

  • change the workers retry_after to something else:
workers:
  retry_after: '1 day'
  • or with the next release queue the job from the web interface
  • or with the next release trigger the job at the CLI but use --queue flag to enqueue it

Hope this helps,
Oliver.

@rkerr
Copy link
Author

rkerr commented Apr 19, 2018

Interesting, certainly the database thought it was schema version 54. Having redeployed all the changes that constraint has now gone, so somehow it got missed. Possibly I was running a dev version at some point.

Is it possible to switch this defer behaviour off? I guess I can set the retry time down to 10 seconds but it doesn't really seem like a behaviour I have any use/need for. To be honest I'm struggling to think of why anyone would want that behaviour...

Also... shouldn't the status be error/defer not stuck on queued if it was deferring?

@ollyg
Copy link
Member

ollyg commented Apr 19, 2018

Hi @rkerr I'm glad this issue is resolved.

It should be possible to switch the feature off but I think there's a bug. I need to fix that! In the meantime you can do the following in config:

workers:
  max_deferrals: 10000

(see the docs here but ignore the advice to set to zero: https://github.com/netdisco/netdisco/wiki/Configuration#workers)

There are certainly many use cases for this feature! Most commonly, it makes the backend much more efficient where there are a lot of devices discovered through neighbor relations but Netdisco is not able to contact them. Also it is key to allowing multiple backends to work smoothly where different backends can contact different sets of devices. To be honest, the case (as yours) where devices go away and come back is seen as more unusual :-).

@rkerr
Copy link
Author

rkerr commented Apr 19, 2018

I'm not entirely sure it is resolved... I haven't updated yet as doing so would restart the backend and that fixes the issue in the short term anyway. I thought it better to leave in a broken state to try to understand a bit better.

Having now read the commit you linked (my bad, should have read it first) I'm not sure that deferrals are the issue. It looks like deferral happens if the device_skip table has deferrals > 10 and last_defer is less than retry_after ago. For these devices deferrals = 0 and last_defer is null - is that what you're expecting here.

I can understand why you'd want this behaviour for devices identified by LLDP but never previously discovered, however that's a slightly different case to devices that have been previously been discovered but are now timing out? Perhaps it would make sense to have different behaviour for those two cases?

I tend to use rather wide regexs in discover_no_type to keep most of the LLDP neighbours out of the way, but I guess that only works if you have a single vendor network.

@inphobia
Copy link
Member

i've been seeing similar things on 2.39.30. items queued for over a day (job overview were all queued jobs). i didnt notice at first since i was setting up a new install, but i can say that items where queued for long over the deadline.

i did notice some strange database behaviour: when my jobqueue stops it seems i tend to have a database connection thats's stuck on a commit transaction, but i havent figured out why yet:

database side:

postgres=# SELECT * FROM pg_stat_activity where pid = 24326;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query
--------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------+-------------------------------+-------------------------------+---------+-------+--------
315286 | testdisc | 24326 | 289223 | testdisc | | | | -1 | 2018-05-28 00:20:17.455307+02 | | 2018-05-30 02:50:29.507962+02 | 2018-05-30 02:50:29.508725+02 | f | idle | commit
(1 row)

os side:
postgres@linux001:~> ps -aef | grep 24326
postgres 24326 12046 0 May28 ? 00:00:00 postgres: testdisc testdisc [local] idle

postgres@linux001:~> date
Wed May 30 03:30:05 CEST 2018

postgres@linux001:~> strace -p 24326
Process 24326 attached
recvfrom(11, ^CProcess 24326 detached
<detached ...>

postgres@linux001:~> date
Wed May 30 03:31:29 CEST 2018

lsof -p 24326

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

postgres 24326 postgres 11u unix 0xffff8805f6b1fbc0 0t0 1752433 /var/run/postgresql/.s.PGSQL.5432 type=STREAM

from what i can see is that the postgres side has 1 process that's waiting for data on its unix socket, but is not receiving anything, but that's about as far as i've gotten before i ran out of time.

i do have to add that i only started seeing this after i upgraded, i didnt report it due to having no time to troubleshoot.
i did set up a second netdisco instance on the same host, under a diiferent user, where i used "perlbrew" to set up a complete seperate environment to rule out a bad system perl module (a few got updated on the system), but that one hasn't been online long enough to give useful data.

second strange thing i noticed is that sometimes i have devices where netdisco wont stop polling, even if they always refuse the connection. i saw 1 device with over 7000 connections refused while i'm using the default 10 when it should start to wait.

but as i said, i upgraded netdisco a few weeks ago so i was first trying to figure out if i screwed up something, but it seems i'm not the only one with these symptoms. i think i first saw them some around may 10-15th, but not sure.

@taylor5042
Copy link

taylor5042 commented May 30, 2018 via email

@inphobia
Copy link
Member

inphobia commented Jun 8, 2018

well, 2 weeks (and 1 or 2 netdisco restarts) later someone really really wants get some data from 1 of my devices & wont take no for an answer:

select * from device_skip order by deferrals;

linux001.aquafin.be | 10.98.65.126 | {} | 10 | 2018-06-06 07:22:00.639148
linux001.aquafin.be | 10.98.52.116 | {} | 10 | 2018-06-06 07:22:02.09192
linux001.aquafin.be | 10.98.51.117 | {} | 10 | 2018-06-06 07:22:04.979775
linux001.aquafin.be | 10.98.51.115 | {} | 10 | 2018-06-06 07:22:31.160593
linux001.aquafin.be | 10.98.4.138 | {} | 10 | 2018-06-06 07:11:52.663068
linux001.aquafin.be | 10.98.8.123 | {} | 10 | 2018-06-06 07:11:52.850854
linux001.aquafin.be | 10.98.3.116 | {} | 10 | 2018-06-06 07:13:28.013948
linux001.aquafin.be | 10.98.64.1 | {} | 11 | 2018-06-06 12:54:12.521929
linux001.aquafin.be | 10.98.60.1 | {} | 11 | 2018-06-05 17:11:08.389781
linux001.aquafin.be | 10.50.110.100 | {} | 11 | 2018-06-05 17:13:15.885833
linux001.aquafin.be | 10.111.100.50 | {} | 7575 | 2018-06-09 00:48:31.001755

now i might have manually added that ip manually (was lugging around a 3year old database the never had any nodes expired on it, so extracted all devices and added them to the new setup with netdisco-do discover ), but if in understand the wiki right manual jobs take precedence over scheduled ones but should still trigger the 10times timeout code.

@ollyg
Copy link
Member

ollyg commented Jun 17, 2018

Hi @inphobia thanks for the update, this does help. Please give release 2.039031 a try which I have just pushed to CPAN. It should address this issue with user submitted jobs that are being retried forever.

@inphobia
Copy link
Member

it seems i have 1 device that still goes to 12:

linux001.aquafin.be | 172.22.139.77 | {} | 10 | 2018-06-18 04:40:01.816838
linux001.aquafin.be | 10.50.110.104 | {} | 10 | 2018-06-18 03:25:46.037538
linux001.aquafin.be | 10.50.110.102 | {} | 10 | 2018-06-18 03:33:39.230272
linux001.aquafin.be | 10.50.110.103 | {} | 10 | 2018-06-18 03:33:30.155197
linux001.aquafin.be | 10.50.102.23 | {} | 10 | 2018-06-18 04:54:38.133412
linux001.aquafin.be | 10.80.1.11 | {} | 10 | 2018-06-18 05:00:21.084916
linux001.aquafin.be | 10.80.1.20 | {} | 10 | 2018-06-18 05:00:21.040312
linux001.aquafin.be | 10.80.1.12 | {} | 10 | 2018-06-18 04:59:23.692281
linux001.aquafin.be | 10.98.48.110 | {} | 10 | 2018-06-18 03:25:46.046965
linux001.aquafin.be | 10.98.67.115 | {} | 10 | 2018-06-18 03:25:30.520367
linux001.aquafin.be | 10.80.1.10 | {} | 10 | 2018-06-18 05:10:08.988231
linux001.aquafin.be | 172.22.161.65 | {} | 10 | 2018-06-18 04:40:18.894744
linux001.aquafin.be | 10.50.110.100 | {} | 12 | 2018-06-18 07:04:24.566127

but i guess that's my doing. after updating to 2.039031 & restarting it stayed at 11, and when i pressed "discover" it changed to 12 but did not run out of control as before. seems to be fixed.

thx!

// nick

@ollyg
Copy link
Member

ollyg commented Jun 19, 2018

Thanks for the confirmation! Great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants