New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Devices stop polling if down for some time #398
Comments
Hi @rkerr, thanks for this report, and sorry for the bug. There are a few different things going on here:
So to fix this, there are a couple of options:
Hope this helps, |
Interesting, certainly the database thought it was schema version 54. Having redeployed all the changes that constraint has now gone, so somehow it got missed. Possibly I was running a dev version at some point. Is it possible to switch this defer behaviour off? I guess I can set the retry time down to 10 seconds but it doesn't really seem like a behaviour I have any use/need for. To be honest I'm struggling to think of why anyone would want that behaviour... Also... shouldn't the status be error/defer not stuck on queued if it was deferring? |
Hi @rkerr I'm glad this issue is resolved. It should be possible to switch the feature off but I think there's a bug. I need to fix that! In the meantime you can do the following in config:
(see the docs here but ignore the advice to set to zero: https://github.com/netdisco/netdisco/wiki/Configuration#workers) There are certainly many use cases for this feature! Most commonly, it makes the backend much more efficient where there are a lot of devices discovered through neighbor relations but Netdisco is not able to contact them. Also it is key to allowing multiple backends to work smoothly where different backends can contact different sets of devices. To be honest, the case (as yours) where devices go away and come back is seen as more unusual :-). |
I'm not entirely sure it is resolved... I haven't updated yet as doing so would restart the backend and that fixes the issue in the short term anyway. I thought it better to leave in a broken state to try to understand a bit better. Having now read the commit you linked (my bad, should have read it first) I'm not sure that deferrals are the issue. It looks like deferral happens if the device_skip table has deferrals > 10 and last_defer is less than retry_after ago. For these devices deferrals = 0 and last_defer is null - is that what you're expecting here. I can understand why you'd want this behaviour for devices identified by LLDP but never previously discovered, however that's a slightly different case to devices that have been previously been discovered but are now timing out? Perhaps it would make sense to have different behaviour for those two cases? I tend to use rather wide regexs in discover_no_type to keep most of the LLDP neighbours out of the way, but I guess that only works if you have a single vendor network. |
i've been seeing similar things on 2.39.30. items queued for over a day (job overview were all queued jobs). i didnt notice at first since i was setting up a new install, but i can say that items where queued for long over the deadline. i did notice some strange database behaviour: when my jobqueue stops it seems i tend to have a database connection thats's stuck on a commit transaction, but i havent figured out why yet: database side:
from what i can see is that the postgres side has 1 process that's waiting for data on its unix socket, but is not receiving anything, but that's about as far as i've gotten before i ran out of time. i do have to add that i only started seeing this after i upgraded, i didnt report it due to having no time to troubleshoot. second strange thing i noticed is that sometimes i have devices where netdisco wont stop polling, even if they always refuse the connection. i saw 1 device with over 7000 connections refused while i'm using the default 10 when it should start to wait. but as i said, i upgraded netdisco a few weeks ago so i was first trying to figure out if i screwed up something, but it seems i'm not the only one with these symptoms. i think i first saw them some around may 10-15th, but not sure. |
I upgraded last week and I am seeing similar issues with devices that stop polling.
I know these devices were unreachable for some time, but they have returned to being reachable however they do not get discovered again unless I manually poll them.
I haven’t had time to diagnose these further.
…On May 29, 2018, 8:29 PM -0600, nick n. ***@***.***>, wrote:
i've been seeing similar things on 2.39.30. items queued for over a day (job overview were all queued jobs). i didnt notice at first since i was setting up a new install, but i can say that items where queued for long over the deadline.
i did notice some strange database behaviour: when my jobqueue stops it seems i tend to have a database connection thats's stuck on a commit transaction, but i havent figured out why yet:
database side:
> postgres=# SELECT * FROM pg_stat_activity where pid = 24326;
> datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query
> --------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------+-------------------------------+-------------------------------+---------+-------+--------
> 315286 | testdisc | 24326 | 289223 | testdisc | | | | -1 | 2018-05-28 00:20:17.455307+02 | | 2018-05-30 02:50:29.507962+02 | 2018-05-30 02:50:29.508725+02 | f | idle | commit
> (1 row)
> os side:
> ***@***.***:~> ps -aef | grep 24326
> postgres 24326 12046 0 May28 ? 00:00:00 postgres: testdisc testdisc [local] idle
> ***@***.***:~> date
> Wed May 30 03:30:05 CEST 2018
> ***@***.***:~> strace -p 24326
> Process 24326 attached
> recvfrom(11, ^CProcess 24326 detached
> <detached ...>
> ***@***.***:~> date
> Wed May 30 03:31:29 CEST 2018
> lsof -p 24326
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
>
> postgres 24326 postgres 11u unix 0xffff8805f6b1fbc0 0t0 1752433 /var/run/postgresql/.s.PGSQL.5432 type=STREAM
from what i can see is that the postgres side has 1 process that's waiting for data on its unix socket, but is not receiving anything, but that's about as far as i've gotten before i ran out of time.
i do have to add that i only started seeing this after i upgraded, i didnt report it due to having no time to troubleshoot.
i did set up a second netdisco instance on the same host, under a diiferent user, where i used "perlbrew" to set up a complete seperate environment to rule out a bad system perl module (a few got updated on the system), but that one hasn't been online long enough to give useful data.
second strange thing i noticed is that sometimes i have devices where netdisco wont stop polling, even if they always refuse the connection. i saw 1 device with over 7000 connections refused while i'm using the default 10 when it should start to wait.
but as i said, i upgraded netdisco a few weeks ago so i was first trying to figure out if i screwed up something, but it seems i'm not the only one with these symptoms. i think i first saw them some around may 10-15th, but not sure.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
well, 2 weeks (and 1 or 2 netdisco restarts) later someone really really wants get some data from 1 of my devices & wont take no for an answer:
now i might have manually added that ip manually (was lugging around a 3year old database the never had any nodes expired on it, so extracted all devices and added them to the new setup with |
Hi @inphobia thanks for the update, this does help. Please give release 2.039031 a try which I have just pushed to CPAN. It should address this issue with user submitted jobs that are being retried forever. |
it seems i have 1 device that still goes to 12:
but i guess that's my doing. after updating to 2.039031 & restarting it stayed at 11, and when i pressed "discover" it changed to 12 but did not run out of control as before. seems to be fixed. thx! // nick |
Thanks for the confirmation! Great to hear! |
I'm not exactly sure when this when this issue was introduced - currently running 2.39.21 and noticed it with several other 2.39 versions but not clear when it started.
If a device is down when the job schedules a discover/macsuck the job never completes (or possibly never runs?), and remains in the queue as 'queued'. This means no further discovers get queued, so when the device comes back up it will never be discovered again.
eg if I select queued jobs from the admin table for a number of devices that went down around 18:00 I see:
These devices are now up and running but haven't been discovered since. Trying to manually prompt a rediscover through the web frontend results in an error in netdisco-web.log:
Running the discover on the CLI with netdisco-do works fine for that run, but doesn't clear the problem so further automatic discoveries don't run.
Restarting the backend seems to resolve the issue temporarily, but the next time a device is down for any period of time that device will stop automatically discovering again.
The text was updated successfully, but these errors were encountered: