New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discover jobs are never dequeued or skipped for non discoverable devices #466
Comments
Hi @inphobia Ok, thanks for the report. I'm going to put to one side the poller performance report, which may well have a bug in it. What you're seeing seems standard behaviour. Regardless of what's in the job queue, you should pay close attention to what jobs are actually selected and run, and that part isn't clear from your report. It's fine for jobs to be in the queue, and due to Netdisco's design as a distributed computing application, parts of the code will certainly queue devices without checking deferrals. (note to self: this could be made smarter by the queueing instance checking its own name against the deferrals backend name) Anyway, stuff gets queued, and it's in various states, but then the poller instance picking jobs will skip over those rows if max_deferrals is reached. Please let me know if you have evidence of jobs being picked and run where max_deferrals should have stopped them. So in summary - just because something's in the queue doesn't mean it'll run. The distributed design of Netdisco means the job queue is always messy. The intelligence is in the code which picks jobs to run, and hopefully it gets it right! https://github.com/netdisco/netdisco/blob/master/lib/App/Netdisco/DB/Result/Virtual/TastyJobs.pm#L13 cheers |
and now it could be that this device has been entered to be discovered via cli, but the admin table has been cleared since, also all jobs have been deleted a few times via the web job queue. there is 1 strange thing i noticed, but i'm not sure if it is relevant. when i was looking into the tastyjobs query with pgadmin i noticed the query was cut off:
the strange thing here is that it's an exact 1024 characters before it is getting cut off. so either this is a limit in pgadmin (also checked with psql cli, seems to be cut of there as well), in postgres itself (should be able to take much larger queries) or in one of the perl database modules, or i'm just on a wrong track. |
Expected Behavior
devices should be removed from the jobqueue after they reach max_deferrals or placed on hold somehow.
Current Behavior
devices which we cannot connect to via snmp remain in the jobqueue it seems. they also seem to be enqueued multiple times.
Possible Solution
Steps to Reproduce (for bugs)
discover job for this device is queued multiple times:
device is not known:
device has reached max_deferrals but discover jobs are still being enqueued even after last_defer timestamp:
this also has some strange impact on the pollerperformance report.
while the jobs were enqueued i ran the query thats behind pollerperformance:
there are a few with 8+ hours, but thats mostly related to what will happen now.
i restart the netdisco backend:
wait 10 seconds, then run the same query again:
now all of the sudden several of my jobs took almost 2 days....
the jobqueue however got reset to error for all the queued jobs:
and since the backend was restarted the defer count was also decremented by 1:
the device we cannot connect to was discover via lldp:
perhaps relevant parts of my config:
i've tried deleting the entire jobqueue (from the webinterface) or select queued jobs but that did not help.
Context
these jobs seem to be only dequeued after a backend restart. seems very similar too #398 but is not exactly the same.
Your Environment
The text was updated successfully, but these errors were encountered: