discover jobs are never dequeued or skipped for non discoverable devices #466

inphobia · 2018-12-13T20:09:30Z

Expected Behavior

devices should be removed from the jobqueue after they reach max_deferrals or placed on hold somehow.

Current Behavior

devices which we cannot connect to via snmp remain in the jobqueue it seems. they also seem to be enqueued multiple times.

Possible Solution

Steps to Reproduce (for bugs)

discover job for this device is queued multiple times:

netdisco-do psql -e "SELECT * FROM public.admin WHERE device = '10.114.1.14'"
[28994] 2018-12-13 19:19:46  info App::Netdisco version 2.039033 loaded.
[28994] 2018-12-13 19:19:46  info psql:  started at Thu Dec 13 20:19:46 2018
  job   |          entered           | started | finished |   device    | port |  action  | subaction  | status | username | userip |         log         | debug |    device_key
--------+----------------------------+---------+----------+-------------+------+----------+------------+--------+----------+--------+---------------------+-------+-------------------
 107769 | 2018-12-12 23:09:03.581714 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 102815 |       | ec:74:ba:18:f5:7c
 105298 | 2018-12-12 19:08:35.60186  |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 107769 |       | ec:74:ba:18:f5:7c
 102815 | 2018-12-12 15:06:21.367012 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 107769 |       | ec:74:ba:18:f5:7c
 110232 | 2018-12-13 03:06:24.198368 |         |          | 10.114.1.14 |      | discover | with-nodes | queued |          |        |                     |       | ec:74:ba:18:f5:7c
 112708 | 2018-12-13 07:08:44.263661 |         |          | 10.114.1.14 |      | discover | with-nodes | queued |          |        |                     |       | ec:74:ba:18:f5:7c
 115186 | 2018-12-13 11:13:28.062473 |         |          | 10.114.1.14 |      | discover | with-nodes | queued |          |        |                     |       | ec:74:ba:18:f5:7c
 117659 | 2018-12-13 15:11:36.075763 |         |          | 10.114.1.14 |      | discover | with-nodes | queued |          |        |                     |       | ec:74:ba:18:f5:7c
 120134 | 2018-12-13 19:12:34.120261 |         |          | 10.114.1.14 |      | discover | with-nodes | queued |          |        |                     |       | ec:74:ba:18:f5:7c

device is not known:

netdisco-do psql -e "SELECT ip,creation,last_discover FROM public.device WHERE ip = '10.114.1.14'"
[29035] 2018-12-13 19:23:04  info App::Netdisco version 2.039033 loaded.
[29035] 2018-12-13 19:23:04  info psql:  started at Thu Dec 13 20:23:04 2018
 ip | creation | last_discover
----+----------+---------------
(0 rows)

netdisco-do psql -e "SELECT * FROM public.community WHERE ip = '10.114.1.14'"
[29723] 2018-12-13 19:54:32  info App::Netdisco version 2.039033 loaded.
[29723] 2018-12-13 19:54:32  info psql:  started at Thu Dec 13 20:54:32 2018
 ip | snmp_comm_rw | snmp_auth_tag_read | snmp_auth_tag_write
----+--------------+--------------------+---------------------
(0 rows)

device has reached max_deferrals but discover jobs are still being enqueued even after last_defer timestamp:

netdisco-do psql -e "SELECT * FROM public.device_skip WHERE device = '10.114.1.14'"
[29051] 2018-12-13 19:25:02  info App::Netdisco version 2.039033 loaded.
[29051] 2018-12-13 19:25:02  info psql:  started at Thu Dec 13 20:25:02 2018
       backend       |   device    | actionset | deferrals |         last_defer
---------------------+-------------+-----------+-----------+----------------------------
 xxx.aquafin.be | 10.114.1.14 | {}        |        10 | 2018-12-13 03:13:51.780795
(1 row)

this also has some strange impact on the pollerperformance report.
while the jobs were enqueued i ran the query thats behind pollerperformance:

netdisco-do psql -e"SELECT action, entered, to_char( entered, 'YYYY-MM-DD HH24:MI:SS' ) AS entered_stamp, COUNT( device ) AS number, MIN( started ) AS start, MAX( finished ) AS end, justify_interval( extract ( epoch FROM( max( finished ) - min( started ) ) ) * interval '1 second' ) AS elapsed FROM admin WHERE action IN ( 'discover', 'macsuck', 'arpnip', 'nbtstat' ) GROUP BY action, entered HAVING count( device ) > 1 ORDER BY entered DESC, elapsed DESC LIMIT 30"
[29155] 2018-12-13 19:33:15  info App::Netdisco version 2.039033 loaded.
[29155] 2018-12-13 19:33:15  info psql:  started at Thu Dec 13 20:33:15 2018
  action  |          entered           |    entered_stamp    | number |        start        |         end         | elapsed
----------+----------------------------+---------------------+--------+---------------------+---------------------+----------
 arpnip   | 2018-12-13 20:05:30.887272 | 2018-12-13 20:05:30 |    687 | 2018-12-13 20:05:31 | 2018-12-13 20:08:28 | 00:02:57
 discover | 2018-12-13 19:05:21.272108 | 2018-12-13 19:05:21 |    895 | 2018-12-13 19:05:22 | 2018-12-13 19:14:27 | 00:09:05
 macsuck  | 2018-12-13 16:55:20.579174 | 2018-12-13 16:55:20 |    868 | 2018-12-13 16:55:21 | 2018-12-13 17:11:28 | 00:16:07
 arpnip   | 2018-12-13 16:05:14.56144  | 2018-12-13 16:05:14 |    687 | 2018-12-13 16:05:15 | 2018-12-13 16:07:18 | 00:02:03
 discover | 2018-12-13 15:05:37.064871 | 2018-12-13 15:05:37 |    897 | 2018-12-13 15:05:37 | 2018-12-13 15:14:48 | 00:09:11
 macsuck  | 2018-12-13 12:55:20.35119  | 2018-12-13 12:55:20 |    868 | 2018-12-13 12:55:21 | 2018-12-13 13:12:31 | 00:17:10
 arpnip   | 2018-12-13 12:05:12.058855 | 2018-12-13 12:05:12 |    687 | 2018-12-13 12:05:13 | 2018-12-13 12:07:15 | 00:02:02
 discover | 2018-12-13 11:05:34.687823 | 2018-12-13 11:05:34 |    896 | 2018-12-13 11:05:35 | 2018-12-13 11:15:21 | 00:09:46
 macsuck  | 2018-12-13 08:55:44.40293  | 2018-12-13 08:55:44 |    868 | 2018-12-13 08:55:45 | 2018-12-13 09:12:18 | 00:16:33
 arpnip   | 2018-12-13 08:05:30.711681 | 2018-12-13 08:05:30 |    687 | 2018-12-13 08:05:31 | 2018-12-13 08:07:23 | 00:01:52
 discover | 2018-12-13 07:05:22.121741 | 2018-12-13 07:05:22 |    896 | 2018-12-13 07:05:23 | 2018-12-13 07:13:43 | 00:08:20
 macsuck  | 2018-12-13 04:55:08.392145 | 2018-12-13 04:55:08 |    867 | 2018-12-13 04:55:09 | 2018-12-13 05:11:49 | 00:16:40
 arpnip   | 2018-12-13 04:05:11.035508 | 2018-12-13 04:05:11 |    686 | 2018-12-13 04:05:11 | 2018-12-13 04:07:13 | 00:02:02
 discover | 2018-12-13 03:05:18.058139 | 2018-12-13 03:05:18 |    896 | 2018-12-13 03:05:18 | 2018-12-13 03:13:47 | 00:08:29
 macsuck  | 2018-12-13 00:55:45.835736 | 2018-12-13 00:55:45 |    866 | 2018-12-13 00:55:46 | 2018-12-13 01:13:09 | 00:17:23
 arpnip   | 2018-12-13 00:05:37.024234 | 2018-12-13 00:05:37 |    685 | 2018-12-13 00:05:37 | 2018-12-13 00:07:28 | 00:01:51
 discover | 2018-12-12 23:05:04.765919 | 2018-12-12 23:05:04 |    895 | 2018-12-12 23:05:05 | 2018-12-13 01:26:18 | 02:21:13
 macsuck  | 2018-12-12 20:55:45.68859  | 2018-12-12 20:55:45 |    868 | 2018-12-12 20:55:46 | 2018-12-13 01:25:33 | 04:29:47
 arpnip   | 2018-12-12 20:05:04.857937 | 2018-12-12 20:05:04 |    687 | 2018-12-12 20:05:05 | 2018-12-13 01:25:33 | 05:20:28
 discover | 2018-12-12 19:05:28.154389 | 2018-12-12 19:05:28 |    896 | 2018-12-12 19:05:29 | 2018-12-12 19:13:57 | 00:08:28
 macsuck  | 2018-12-12 16:55:39.898355 | 2018-12-12 16:55:39 |    869 | 2018-12-12 16:55:40 | 2018-12-13 01:25:33 | 08:29:53
 arpnip   | 2018-12-12 16:05:27.293894 | 2018-12-12 16:05:27 |    688 | 2018-12-12 16:05:28 | 2018-12-13 01:25:32 | 09:20:04
 discover | 2018-12-12 15:05:34.83978  | 2018-12-12 15:05:34 |    902 | 2018-12-12 15:05:35 | 2018-12-13 01:25:35 | 10:20:00
 macsuck  | 2018-12-12 12:55:30.418396 | 2018-12-12 12:55:30 |    874 | 2018-12-12 12:55:31 | 2018-12-12 13:12:03 | 00:16:32
 arpnip   | 2018-12-12 12:05:30.642958 | 2018-12-12 12:05:30 |    693 | 2018-12-12 12:05:31 | 2018-12-12 12:07:28 | 00:01:57
 discover | 2018-12-12 11:05:26.545956 | 2018-12-12 11:05:26 |    901 | 2018-12-12 11:05:27 | 2018-12-12 11:14:22 | 00:08:55

there are a few with 8+ hours, but thats mostly related to what will happen now.

i restart the netdisco backend:

netdisco-backend stop
Netdisco Backend                                              [Stopped]

netdisco-backend start
Netdisco Backend                                              [Started]

wait 10 seconds, then run the same query again:

netdisco-do psql -e"SELECT action, entered, to_char( entered, 'YYYY-MM-DD HH24:MI:SS' ) AS entered_stamp, COUNT( device ) AS number, MIN( started ) AS start, MAX( finished ) AS end, justify_interval( extract ( epoch FROM( max( finished ) - min( started ) ) ) * interval '1 second' ) AS elapsed FROM admin WHERE action IN ( 'discover', 'macsuck', 'arpnip', 'nbtstat' ) GROUP BY action, entered HAVING count( device ) > 1 ORDER BY entered DESC, elapsed DESC LIMIT 30"
[29461] 2018-12-13 19:35:07  info App::Netdisco version 2.039033 loaded.
[29461] 2018-12-13 19:35:07  info psql:  started at Thu Dec 13 20:35:07 2018
  action  |          entered           |    entered_stamp    | number |        start        |         end         |    elapsed
----------+----------------------------+---------------------+--------+---------------------+---------------------+----------------
 arpnip   | 2018-12-13 20:05:30.887272 | 2018-12-13 20:05:30 |    687 | 2018-12-13 20:05:31 | 2018-12-13 20:08:28 | 00:02:57
 discover | 2018-12-13 19:05:21.272108 | 2018-12-13 19:05:21 |    895 | 2018-12-13 19:05:22 | 2018-12-13 19:14:27 | 00:09:05
 macsuck  | 2018-12-13 16:55:20.579174 | 2018-12-13 16:55:20 |    868 | 2018-12-13 16:55:21 | 2018-12-13 20:34:59 | 03:39:38
 arpnip   | 2018-12-13 16:05:14.56144  | 2018-12-13 16:05:14 |    687 | 2018-12-13 16:05:15 | 2018-12-13 16:07:18 | 00:02:03
 discover | 2018-12-13 15:05:37.064871 | 2018-12-13 15:05:37 |    897 | 2018-12-13 15:05:37 | 2018-12-13 20:35:07 | 05:29:30
 macsuck  | 2018-12-13 12:55:20.35119  | 2018-12-13 12:55:20 |    868 | 2018-12-13 12:55:21 | 2018-12-13 13:12:31 | 00:17:10
 arpnip   | 2018-12-13 12:05:12.058855 | 2018-12-13 12:05:12 |    687 | 2018-12-13 12:05:13 | 2018-12-13 20:34:58 | 08:29:45
 discover | 2018-12-13 11:05:34.687823 | 2018-12-13 11:05:34 |    896 | 2018-12-13 11:05:35 | 2018-12-13 11:15:21 | 00:09:46
 macsuck  | 2018-12-13 08:55:44.40293  | 2018-12-13 08:55:44 |    868 | 2018-12-13 08:55:45 | 2018-12-13 09:12:18 | 00:16:33
 arpnip   | 2018-12-13 08:05:30.711681 | 2018-12-13 08:05:30 |    687 | 2018-12-13 08:05:31 | 2018-12-13 08:07:23 | 00:01:52
 discover | 2018-12-13 07:05:22.121741 | 2018-12-13 07:05:22 |    896 | 2018-12-13 07:05:23 | 2018-12-13 07:13:43 | 00:08:20
 macsuck  | 2018-12-13 04:55:08.392145 | 2018-12-13 04:55:08 |    867 | 2018-12-13 04:55:09 | 2018-12-13 05:11:49 | 00:16:40
 arpnip   | 2018-12-13 04:05:11.035508 | 2018-12-13 04:05:11 |    686 | 2018-12-13 04:05:11 | 2018-12-13 04:07:13 | 00:02:02
 discover | 2018-12-13 03:05:18.058139 | 2018-12-13 03:05:18 |    896 | 2018-12-13 03:05:18 | 2018-12-13 03:13:47 | 00:08:29
 macsuck  | 2018-12-13 00:55:45.835736 | 2018-12-13 00:55:45 |    866 | 2018-12-13 00:55:46 | 2018-12-13 01:13:09 | 00:17:23
 arpnip   | 2018-12-13 00:05:37.024234 | 2018-12-13 00:05:37 |    685 | 2018-12-13 00:05:37 | 2018-12-13 00:07:28 | 00:01:51
 discover | 2018-12-12 23:05:04.765919 | 2018-12-12 23:05:04 |    895 | 2018-12-12 23:05:05 | 2018-12-13 01:26:18 | 02:21:13
 macsuck  | 2018-12-12 20:55:45.68859  | 2018-12-12 20:55:45 |    868 | 2018-12-12 20:55:46 | 2018-12-13 01:25:33 | 04:29:47
 arpnip   | 2018-12-12 20:05:04.857937 | 2018-12-12 20:05:04 |    687 | 2018-12-12 20:05:05 | 2018-12-13 20:34:55 | 1 day 00:29:50
 discover | 2018-12-12 19:05:28.154389 | 2018-12-12 19:05:28 |    896 | 2018-12-12 19:05:29 | 2018-12-13 20:34:59 | 1 day 01:29:30
 macsuck  | 2018-12-12 16:55:39.898355 | 2018-12-12 16:55:39 |    869 | 2018-12-12 16:55:40 | 2018-12-13 01:25:33 | 08:29:53
 arpnip   | 2018-12-12 16:05:27.293894 | 2018-12-12 16:05:27 |    688 | 2018-12-12 16:05:28 | 2018-12-13 01:25:32 | 09:20:04
 discover | 2018-12-12 15:05:34.83978  | 2018-12-12 15:05:34 |    902 | 2018-12-12 15:05:35 | 2018-12-13 20:33:48 | 1 day 05:28:13
 macsuck  | 2018-12-12 12:55:30.418396 | 2018-12-12 12:55:30 |    874 | 2018-12-12 12:55:31 | 2018-12-13 20:33:44 | 1 day 07:38:13
 arpnip   | 2018-12-12 12:05:30.642958 | 2018-12-12 12:05:30 |    693 | 2018-12-12 12:05:31 | 2018-12-13 20:33:45 | 1 day 08:28:14
 discover | 2018-12-12 11:05:26.545956 | 2018-12-12 11:05:26 |    901 | 2018-12-12 11:05:27 | 2018-12-12 11:14:22 | 00:08:55
(26 rows)

now all of the sudden several of my jobs took almost 2 days....

the jobqueue however got reset to error for all the queued jobs:

 netdisco-do psql -e "SELECT * FROM public.admin WHERE device = '10.114.1.14'"
[29480] 2018-12-13 19:36:09  info App::Netdisco version 2.039033 loaded.
[29480] 2018-12-13 19:36:09  info psql:  started at Thu Dec 13 20:36:09 2018
  job   |          entered           | started | finished |   device    | port |  action  | subaction  | status | username | userip |         log         | debug |    device_key
--------+----------------------------+---------+----------+-------------+------+----------+------------+--------+----------+--------+---------------------+-------+-------------------
 107769 | 2018-12-12 23:09:03.581714 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 102815 |       | ec:74:ba:18:f5:7c
 105298 | 2018-12-12 19:08:35.60186  |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 107769 |       | ec:74:ba:18:f5:7c
 102815 | 2018-12-12 15:06:21.367012 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 107769 |       | ec:74:ba:18:f5:7c
 112708 | 2018-12-13 07:08:44.263661 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 110232 |       | ec:74:ba:18:f5:7c
 115186 | 2018-12-13 11:13:28.062473 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 110232 |       | ec:74:ba:18:f5:7c
 120134 | 2018-12-13 19:12:34.120261 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 110232 |       | ec:74:ba:18:f5:7c
 117659 | 2018-12-13 15:11:36.075763 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 110232 |       | ec:74:ba:18:f5:7c
 110232 | 2018-12-13 03:06:24.198368 |         |          | 10.114.1.14 |      | discover | with-nodes | error  |          |        | duplicate of 120134 |       | ec:74:ba:18:f5:7c

and since the backend was restarted the defer count was also decremented by 1:

netdisco-do psql -e "SELECT * FROM public.device_skip WHERE device = '10.114.1.14'"
[29600] 2018-12-13 19:41:55  info App::Netdisco version 2.039033 loaded.
[29600] 2018-12-13 19:41:55  info psql:  started at Thu Dec 13 20:41:55 2018
       backend       |   device    | actionset | deferrals |         last_defer
---------------------+-------------+-----------+-----------+----------------------------
xxx.aquafin.be | 10.114.1.14 | {}        |         9 | 2018-12-13 03:13:51.780795

the device we cannot connect to was discover via lldp:

netdisco-do psql -e "SELECT * FROM public.node WHERE mac = 'ec:74:ba:18:f5:7c'"
[29617] 2018-12-13 19:44:03  info App::Netdisco version 2.039033 loaded.
[29617] 2018-12-13 19:44:03  info psql:  started at Thu Dec 13 20:44:03 2018
        mac        |    switch     | port | vlan | active |   oui    |         time_first         |        time_recent         |         time_last
-------------------+---------------+------+------+--------+----------+----------------------------+----------------------------+----------------------------
 ec:74:ba:18:f5:7c | 10.114.100.15 | fa19 | 1    | t      | ec:74:ba | 2018-12-09 00:56:07.623889 | 2018-12-09 00:56:07.623889 | 2018-12-13 16:57:16.737285

perhaps relevant parts of my config:

schedule:
  discoverall:
    when: '5 3,7,11,15,19,23 * * *'
  macwalk:
    when:
      min: 55
      hour: '*/4'
  arpwalk:
    when:
      min: 5
      hour: '*/4'
  nbtwalk: null
  expire:
    when: '30 23 * * *'

workers:
  tasks: 12
  timeout: 600

i've tried deleting the entire jobqueue (from the webinterface) or select queued jobs but that did not help.

Context

these jobs seem to be only dequeued after a backend restart. seems very similar too #398 but is not exactly the same.

Your Environment

Netdisco version used: 2.39.33
SNMP::Info version used: 3.63
postgres 10.00.5
perl 5.26.1
db schema 53

The text was updated successfully, but these errors were encountered:

ollyg · 2018-12-28T17:20:28Z

Hi @inphobia

Ok, thanks for the report. I'm going to put to one side the poller performance report, which may well have a bug in it.

What you're seeing seems standard behaviour. Regardless of what's in the job queue, you should pay close attention to what jobs are actually selected and run, and that part isn't clear from your report. It's fine for jobs to be in the queue, and due to Netdisco's design as a distributed computing application, parts of the code will certainly queue devices without checking deferrals.

(note to self: this could be made smarter by the queueing instance checking its own name against the deferrals backend name)

Anyway, stuff gets queued, and it's in various states, but then the poller instance picking jobs will skip over those rows if max_deferrals is reached. Please let me know if you have evidence of jobs being picked and run where max_deferrals should have stopped them.

So in summary - just because something's in the queue doesn't mean it'll run. The distributed design of Netdisco means the job queue is always messy. The intelligence is in the code which picks jobs to run, and hopefully it gets it right!

https://github.com/netdisco/netdisco/blob/master/lib/App/Netdisco/DB/Result/Virtual/TastyJobs.pm#L13

cheers
Oliver.

inphobia · 2018-12-29T15:03:10Z

like:

and

now it could be that this device has been entered to be discovered via cli, but the admin table has been cleared since, also all jobs have been deleted a few times via the web job queue.

there is 1 strange thing i noticed, but i'm not sure if it is relevant. when i was looking into the tastyjobs query with pgadmin i noticed the query was cut off:

SELECT me.job, me.entered, me.started, me.finished, me.device, me.port, me.action, me.subaction, me.status, me.username, me.userip, me.log, me.debug, me.device_key, me.job_priority FROM (  WITH my_jobs AS
    (SELECT admin.* FROM admin
       LEFT OUTER JOIN device_skip ds
         ON (ds.backend = $1 AND admin.device = ds.device
             AND admin.action = ANY (ds.actionset))
      WHERE admin.status = 'queued'
        AND ds.device IS NULL)

  SELECT my_jobs.*,
         CASE WHEN ( (my_jobs.username IS NOT NULL AND ((ds.deferrals IS NULL AND ds.last_defer IS NULL)
                                                        OR my_jobs.entered > ds.last_defer))
                    OR (my_jobs.action = ANY (string_to_array(btrim($2, '{"}'), '","'))) )
              THEN 100
              ELSE 0
          END AS job_priority
    FROM my_jobs

    LEFT OUTER JOIN device_skip ds
      ON (ds.backend = $3 AND ds.device = my_jobs.device)

   WHERE ds.deferrals < $4
      OR (my_jobs.username IS NOT NULL AND (ds.la

the strange thing here is that it's an exact 1024 characters before it is getting cut off. so either this is a limit in pgadmin (also checked with psql cli, seems to be cut of there as well), in postgres itself (should be able to take much larger queries) or in one of the perl database modules, or i'm just on a wrong track.

inphobia changed the title ~~discover jobs are never unqueued for non discoverable devices~~ discover jobs are never dequeued or skipped for non discoverable devices Dec 13, 2018

inphobia mentioned this issue Dec 14, 2018

make job queue list in webinterface configurable #469

Merged

ollyg closed this as completed Dec 28, 2018

inphobia mentioned this issue Oct 29, 2019

db error during upgrade to 2.044003 #668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discover jobs are never dequeued or skipped for non discoverable devices #466

discover jobs are never dequeued or skipped for non discoverable devices #466

inphobia commented Dec 13, 2018

ollyg commented Dec 28, 2018

inphobia commented Dec 29, 2018

discover jobs are never dequeued or skipped for non discoverable devices #466

discover jobs are never dequeued or skipped for non discoverable devices #466

Comments

inphobia commented Dec 13, 2018

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

ollyg commented Dec 28, 2018

inphobia commented Dec 29, 2018