Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdisco attempts SNMP arpnip on all Pseudo devices which slows down job queue #561

Closed
inphobia opened this issue Apr 4, 2019 · 12 comments
Labels

Comments

@inphobia
Copy link
Member

inphobia commented Apr 4, 2019

arpnip is run & remains queued for pseudo devices, most likely due to 4ef5691 .

iirc the reasoning was that devices that don't do snmp at all can still run sshcollector, but for real pseudo devices (created via the web interface but not accessible) this will cause arpnip to be queued for them.

Expected Behavior

don't run any workers on pseudo devices.

Current Behavior

arpwalk seems to queue arpnip for pseudo devices.

Possible Solution

workaround is to assign add these devices to the "devices_no" setting.

Steps to Reproduce (for bugs)

  1. create pseudodevice via webinterface
  2. run arpnip or wait for the arpwalk schedule to trigger
  3. notice that their jobs wont be dequeued:
netdisco-do psql -e "copy (SELECT * FROM public.admin where device <<= '169.254.0.0/16') TO STDOUT WITH CSV HEADER"
[26342] 2019-04-04 20:38:18  info App::Netdisco version 2.042005 loaded.
[26342] 2019-04-04 20:38:18  info psql:  started at Thu Apr  4 22:38:18 2019
job,entered,started,finished,device,port,action,subaction,status,username,userip,log,debug,device_key
30732,2019-04-04 22:21:15.319336,,,169.254.8.8,,arpnip,,queued,testdisc,,,,
30374,2019-04-04 00:10:19.119934,,,169.254.8.8,,arpnip,,error,,,duplicate of 30732,,
30369,2019-04-04 00:10:19.119934,,,169.254.9.9,,arpnip,,error,,,duplicate of 30404,,
30404,2019-04-04 01:16:33.009236,,,169.254.9.9,,arpnip,,error,testdisc,,duplicate of 30369,,
30423,2019-04-04 04:10:19.89362,,,169.254.9.9,,arpnip,,queued,,,,,
30370,2019-04-04 00:10:19.119934,,,169.254.10.10,,arpnip,,queued,,,,,
[26342] 2019-04-04 20:38:18  info psql: finished at Thu Apr  4 22:38:18 2019
[26342] 2019-04-04 20:38:18  info psql: status done: psql session closed.

(tested will arpwalk in webinterface, netdisco-do arpwalk & netdisco-do arpnip --enqueue)

Context

we use pseudo devices to mock up topology for a mpls network.

Your Environment

  • Netdisco version used: 2.042005
  • SNMP::Info version used: 3.66
@ollyg
Copy link
Member

ollyg commented Apr 5, 2019

Very sorry, cannot reproduce this!

Note that you do have a misunderstanding: "arpwalk seems to queue arpnip for pseudo devices"

This is correct! A pseudo device is needed for sshcollector targets if the target cannot run an SNMP discover.

@ollyg ollyg closed this as completed Apr 5, 2019
@inphobia
Copy link
Member Author

background: we use pseudodevs to mock up a topology over parts of our network where we can't acccess all data (managed mpls, private cellular for example).

now, i did find an easy way to reproduce the issue:

i add a pseudo device throught the webui. i gave it ip 169.254.33.33 & 3 ports, did not create any manual topo or did anything else than add the device.

  • device created:
netdisco-do psql -e "copy (SELECT * FROM public.device where ip <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[22632] 2019-04-29 18:28:09  info App::Netdisco version 2.042007 loaded.
[22632] 2019-04-29 18:28:09  info psql:  started at Mon Apr 29 20:28:09 2019
ip,creation,dns,description,uptime,contact,name,location,layers,ports,mac,serial,model,ps1_type,ps2_type,ps1_status,ps2_status,fan,slots,vendor,os,os_ver,log,snmp_ver,snmp_comm,snmp_class,vtp_domain,last_discover,last_macsuck,last_arpnip
169.254.33.33,2019-04-29 20:27:50.671621,ps4,,,,,,00000100,,,,,,,,,,,netdisco,,,,,,,,2019-04-29 20:27:50.671621,,
[22632] 2019-04-29 18:28:09  info psql: finished at Mon Apr 29 20:28:09 2019
[22632] 2019-04-29 18:28:09  info psql: status done: psql session closed.
  • not in skip table:
netdisco-do psql -e "copy (SELECT * FROM public.device_skip where device <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[22657] 2019-04-29 18:28:59  info App::Netdisco version 2.042007 loaded.
[22657] 2019-04-29 18:28:59  info psql:  started at Mon Apr 29 20:28:59 2019
backend,device,actionset,deferrals,last_defer
[22657] 2019-04-29 18:28:59  info psql: finished at Mon Apr 29 20:28:59 2019
[22657] 2019-04-29 18:28:59  info psql: status done: psql session closed.
  • no jobs known:
netdisco-do psql -e "copy (SELECT * FROM public.admin where device <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[22640] 2019-04-29 18:28:31  info App::Netdisco version 2.042007 loaded.
[22640] 2019-04-29 18:28:31  info psql:  started at Mon Apr 29 20:28:31 2019
job,entered,started,finished,device,port,action,subaction,status,username,userip,log,debug,device_key
[22640] 2019-04-29 18:28:31  info psql: finished at Mon Apr 29 20:28:31 2019
[22640] 2019-04-29 18:28:31  info psql: status done: psql session closed.

then run a netdisco-do arpwalk

  • device table (now with last_arpnip date)
netdisco-do psql -e "copy (SELECT * FROM public.device where ip <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[23103] 2019-04-29 18:36:38  info App::Netdisco version 2.042007 loaded.
[23103] 2019-04-29 18:36:38  info psql:  started at Mon Apr 29 20:36:38 2019
ip,creation,dns,description,uptime,contact,name,location,layers,ports,mac,serial,model,ps1_type,ps2_type,ps1_status,ps2_status,fan,slots,vendor,os,os_ver,log,snmp_ver,snmp_comm,snmp_class,vtp_domain,last_discover,last_macsuck,last_arpnip
169.254.33.33,2019-04-29 20:27:50.671621,ps4,,,,,,00000100,,,,,,,,,,,netdisco,,,,,,,,2019-04-29 20:27:50.671621,,2019-04-29 20:36:02.135988
[23103] 2019-04-29 18:36:38  info psql: finished at Mon Apr 29 20:36:38 2019
[23103] 2019-04-29 18:36:38  info psql: status done: psql session closed.
  • now in skip table
netdisco-do psql -e "copy (SELECT * FROM public.device_skip where device <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[23098] 2019-04-29 18:36:34  info App::Netdisco version 2.042007 loaded.
[23098] 2019-04-29 18:36:34  info psql:  started at Mon Apr 29 20:36:34 2019
backend,device,actionset,deferrals,last_defer
linux002.aquafin.be,169.254.33.33,{},10,2019-04-29 20:36:02.141656
[23098] 2019-04-29 18:36:34  info psql: finished at Mon Apr 29 20:36:34 2019
[23098] 2019-04-29 18:36:34  info psql: status done: psql session closed.
  • and had jobs queued for it
netdisco-do psql -e "copy (SELECT * FROM public.admin where device <<= '169.254.33.0/24') TO STDOUT WITH CSV HEADER"
[23092] 2019-04-29 18:36:29  info App::Netdisco version 2.042007 loaded.
[23092] 2019-04-29 18:36:29  info psql:  started at Mon Apr 29 20:36:29 2019
job,entered,started,finished,device,port,action,subaction,status,username,userip,log,debug,device_key
41168,2019-04-29 20:31:51.971153,,,169.254.33.33,,arpnip,,queued,testdisc,,,,
[23092] 2019-04-29 18:36:29  info psql: finished at Mon Apr 29 20:36:29 2019
[23092] 2019-04-29 18:36:29  info psql: status done: psql session closed.

important titbits fror our config

devices_only:
  - 169.254.0.0/16

(and some other devices & networks, but those don't matter here)

we have no device_auth stanzas specifically for these devices, and only 1 stanza for ssh collecting (i"ll show that one, uses pubkey authentication)

  - tag: 'sshcollector'
    driver: cli
    platform: NXOS
    username: yeahnotreally
    only: 'model:N9KC9332PQ'

running arpnip manually shows netdisco tries to connect to the pseudodevice via snmp:

netdisco-do arpnip -d 169.254.33.33 -e specify -DI
[23227] 2019-04-29 18:40:41  info App::Netdisco version 2.042007 loaded.
[23227] 2019-04-29 18:40:42  info arpnip: [169.254.33.33] started at Mon Apr 29 20:40:42 2019
[23227] 2019-04-29 18:40:42 debug arpnip: running with timeout 600s
[23227] 2019-04-29 18:40:42 debug => running workers for phase: check
[23227] 2019-04-29 18:40:42 debug -> run worker check/_base_/0
[23227] 2019-04-29 18:40:42 debug arpnip is able to run
[23227] 2019-04-29 18:40:42 debug => running workers for phase: main
[23227] 2019-04-29 18:40:42 debug -> run worker main/nodes/200
[23227] 2019-04-29 18:40:42 debug skip: driver or action not applicable
[23227] 2019-04-29 18:40:42 debug -> run worker main/nodes/100
[23227] 2019-04-29 18:40:42 debug snmp reader cache warm: [169.254.33.33]
[23227] 2019-04-29 18:40:42 debug [169.254.33.33:161] try_connect with ver: 2, class: SNMP::Info, comm: <hidden>
SNMP::Info::_global uptime : DISMAN-EVENT-MIB::sysUpTimeInstance : .1.3.6.1.2.1.1.3.0
SNMP::Info::_global(uptime) Timeout at /home/testdisc/perl5/lib/perl5/App/Netdisco/Transport/SNMP.pm line 239.
[23227] 2019-04-29 18:40:48 debug [169.254.33.33:161] try_connect with ver: 1, class: SNMP::Info, comm: <hidden>
SNMP::Info::_global uptime : DISMAN-EVENT-MIB::sysUpTimeInstance : .1.3.6.1.2.1.1.3.0
SNMP::Info::_global(uptime) Timeout at /home/testdisc/perl5/lib/perl5/App/Netdisco/Transport/SNMP.pm line 239.
[23227] 2019-04-29 18:40:54 debug [169.254.33.33:161] try_connect with ver: 2, class: SNMP::Info, comm: <hidden>
SNMP::Info::_global uptime : DISMAN-EVENT-MIB::sysUpTimeInstance : .1.3.6.1.2.1.1.3.0
SNMP::Info::_global(uptime) Timeout at /home/testdisc/perl5/lib/perl5/App/Netdisco/Transport/SNMP.pm line 239.
[23227] 2019-04-29 18:41:00 debug [169.254.33.33:161] try_connect with ver: 1, class: SNMP::Info, comm: <hidden>
SNMP::Info::_global uptime : DISMAN-EVENT-MIB::sysUpTimeInstance : .1.3.6.1.2.1.1.3.0
SNMP::Info::_global(uptime) Timeout at /home/testdisc/perl5/lib/perl5/App/Netdisco/Transport/SNMP.pm line 239.
[23227] 2019-04-29 18:41:06 debug arpnip failed: could not SNMP connect to 169.254.33.33
[23227] 2019-04-29 18:41:06 debug -> run worker main/subnets/100
[23227] 2019-04-29 18:41:06 debug arpnip failed: could not SNMP connect to 169.254.33.33
[23227] 2019-04-29 18:41:06 debug => running workers for phase: store
[23227] 2019-04-29 18:41:06 debug -> run worker store/nodes/0
[23227] 2019-04-29 18:41:06 debug  [169.254.33.33] arpnip - processed 0 ARP Cache entries
[23227] 2019-04-29 18:41:06 debug  [169.254.33.33] arpnip - processed 0 IPv6 Neighbor Cache entries
[23227] 2019-04-29 18:41:06 debug Ended arpnip for 169.254.33.33
[23227] 2019-04-29 18:41:06  info arpnip: finished at Mon Apr 29 20:41:06 2019
[23227] 2019-04-29 18:41:06  info arpnip: status defer: arpnip failed: could not SNMP connect to 169.254.33.33

there are a few solutions i can think of, from easiest to hardest:

  • easiest for me is to just remove 169.254.0.0/16 from devices_only, not really a reason to have it there.
  • always refuse to run snmp actions on pseudo devices
  • introduce a noop driver, so devices/worker steps can be skipped without errors.

version used: 2.042007, snmp::info 3.68

@inphobia inphobia reopened this Apr 29, 2019
@ollyg
Copy link
Member

ollyg commented Apr 30, 2019

Thanks - this makes sense. Can you elaborate on why "remove 169.254.0.0/16 from devices_only" will be a solution? I don't follow...

@ollyg
Copy link
Member

ollyg commented Apr 30, 2019

Also what is the output of netdisco-do dumpconfig -e device_auth -d 169.254.33.33 ?

@ollyg
Copy link
Member

ollyg commented Apr 30, 2019

Netdisco is trying SNMP because (I suspect) it thinks it has valid SNMP credentials... so the behaviour is, in some way, exactly according to design.

How could it tell the credentials are not to be used? Perhaps as you suggest, an assumption could be made for pseudo devices never to do SNMP. This needs more thought.

@ollyg
Copy link
Member

ollyg commented Apr 30, 2019

Hang on, is something fishy? Why is this happening:

[23227] 2019-04-29 18:40:42 debug -> run worker main/nodes/200
[23227] 2019-04-29 18:40:42 debug skip: driver or action not applicable

Netdisco gives higher priority to CLI worker than SNMP worker, so even with creds for both, CLI should succeed and then the job is done. Why is CLI skipped and then Netdisco moves to SNMP?

@rc9000
Copy link
Member

rc9000 commented Apr 30, 2019

As I read this, the device_auth does not apply so SNMP arpnip will be run.

What worker is run is now purely based on layers also for pseudo devices, right? So if @inphobia your goal is to have no job run, setting the layers attribute to all zeros should do it I assume.

@inphobia
Copy link
Member Author

Thanks - this makes sense. Can you elaborate on why "remove 169.254.0.0/16 from devices_only" will be a solution? I don't follow...

well, there's really no need to run anything on 169.254.0.0/16 atm, it was left over from some previous tests.

Also what is the output of netdisco-do dumpconfig -e device_auth -d 169.254.33.33 ?

netdisco-do dumpconfig -e device_auth -d 169.254.33.33
[19782] 2019-04-30 13:25:08  info App::Netdisco version 2.042007 loaded.
[19782] 2019-04-30 13:25:08  info dumpconfig: [169.254.33.33] started at Tue Apr 30 15:25:08 2019
\ [
    [0] {
        community   "xxx",
        driver      "snmp",
        no          "group:v3hosts",
        only        [
            [0] "any"
        ],
        read        1,
        tag         "default_v2_readonly_2",
        write       ""
    },
    [1] {
        community   "xxx",
        driver      "snmp",
        no          "group:v3hosts",
        only        [
            [0] "any"
        ],
        read        1,
        tag         "default_v2_readonly_4",
        write       ""
    }
]
[19782] 2019-04-30 13:25:08  info dumpconfig: finished at Tue Apr 30 15:25:08 2019
[19782] 2019-04-30 13:25:08  info dumpconfig: status done: Dumped config

Why is CLI skipped and then Netdisco moves to SNMP?

since we don't have any cli creds for these devices. we have cli options for others. point i was trying to make is that we have a cli config, but not for these devices. could have been more clear.

What worker is run is now purely based on layers also for pseudo devices, right? So if @inphobia your goal is to have no job run, setting the layers attribute to all zeros should do it I assume.

correct; but new pseudo devs are always created with l3 & old ones were converted to include l3 a few updates ago (which was a great idea btw).

perhaps i should get started on more more speudodev editing options, so you can customize them a bit more (like enabling/disabling l3 support). been thinking about that for a while now. being able to do so from device details would be nice but all of that depends on snmp write atm, extending the admin page -> pseudo devices would be a logical place but could get cluttered real quick.

@ollyg
Copy link
Member

ollyg commented Apr 30, 2019

Oh okay! Now I understand what your issue is (I think!).

On pseudo devices where you don't have/want/need CLI baed ARP collection, Netdisco needlessly attempts SNMP connect because default/applicable credentials exist, and this is bogging the backend and the queue?

I'll have a think about it. (arguably better in another ticket as this is not the same as the original issue)

@inphobia
Copy link
Member Author

adding them to devices_no is the easiest fix ofcourse, either by address or perhaps just with vendor:netdisco

perhaps vendor:netdisco might even be a good default. if you're at the stage where you do cli collection on pseudo devices i guess you are at a level where changing that parameter won't be hard. on the other hand, devices_no is a common setting, so if we want to go this route imo it's best to add it to config & deployment.yml (else you might overwrite it without knowing).

@ollyg ollyg changed the title arpnip remains queued for pseudo devices Netdisco attempts SNMP arpnip on all Pseudo devices which slows down job queue May 27, 2019
@ollyg ollyg added the bug label May 27, 2019
@inphobia
Copy link
Member Author

i've been thinking about pseudo devices (in the context of a more generic device override system) lately.

perhaps the "model" field could be a way out here. currently pseudos have vendor "netdisco" but no model filled out. perhaps we could define a model type for pseudo devices that should arpnip and another for plain static devices.

like:

  • model: "plain" -> no actions are run other as topology checks
  • model: "arpdev" -> arpnip can be run on these devices.

when time permits i'm thinking & working on a more generic device type override for netdisco. we don't always have or even want write access to devices, but would like to change certain parameters, location is a big one. so i'm trying to build some kind of shadow version of certain dev tables where overrides can live. kinda like manual topology but on steroids.
i'm thinking of things like using that to redesign network segments based on what netdisco knows, adding some kind of user defined tags to extend our excellent acl framework.

big dreams, now i just need to skill, time & talent to make them happen :)

ollyg added a commit that referenced this issue Sep 3, 2019
this patch resets all pseudo devices to have no layer3 support but adds a
feature to the pseudo devices admin panel to enable layer3 support. it also
changes arpnip and arpwalk behaviour to always permit the action if layer3
is available (ignoring the vendor).

documentation will need updating to tell users to create pseudo devices
with layer3 support when they want to arpnip an unsupported platform.

arpnip with ssh/cli against a supported platform (one that can be discovered)
will continue to work normally.

Squashed commit of the following:

commit 9dad5be
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 09:03:53 2019 +0100

    allow pseudo with layer 3 to run arpnip

commit 7d97943
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 08:59:10 2019 +0100

    allow pseudo devices with layer 2/3 capability

commit d1fdf57
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 08:55:41 2019 +0100

    move pseudo and layer checks to is_able from is_able_now

commit e0f72ef
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 08:51:42 2019 +0100

    ports defaults to one

commit 86ba012
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 08:50:45 2019 +0100

    add tooltip for arpnip toggle

commit cdd2470
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Tue Sep 3 08:34:46 2019 +0100

    simplify template

commit 46236d6
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 23:53:56 2019 +0100

    a fix up for pseudo devices which need layer 3

commit 016d249
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 20:37:11 2019 +0100

    do not wrap buttons

commit 1ec1402
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 20:33:03 2019 +0100

    implement user settable layer-three service for pseudo devices

commit a267efa
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 18:39:22 2019 +0100

    only set layer if successful action

commit b108be5
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 18:32:19 2019 +0100

    should defer SNMP against pseudo devices

commit 897ba3a
Merge: e0ddbaa a734890
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Sep 1 14:54:36 2019 +0100

    Merge branch 'master' into og-pseudo-vs-cli-arpnip

commit e0ddbaa
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Mon Aug 26 11:35:13 2019 +0100

    as last commit, for discover

commit 61f9c89
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Aug 25 23:55:38 2019 +0100

    move pseudo and layer checks into is_*able functions

commit 8b010d4
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Aug 25 18:38:11 2019 +0100

    any device completing macsuck/arpnip must have that layer

commit a11bce7
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Aug 25 18:33:27 2019 +0100

    clean up device layers

commit d2661bf
Author: Oliver Gorwits <oliver@cpan.org>
Date:   Sun Aug 25 18:18:02 2019 +0100

    first make arpnip behave like other jobs towards pseudo devices
@ollyg
Copy link
Member

ollyg commented Sep 3, 2019

fixed in
[master 2897eda] #587 #561 update pseudo devices to better support ssh arpnip

@ollyg ollyg closed this as completed Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants