Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce macsuck bandwidth usage to database #680

Merged
merged 3 commits into from Dec 21, 2019
Merged

Reduce macsuck bandwidth usage to database #680

merged 3 commits into from Dec 21, 2019

Conversation

rc9000
Copy link
Member

@rc9000 rc9000 commented Nov 7, 2019

From IRC today:

19:31 < rc9000> hi guys! some days ago our db cluster switched my postgres instance 
                to the backup dc, but netdisco remained at the main dc.
19:31 < rc9000> a little later I got complaints that I was using about 400mbit/s of 
                the already highly saturated link
19:32 < rc9000> it turns out that get_port_macs queries a full list of all device_port 
                ips and macs, which is about 13MB in our case
19:33 < rc9000> and we poll 9000ish devices in 40 minutes, so the math works out :) 
19:34 < rc9000> i'm looking at the code and it seems i could make a slightly more specific 
                get_port_macs inside walk_fwtable, that contains only the macs really 
                returned from snmp->fw_mac

This PR tries to fix the above with a get_port_macs that's called inside walk_fwtable and only returns the macs found in the current target device/vlan.

If I'm reading walk_fwtable right the only lookups that can ever be made are for macs that the current snmp->fw_mac contains, but I'd appreciate if @ollyg and other cracks could verify this assumption. Also, I did not find usages of get_port_macs outside of Macsuck/Nodes.pm, so I hope this doesn't break anything.

I'm trying it now in our environment and it both seems to work ok so far and has reduced bandwidth DB -> poller by orders of magnitude.

get_port_macs transfers a full list of all device_port.(mac,ip) in every macsuck.
With 8k devices and 40k interfaces it takes up around 15MB. Transferring them 8k times
during an 1h macsuck cycle requires bandwidth in the 300 to 400 mbit/s range.

This patch changes get_port_macs to be called inside walk_fwtable and only transfer
the macs found in the current target device/vlan.
@inphobia
Copy link
Member

inphobia commented Nov 8, 2019

at first glance seems to work just fine, but i'll need to wait a few days so i actually have users moving around etc...
(and also only have a few days of history atm, will see if i can copy production db to test & run with the same config for a few days)

@rc9000
Copy link
Member Author

rc9000 commented Nov 8, 2019

Some related RRD:

Traffic before/after on database server switch port (hard to see now but there are some green pixels there)

rrdsgp48

Effect on poller performance

scr 2019-11-08 at 08 53 46

@rc9000
Copy link
Member Author

rc9000 commented Nov 11, 2019

Performance implications of this PR from a single test over 7k devices, mostly Cisco:

  • 190502 queries performed for 7186 devices macsucked (26.5 vlans/queries per device)
  • max mac addresses in "in" clause 868, this query takes 0.28 sec and finishes successfully.
  • average query takes 0.0304 sec.
  • 26 of the queries took >= 1 sec
  • all these queries together take 5806 sec

Full CSV this comes from:
gm_profile_res.csv.zip

@ollyg ollyg merged commit 26d3fbd into master Dec 21, 2019
rc9000 added a commit that referenced this pull request Jan 26, 2020
Hi @ollyg! Unfortunately I have found some issues with the code we
finally released in #680:

* get_port_macs expects an array ref but values() returns array,
  so the code was never called due to the return unless... check
* When fw_mac_list had exactly two entries, only the second value
  was bound as a scalar to the parameter. This is probably due
  to the shorthand bind formats described in
  https://metacpan.org/pod/DBIx::Class::ResultSet#DBIC-BIND-VALUES,
  but I'm not a 100% on this.
* return unless now checks for an entry in the list, with the old
  check the statement was also executed for empty lists

In cases where only the device(_port)?.mac lookup worked for uplink
detection, users of 02.044005 - 010 might get a lot of uplinks not
labeled as such.
@rc9000 rc9000 deleted the rc-utilportmac branch February 3, 2020 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants