Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze performance regression of gw_mac_address.py #14818

Open
LKreutzer opened this issue Jan 9, 2023 · 2 comments
Open

Analyze performance regression of gw_mac_address.py #14818

LKreutzer opened this issue Jan 9, 2023 · 2 comments
Assignees
Labels

Comments

@LKreutzer
Copy link
Contributor

Calculating the MAC address of a device was recently refactored. This made the inout_non_nat sudo test flaky, which was mitigated by increasing the timeout in #14817.

  • Why did increasing the timeout make the test more stable?
  • Is there a performance issue originating from the refactoring?

CC: @wolfseb

@wolfseb
Copy link
Contributor

wolfseb commented Feb 27, 2023

Analysis for testFlowVlanSnapshotMatch_static2:
Test setup:

  • 3 IP addresses are added to the mobilityd gateway info: 2 with a vlan and 1 without
  • The egress service will continuously send ARP packets to these IPs to retrieve their mac addresses (this is where the refactored code is touched). The frequency of this check can be adjusted by changing non_nat_gw_probe_frequency in the setup (the default in the test was 0.5s).

Behaviour before the refactoring:

  • Within the runtime of the test, two ARP packets are received from each of the 3 IP addresses, each process of sending and receiving takes on average 1.1-1.2s (there was one case with 1.7s within 50 test executions).

Behaviour after the refactoring:

  • Sending and receiving packets usually takes between 0.2s and 1.2s (mostly <0.8s) with strong fluctuation. Durations much larger than this are common, e.g. 7.5s, sometimes up to 20s. When setting the timeout of the socket, an error will occur and the test will fail if it takes longer to receive the response packet.
  • The number of received packets ranges between 3 and 9, the majority of the time it is 6.
  • When removing the timeout on the socket and adjusting the non_nat_gw_probe_frequency to >1s, the number of received packets reduces quickly to exclusively 3
  • The time it takes to receive the packets depends somewhat on the value of non_nat_gw_probe_frequency, e.g.
    • for 5s, the first packet takes 5-6.5s to be received, and the other 2 take <1s.
    • for 10s, most of the time all packets take <1s (mostly <0.5s), but the third one sometimes takes >10s.

The issue seems to lie somewhere in the concurrency between the egress service and the sending/receiving of ARP packets via socket. It is unclear what causes these varying times (and especially the occasional extremely long wait times). On average, the performance is better than before (sending and receiving packets mostly takes <1.0s compared to 1.1-1.2s before the refactoring).

@MoritzThomasHuebner
Copy link
Contributor

Also flagging that this issue would be likely resolved by #14970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants