Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some latency runs are too long for the plugin #967

Closed
tonin opened this issue Mar 4, 2020 · 1 comment
Closed

Some latency runs are too long for the plugin #967

tonin opened this issue Mar 4, 2020 · 1 comment
Assignees

Comments

@tonin
Copy link
Member

@tonin tonin commented Mar 4, 2020

When doing some custom latency measurements involving a lot of packets, it seems the plugin is not waiting long enough for the run to terminate.

A run that is failing everytime I run it (on 4.2.3):

[root@localhost database]# pscheduler task latency -p 1008 -i 0.00002 -c 500000 --source 193.219.48.252 --dest 193.219.48.250
Submitting task...
Task URL:
https://193.219.48.252/pscheduler/tasks/e72ba541-74f6-409d-9382-971ac4e314d0
Running with tool 'owping'
Fetching first run...

Next scheduled run:
https://193.219.48.252/pscheduler/tasks/e72ba541-74f6-409d-9382-971ac4e314d0/runs/306715d9-7a48-463b-b313-9c17e03dd56e
Starts 2020-03-04T18:13:39Z (~1 seconds)
Ends   2020-03-04T18:14:00Z (~20 seconds)
Waiting for result...

Run did not complete: Failed

Limit system diagnostics for this run:

  Hints:
    requester: 193.219.48.252
    server: 193.219.48.252
  Identified as everybody, local-interfaces
  Classified as default, friendlies
  Application: Hosts we trust to do everything
    Group 1: Limit 'always' passed
    Group 1: Want all, 1/1 passed, 0/1 failed: PASS
    Application PASSES
  Application: Defaults applied to non-friendly hosts
    Group 1: Limit 'innocuous-tests' passed
    Group 1: Limit 'throughput-default-time' failed: Test is not 'throughput'
    Group 1: Limit 'throughput-default-udp' failed: Test is not 'throughput'
    Group 1: Limit 'idleex-default' failed: Test is not 'idleex'
    Group 1: Want any, 1/4 passed, 3/4 failed: PASS
    Application PASSES
  Proposal meets limits
  Priority set at 0:
    Initial priority  (Set to 0)


Error:
    Process took too long to run.

Diagnostics from 193.219.48.252:
  No diagnostics.

Error from 193.219.48.252:
  No error.

No further runs scheduled.

But the same run with the default test parameters is successful:

[root@localhost database]# pscheduler task latency --source 193.219.48.252 --dest 193.219.48.250
Submitting task...
Task URL:
https://193.219.48.252/pscheduler/tasks/a397e9fa-7197-4409-9115-6eccd6e2eb13
Running with tool 'owping'
Fetching first run...

Next scheduled run:
https://193.219.48.252/pscheduler/tasks/a397e9fa-7197-4409-9115-6eccd6e2eb13/runs/5b2054f6-1d12-4c3e-9121-46cf68a5e87c
Starts 2020-03-04T18:15:52Z (~1 seconds)
Ends   2020-03-04T18:16:13Z (~20 seconds)
Waiting for result...

Packet Statistics
-----------------
Packets Sent ......... 100 packets
Packets Received ..... 100 packets
Packets Lost ......... 0 packets
Packets Duplicated ... 0 packets
Packets Reordered .... 0 packets

(...)

And running the same measurement with owping is successful too:

[root@localhost database]# /usr/bin/owping -R -s 1008 -c 500000 -i 2e-05 -L 0 -t -S 193.219.48.252 193.219.48.250
Approximately 12.9 seconds until results available

--- owping statistics from [localhost.localdomain]:9886 to [193.219.48.250]:9070 ---
SID:	c1db30fae20a6e9bf78fd7ec693057db
first:	2020-03-04T20:14:52.928
last:	2020-03-04T20:15:02.922
500000 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = -1.9/-1.9/-0.673 ms, (err=0.667 ms)
one-way jitter = 0 ms (P95-P50)
hops = 0 (consistently)
no reordering

When pscheduler runs owping it calls it with -R to have the full packet trace, that is then sent back from the destination to source. What probably happens is that when the number of packets is huge, like in this test (500.000 packets), pscheduler needs wait long enough for all the packet trace to be returned back to the source before considering the run finished. I'm not sure we can cap this waiting time as, potentially, the number of packets can be very high. Or the additional time should be proportional to the number of packets.

[root@localhost database]# /usr/bin/owping -R -s 1008 -c 500000 -i 2e-05 -L 0 -t -S 193.219.48.252 [193.219.48.250]:861
0 16287954065985537823 1 0.000227928 16287954065978865742 1 0.000425339 255
1 16287954065985902896 1 0.000227928 16287954065979467037 1 0.000425339 255
2 16287954065986048924 1 0.000227928 16287954065979690376 1 0.000425339 255
3 16287954065986173479 1 0.000227928 16287954065979832110 1 0.000425339 255
4 16287954065986298033 1 0.000227928 16287954065980051153 1 0.000425339 255
5 16287954065986426882 1 0.000227928 16287954065980197182 1 0.000425339 255
6 16287954065986547141 1 0.000227928 16287954065980411930 1 0.000425339 255
7 16287954065986667400 1 0.000227928 16287954065980562254 1 0.000425339 255
8 16287954065986791954 1 0.000227928 16287954065980785592 1 0.000425339 255
9 16287954065986912213 1 0.000227928 16287954065980931621 1 0.000425339 255
10 16287954065987032472 1 0.000227928 16287954065981146370 1 0.000425339 255
11 16287954065987157026 1 0.000227928 16287954065981537212 1 0.000425339 255
(...)
499995 16287954108815454120 1 0.000227928 16287954108806289978 1 0.000425339 255
499996 16287954108815518545 1 0.000227928 16287954108806345813 1 0.000425339 255
499997 16287954108815574379 1 0.000227928 16287954108806405943 1 0.000425339 255
499998 16287954108815634509 1 0.000227928 16287954108806466072 1 0.000425339 255
499999 16287954108815694638 1 0.000227928 16287954108806526202 1 0.000425339 255
@tonin tonin self-assigned this Mar 19, 2020
tonin added a commit that referenced this issue Mar 19, 2020
…sing

time that is dependant on the number of packets being sent. Fix #927 and #967
@tonin tonin added this to the sprint-20200323 milestone Mar 19, 2020
@tonin

This comment has been minimized.

Copy link
Member Author

@tonin tonin commented Mar 23, 2020

Confirmed the fix is solving the problem as seen on the LoLa systems that were trying to run these high packet rate tests.

@tonin tonin closed this Mar 23, 2020
tonin added a commit that referenced this issue Mar 25, 2020
machines doing runs with very high packet count. Refs #927 and #967
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.