fix(proto): cap PTO backoff at 2 seconds post-handshake#523
Conversation
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/523/docs/noq/ Last updated: 2026-03-30T12:08:07Z |
Performance Comparison Report
|
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5901.0 Mbps | 7996.1 Mbps | -26.2% | 97.0% / 98.6% |
| medium-concurrent | 5620.4 Mbps | 7957.9 Mbps | -29.4% | 96.5% / 98.1% |
| medium-single | 4384.6 Mbps | 4749.2 Mbps | -7.7% | 98.7% / 100.0% |
| small-concurrent | 4016.9 Mbps | 5001.7 Mbps | -19.7% | 97.0% / 99.6% |
| small-single | 3673.3 Mbps | 4684.1 Mbps | -21.6% | 96.8% / 98.6% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3016.7 Mbps | 3938.6 Mbps | -23.4% |
| lan | 782.4 Mbps | 802.3 Mbps | -2.5% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 21.9% slower on average
e31a3050c5bc177d303fa86e6fcbb8fb21b42550 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5357.5 Mbps | 7962.5 Mbps | -32.7% | 93.5% / 98.3% |
| medium-concurrent | 5524.5 Mbps | 7827.3 Mbps | -29.4% | 94.7% / 105.0% |
| medium-single | 3617.4 Mbps | 4749.4 Mbps | -23.8% | 94.7% / 106.0% |
| small-concurrent | 3815.0 Mbps | 5276.4 Mbps | -27.7% | 99.4% / 162.0% |
| small-single | 3431.1 Mbps | 4846.9 Mbps | -29.2% | 90.7% / 98.1% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3030.7 Mbps | 4092.6 Mbps | -25.9% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 28.0% slower on average
b536af97431b460cd53ee38aec9210ca5a64b014 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5361.5 Mbps | 7840.7 Mbps | -31.6% | 92.4% / 97.5% |
| medium-concurrent | 5511.1 Mbps | 7685.0 Mbps | -28.3% | 91.7% / 96.9% |
| medium-single | 3873.2 Mbps | 4190.1 Mbps | -7.6% | 88.5% / 96.9% |
| small-concurrent | 3795.3 Mbps | 4889.1 Mbps | -22.4% | 90.7% / 98.3% |
| small-single | 3492.1 Mbps | 4346.8 Mbps | -19.7% | 89.3% / 97.3% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3092.6 Mbps | 4006.1 Mbps | -22.8% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.2% slower on average
cac0daac3903724f201ae0eb22ac922b19ef086a - artifacts
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2931.6 Mbps | N/A | N/A |
| lan | 782.4 Mbps | N/A | N/A |
| lossy | 69.8 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
f6cf1dd3e6bdc7fe99c252152e112cb46bbfe684 - artifacts
No results available
f6cf1dd3e6bdc7fe99c252152e112cb46bbfe684 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5382.6 Mbps | 7900.0 Mbps | -31.9% | 98.4% / 188.0% |
| medium-concurrent | 5409.1 Mbps | 7508.9 Mbps | -28.0% | 95.0% / 130.0% |
| medium-single | 4022.2 Mbps | 4749.0 Mbps | -15.3% | 93.0% / 106.0% |
| small-concurrent | 3846.9 Mbps | 5412.8 Mbps | -28.9% | 91.4% / 99.4% |
| small-single | 3459.2 Mbps | 4866.4 Mbps | -28.9% | 89.9% / 97.8% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3099.7 Mbps | 4134.5 Mbps | -25.0% |
| lan | 782.4 Mbps | 810.4 Mbps | -3.4% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 26.4% slower on average
a18eddeda9a478bed04adb95f31f3d61732cd555 - artifacts
No results available
7c42112deeb9344aecbd0d509b19878cd263e9e2 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5612.9 Mbps | 7918.3 Mbps | -29.1% | 96.6% / 109.0% |
| medium-concurrent | 5642.8 Mbps | 7833.6 Mbps | -28.0% | 95.7% / 108.0% |
| medium-single | 3800.5 Mbps | 4511.8 Mbps | -15.8% | 90.7% / 98.2% |
| small-concurrent | 3944.9 Mbps | 5290.8 Mbps | -25.4% | 94.4% / 111.0% |
| small-single | 3599.6 Mbps | 4746.1 Mbps | -24.2% | 96.5% / 112.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3135.2 Mbps | 3925.3 Mbps | -20.1% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.9 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 24.2% slower on average
6d2bcad430707dc090f3806606b8390603e9d1bd - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5938.5 Mbps | 7968.6 Mbps | -25.5% | 96.7% / 107.0% |
| medium-concurrent | 5469.3 Mbps | 7641.2 Mbps | -28.4% | 95.8% / 108.0% |
| medium-single | 4119.3 Mbps | 4189.5 Mbps | -1.7% | 92.9% / 109.0% |
| small-concurrent | 3893.6 Mbps | 4874.6 Mbps | -20.1% | 93.0% / 108.0% |
| small-single | 3604.2 Mbps | 4402.4 Mbps | -18.1% | 89.5% / 97.4% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3072.5 Mbps | 3663.9 Mbps | -16.1% |
| lan | 782.5 Mbps | 796.5 Mbps | -1.8% |
| lossy | 69.9 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 19.7% slower on average
5a4f32450f2c78f641bf5d04cd2024e904623649 - artifacts
No results available
83a158e039422cd4b6773e9dc50c41268c122e8f - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5375.3 Mbps | 8180.3 Mbps | -34.3% | 96.4% / 129.0% |
| medium-concurrent | 5420.1 Mbps | 7457.0 Mbps | -27.3% | 91.9% / 97.3% |
| medium-single | 3842.2 Mbps | 4155.1 Mbps | -7.5% | 96.7% / 127.0% |
| small-concurrent | 3814.9 Mbps | 4996.3 Mbps | -23.6% | 93.7% / 102.0% |
| small-single | 3392.7 Mbps | 4053.9 Mbps | -16.3% | 84.9% / 95.7% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3034.9 Mbps | 3967.0 Mbps | -23.5% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.9 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.6% slower on average
8793168f020876502c340e8b8755832e22d90fca - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5792.7 Mbps | 7823.0 Mbps | -26.0% | 95.9% / 105.0% |
| medium-concurrent | 5287.1 Mbps | 7970.3 Mbps | -33.7% | 92.7% / 104.0% |
| medium-single | 4157.5 Mbps | 4737.5 Mbps | -12.2% | 97.3% / 160.0% |
| small-concurrent | 3907.9 Mbps | 5121.1 Mbps | -23.7% | 90.7% / 98.8% |
| small-single | 3557.2 Mbps | 4675.8 Mbps | -23.9% | 94.7% / 108.0% |
Summary
noq is 25.1% slower on average
2c616ad496abb80df45f403da7b3a1418f98e537 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5342.6 Mbps | 7896.9 Mbps | -32.3% | 94.9% / 125.0% |
| medium-concurrent | 5429.7 Mbps | 7825.9 Mbps | -30.6% | 94.8% / 126.0% |
| medium-single | 3610.2 Mbps | 4749.0 Mbps | -24.0% | 92.0% / 102.0% |
| small-concurrent | 3794.7 Mbps | 5297.1 Mbps | -28.4% | 96.1% / 129.0% |
| small-single | 3551.7 Mbps | 4858.7 Mbps | -26.9% | 94.6% / 128.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3092.4 Mbps | 3603.9 Mbps | -14.2% |
| lan | 782.5 Mbps | 796.4 Mbps | -1.7% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 26.8% slower on average
8d218842889c6e2e80112ec00ada63d135d598c4 - artifacts
No results available
e69ecfb31cdbd7bbea687d1a49f7812e044f47a3 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5361.6 Mbps | 7834.2 Mbps | -31.6% | 93.8% / 101.0% |
| medium-concurrent | 5415.7 Mbps | 7864.2 Mbps | -31.1% | 91.5% / 97.1% |
| medium-single | 3784.3 Mbps | 4561.3 Mbps | -17.0% | 91.4% / 98.8% |
| small-concurrent | 3896.6 Mbps | 5071.6 Mbps | -23.2% | 99.2% / 130.0% |
| small-single | 3494.8 Mbps | 4654.8 Mbps | -24.9% | 86.6% / 96.3% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2905.2 Mbps | N/A | N/A |
| lan | 775.1 Mbps | N/A | N/A |
| lossy | 69.8 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
Summary
noq is 26.8% slower on average
4fbdccc243c1eb04877b63520949c4de64c754bd - artifacts
No results available
185a6e067ff98d8c6ca6d7ddb11f02551e62b9d9 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5560.5 Mbps | 8062.3 Mbps | -31.0% | 91.2% / 96.7% |
| medium-concurrent | 5351.9 Mbps | 7856.4 Mbps | -31.9% | 98.8% / 191.0% |
| medium-single | 4618.5 Mbps | 4749.6 Mbps | -2.8% | 90.3% / 97.6% |
| small-concurrent | 3789.4 Mbps | 5372.8 Mbps | -29.5% | 97.9% / 135.0% |
| small-single | 3348.5 Mbps | 4821.1 Mbps | -30.5% | 93.0% / 102.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3083.4 Mbps | N/A | N/A |
| lan | 782.5 Mbps | N/A | N/A |
| lossy | 69.8 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
Summary
noq is 26.5% slower on average
ce171940ac91339fd7fb6b32053dfbe66e56ad52 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5460.0 Mbps | 7875.9 Mbps | -30.7% | 93.3% / 98.1% |
| medium-concurrent | 5576.5 Mbps | 7920.7 Mbps | -29.6% | 91.4% / 97.0% |
| medium-single | 4120.6 Mbps | 4596.9 Mbps | -10.4% | 94.8% / 107.0% |
| small-concurrent | 3860.0 Mbps | 5224.9 Mbps | -26.1% | 96.5% / 107.0% |
| small-single | 3527.2 Mbps | 4690.7 Mbps | -24.8% | 93.6% / 108.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2863.0 Mbps | 3631.3 Mbps | -21.2% |
| lan | 782.5 Mbps | 796.4 Mbps | -1.7% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 24.5% slower on average
a948f32bdc33af9806c3074c622cc0a230e48f19 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 6193.8 Mbps | 7899.3 Mbps | -21.6% | 97.1% / 98.6% |
| medium-concurrent | 5521.5 Mbps | 7625.9 Mbps | -27.6% | 96.5% / 98.1% |
| medium-single | 3862.2 Mbps | 4189.1 Mbps | -7.8% | 96.6% / 98.6% |
| small-concurrent | 4000.2 Mbps | 5170.6 Mbps | -22.6% | 97.0% / 99.5% |
| small-single | 3707.0 Mbps | 4458.9 Mbps | -16.9% | 96.7% / 98.8% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | N/A | 3926.3 Mbps | N/A |
| lan | N/A | 798.8 Mbps | N/A |
| lossy | N/A | 69.8 Mbps | N/A |
| wan | N/A | 83.8 Mbps | N/A |
Summary
noq is 20.6% slower on average
0c7dee4bc03990a3fb7dded05022bdab800f3993 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5456.2 Mbps | 7818.6 Mbps | -30.2% | 94.8% / 104.0% |
| medium-concurrent | 5467.4 Mbps | 7966.8 Mbps | -31.4% | 95.4% / 109.0% |
| medium-single | 4140.2 Mbps | 4594.7 Mbps | -9.9% | 100.1% / 162.0% |
| small-concurrent | 3956.7 Mbps | 5217.5 Mbps | -24.2% | 91.1% / 99.1% |
| small-single | 3491.8 Mbps | 4727.0 Mbps | -26.1% | 95.3% / 109.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2961.0 Mbps | N/A | N/A |
| lan | 782.4 Mbps | N/A | N/A |
| lossy | 69.8 Mbps | N/A | N/A |
| wan | 83.8 Mbps | N/A | N/A |
Summary
noq is 25.8% slower on average
536791ce61a59004a10c716d76f6eceb4023959f - artifacts
No results available
75629fb94db052d2c134eecdb84feb9fed10be42 - artifacts
No results available
21893a499bd682022b5acb87ddc0e08553bc8f2a - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5209.2 Mbps | 7949.8 Mbps | -34.5% | 91.3% / 97.4% |
| medium-concurrent | 5417.3 Mbps | 7850.7 Mbps | -31.0% | 92.8% / 97.4% |
| medium-single | 3580.0 Mbps | 4689.8 Mbps | -23.7% | 96.5% / 128.0% |
| small-concurrent | 3846.4 Mbps | 5192.9 Mbps | -25.9% | 88.3% / 98.3% |
| small-single | 3417.6 Mbps | 4714.5 Mbps | -27.5% | 92.9% / 102.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3144.6 Mbps | 3952.9 Mbps | -20.4% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 69.8 Mbps | ~0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 27.6% slower on average
84c9e1177531cbd77e64ac469c60e550ed1b7f1b - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5487.2 Mbps | 8031.2 Mbps | -31.7% | 94.3% / 105.0% |
| medium-concurrent | 5377.2 Mbps | 7486.0 Mbps | -28.2% | 92.2% / 97.1% |
| medium-single | 4067.9 Mbps | 4749.2 Mbps | -14.3% | 86.6% / 95.8% |
| small-concurrent | 3816.5 Mbps | 5369.0 Mbps | -28.9% | 94.0% / 103.0% |
| small-single | 3434.3 Mbps | 4852.6 Mbps | -29.2% | 89.9% / 97.4% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3065.8 Mbps | 3897.5 Mbps | -21.3% |
| lan | 782.4 Mbps | 805.9 Mbps | -2.9% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 25.9% slower on average
b24e9ef4011b0221e13f9f8edfa0aa5171f58409 - artifacts
No results available
24b4c76a40f16238faee0ff0b82047740cf32d6b - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5669.7 Mbps | 7928.0 Mbps | -28.5% | 92.1% / 97.3% |
| medium-concurrent | 5496.3 Mbps | 7404.3 Mbps | -25.8% | 98.7% / 181.0% |
| medium-single | 4438.0 Mbps | 4760.8 Mbps | -6.8% | 95.9% / 122.0% |
| small-concurrent | 3840.3 Mbps | 5341.5 Mbps | -28.1% | 95.9% / 125.0% |
| small-single | 3503.6 Mbps | 4713.2 Mbps | -25.7% | 92.4% / 102.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 2877.3 Mbps | 3925.3 Mbps | -26.7% |
| lan | 782.1 Mbps | 800.8 Mbps | -2.3% |
| lossy | 69.9 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.6% slower on average
d1566b5115c9c14868029188dd8c9a8ff1e0cdf1 - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5807.2 Mbps | 8064.7 Mbps | -28.0% | 92.6% / 97.7% |
| medium-concurrent | 5290.2 Mbps | 7795.9 Mbps | -32.1% | 93.7% / 105.0% |
| medium-single | 3798.8 Mbps | 4189.7 Mbps | -9.3% | 95.8% / 108.0% |
| small-concurrent | 3883.3 Mbps | 5055.1 Mbps | -23.2% | 96.4% / 111.0% |
| small-single | 3527.9 Mbps | 4467.7 Mbps | -21.0% | 93.1% / 108.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3037.5 Mbps | 3939.2 Mbps | -22.9% |
| lan | 782.5 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 23.7% slower on average
|
checked locally and the failing tests just do not finish |
cac0daa to
f6cf1dd
Compare
|
should be better now |
f6cf1dd to
a18edde
Compare
7c42112 to
6d2bcad
Compare
flub
left a comment
There was a problem hiding this comment.
Generally LGTM though I do have some questions.
There was a problem hiding this comment.
I think I understand what the PR is aiming to solve, but I'm failing to understand the solution or why the approach is appropriate, beyond using picoquic as a sort of authoritative source.
Is the issue we are solving that PTO might become larger than the idle timeout? This is what I understood. I guess there are some other issues, but I'll focus on this one for the moment.
If this is the case, why divide by 16 only after 15? This makes for ~1s PTOs for idle timeouts of 15s but 2s PTOs for idle timeouts of 14s. It's such a weird criterion and I'm finding it hard to justify.
Why not cap the "desired" PTO to the min of something like 50% of the idle timeout and 2s, ensuring one PTO can always fire before the idle timeout (regardless of value)? Basically, I'm having issues with the only if idle_timeout > 15s. I've intentionally left out the hard cap special case of large RTT paths, which I think is a fine upper bound instead of the 2s.
I went ahead and gave a try at swapping picoquic's weird criterion with the one I propose, and all tests, including your new ones, pass. I'm keen on understanding if there are objections to this, but for now I'm inclined toward an approach that seems to work, is easier to understand, and does not have an arbitrary application criterion.
matheus23
left a comment
There was a problem hiding this comment.
There is another bug, that was hiding here, that his uncovers. Trying to fix it.. but we should keep this workaround in this PR imho
I disagree with you three.
There is definitely a bug in here. Why does noq-proto keep resending packets forever and never hit the idle timeout? This should not happen.
I'm looking at logs and will try to figure out the bug that this PR introduces.
For some reason I had a hard time understanding y'all here.
I'm fine with keeping the workaround in this PR and fixing the underlying bug ASAP.
The danger is that together with this PR's changes to the loss timer, there will be situations in which users will have broken connections that never time out. That's bad.
But I do see how the underlying bug for this is also present on main.
Perhaps we should fix it on main first, then remove the workaround here, then merge this?
fyi @flub @divagant-martian @matheus23 this started breaking patchbay tests in the combined branch on degradation level 6 so I will go back to the previous version |
With `now + duration`, the PTO fires `duration` after the current poll, regardless of when the last probe was sent. When the PTO cap compresses `duration` below the path RTT (e.g. cap=1875ms vs RTT=1600ms at 800ms one-way latency), consecutive PTO firings overlap with in-flight probes that haven't had time to be acknowledged yet. In patchbay degrade_ladder tests at level 6 (800ms latency, 20% loss, 20% reorder), this causes the client to exhaust its holepunch timeout (15s) with spurious retransmissions before the PATH_CHALLENGE/RESPONSE exchange completes. The server side is less affected because it does not initiate holepunching. Fix: use `max(last_ack_eliciting, now) + duration` so each PTO waits at least `duration` from the last send, giving the peer a full RTT to respond before the next probe.
185a6e0 to
ce17194
Compare
This adopts a variant of the algorithm to cap the PTO interval, but keep computing it from the last_ack_eliciting time to keep the computation stable. And adds documentation of why this matters. Also some other small doc and comments improvements around. And fix the SendableFrames::other computation.
| }; | ||
|
|
||
| let backoff = 2u32.pow(path.pto_count.min(MAX_BACKOFF_EXPONENT)); | ||
| let duration = path.rtt.pto_base() * backoff; |
There was a problem hiding this comment.
This affects the tail-loss probes during the handshake. There are tests for this and we did not have issues with that. So I have left this computation unchanged.
There was a problem hiding this comment.
Adding a comment as to why the handshake is a special case would be nice!
There was a problem hiding this comment.
I... I don't know. Other than the existing comment. I mostly left this untouched because it was already untouched. And because it was already always using now + duration which I don't really follow. It possibly is because the space is essentially fixed so you can't get into trouble there.
In theory the handshake has the same issue that when a packet is lost the exponential backoff will grow to large delays, and you probably give up when the idle timeout hits. I think the cap approach should be applicable there as well, you would increase the number of packets sent before giving up. OTOH it is not as needed, this this is mostly about improving how quickly you detect a loss IIUC.
FWIW, I also don't know why this doesn't introduce any jitter in the PTO calculations.
There was a problem hiding this comment.
PTAL: I've implemented a version of that also uses the cap during the handshake. This otherwise keeps the same behaviour as was there previously: counting from now. I think that behaviour is fine because when this clause kicks in there only is ever one space that gets used. But it is still weird to me. I also suspect that the previous version was rather sad as well and resulting in way fewer tail-loss probes than would be desired during this time.
|
PTAL, I implemented a more gradual cap on the PTO interval. And adjusted the computation so that it computes the correct duration since the last ack-eliciting packet was sent. |
| }; | ||
|
|
||
| let backoff = 2u32.pow(path.pto_count.min(MAX_BACKOFF_EXPONENT)); | ||
| let duration = path.rtt.pto_base() * backoff; |
There was a problem hiding this comment.
Adding a comment as to why the handshake is a special case would be nice!
Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>
Description
Cap PTO backoff post-handshake to match picoquic's
picoquic_current_retransmit_timer(). Without a cap, PTO grows exponentially without bound during connectivity gaps, making recovery take tens of seconds even after the link is restored.Relates to #376
Implementation
Matches picoquic's
timing.c:42-88:timing.c:59-63): cap PTO atidle_timeout / 16so ~16 retransmits fit before the idle timer fires. Only whenidle_timeout > 15s.timing.c:77-85): matchesPICOQUIC_LARGE_RETRANSMIT_TIMER. For satellite paths (RTT >610ms), uses1.5 * smoothed_rttinstead.max(last_ack_eliciting, now)to prevent scheduling in the past when the cap makes duration small relative to how long ago the last packet was sent.Only applies post-handshake to avoid interfering with connection establishment.
References
picoquic_current_retransmit_timer()— reference implementation