Skip to content

fix(proto): cap PTO backoff at 2 seconds post-handshake#523

Merged
flub merged 20 commits intomainfrom
fix/pto-cap
Mar 30, 2026
Merged

fix(proto): cap PTO backoff at 2 seconds post-handshake#523
flub merged 20 commits intomainfrom
fix/pto-cap

Conversation

@dignifiedquire
Copy link
Copy Markdown
Contributor

@dignifiedquire dignifiedquire commented Mar 21, 2026

Description

Cap PTO backoff post-handshake to match picoquic's picoquic_current_retransmit_timer(). Without a cap, PTO grows exponentially without bound during connectivity gaps, making recovery take tens of seconds even after the link is restored.

Relates to #376

Implementation

Matches picoquic's timing.c:42-88:

  1. Idle timeout bound (timing.c:59-63): cap PTO at idle_timeout / 16 so ~16 retransmits fit before the idle timer fires. Only when idle_timeout > 15s.
  2. Hard cap at 2s (timing.c:77-85): matches PICOQUIC_LARGE_RETRANSMIT_TIMER. For satellite paths (RTT > 610ms), uses 1.5 * smoothed_rtt instead.
  3. PTO base time: use max(last_ack_eliciting, now) to prevent scheduling in the past when the cap makes duration small relative to how long ago the last packet was sent.

Only applies post-handshake to avoid interfering with connection establishment.

References

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/523/docs/noq/

Last updated: 2026-03-30T12:08:07Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

Performance Comparison Report

84a2d4398ffba83c893290b5ad495bc4f921fc57 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5901.0 Mbps 7996.1 Mbps -26.2% 97.0% / 98.6%
medium-concurrent 5620.4 Mbps 7957.9 Mbps -29.4% 96.5% / 98.1%
medium-single 4384.6 Mbps 4749.2 Mbps -7.7% 98.7% / 100.0%
small-concurrent 4016.9 Mbps 5001.7 Mbps -19.7% 97.0% / 99.6%
small-single 3673.3 Mbps 4684.1 Mbps -21.6% 96.8% / 98.6%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3016.7 Mbps 3938.6 Mbps -23.4%
lan 782.4 Mbps 802.3 Mbps -2.5%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 21.9% slower on average

---
e31a3050c5bc177d303fa86e6fcbb8fb21b42550 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5357.5 Mbps 7962.5 Mbps -32.7% 93.5% / 98.3%
medium-concurrent 5524.5 Mbps 7827.3 Mbps -29.4% 94.7% / 105.0%
medium-single 3617.4 Mbps 4749.4 Mbps -23.8% 94.7% / 106.0%
small-concurrent 3815.0 Mbps 5276.4 Mbps -27.7% 99.4% / 162.0%
small-single 3431.1 Mbps 4846.9 Mbps -29.2% 90.7% / 98.1%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3030.7 Mbps 4092.6 Mbps -25.9%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 28.0% slower on average

---
b536af97431b460cd53ee38aec9210ca5a64b014 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5361.5 Mbps 7840.7 Mbps -31.6% 92.4% / 97.5%
medium-concurrent 5511.1 Mbps 7685.0 Mbps -28.3% 91.7% / 96.9%
medium-single 3873.2 Mbps 4190.1 Mbps -7.6% 88.5% / 96.9%
small-concurrent 3795.3 Mbps 4889.1 Mbps -22.4% 90.7% / 98.3%
small-single 3492.1 Mbps 4346.8 Mbps -19.7% 89.3% / 97.3%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3092.6 Mbps 4006.1 Mbps -22.8%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.2% slower on average

---
cac0daac3903724f201ae0eb22ac922b19ef086a - artifacts

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2931.6 Mbps N/A N/A
lan 782.4 Mbps N/A N/A
lossy 69.8 Mbps N/A N/A
wan 83.8 Mbps N/A N/A
---
f6cf1dd3e6bdc7fe99c252152e112cb46bbfe684 - artifacts

No results available

---
f6cf1dd3e6bdc7fe99c252152e112cb46bbfe684 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5382.6 Mbps 7900.0 Mbps -31.9% 98.4% / 188.0%
medium-concurrent 5409.1 Mbps 7508.9 Mbps -28.0% 95.0% / 130.0%
medium-single 4022.2 Mbps 4749.0 Mbps -15.3% 93.0% / 106.0%
small-concurrent 3846.9 Mbps 5412.8 Mbps -28.9% 91.4% / 99.4%
small-single 3459.2 Mbps 4866.4 Mbps -28.9% 89.9% / 97.8%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3099.7 Mbps 4134.5 Mbps -25.0%
lan 782.4 Mbps 810.4 Mbps -3.4%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 26.4% slower on average

---
a18eddeda9a478bed04adb95f31f3d61732cd555 - artifacts

No results available

---
7c42112deeb9344aecbd0d509b19878cd263e9e2 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5612.9 Mbps 7918.3 Mbps -29.1% 96.6% / 109.0%
medium-concurrent 5642.8 Mbps 7833.6 Mbps -28.0% 95.7% / 108.0%
medium-single 3800.5 Mbps 4511.8 Mbps -15.8% 90.7% / 98.2%
small-concurrent 3944.9 Mbps 5290.8 Mbps -25.4% 94.4% / 111.0%
small-single 3599.6 Mbps 4746.1 Mbps -24.2% 96.5% / 112.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3135.2 Mbps 3925.3 Mbps -20.1%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.9 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 24.2% slower on average

---
6d2bcad430707dc090f3806606b8390603e9d1bd - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5938.5 Mbps 7968.6 Mbps -25.5% 96.7% / 107.0%
medium-concurrent 5469.3 Mbps 7641.2 Mbps -28.4% 95.8% / 108.0%
medium-single 4119.3 Mbps 4189.5 Mbps -1.7% 92.9% / 109.0%
small-concurrent 3893.6 Mbps 4874.6 Mbps -20.1% 93.0% / 108.0%
small-single 3604.2 Mbps 4402.4 Mbps -18.1% 89.5% / 97.4%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3072.5 Mbps 3663.9 Mbps -16.1%
lan 782.5 Mbps 796.5 Mbps -1.8%
lossy 69.9 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 19.7% slower on average

---
5a4f32450f2c78f641bf5d04cd2024e904623649 - artifacts

No results available

---
83a158e039422cd4b6773e9dc50c41268c122e8f - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5375.3 Mbps 8180.3 Mbps -34.3% 96.4% / 129.0%
medium-concurrent 5420.1 Mbps 7457.0 Mbps -27.3% 91.9% / 97.3%
medium-single 3842.2 Mbps 4155.1 Mbps -7.5% 96.7% / 127.0%
small-concurrent 3814.9 Mbps 4996.3 Mbps -23.6% 93.7% / 102.0%
small-single 3392.7 Mbps 4053.9 Mbps -16.3% 84.9% / 95.7%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3034.9 Mbps 3967.0 Mbps -23.5%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 69.9 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.6% slower on average

---
8793168f020876502c340e8b8755832e22d90fca - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5792.7 Mbps 7823.0 Mbps -26.0% 95.9% / 105.0%
medium-concurrent 5287.1 Mbps 7970.3 Mbps -33.7% 92.7% / 104.0%
medium-single 4157.5 Mbps 4737.5 Mbps -12.2% 97.3% / 160.0%
small-concurrent 3907.9 Mbps 5121.1 Mbps -23.7% 90.7% / 98.8%
small-single 3557.2 Mbps 4675.8 Mbps -23.9% 94.7% / 108.0%

Summary

noq is 25.1% slower on average

---
2c616ad496abb80df45f403da7b3a1418f98e537 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5342.6 Mbps 7896.9 Mbps -32.3% 94.9% / 125.0%
medium-concurrent 5429.7 Mbps 7825.9 Mbps -30.6% 94.8% / 126.0%
medium-single 3610.2 Mbps 4749.0 Mbps -24.0% 92.0% / 102.0%
small-concurrent 3794.7 Mbps 5297.1 Mbps -28.4% 96.1% / 129.0%
small-single 3551.7 Mbps 4858.7 Mbps -26.9% 94.6% / 128.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3092.4 Mbps 3603.9 Mbps -14.2%
lan 782.5 Mbps 796.4 Mbps -1.7%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 26.8% slower on average

---
8d218842889c6e2e80112ec00ada63d135d598c4 - artifacts

No results available

---
e69ecfb31cdbd7bbea687d1a49f7812e044f47a3 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5361.6 Mbps 7834.2 Mbps -31.6% 93.8% / 101.0%
medium-concurrent 5415.7 Mbps 7864.2 Mbps -31.1% 91.5% / 97.1%
medium-single 3784.3 Mbps 4561.3 Mbps -17.0% 91.4% / 98.8%
small-concurrent 3896.6 Mbps 5071.6 Mbps -23.2% 99.2% / 130.0%
small-single 3494.8 Mbps 4654.8 Mbps -24.9% 86.6% / 96.3%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2905.2 Mbps N/A N/A
lan 775.1 Mbps N/A N/A
lossy 69.8 Mbps N/A N/A
wan 83.8 Mbps N/A N/A

Summary

noq is 26.8% slower on average

---
4fbdccc243c1eb04877b63520949c4de64c754bd - artifacts

No results available

---
185a6e067ff98d8c6ca6d7ddb11f02551e62b9d9 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5560.5 Mbps 8062.3 Mbps -31.0% 91.2% / 96.7%
medium-concurrent 5351.9 Mbps 7856.4 Mbps -31.9% 98.8% / 191.0%
medium-single 4618.5 Mbps 4749.6 Mbps -2.8% 90.3% / 97.6%
small-concurrent 3789.4 Mbps 5372.8 Mbps -29.5% 97.9% / 135.0%
small-single 3348.5 Mbps 4821.1 Mbps -30.5% 93.0% / 102.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3083.4 Mbps N/A N/A
lan 782.5 Mbps N/A N/A
lossy 69.8 Mbps N/A N/A
wan 83.8 Mbps N/A N/A

Summary

noq is 26.5% slower on average

---
ce171940ac91339fd7fb6b32053dfbe66e56ad52 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5460.0 Mbps 7875.9 Mbps -30.7% 93.3% / 98.1%
medium-concurrent 5576.5 Mbps 7920.7 Mbps -29.6% 91.4% / 97.0%
medium-single 4120.6 Mbps 4596.9 Mbps -10.4% 94.8% / 107.0%
small-concurrent 3860.0 Mbps 5224.9 Mbps -26.1% 96.5% / 107.0%
small-single 3527.2 Mbps 4690.7 Mbps -24.8% 93.6% / 108.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2863.0 Mbps 3631.3 Mbps -21.2%
lan 782.5 Mbps 796.4 Mbps -1.7%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 24.5% slower on average

---
a948f32bdc33af9806c3074c622cc0a230e48f19 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 6193.8 Mbps 7899.3 Mbps -21.6% 97.1% / 98.6%
medium-concurrent 5521.5 Mbps 7625.9 Mbps -27.6% 96.5% / 98.1%
medium-single 3862.2 Mbps 4189.1 Mbps -7.8% 96.6% / 98.6%
small-concurrent 4000.2 Mbps 5170.6 Mbps -22.6% 97.0% / 99.5%
small-single 3707.0 Mbps 4458.9 Mbps -16.9% 96.7% / 98.8%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal N/A 3926.3 Mbps N/A
lan N/A 798.8 Mbps N/A
lossy N/A 69.8 Mbps N/A
wan N/A 83.8 Mbps N/A

Summary

noq is 20.6% slower on average

---
0c7dee4bc03990a3fb7dded05022bdab800f3993 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5456.2 Mbps 7818.6 Mbps -30.2% 94.8% / 104.0%
medium-concurrent 5467.4 Mbps 7966.8 Mbps -31.4% 95.4% / 109.0%
medium-single 4140.2 Mbps 4594.7 Mbps -9.9% 100.1% / 162.0%
small-concurrent 3956.7 Mbps 5217.5 Mbps -24.2% 91.1% / 99.1%
small-single 3491.8 Mbps 4727.0 Mbps -26.1% 95.3% / 109.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2961.0 Mbps N/A N/A
lan 782.4 Mbps N/A N/A
lossy 69.8 Mbps N/A N/A
wan 83.8 Mbps N/A N/A

Summary

noq is 25.8% slower on average

---
536791ce61a59004a10c716d76f6eceb4023959f - artifacts

No results available

---
75629fb94db052d2c134eecdb84feb9fed10be42 - artifacts

No results available

---
21893a499bd682022b5acb87ddc0e08553bc8f2a - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5209.2 Mbps 7949.8 Mbps -34.5% 91.3% / 97.4%
medium-concurrent 5417.3 Mbps 7850.7 Mbps -31.0% 92.8% / 97.4%
medium-single 3580.0 Mbps 4689.8 Mbps -23.7% 96.5% / 128.0%
small-concurrent 3846.4 Mbps 5192.9 Mbps -25.9% 88.3% / 98.3%
small-single 3417.6 Mbps 4714.5 Mbps -27.5% 92.9% / 102.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3144.6 Mbps 3952.9 Mbps -20.4%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 69.8 Mbps ~0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 27.6% slower on average

---
84c9e1177531cbd77e64ac469c60e550ed1b7f1b - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5487.2 Mbps 8031.2 Mbps -31.7% 94.3% / 105.0%
medium-concurrent 5377.2 Mbps 7486.0 Mbps -28.2% 92.2% / 97.1%
medium-single 4067.9 Mbps 4749.2 Mbps -14.3% 86.6% / 95.8%
small-concurrent 3816.5 Mbps 5369.0 Mbps -28.9% 94.0% / 103.0%
small-single 3434.3 Mbps 4852.6 Mbps -29.2% 89.9% / 97.4%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3065.8 Mbps 3897.5 Mbps -21.3%
lan 782.4 Mbps 805.9 Mbps -2.9%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 25.9% slower on average

---
b24e9ef4011b0221e13f9f8edfa0aa5171f58409 - artifacts

No results available

---
24b4c76a40f16238faee0ff0b82047740cf32d6b - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5669.7 Mbps 7928.0 Mbps -28.5% 92.1% / 97.3%
medium-concurrent 5496.3 Mbps 7404.3 Mbps -25.8% 98.7% / 181.0%
medium-single 4438.0 Mbps 4760.8 Mbps -6.8% 95.9% / 122.0%
small-concurrent 3840.3 Mbps 5341.5 Mbps -28.1% 95.9% / 125.0%
small-single 3503.6 Mbps 4713.2 Mbps -25.7% 92.4% / 102.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 2877.3 Mbps 3925.3 Mbps -26.7%
lan 782.1 Mbps 800.8 Mbps -2.3%
lossy 69.9 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.6% slower on average

---
d1566b5115c9c14868029188dd8c9a8ff1e0cdf1 - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5807.2 Mbps 8064.7 Mbps -28.0% 92.6% / 97.7%
medium-concurrent 5290.2 Mbps 7795.9 Mbps -32.1% 93.7% / 105.0%
medium-single 3798.8 Mbps 4189.7 Mbps -9.3% 95.8% / 108.0%
small-concurrent 3883.3 Mbps 5055.1 Mbps -23.2% 96.4% / 111.0%
small-single 3527.9 Mbps 4467.7 Mbps -21.0% 93.1% / 108.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3037.5 Mbps 3939.2 Mbps -22.9%
lan 782.5 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 23.7% slower on average

@n0bot n0bot bot added this to iroh Mar 21, 2026
@github-project-automation github-project-automation bot moved this to 🚑 Needs Triage in iroh Mar 21, 2026
@divagant-martian
Copy link
Copy Markdown
Collaborator

checked locally and the failing tests just do not finish

@dignifiedquire
Copy link
Copy Markdown
Contributor Author

should be better now

@github-project-automation github-project-automation bot moved this from 🚑 Needs Triage to ✅ Done in iroh Mar 23, 2026
@github-project-automation github-project-automation bot moved this from ✅ Done to 🚑 Needs Triage in iroh Mar 23, 2026
@dignifiedquire dignifiedquire added this to the noq: iroh v0.98 milestone Mar 23, 2026
dignifiedquire added a commit that referenced this pull request Mar 23, 2026
#527 is being reverted upstream (#534). The busy-loop was caused by the
PTO cap bug (#523), not swallowed ENETUNREACH. Proper fix needs a
notify_destination_unreachable API in noq-proto (deferred).
Copy link
Copy Markdown
Collaborator

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM though I do have some questions.

Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/tests/multipath.rs Outdated
Comment thread noq-proto/src/tests/proptests.rs Outdated
@github-project-automation github-project-automation bot moved this from 🚑 Needs Triage to 🏗 In progress in iroh Mar 23, 2026
@dignifiedquire dignifiedquire requested a review from flub March 23, 2026 17:50
Comment thread noq-proto/src/connection/mod.rs Outdated
@dignifiedquire dignifiedquire requested a review from flub March 24, 2026 13:19
Copy link
Copy Markdown
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what the PR is aiming to solve, but I'm failing to understand the solution or why the approach is appropriate, beyond using picoquic as a sort of authoritative source.

Is the issue we are solving that PTO might become larger than the idle timeout? This is what I understood. I guess there are some other issues, but I'll focus on this one for the moment.

If this is the case, why divide by 16 only after 15? This makes for ~1s PTOs for idle timeouts of 15s but 2s PTOs for idle timeouts of 14s. It's such a weird criterion and I'm finding it hard to justify.

Why not cap the "desired" PTO to the min of something like 50% of the idle timeout and 2s, ensuring one PTO can always fire before the idle timeout (regardless of value)? Basically, I'm having issues with the only if idle_timeout > 15s. I've intentionally left out the hard cap special case of large RTT paths, which I think is a fine upper bound instead of the 2s.

I went ahead and gave a try at swapping picoquic's weird criterion with the one I propose, and all tests, including your new ones, pass. I'm keen on understanding if there are objections to this, but for now I'm inclined toward an approach that seems to work, is easier to understand, and does not have an arbitrary application criterion.

Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
Copy link
Copy Markdown
Member

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another bug, that was hiding here, that his uncovers. Trying to fix it.. but we should keep this workaround in this PR imho

I disagree with you three.
There is definitely a bug in here. Why does noq-proto keep resending packets forever and never hit the idle timeout? This should not happen.
I'm looking at logs and will try to figure out the bug that this PR introduces.

Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
@matheus23 matheus23 dismissed their stale review March 25, 2026 09:59

For some reason I had a hard time understanding y'all here.
I'm fine with keeping the workaround in this PR and fixing the underlying bug ASAP.

The danger is that together with this PR's changes to the loss timer, there will be situations in which users will have broken connections that never time out. That's bad.

But I do see how the underlying bug for this is also present on main.

Perhaps we should fix it on main first, then remove the workaround here, then merge this?

@dignifiedquire
Copy link
Copy Markdown
Contributor Author

I swapped this for now + duration and all tests work, am I not running all the tests?

fyi @flub @divagant-martian @matheus23 this started breaking patchbay tests in the combined branch on degradation level 6 so I will go back to the previous version

With `now + duration`, the PTO fires `duration` after the current poll,
regardless of when the last probe was sent. When the PTO cap compresses
`duration` below the path RTT (e.g. cap=1875ms vs RTT=1600ms at 800ms
one-way latency), consecutive PTO firings overlap with in-flight probes
that haven't had time to be acknowledged yet.

In patchbay degrade_ladder tests at level 6 (800ms latency, 20% loss,
20% reorder), this causes the client to exhaust its holepunch timeout
(15s) with spurious retransmissions before the PATH_CHALLENGE/RESPONSE
exchange completes. The server side is less affected because it does not
initiate holepunching.

Fix: use `max(last_ack_eliciting, now) + duration` so each PTO waits
at least `duration` from the last send, giving the peer a full RTT to
respond before the next probe.
This adopts a variant of the algorithm to cap the PTO interval, but
keep computing it from the last_ack_eliciting time to keep the
computation stable. And adds documentation of why this matters.

Also some other small doc and comments improvements around.

And fix the SendableFrames::other computation.
};

let backoff = 2u32.pow(path.pto_count.min(MAX_BACKOFF_EXPONENT));
let duration = path.rtt.pto_base() * backoff;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects the tail-loss probes during the handshake. There are tests for this and we did not have issues with that. So I have left this computation unchanged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment as to why the handshake is a special case would be nice!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... I don't know. Other than the existing comment. I mostly left this untouched because it was already untouched. And because it was already always using now + duration which I don't really follow. It possibly is because the space is essentially fixed so you can't get into trouble there.

In theory the handshake has the same issue that when a packet is lost the exponential backoff will grow to large delays, and you probably give up when the idle timeout hits. I think the cap approach should be applicable there as well, you would increase the number of packets sent before giving up. OTOH it is not as needed, this this is mostly about improving how quickly you detect a loss IIUC.

FWIW, I also don't know why this doesn't introduce any jitter in the PTO calculations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL: I've implemented a version of that also uses the cap during the handshake. This otherwise keeps the same behaviour as was there previously: counting from now. I think that behaviour is fine because when this clause kicks in there only is ever one space that gets used. But it is still weird to me. I also suspect that the previous version was rather sad as well and resulting in way fewer tail-loss probes than would be desired during this time.

@flub
Copy link
Copy Markdown
Collaborator

flub commented Mar 26, 2026

PTAL, I implemented a more gradual cap on the PTO interval. And adjusted the computation so that it computes the correct duration since the last ack-eliciting packet was sent.

@flub flub dismissed their stale review March 26, 2026 18:24

i made the changes myself now :)

Copy link
Copy Markdown
Member

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits

};

let backoff = 2u32.pow(path.pto_count.min(MAX_BACKOFF_EXPONENT));
let duration = path.rtt.pto_base() * backoff;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment as to why the handshake is a special case would be nice!

Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs Outdated
Comment thread noq-proto/src/connection/mod.rs
@flub flub added this pull request to the merge queue Mar 30, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 30, 2026
@flub flub added this pull request to the merge queue Mar 30, 2026
Merged via the queue into main with commit 50faca4 Mar 30, 2026
36 checks passed
@flub flub deleted the fix/pto-cap branch March 30, 2026 16:26
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants