Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint #2226

Closed
ramfox opened this issue Apr 23, 2024 · 2 comments · Fixed by #2322
Assignees
Labels
bug Something isn't working c-iroh-net

Comments

@ramfox
Copy link
Contributor

ramfox commented Apr 23, 2024

No description provided.

@ramfox ramfox added this to the v0.15.0 milestone Apr 23, 2024
@ramfox ramfox added bug Something isn't working c-iroh-net labels Apr 23, 2024
@divagant-martian
Copy link
Contributor

Noticed this as well while checking quinn's code. Where exactly did you experience this thought? This is true as well for poll_recv which is the one that has been giving us more trouble

@ramfox ramfox changed the title bug: in poll_send, returning anything except Ok(n) results in quinn killing the endpoint bug: in poll_send, returning an Ready(Err(e)) & not Ok(n) results in quinn killing the endpoint Apr 24, 2024
@ramfox
Copy link
Contributor Author

ramfox commented Apr 24, 2024

This has been part of pasha's connection issues. We currently are in a strange state with discovery where (when joining a document) it is possible we attempt to dial before we have any addresses. That will be fixed in #2225 . But it illuminated this bug:

When we have no possible addresses, we return an error, and this shuts down the entire Endpoint

I'll take a look at the recv side as well.

@ramfox ramfox changed the title bug: in poll_send, returning an Ready(Err(e)) & not Ok(n) results in quinn killing the endpoint bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint Apr 24, 2024
@ramfox ramfox modified the milestones: v0.15.0, v0.16.0 Apr 26, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 26, 2024
…2225)

## Description

A bug in the discovery flow assumed that if we had a mapped quic address
for a `node_id`, then we had at least a relay URL or one direct address
available for that node.

This meant there were instances in which discovery should have been
launched before attempting to dial the node, but was never launched,
leading to `no UDP or relay address available for node` errors.

Now we check if addresses are available for a `node_id` and launch
discovery if we do not have any before we attempt to dial.

We also now take into account the "alive" status of any relay URL
information we have on a `node_id` when determining if we need to run
discovery for that `node_id`

### tests
This also refactors the test DNS server and the test Pkarr relay server
to use `CleanupDropGuard` for resource cleanup. It also moves the
functions that launch those servers into the `iroh-net::test_utils`
crate. This ended up being an unnecessary refactor (I ended up writing
the test in the `discovery` crate anyway), but I left it in case we need
to do other tests that rely on discovery outside of the `discovery`
crate.

## Notes & open questions

The above issue uncovered a more serious bug: the endpoint currently
dies when it attempts to dial a node without any available address
information because we return the `no UDP or relay address available for
node` error. We should not do this. In that situation, we should let
that connection timeout or launch discovery services inside the
magicsocket. That bug (#2226) is not fixed in this PR.

## Breaking changes
- Created new public struct `RelayUrlInfo` that combines the relay_url
and additional information about the state of our connection to the
remote node at this relay URL:
```
struct RelayUrlInfo {
    relay_url: RelayUrl,
    /// How long since there has been activity on this relay url
    last_alive: Option<Duration>,
    /// Latest latency information for this relay url
    latency: Option<Duration>,
}
```
- `NodeInfo.relay_url` (called `ConnectionInfo` outside of `iroh-net`)
is now `Option<RelayUrlInfo>`, changed from `Option<RelayUrl>`

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: Kasey <klhuizinga@gmail.com>
@ramfox ramfox self-assigned this May 1, 2024
@ramfox ramfox modified the milestones: v0.16.0, v0.17.0 May 13, 2024
@dignifiedquire dignifiedquire removed this from the v0.17.0 milestone May 22, 2024
github-merge-queue bot pushed a commit that referenced this issue Jun 6, 2024
…r direct addresses in `poll_send` (#2322)

## Description

If we have no relay URL or addresses to send transmits for a NodeID in
`poll_send`, what do we do?

Returning `Polling::Ready(Err(e))` causes the endpoint to error, which
causes all connections to fail.

If we return `Polling::Pending` (in this case), we have no mechanism for
waking the waker once the `poll_send` is returned. Also, even if we wake
up and continue to poll, we will attempt to send the same transmits that
we know we cannot send.

If we return `Polling::Ready(Ok(0))`, we will get into a loop in Quinn
that attempts to keep re-sending the same transmits.

However, if we report back to Quinn that we *have* sent those transmits
(by returning `Polling::Ready(Ok(n))`), then Quinn will move on and
attempt to send new transmits. QUIC mechanisms might cause those
transmits to be re-sent when we get no ACKs for them, but eventually,
the connection will time out.

closes #2226 

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-iroh-net
Projects
Archived in project
3 participants