Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

Open
nazar-pc opened this issue May 25, 2024 · 7 comments

Comments

@nazar-pc
Copy link
Contributor

nazar-pc commented May 25, 2024

I have an app that does a lot of get_providers() calls via Kademlia.

Most of the time they finish successfully, but occasionally the first and only query result is GetProvidersOk::FinishedWithNoAdditionalRecord with empty list of closest peers, which seems wrong both because the key is actually present on the network and due to closest peers list being empty.

It would be reasonable if the key was not found and I got a list of closest peers, but the fact that list of closest peers is empty while Kademlia k-buckets are not empty tells me something is off here, but I'm having a hard time tracing what, where and why.

Sometimes I first get GetProvidersOk::FoundProviders with empty list followed by GetProvidersOk::FinishedWithNoAdditionalRecord with a single peer ID, again for a key that definitely exists on the network.

Any suggestions for debugging steps here?

libp2p-kad 0.45.3 (latest at the moment of writing), disjoint_query_paths: true

@guillaumemichel
Copy link
Contributor

I am not very familiar with the disjoint paths query. It seems to be because the ClosestPeersIters are empty, but I am not sure why?

Does the same issue occur if you use another query routing technique (closest or fixed)?

  • Is your node connected with the rest of the network when the query fails?
  • Is it possible that the querying node is storing the provider record itself, and that all entries are expired?
  • Do you enable Caching? If yes, was the key requested already by the node?

@nazar-pc
Copy link
Contributor Author

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

Is your node connected with the rest of the network when the query fails?

I'm quite sure that I am. There are many concurrent queries happening and only a few end up like this and not even every time, sometimes I have 1000 successful in a row.

Is it possible that the querying node is storing the provider record itself, and that all entries are expired?

Not possible, the node making requests in thsi particular situation is such that physically and by design has no records.

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache 😅

@guillaumemichel
Copy link
Contributor

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache 😅

By default you should get 1 peer in the FinishedWithNoAdditionalRecord.

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

Is the code you are working on public?

@nazar-pc
Copy link
Contributor Author

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

I will.

Is the code you are working on public?

Yes, all of it, but it is a big project and might be a bit tricky as a demo of this issue. Let me do some experiments and post results here.

@nazar-pc
Copy link
Contributor Author

It does happen with and without disjoint query paths.

A smaller reproduction using https://github.com/subspace/subspace/tree/debug-piece-not-found-libp2p-wip is this:

cargo run --example benchmark -- --bootstrap-nodes=/dns/bootstrap-0.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWK7NuL4S6aEdy5gELnvhCGo6EyrWVARnBy7W4AJTVkaF1 --bootstrap-nodes=/dns/bootstrap-1.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWQK33n2raSXzjH8JyqbFtczBmbwZiK9Tpicdw3rveJesj --protocol-version=0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34 --out-peers=100 --pending-out-peers=100 parallel --parallelism-level=100 --max-pieces=256 --start-with=256 --retries=0

You'll see some Piece not found and for some of them (not all, keys are printed in logs) you'll see the only records matched by key are these:

2024-05-30T14:37:33.057585Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce0508c101000000000000" closest_peers=0 closest_peers=[]
2024-05-30T14:37:33.057618Z  WARN benchmark: Piece not found. piece_index=449 key="90b2ce0508c101000000000000" pure_duration=9.319213208s full_duration=180.36635061s

While expectation is to see something like this (another one that was not found):

2024-05-30T14:37:27.970135Z ERROR subspace_networking::node_runner: Get providers query yielded 0 results key="90b2ce05089201000000000000"
2024-05-30T14:37:30.060845Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce05089201000000000000" closest_peers=1 closest_peers=[PeerId("12D3KooWLjVBE9kzygAVTL1qHfEF8WpupjkUqgqnpUYZTRPvjL3y")]
2024-05-30T14:37:50.274517Z  WARN benchmark: Piece not found. piece_index=402 key="90b2ce05089201000000000000" pure_duration=30.749968564s full_duration=197.583262458s

Error logs are not errors, I just made them errors so I have different colors 🙃

subspace-networking is the crate that encapsulates our libp2p-based networking stack used for these examples.

@guillaumemichel
Copy link
Contributor

What is the size of the network?

Given the high duration of the lookups (I am not sure which duration should be considered), it seems like the node was able to establish some connections and perform the lookup even though it didn't find the record. So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

@nazar-pc
Copy link
Contributor Author

What is the size of the network?

We have over 8000 consensus nodes (clients) and at least the same number (likely higher) farmers that are serving data to clients.

So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

That is my understanding as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants