Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipns: only use cache for prefetching then do confirmation with actual dht #545

Open
Jorropo opened this issue Jan 11, 2024 · 5 comments · Fixed by ipfs/kubo#10321
Open
Labels
effort/hours Estimated to take one or several hours P2 Medium: Good to have, but can wait until someone steps up

Comments

@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

I had a report from a user complaining about stale IPNS records.

After trying out a small repro on my machine, it is fairly easy to have 1h+ sync times because we will cache IPNS records for 1h without ever challenging them.

The IPNS code is surprising, it implements rollback based resolution, so it can get bad candidates early, start next recursive resolution, then when it get better candidates it can rollback and reconciliate the previous name resolution with the newer better results (or not if better candidates lines up in the existing timeline).
This allows to optimistically start with bad candidates and validate or cancel and restart with better ones later.

I think the cache should be used as a first untrusted candidate, so when a cache entry is found, we add as the first candidate in the rollback resolution process, but then we proceed with the rollback resolution as usual.

This would also combine nicely with #397 (altho any of the two is still a good improvement)

@Jorropo Jorropo added the need/triage Needs initial labeling and prioritization label Jan 11, 2024
@Jorropo
Copy link
Contributor Author

Jorropo commented Jan 11, 2024

For people interested to look at fixing it:
The whole rollback resolution code is based at:

func resolveAsync(ctx context.Context, r resolver, p path.Path, options ResolveOptions) <-chan AsyncResult {

Through interfaces it then callback into:

func (ns *namesys) resolveOnceAsync(ctx context.Context, p path.Path, options ResolveOptions) <-chan AsyncResult {

Here is when the cache is checked:

boxo/namesys/namesys.go

Lines 175 to 181 in 76d9292

if resolvedBase, ttl, lastMod, ok := ns.cacheGet(resolvablePath.String()); ok {
p, err = joinPaths(resolvedBase, p)
span.SetAttributes(attribute.Bool("CacheHit", true))
span.RecordError(err)
out <- AsyncResult{Path: p, TTL: ttl, LastMod: lastMod, Err: err}
close(out)
return out

See return out, I think the early return should be removed and the code should continue execution bellow (after writing cached entry to the channel).

@lidel
Copy link
Member

lidel commented Jan 12, 2024

it is fairly easy to have 1h+ sync times because we will cache IPNS records for 1h without ever challenging them.

When one publishes IPNS record with ipfs name publish the implicit default for caching hint is --ttl 1h,
which seems to align with your observation ("stale" record is cached for ~1h before checking for update).

What was the TTL of the record you used in your test?
Do you still see "~1h sync problem" if you resolve IPNS name published with --ttl set to 1m?

My mental model is that IPNS caching based on TTL is a feature similar to TTL in DNS records.
Services should leverage TTL for caching, avoid work and only check for updates when TTL window expires.

I think the cache should be used as a first untrusted candidate, so when a cache entry is found, we add as the first candidate in the rollback resolution process, but then we proceed with the rollback resolution as usual.

Are you suggesting we should always do lookup for updated records, even when we have a valid record cached with TTL still being in effect? Or did I miss some nuance around rollback?

cc @hacdias since we've made cache changes related to TTL around Kubo 0.24, I believe our intention was to avoid doing lookup for updates if we have had a valid result in cache for less than TTL window.

@Jorropo
Copy link
Contributor Author

Jorropo commented Jan 12, 2024

Do you still see "~1h sync problem" if you resolve IPNS name published with --ttl set to 1m?

No, however this is not controlled by the user in question, they are running a gateway and their users are complaining that the IPNS records don't update as fast as ipfs.io, my guess is that their users are being loadbalanced to different ipfs.io instances and thus not experiencing caching.

My mental model is that IPNS caching based on TTL is a feature similar to TTL in DNS records.
Services should leverage TTL for caching, avoid work and only check for updates when TTL window expires.

Then ipfs name publish should not have a default --ttl and ask the user to provide one instead ?
Or maybe it should be 1m ?

Are you suggesting we should always do lookup for updated records, even when we have a valid record cached with TTL still being in effect? Or did I miss some nuance around rollback?

Yes, maybe a 10s~1m hard cache could be useful, but yes.

@hacdias
Copy link
Member

hacdias commented Jan 16, 2024

cc @hacdias since we've made cache changes related to TTL around Kubo 0.24, I believe our intention was to avoid doing lookup for updates if we have had a valid result in cache for less than TTL window.

Yes, that was the goal.

Then ipfs name publish should not have a default --ttl and ask the user to provide one instead ?
Or maybe it should be 1m ?

This has already been discussed: ipfs/specs#371. The idea was to make it similar to what is already expected from DNS, for example. I don't think the default is bad. The publisher of the website should be responsible for setting the best TTL for their use case. If they are changing the website all the time, they should also create records with a very short TTL.

@hacdias hacdias added P2 Medium: Good to have, but can wait until someone steps up effort/hours Estimated to take one or several hours and removed need/triage Needs initial labeling and prioritization labels Jan 23, 2024
@hacdias
Copy link
Member

hacdias commented Jan 23, 2024

Triage notes (updated by @lidel):

  • Caching is useful when used correctly, and should NOT be disabled by default

  • TTL of IPNS record is separate from its signature expiration date. Both are something a publisher controls, not gateway, but only TTL can be adjusted on gateway (just like DNS record's TTL is up to publisher, and not DNS server, but resolver can put "max ceiling" on it as allowed by RFC2181)

  • What eth.limo really wants is to ignore TTL, so to do that they need to disable IPNS caching, or set "max cache TTL" similar to DNS.MaxCacheTTL.

  • Next step here:

    • Firstly, we should add IPNS.MaxCacheTTL and make it work the same way already existing DNS.MaxCacheTTL does (this should be part of boxo/namesys, as an optional feature that puts a ceiling on TTL used for caching of IPNS Records).
    • We could also change config option Ipns.ResolveCacheSize to Priority type, such that the cache can be disabled by setting JSON value to false, and disable caching if set to false, but IPNS.MaxCacheTTL feels more useful (they can still adjust it to something low and useful, like 1 minute)

@hacdias hacdias self-assigned this Jan 23, 2024
@hacdias hacdias moved this to 🥞 Todo in IPFS Shipyard Team Jan 23, 2024
@hacdias hacdias moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Jan 25, 2024
@github-project-automation github-project-automation bot moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Jan 26, 2024
@Jorropo Jorropo reopened this Jan 26, 2024
@hacdias hacdias moved this from 🎉 Done to 🥞 Todo in IPFS Shipyard Team Jan 26, 2024
@hacdias hacdias removed their assignment Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours P2 Medium: Good to have, but can wait until someone steps up
Projects
No open projects
Status: 🥞 Todo
Development

Successfully merging a pull request may close this issue.

3 participants