Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider network auto-update by default #214

Open
brycedrennan opened this issue Oct 25, 2020 · 4 comments
Open

Reconsider network auto-update by default #214

brycedrennan opened this issue Oct 25, 2020 · 4 comments

Comments

@brycedrennan
Copy link
Collaborator

While it's understandable and useful in many situations to want the latest dataset, it can cause issues in some situations:

  • ephemeral environments that will not be able to cache the network calls to disk. I'm thinking things like k8s tasks or other distributed systems. They'll be refetching the list at every invocation.
  • firewalled or no-connection environments. I believe the library works in this case but only after the delay of making a failed http connection

Not sure what a solution would look like but here are some ideas:

  • automate the publishing of the python package on a schedule with an updated tld_set
  • make the default non-autoupdating but allow the self-updating version to be easily used via function argument. Something like use_latest or use_autoupdating
  • add a TTL to the cached version. For example we could set it at 7 days and it would automatically refetch the list if the cached version was older than that.
@jayaddison
Copy link

This sounds useful @brycedrennan - allowing workloads that use tldextract to be offline-friendly, stateless and side-effect-free within their container would be great.

Something that occurs to me is that this is a bit like an HTTP caching problem. We have a bundle of content (the TLD metadata) that the application might like to refresh after some time duration has expired. The context may already exist locally (bundled within the package, or in an existing cache), and/or we may want to download an updated version (particularly if the local content has expired). We may want to save the updated copy or it may be discardable.

Rather than reimplement HTTP caching, perhaps it'd be worth investigating in-process Python libraries that could take the burden of disk caching off of tldextract's plate? (CacheControl looks promising - fairly standards-compliant and supports file-based storage)

From that point, it'd be possible to extract flags that enable content updates. If we know that a version of the TLD metadata is bundled in the package (which is fine since it's small), then those could be expiry=None (interpretation: no expiry; use the packaged metadata by default) and cache=False (interpretation: for any content that is retrieved, do not cache the results by default).

NB: Hopefully it'd be possible to retain tldextract's existing XDG-friendly behaviour by passing that configuration through to CacheControl.

@gwittel
Copy link

gwittel commented Jan 25, 2022

Reviving this because its so old. In our case, we run tldextract --update upon building a container. We'd love for the cache to be static. It also avoids custom work to make the cache writable by the runtime user (also a problem for us).

In the context of containers, I'd be happy with setting a TLDEXTRACT_NO_UPDATE=1 type environment variable. In that mode, it would use the cache, and error out if the cache is absent.

@hauntsaninja
Copy link
Contributor

hauntsaninja commented Feb 4, 2022

I ran into this as well. Would a PR adding an environment variable to disable auto-update be welcome?

I'd also be happy to add a Github Actions workflow or something to keep the file checked into the repo up to date.

@mosajjal
Copy link

@brycedrennan, is it possible to have a variation of tldextract that works fully offline? I'm not super familiar with pypi but I've seen packages with variations in brackets. eg:

pip install tldextract[offline]

if that's possible, we can run a monthly job to repackage the offline variant and push it to pypi. the users can set that up in their side to pull the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants