Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data can be stale while data loader is unmodified #1148

Closed
timmolendijk opened this issue Mar 27, 2024 · 12 comments
Closed

Data can be stale while data loader is unmodified #1148

timmolendijk opened this issue Mar 27, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@timmolendijk
Copy link

timmolendijk commented Mar 27, 2024

I might be missing something here but I’m confused as to the suggested workflow for the following use case:

  • let’s say we build a dashboard for today’s weather;
  • it uses a single data loader that fetches its data from some third party weather API;
  • the request is constant but the response isn’t (a.k.a. the weather changes);
  • we set up some cronjob to rebuild the site every hour or so.

So what currently happens is that the weather data isn’t updated despite the rebuild, because the data loader hasn’t changed. This is obviously not the desired behavior in this scenario. How am I supposed to go about this? The only solution I can think of is manually deleting the data file from the cache before every build, while leaving the npm cache untouched to avoid breaking the site due to unexpected version bumps. Although this works, it also feels like a brittle hack.

An alternative approach — albeit equally “dirty” in my view — would seem to be to touch the data loader before every build, but that doesn’t appear to work in my environment (macOS) for reasons that I do not fully comprehend. (Manually editing and saving the data loader does work.)

Shouldn’t this workflow be more explicitly supported? Would be interested in learning others’ perspective on this. Thanks!

@timmolendijk timmolendijk added the enhancement New feature or request label Mar 27, 2024
@mbostock
Copy link
Member

mbostock commented Mar 27, 2024

In the continuous deployment case, our expectation is that you don’t have any cache, or if you do, you’re selective about what you keep in the cache to control what gets built. So your initial idea of deleting (or not retaining) the cache is exactly what we recommend. Although I wouldn’t characterize this as manual — either you can delete the entire cache if you want to run all your data loaders, or equivalently just prepopulate the cache with any data you don’t want to recompute. We do the later in our CI using GitHub Actions, keying the cache based on the current date, such that our data isn’t refetched more than once per day even if we land many commits.

@mbostock
Copy link
Member

You can see an example of this caching technique here:

https://observablehq.com/framework/deploying#caching

@timmolendijk
Copy link
Author

Thanks, I had seen that example indeed. I won’t be using GitHub Actions though, so just observable build and the file system for me.

@mbostock
Copy link
Member

In that case, you could add another cron job that clears your cache (or parts of it) at the desired cadence, or clear it before building if you always want fresh data.

@timmolendijk
Copy link
Author

timmolendijk commented Mar 27, 2024

Alright, good to know that messing with the cache folder directly isn’t advised against. Thanks for the input!

@timmolendijk
Copy link
Author

timmolendijk commented Mar 27, 2024

Figured out a solution. Just putting my experiences out here, as a n=1:

The preview server is very cool but I find myself struggling with it for the same reasons as I mentioned in my opening post: I often want to regenerate a data file, but the preview server won’t pick up on any changes in the loader (because there aren’t any) which makes me constantly second-guess whether I am truly looking at the most recent version of my project. I end up feeling more confident — and as a consequence more fast — by just invalidating cache and rebuilding dist explicitly.

I think what would make my life better is if there were a command (e.g. observable reload) that would flush the data cache and rerun all data loaders, and nothing else (i.e. it wouldn’t do any site building). This would enable me to use the preview server with all its fancy hot-reloading, while also give me full control over data file generation.

I’m sure there are some challenges with this suggestion (such as how does reload know which data is used by the site), so I imagine this all to be less straight-forward than it sounds. But wanted to share my thoughts regardless.

@mbostock
Copy link
Member

I like that idea, and I think it’s doable. 👍

Does the command need to run all the data loaders greedily, or would it be enough to flush the data loader cache and only run the data loaders as needed on access? We could make this immediate for the data loaders that are referenced by the current page in preview (we already watch associated files), but I think it’d be nice to defer running other data loaders that are not needed for the current page.

@mbostock
Copy link
Member

Clarifying one more point…

An alternative approach — albeit equally “dirty” in my view — would seem to be to touch the data loader before every build, but that doesn’t appear to work in my environment (macOS) for reasons that I do not fully comprehend.

In build, we don’t consider modification times — we only consider whether or not the cached data loader output exists. That’s the useStale: true option here:

const loader = loaders.find(join("/", file), {useStale: true});

So like we discussed above, you can force the data loader to re-run by removing the cached output. It’s expected that touching the data loader won’t have any effect on build. We do this mainly because in CI (at least with GitHub Actions and its cache action), modification times typically aren’t meaningful, so cache existence is a more reliable test of intent.

In preview, however, I would like touching the data loader to trigger the data loader to re-run (rather than requiring you to change the source contents of the data loader). I’ve put up a fix for that in #1153.

@timmolendijk
Copy link
Author

timmolendijk commented Mar 27, 2024

Does the command need to run all the data loaders greedily, or would it be enough to flush the data loader cache and only run the data loaders as needed on access? We could make this immediate for the data loaders that are referenced by the current page in preview (we already watch associated files), but I think it’d be nice to defer running other data loaders that are not needed for the current page.

Flushing the data cache and deferred running of the loaders would be sufficient in principle. In principle, because in practice there’s an issue here I’ve been running into. Consider the following workflow:

  • run observable preview → site opens in browser
  • edit a data loader of a data file (that is used on the site) in my code editor
  • save changes → preview server notifies site in browser

<no fresh data file is being regenerated at this point!>

  • I would have to navigate to my browser and give focus to the tab in which my site runs for the updated data loader to start running
  • which I would like to verify so I then head back to my terminal (in editor)
  • my data loaders typically take a few minutes to run so I will stick around in my editor/terminal until I can confirm that the loader has successfully finished
  • now I can head to my browser for a second time to check out the result

As you can see putting the in-browser site in charge of triggering data loaders creates some overhead in this process. Once again the better approach for me, at the current stage, is to not use the preview server but instead explicitly flush cache and rebuild site.


In preview, however, I would like touching the data loader to trigger the data loader to re-run (rather than requiring you to change the source contents of the data loader). I’ve put up a fix for that in #1153.

Ah, so as far as I understand the behavior that I was expecting wasn’t actually there yet, but has now been addressed. That is wonderful! 🥳

@rbcavanaugh
Copy link

Stumbling across this issue after struggling to understand why my preview site wasn't updating when source data updated. What is the suggested method to ensure that the cache is cleared? Is it simply to run rm -rf docs/.observablehq/cache before running npm run dev again?

@mbostock
Copy link
Member

Yes @rbcavanaugh, or npm run clean if you’re using the default template (observable create).

"clean": "rimraf docs/.observablehq/cache",

@rbcavanaugh
Copy link

fantastic - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants