Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File hash does not change after its content updates #188

Closed
h2jorm opened this issue Dec 10, 2017 · 40 comments
Closed

File hash does not change after its content updates #188

h2jorm opened this issue Dec 10, 2017 · 40 comments
Labels
💬 RFC Request For Comments
Projects

Comments

@h2jorm
Copy link

h2jorm commented Dec 10, 2017

This is a 🙋 feature request.

🤔 Expected Behavior

An output file hash should change when its content updates.

😯 Current Behavior

File hash does not change.

🔦 Context

<!-- index.html -->
<html>
<body>
  <script src="./main.js"></script>
</body>
</html>
// main.js
console.log(1);

When build, 'b695675d84099f097ec37d68c8c83fce.js' generates.

parcel build --no-cache --no-minify index.html

And then, change main.js

// main.js
console.log(2);

Build again.

parcel build --no-cache --no-minify index.html

The javascript file name is still 'b695675d84099f097ec37d68c8c83fce.js'.
I am not sure it is the expected behavior or not. However, when I using webpack, the output file hash will change every time its content updates.

🌍 Your Environment

Software Version(s)
Parcel v1.1.0
Node v8.9.1
npm/Yarn yarn v1.3.2
Operating System macOS High Sierra 10.13.1
@DeMoorJasper
Copy link
Member

DeMoorJasper commented Dec 10, 2017

As far as i know this is expected behaviour.
What does change on file change is the hash contained in the fragment that is used for bundling.

{
  "dependencies":[],
  "generated": {
    "js":"\"use strict\";\n\nalert('helo');"
  },
  "hash": "d99518b2c556df9c6c4d8a2e9bd72423" // <-- This changes on change
}

Being generated here in Asset.js

generateHash() {
  return objectHash(this.generated);
}

The filename hash is however based on following parameters (so build, minify and serve files should have different cache names):

const OPTION_KEYS = ['publicURL', 'minify', 'hmr'];

@sudhakar
Copy link

@davidnagli Could you reopen this issue please. IMHO, its okie to not change hash during development build. But for production build, if hash doesnt change even when the content changes, then Cache-Control & ETag headers can not be effectively used.

In my case, I add react.js, react-dom.js etc on to a separate bundle vendor.js, which rarely changes. So I set it to cache for 1yr. If I happen to added one or two more libs, I wouldnt be able to bust the cahce as hash never changes as browser thinks that "I already has this file & no need to ask the server again" :(

@devongovett
Copy link
Member

the filenames are not currently generated based on the hash of contents. you could do something like versioning, e.g. http://mycdn.com/v1.0.0/somelib.js. when you publish a new version, the url will change.

@sudhakar
Copy link

yes @devongovett thats a good idea. But it would be extra configuration to achieve cache busting. I am happy with current setup for now. But IMO, its nice to have it in the core, so users get cache busting for free!

@jouni-kantola
Copy link

jouni-kantola commented Dec 11, 2017

Why close an issue that clearly adds value if it would be fixed, @davidnagli?

From my point of view, hashes should always be based on content. Long-term caching should be content based not release based.

@devongovett
Copy link
Member

one other reason this will be hard to change is that the filenames need to be generated before the entire contents of the file is available, since assets are processed in parallel.

@devongovett devongovett reopened this Dec 11, 2017
@davidnagli
Copy link
Contributor

Sorry for closing the issue incorrectly. It was my understanding that this was Parcels expected behavior.

@devongovett So are we going to overhaul the caching system?

@davidnagli davidnagli added the 💬 RFC Request For Comments label Dec 12, 2017
@davidnagli davidnagli added this to Discussion in RFC via automation Dec 12, 2017
@davidnagli davidnagli moved this from Discussion to Design in RFC Dec 12, 2017
@eXon
Copy link
Contributor

eXon commented Dec 12, 2017

Maybe it would be easier to add the hash in the url at the query level instead of the filename. That way you can easily cache my-app.js?hash-here without having to change them for real. It's the best world of both.

@chee
Copy link

chee commented Dec 12, 2017

What about creating a random string (build time, perhaps) and using it in the filename?

like

filename = `${hash}.${buildstamp}.${ext}`

With a buildstamp of Date().now().toString(36) you'd get a filename like:

d710beaad39d4ee3906c24983931b45b.jb47tk6c.js

You would get a cachebust file with every build, and the contents of the file would not need to be known before the filename.

(the buildstamp would be the same for every file in that build)
(entry would not get stamped)

@DeMoorJasper
Copy link
Member

DeMoorJasper commented Dec 17, 2017

If this is about releasing code/cache issues with users, why not just append the version number found in package.json to the hash or before the extension?
This would need to get implemented in parcel, not in an after-building script or whathever

@jouni-kantola
Copy link

This is a typical scenario I think should be supported. I'm addressing this from a web performance/UX perspective rather than DX.

  1. I have 1/n vendor bundles with filenames including a hash
  2. I have n code splitted bundles with application code, styles, etc
  3. I fix a bug in an application module, and release
  4. Client only needs to download that specifically code splitted bundle where I fixed the bug

A scenario like this could save the client from re-downloading hundreds of KiB's. If the whole release would be versioned everything would be cache busted.

@chee
Copy link

chee commented Dec 17, 2017

@DeMoorJasper i'm sure there are many people using npm for managing their codebase who aren't bumping their version number every time they make a change, because they aren't publishing it as a module.

think a continuous deployment setup where there are several people merging pull requests into a main branch that's being built on the server and sent down the tubes.

They'd, as i would, want the cache to be bust by build rather than by the version number (which may not have changed).

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

@DeMoorJasper
Copy link
Member

@chee was just an idea totally forgot about browser cache and web performance, wondering how this would be implemented.
Now i'm leaning more towards your timestamp approach

@vforvalerio87
Copy link

This should be really fixed imho; I just implemented parcel in a project and everytime I make any change to js or css I have to manually add a progressive number and change the reference in the html, otherwise when I deploy to production (which has browser caching and a CDN) the server won't give me the updated version of those files.
In my opinion the best approach would be the content checksum approach.

@shawwn
Copy link
Contributor

shawwn commented Dec 20, 2017 via email

@shawwn
Copy link
Contributor

shawwn commented Dec 20, 2017

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

Good point. The query parameter should be the UTC mtime of the source asset file, not random. That will preserve caching.

One way to do this without breaking asset-level parallelism is to modify HTMLPackager and CSSPackager to scan for bundled URLs (/dist/${hash}.${ext}) and substitute in the query parameter (/dist/${hash}.${ext}?${mtime}).

That can happen as a post-processing step, after bundling. It should be possible to do this efficiently.

That will preserve all the previous advantages, like the fact that assets can generate bundled filenames without needing the contents of the other assets or querying the parent process.

@vforvalerio87
Copy link

vforvalerio87 commented Dec 20, 2017

The mtime approach is undesirable though because it busts cache for every asset every time. I'd rather stick to manually renaming assets in that case because I don't want every user to re-download everything again every time I deploy a website (possibly numerous times per day)

@shawwn
Copy link
Contributor

shawwn commented Dec 20, 2017 via email

@jouni-kantola
Copy link

To be honest, I'd much rather have a slow build where hashing was content based, than having users re-downloading assets when they shouldn't need to. What about a flag to the CLI?

@Munter
Copy link

Munter commented Dec 25, 2017

Here are some learnings from Assetgraph, where we solved the same issue.

You absolutely want to do content hashing to you can achieve deterministic content addressable file names that lend them selves well to far future cache expiry. Random build-specific hash busts the cache too often. Query parameters aren't always treated correctly by proxies in between the server and client.

You do however not need to do content hashing a lot of times. You can get away with doing them once, at the point where you know you are done making source code modifications and are ready to write out to disc.

The hash renaming must be done in a depth first post-order graph traversal to ensure content hashes updating all the way up to the entry points when deeply nested dependencies update. Any other traversal algorithm will result in caching errors

@chee
Copy link

chee commented Dec 26, 2017

Query parameters aren't always treated correctly by proxies in between the server and client.

Some proxy software classifies anything with a query string as dynamic content, and so does not cache it at all. This is, for instance, Squid's default behaviour.

@benhutton
Copy link

@devongovett what are you thinking about this one? Is asset fingerprinting something that you agree should be baked into Parcel's core? Is it something we should try to figure out how to add the correct hooks to write a plugin for? Is it something I should try to find some other way to write a post-processor to accomplish?

@shanebo
Copy link

shanebo commented Jan 9, 2018

@devongovett, me and @benhutton are willing to invest some time and energy into this fingerprinting issue but we don't want to head in an implementation direction you aren't a fan of. Would you be open to putting some thought into this with us so we can try and work on a PR solution, or plugin?

@shawwn
Copy link
Contributor

shawwn commented Jan 9, 2018 via email

@devongovett
Copy link
Member

@shanebo @benhutton I think this will be very difficult to achieve in the current parallel architecture. We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

If you have suggestions for ways around this, or alternative hashing/versioning strategies, let me know!

@chee
Copy link

chee commented Jan 10, 2018

@devongovett what if the files were generated as template files, with placeholders for the paths

<script src="{{main.js}}"></script>

then those are compiled afterwards in a separate operation?

(so the hash of the file would be the content of the file with the template markers in it)

the difficulty might be if a codesplit dependency changes, but the parent does not. that the parent would still need cachebust even though nothing is different.

@Munter
Copy link

Munter commented Jan 10, 2018

@chee oh no, please don't invent conventions that take the source code away from working in a browser :'(

@Munter
Copy link

Munter commented Jan 10, 2018

@devongovett

We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

Why do you need to know the final URL before all assets are processed? What if you just used the current names as a temporary name. You could actually keep it like this in the development environment, since content addressable URL's are more of a production feature.

When you do a production build you could just rename the files once more and update the references to them. Or does the build pipeline somehow lose the references to an asset in the middle of the pipeline?

@benhutton
Copy link

Likely what needs to happen, either as a plugin or as part of the parcel core, is some sort of post-processor that does the depth first post order traversal mentioned at the bottom of #188 (comment)

We assemble the full and final graph and do a quick walk through it, renaming files and then re-referencing them further up.

It should be relatively fast, but I agree that it only needs to happen in production. It could easily be hidden behind some sort of flag on the executable.

As to why we care about this particular strategy so much: it's the only thing that seems to work reliably with CDNs. (That we know of! If someone else has a better solution, I'm all ears.)

  • Query strings are non-ideal, as mentioned here: File hash does not change after its content updates #188 (comment). They also don't play nice with image transformation SaaS like imgix. But even if we did go with a query string, it would have the same post-processing concerns that we have here.
  • Versioning the whole directory is both error-prone and leads to overly-aggressive invalidation
  • Etags would be the remaining option, but that takes the CDN approach off the table, removing our assets from the edge locations and potentially bogging down our servers with a ton of unnecessary asset queries.

@shawwn, @shanebo and I will try hitting you up on slack later today to talk through this more.

@chee
Copy link

chee commented Jan 10, 2018

@Munter i'm talking about doing this as part of the compile step, not as something the developer would have to do, and it would go away.

@Munter
Copy link

Munter commented Jan 10, 2018

@benhutton I just jumped on your slack as well (@Munter). Feel free to ping me if you need any feedback on how we implemented this in assetgraph. I don't know if the models are close enough to each other to be able to do the same, but it feels close from inspecting the sources here

@DeMoorJasper
Copy link
Member

DeMoorJasper commented Jan 10, 2018

@benhutton what about just keeping the current naming-system (for initial and development naming) but renaming all references at the end of a production build to the content hash, it's sort of the same as what @chee and u suggested but i'm pretty sure it'll be way easier to implement

@benhutton
Copy link

benhutton commented Jan 11, 2018

@DeMoorJasper I think that maybe we're talking about the same thing? Only change things for production, and do it at the end.

I don't think there is any way around doing a tree traversal, though. That is, I think that this algorithm will NOT work:

  1. Find the md5 hash of every file.
  2. Rename those files to include the hash.
  3. Go back and edit the files to include references to the new file names with the hashes.

Instead, we need to do the tree traversal that @Munter described.

  1. Find your graph.
  2. Find a leaf node.
  3. Hash, rename.
  4. Update references to that file.
  5. Walk the tree back up, repeating for each file.

The idea is that when any given node changes, all the nodes above it will end up changing too as the references trickle up. And any nodes that are NOT affected will NOT change. So you are busting exactly the right caches at the right time.

Here's the big principle: A file doesn't get edited after it gets hashed. The hash is of the FINAL content of that file.

@DeMoorJasper does that make sense?
@Munter am I describing the algorithm you had in mind accurately?

@Munter
Copy link

Munter commented Jan 11, 2018

@benhutton That is the exact right algorithm and the correct reason you describe.

This image always helps me visualise it best:

Traversal order: A C E D B H I G F

It's still important to start at your entry point(s) and just remember to put the hashing logic after child traversal. This is the what we do in AssetGraph: https://github.com/assetgraph/assetgraph/blob/master/lib/AssetGraph.js#L445-L462

When you extend Parcel with multiple entry points you probably want to keep track of seen assets to avoid double work as well

@fritx
Copy link

fritx commented Feb 9, 2018

Is there any workaround for now?

@augnustin
Copy link

augnustin commented Mar 8, 2018

I completely 👍 the MD5 hash naming strategy and I'm glad this was the final pick. Parcel is a lot beautiful because it is plug and play, and it needs to remain this way!

Looking forward to seeing this available in production. Any idea if this would be within few months or much more than that?

Cheers

@augnustin
Copy link

augnustin commented Mar 15, 2018

I'd like to mention that IMHO this issue is top priority:

For now, my deploys are completely random. I try many things before I can serve the last version of my assets. Among those:

  • assets rebuild
  • rm -rf public/* && assets rebuild
  • service nginx restart

still I get unpredictable results...

This makes it unusable in real production context.

@augnustin
Copy link

Ok I fixed my issue by doing rm -rf .cache. This might be another issue but I'm reporting here in case someone faces the same situation. I'll create the other one when I have more predictable results to share.

@devongovett
Copy link
Member

devongovett commented Mar 21, 2018

Should be solved by #1025 which generates content-hashed filenames for static assets. Please help test using the master branch - a release will hopefully come next week!

RFC automation moved this from Design to Done Mar 21, 2018
@augnustin
Copy link

Wonderful! Great reactivity.

Definitely willing to test it as soon as it is released. If you can here or in #1025 this would be perfect.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💬 RFC Request For Comments
Projects
No open projects
RFC
  
Done
Development

No branches or pull requests