New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

File hash does not change after its content updates #188

Closed
liqingfx opened this Issue Dec 10, 2017 · 40 comments

Comments

Projects
@liqingfx

liqingfx commented Dec 10, 2017

This is a 馃檵 feature request.

馃 Expected Behavior

An output file hash should change when its content updates.

馃槸 Current Behavior

File hash does not change.

馃敠 Context

<!-- index.html -->
<html>
<body>
  <script src="./main.js"></script>
</body>
</html>
// main.js
console.log(1);

When build, 'b695675d84099f097ec37d68c8c83fce.js' generates.

parcel build --no-cache --no-minify index.html

And then, change main.js

// main.js
console.log(2);

Build again.

parcel build --no-cache --no-minify index.html

The javascript file name is still 'b695675d84099f097ec37d68c8c83fce.js'.
I am not sure it is the expected behavior or not. However, when I using webpack, the output file hash will change every time its content updates.

馃實 Your Environment

Software Version(s)
Parcel v1.1.0
Node v8.9.1
npm/Yarn yarn v1.3.2
Operating System macOS High Sierra 10.13.1
@DeMoorJasper

This comment has been minimized.

Member

DeMoorJasper commented Dec 10, 2017

As far as i know this is expected behaviour.
What does change on file change is the hash contained in the fragment that is used for bundling.

{
  "dependencies":[],
  "generated": {
    "js":"\"use strict\";\n\nalert('helo');"
  },
  "hash": "d99518b2c556df9c6c4d8a2e9bd72423" // <-- This changes on change
}

Being generated here in Asset.js

generateHash() {
  return objectHash(this.generated);
}

The filename hash is however based on following parameters (so build, minify and serve files should have different cache names):

const OPTION_KEYS = ['publicURL', 'minify', 'hmr'];

@davidnagli davidnagli added 馃悰 Bug and removed 馃悰 Bug labels Dec 10, 2017

@davidnagli davidnagli closed this Dec 11, 2017

@sudhakar

This comment has been minimized.

sudhakar commented Dec 11, 2017

@davidnagli Could you reopen this issue please. IMHO, its okie to not change hash during development build. But for production build, if hash doesnt change even when the content changes, then Cache-Control & ETag headers can not be effectively used.

In my case, I add react.js, react-dom.js etc on to a separate bundle vendor.js, which rarely changes. So I set it to cache for 1yr. If I happen to added one or two more libs, I wouldnt be able to bust the cahce as hash never changes as browser thinks that "I already has this file & no need to ask the server again" :(

@devongovett

This comment has been minimized.

Member

devongovett commented Dec 11, 2017

the filenames are not currently generated based on the hash of contents. you could do something like versioning, e.g. http://mycdn.com/v1.0.0/somelib.js. when you publish a new version, the url will change.

@sudhakar

This comment has been minimized.

sudhakar commented Dec 11, 2017

yes @devongovett thats a good idea. But it would be extra configuration to achieve cache busting. I am happy with current setup for now. But IMO, its nice to have it in the core, so users get cache busting for free!

@jouni-kantola

This comment has been minimized.

jouni-kantola commented Dec 11, 2017

Why close an issue that clearly adds value if it would be fixed, @davidnagli?

From my point of view, hashes should always be based on content. Long-term caching should be content based not release based.

@devongovett

This comment has been minimized.

Member

devongovett commented Dec 11, 2017

one other reason this will be hard to change is that the filenames need to be generated before the entire contents of the file is available, since assets are processed in parallel.

@devongovett devongovett reopened this Dec 11, 2017

@davidnagli

This comment has been minimized.

Member

davidnagli commented Dec 11, 2017

Sorry for closing the issue incorrectly. It was my understanding that this was Parcels expected behavior.

@devongovett So are we going to overhaul the caching system?

@davidnagli davidnagli added this to Discussion in RFC via automation Dec 12, 2017

@davidnagli davidnagli moved this from Discussion to Design in RFC Dec 12, 2017

@eXon

This comment has been minimized.

Contributor

eXon commented Dec 12, 2017

Maybe it would be easier to add the hash in the url at the query level instead of the filename. That way you can easily cache my-app.js?hash-here without having to change them for real. It's the best world of both.

@chee

This comment has been minimized.

Contributor

chee commented Dec 12, 2017

What about creating a random string (build time, perhaps) and using it in the filename?

like

filename = `${hash}.${buildstamp}.${ext}`

With a buildstamp of Date().now().toString(36) you'd get a filename like:

d710beaad39d4ee3906c24983931b45b.jb47tk6c.js

You would get a cachebust file with every build, and the contents of the file would not need to be known before the filename.

(the buildstamp would be the same for every file in that build)
(entry would not get stamped)

@DeMoorJasper

This comment has been minimized.

Member

DeMoorJasper commented Dec 17, 2017

If this is about releasing code/cache issues with users, why not just append the version number found in package.json to the hash or before the extension?
This would need to get implemented in parcel, not in an after-building script or whathever

@jouni-kantola

This comment has been minimized.

jouni-kantola commented Dec 17, 2017

This is a typical scenario I think should be supported. I'm addressing this from a web performance/UX perspective rather than DX.

  1. I have 1/n vendor bundles with filenames including a hash
  2. I have n code splitted bundles with application code, styles, etc
  3. I fix a bug in an application module, and release
  4. Client only needs to download that specifically code splitted bundle where I fixed the bug

A scenario like this could save the client from re-downloading hundreds of KiB's. If the whole release would be versioned everything would be cache busted.

@chee

This comment has been minimized.

Contributor

chee commented Dec 17, 2017

@DeMoorJasper i'm sure there are many people using npm for managing their codebase who aren't bumping their version number every time they make a change, because they aren't publishing it as a module.

think a continuous deployment setup where there are several people merging pull requests into a main branch that's being built on the server and sent down the tubes.

They'd, as i would, want the cache to be bust by build rather than by the version number (which may not have changed).

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

@DeMoorJasper

This comment has been minimized.

Member

DeMoorJasper commented Dec 17, 2017

@chee was just an idea totally forgot about browser cache and web performance, wondering how this would be implemented.
Now i'm leaning more towards your timestamp approach

@vforvalerio87

This comment has been minimized.

vforvalerio87 commented Dec 20, 2017

This should be really fixed imho; I just implemented parcel in a project and everytime I make any change to js or css I have to manually add a progressive number and change the reference in the html, otherwise when I deploy to production (which has browser caching and a CDN) the server won't give me the updated version of those files.
In my opinion the best approach would be the content checksum approach.

@shawwn

This comment has been minimized.

Member

shawwn commented Dec 20, 2017

@shawwn

This comment has been minimized.

Member

shawwn commented Dec 20, 2017

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

Good point. The query parameter should be the UTC mtime of the source asset file, not random. That will preserve caching.

One way to do this without breaking asset-level parallelism is to modify HTMLPackager and CSSPackager to scan for bundled URLs (/dist/${hash}.${ext}) and substitute in the query parameter (/dist/${hash}.${ext}?${mtime}).

That can happen as a post-processing step, after bundling. It should be possible to do this efficiently.

That will preserve all the previous advantages, like the fact that assets can generate bundled filenames without needing the contents of the other assets or querying the parent process.

@vforvalerio87

This comment has been minimized.

vforvalerio87 commented Dec 20, 2017

The mtime approach is undesirable though because it busts cache for every asset every time. I'd rather stick to manually renaming assets in that case because I don't want every user to re-download everything again every time I deploy a website (possibly numerous times per day)

@shawwn

This comment has been minimized.

Member

shawwn commented Dec 20, 2017

@jouni-kantola

This comment has been minimized.

jouni-kantola commented Dec 20, 2017

To be honest, I'd much rather have a slow build where hashing was content based, than having users re-downloading assets when they shouldn't need to. What about a flag to the CLI?

@Munter

This comment has been minimized.

Munter commented Dec 25, 2017

Here are some learnings from Assetgraph, where we solved the same issue.

You absolutely want to do content hashing to you can achieve deterministic content addressable file names that lend them selves well to far future cache expiry. Random build-specific hash busts the cache too often. Query parameters aren't always treated correctly by proxies in between the server and client.

You do however not need to do content hashing a lot of times. You can get away with doing them once, at the point where you know you are done making source code modifications and are ready to write out to disc.

The hash renaming must be done in a depth first post-order graph traversal to ensure content hashes updating all the way up to the entry points when deeply nested dependencies update. Any other traversal algorithm will result in caching errors

@chee

This comment has been minimized.

Contributor

chee commented Dec 26, 2017

Query parameters aren't always treated correctly by proxies in between the server and client.

Some proxy software classifies anything with a query string as dynamic content, and so does not cache it at all. This is, for instance, Squid's default behaviour.

@benhutton

This comment has been minimized.

benhutton commented Jan 9, 2018

@devongovett what are you thinking about this one? Is asset fingerprinting something that you agree should be baked into Parcel's core? Is it something we should try to figure out how to add the correct hooks to write a plugin for? Is it something I should try to find some other way to write a post-processor to accomplish?

@shanebo

This comment has been minimized.

shanebo commented Jan 9, 2018

@devongovett, me and @benhutton are willing to invest some time and energy into this fingerprinting issue but we don't want to head in an implementation direction you aren't a fan of. Would you be open to putting some thought into this with us so we can try and work on a PR solution, or plugin?

@shawwn

This comment has been minimized.

Member

shawwn commented Jan 9, 2018

@devongovett

This comment has been minimized.

Member

devongovett commented Jan 10, 2018

@shanebo @benhutton I think this will be very difficult to achieve in the current parallel architecture. We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

If you have suggestions for ways around this, or alternative hashing/versioning strategies, let me know!

@chee

This comment has been minimized.

Contributor

chee commented Jan 10, 2018

@devongovett what if the files were generated as template files, with placeholders for the paths

<script src="{{main.js}}"></script>

then those are compiled afterwards in a separate operation?

(so the hash of the file would be the content of the file with the template markers in it)

the difficulty might be if a codesplit dependency changes, but the parent does not. that the parent would still need cachebust even though nothing is different.

@Munter

This comment has been minimized.

Munter commented Jan 10, 2018

@chee oh no, please don't invent conventions that take the source code away from working in a browser :'(

@Munter

This comment has been minimized.

Munter commented Jan 10, 2018

@devongovett

We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

Why do you need to know the final URL before all assets are processed? What if you just used the current names as a temporary name. You could actually keep it like this in the development environment, since content addressable URL's are more of a production feature.

When you do a production build you could just rename the files once more and update the references to them. Or does the build pipeline somehow lose the references to an asset in the middle of the pipeline?

@benhutton

This comment has been minimized.

benhutton commented Jan 10, 2018

Likely what needs to happen, either as a plugin or as part of the parcel core, is some sort of post-processor that does the depth first post order traversal mentioned at the bottom of #188 (comment)

We assemble the full and final graph and do a quick walk through it, renaming files and then re-referencing them further up.

It should be relatively fast, but I agree that it only needs to happen in production. It could easily be hidden behind some sort of flag on the executable.

As to why we care about this particular strategy so much: it's the only thing that seems to work reliably with CDNs. (That we know of! If someone else has a better solution, I'm all ears.)

  • Query strings are non-ideal, as mentioned here: #188 (comment). They also don't play nice with image transformation SaaS like imgix. But even if we did go with a query string, it would have the same post-processing concerns that we have here.
  • Versioning the whole directory is both error-prone and leads to overly-aggressive invalidation
  • Etags would be the remaining option, but that takes the CDN approach off the table, removing our assets from the edge locations and potentially bogging down our servers with a ton of unnecessary asset queries.

@shawwn, @shanebo and I will try hitting you up on slack later today to talk through this more.

@chee

This comment has been minimized.

Contributor

chee commented Jan 10, 2018

@Munter i'm talking about doing this as part of the compile step, not as something the developer would have to do, and it would go away.

@Munter

This comment has been minimized.

Munter commented Jan 10, 2018

@benhutton I just jumped on your slack as well (@Munter). Feel free to ping me if you need any feedback on how we implemented this in assetgraph. I don't know if the models are close enough to each other to be able to do the same, but it feels close from inspecting the sources here

@DeMoorJasper

This comment has been minimized.

Member

DeMoorJasper commented Jan 10, 2018

@benhutton what about just keeping the current naming-system (for initial and development naming) but renaming all references at the end of a production build to the content hash, it's sort of the same as what @chee and u suggested but i'm pretty sure it'll be way easier to implement

@benhutton

This comment has been minimized.

benhutton commented Jan 11, 2018

@DeMoorJasper I think that maybe we're talking about the same thing? Only change things for production, and do it at the end.

I don't think there is any way around doing a tree traversal, though. That is, I think that this algorithm will NOT work:

  1. Find the md5 hash of every file.
  2. Rename those files to include the hash.
  3. Go back and edit the files to include references to the new file names with the hashes.

Instead, we need to do the tree traversal that @Munter described.

  1. Find your graph.
  2. Find a leaf node.
  3. Hash, rename.
  4. Update references to that file.
  5. Walk the tree back up, repeating for each file.

The idea is that when any given node changes, all the nodes above it will end up changing too as the references trickle up. And any nodes that are NOT affected will NOT change. So you are busting exactly the right caches at the right time.

Here's the big principle: A file doesn't get edited after it gets hashed. The hash is of the FINAL content of that file.

@DeMoorJasper does that make sense?
@Munter am I describing the algorithm you had in mind accurately?

@Munter

This comment has been minimized.

Munter commented Jan 11, 2018

@benhutton That is the exact right algorithm and the correct reason you describe.

This image always helps me visualise it best:

Traversal order: A C E D B H I G F

It's still important to start at your entry point(s) and just remember to put the hashing logic after child traversal. This is the what we do in AssetGraph: https://github.com/assetgraph/assetgraph/blob/master/lib/AssetGraph.js#L445-L462

When you extend Parcel with multiple entry points you probably want to keep track of seen assets to avoid double work as well

@fritx

This comment has been minimized.

fritx commented Feb 9, 2018

Is there any workaround for now?

@augnustin

This comment has been minimized.

augnustin commented Mar 8, 2018

I completely 馃憤 the MD5 hash naming strategy and I'm glad this was the final pick. Parcel is a lot beautiful because it is plug and play, and it needs to remain this way!

Looking forward to seeing this available in production. Any idea if this would be within few months or much more than that?

Cheers

@augnustin

This comment has been minimized.

augnustin commented Mar 15, 2018

I'd like to mention that IMHO this issue is top priority:

For now, my deploys are completely random. I try many things before I can serve the last version of my assets. Among those:

  • assets rebuild
  • rm -rf public/* && assets rebuild
  • service nginx restart

still I get unpredictable results...

This makes it unusable in real production context.

@augnustin

This comment has been minimized.

augnustin commented Mar 15, 2018

Ok I fixed my issue by doing rm -rf .cache. This might be another issue but I'm reporting here in case someone faces the same situation. I'll create the other one when I have more predictable results to share.

@devongovett

This comment has been minimized.

Member

devongovett commented Mar 21, 2018

Should be solved by #1025 which generates content-hashed filenames for static assets. Please help test using the master branch - a release will hopefully come next week!

RFC automation moved this from Design to Done Mar 21, 2018

@augnustin

This comment has been minimized.

augnustin commented Mar 21, 2018

Wonderful! Great reactivity.

Definitely willing to test it as soon as it is released. If you can here or in #1025 this would be perfect.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment