Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not modify mtime on unmodified files #10842

Open
JonHurst opened this issue Mar 16, 2023 · 11 comments
Open

Option to not modify mtime on unmodified files #10842

JonHurst opened this issue Mar 16, 2023 · 11 comments
Labels
Milestone

Comments

@JonHurst
Copy link

Naïve sychronisers such as aws s3 sync rely on file size and file modification time to decide whether to upload a file. Hugo appears to overwrite existing files, updating their file modification time, even when the file in question is unchanged.

I propose that something like a --preserveModificationTimes flag be added. This would cause Hugo to check whether a file that it is about to overwrite has identical content and cancel the operation if this is the case, thus preserving the mtime.

@bep bep removed the NeedsTriage label Mar 16, 2023
@bep bep added this to the v0.112.0 milestone Mar 16, 2023
@bep
Copy link
Member

bep commented Mar 16, 2023

Naïve sychronisers such as aws s3 sync rely on file size and file modification time to decide whether to upload a file.

Why not use hugo deploy? (or s3deploy)

@JonHurst
Copy link
Author

Why not use hugo deploy? (or s3deploy)

There are plenty of workarounds. rclone --checksum is another. Personally I wrote a
35 line python script
that takes about 200ms to fix the mtimes after a hugo run. So definitely a nice to have rather than a need to have.

aws s3 sync is, however, the canonical way to synchronise local filesystems with an s3 bucket. It is likely to have capabilities that workaround tools lack, and it will be where new capabilities land first. Its developers have chosen to rely on existence, file size and mtime as the three descriminators for upload -- the md5 checksum used by the other tools, I believe, comes from undocumented behaviour of the etag (could be wrong on this). To support it, and any other tool that has made the same assumptions, just requires that the modification times of unmodified files are not modified, which seems like the "right" thing to do anyway.

For fingerprinted files all that is needed is a check for existence before writing. For static files, I believe the behaviour is already not to update the mtime on copy. For non fingerprinted, there is obviously a performance hit because the read has to come from disk whereas the write would be cached, and there is a byte by byte comparison with the string in memory to do. I suspect that this overhead is tiny, particularly in go, but putting it behind a flag would mitigate this anyway.

@jmooring
Copy link
Member

See aws/aws-cli#6750. That's the right place to address your use case.

@JonHurst
Copy link
Author

JonHurst commented Mar 20, 2023

See aws/aws-cli#6750. That's the right place to address your use case.

I don't think that aws-cli is being particularly unreasonable in assuming that a file with a newer mtime has been updated. Sure, aws-cli could, and probably will in the future, incorporate a checksum comparison to account for unmodified files with changed mtimes, but that is still basically a workaround and is specific to aws.

Something, somewhere has to decide that the file is unchanged and therefore doesn't need synchronising. That can be the synchronising tool itself; it can be, as I am doing, a script run between the hugo run and the synchronising tool; or it can be hugo. Of the options, the most natural and efficient is to do it in hugo -- hugo knows what's static, knows what's fingerprinted and has the string ready in memory for everything else.

As I say, it's not something that is personally causing me trouble; I'm quite happy with my solution. I do, however, believe it would be a useful feature if hugo could, behind a flag if necessary, only change mtimes when files have been modified.

@flowerbug
Copy link

filezilla is another file sychronisation tool which can use modification time and size to decide whether or not to upload a file.

i for one would welcome this flag/feature as it would save many useless uploads.

not everyone is using cloud/webhosting scripts or git for development and uploads of websites.

@bep bep modified the milestones: v0.112.0, v0.113.0 Apr 15, 2023
@bep bep modified the milestones: v0.113.0, v0.114.0, v0.115.0 Jun 8, 2023
@bep bep modified the milestones: v0.115.0, v0.116.0 Jun 30, 2023
@bep bep modified the milestones: v0.116.0, v0.117.0 Aug 1, 2023
@lpar
Copy link

lpar commented Aug 1, 2023

Not quite the same thing, but I'd like a way to set mtime of each generated file to be the same as whatever .Lastmod ends up being. The mtimes getting reset every build is causing my Last-Modified HTTP headers to be incorrect after site upload.

Setting mtime based on .Lastmod would work for rsync deployment as well as S3, and since the data sources for .Lastmod are configurable it would be a nice general solution.

For now I think I'm going to have to write the times into a Dublin Core last modification meta element in the head, then write a utility to walk the public folder, read all the files, and update all the mtimes from those values.

@jmooring
Copy link
Member

jmooring commented Aug 1, 2023

@lpar This makes sense, but remember that some pages may not be backed by a file (e.g., top level sections, taxonomy pages, term pages). Those would have .Lastmod = 0001-01-01T00:00:00Z. In that case what would you want?

@jmooring
Copy link
Member

jmooring commented Aug 2, 2023

@lpar

I'm going to have to ...

This seems a bit easier:

git clone --single-branch -b hugo-github-issue-10842 https://github.com/jmooring/hugo-testing hugo-github-issue-10842
cd hugo-github-issue-10842
hugo
./touch.sh

Files of interest:

  • hugo.toml
  • layouts/_default/home.lastmod.csv
  • touch.sh

Create a data file while building the site, then run a short script. On my rather average laptop, it takes about 2 seconds to run against a 500 page site, and about 8 seconds for a 2000 page site.

@lpar
Copy link

lpar commented Aug 4, 2023

Oh, that's cunning. The shell script doesn't work when I test because I'm on a Mac so bash is years out of date and there's no readarray, but it gave me the idea for something even more cunning:

config.toml:

[mediaTypes]
[mediaTypes."text/x-shellscript"]
  suffixes = ["sh"]

[outputFormats.lastmod]
baseName = 'lastmod'
isPlainText = true
mediaType = 'text/x-shellscript'
notAlternative = true

[outputs]
home = ["HTML","JSON","lastmod"]

layouts/_{default/home.lastmod.sh:

#!/bin/sh
{{ range site.Pages.ByDate -}}
{{- if .File -}}
touch -d {{ .Lastmod.UTC.Format "2006-01-02T15:04:05Z" }} public{{ path.Join .RelPermalink "index.html" }}
{{ end -}}
{{- end -}}
rm public/lastmod.sh

Then I can just run public/lastmod.sh from the GitHub action.

Normally this would be a terrible idea for security reasons, but since there's no user content being placed directly in the shell script and it's running in a container anyway, I think it's OK for this particular task.

@bep bep modified the milestones: v0.117.0, v0.118.0 Aug 30, 2023
@bep bep modified the milestones: v0.118.0, v0.119.0 Sep 15, 2023
@bep bep modified the milestones: v0.119.0, v0.120.0 Oct 5, 2023
@bep bep removed this from the v0.120.0 milestone Oct 31, 2023
@bep bep added this to the v0.121.0 milestone Oct 31, 2023
@bep bep modified the milestones: v0.121.0, v0.122.0 Dec 6, 2023
@bep bep modified the milestones: v0.122.0, v0.123.0, v0.124.0 Jan 27, 2024
@bep bep modified the milestones: v0.124.0, v0.125.0 Mar 4, 2024
@mboelen
Copy link

mboelen commented Apr 1, 2024

Oh, that's cunning. The shell script doesn't work when I test because I'm on a Mac so bash is years out of date and there's no readarray, but it gave me the idea for something even more cunning:

config.toml:

[mediaTypes]
[mediaTypes."text/x-shellscript"]
  suffixes = ["sh"]

[outputFormats.lastmod]
baseName = 'lastmod'
isPlainText = true
mediaType = 'text/x-shellscript'
notAlternative = true

[outputs]
home = ["HTML","JSON","lastmod"]

layouts/_{default/home.lastmod.sh:

#!/bin/sh
{{ range site.Pages.ByDate -}}
{{- if .File -}}
touch -d {{ .Lastmod.UTC.Format "2006-01-02T15:03:05Z" }} public{{ path.Join .RelPermalink "index.html" }}
{{ end -}}
{{- end -}}
rm public/lastmod.sh

Then I can just run public/lastmod.sh from the GitHub action.

Normally this would be a terrible idea for security reasons, but since there's no user content being placed directly in the shell script and it's running in a container anyway, I think it's OK for this particular task.

Works fine here as well (had to change Minute though, as it refers to '03' and should be '04')

@lpar
Copy link

lpar commented Apr 3, 2024

@mboelen Thanks, the bug was fixed in our internal copy a few days later but I forgot to come back here and update. I've updated the original comment now in case anyone copypastes it without reading the rest of the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants