Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io/fs, net/http: define interface for automatic ETag serving #60940

Open
oliverpool opened this issue Jun 22, 2023 · 13 comments
Open

io/fs, net/http: define interface for automatic ETag serving #60940

oliverpool opened this issue Jun 22, 2023 · 13 comments

Comments

@oliverpool
Copy link

oliverpool commented Jun 22, 2023

Renewal of #43223

In the discussion of io/fs and embed, a few people asked for automatic serving of ETag headers for static content, using content hashes.

Here is a proposal which tries to address the concerns raised in #43223.

Updated proposal (fs.FileInfo)

First, in io/fs, define

// A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
	fs.FileInfo

	// FileHashes returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as base64 encoding,
	// for interoperability with other systems.
	//
	// FileHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, FileHashes should
	// return nil rather than take the time to compute one.
	FileHashes() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if its FileInfo implements FileHashesInfo and the FileHashes method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w ResponseWriter, fi fs.FileInfo) {
	if ch, ok := fi.(FileHashesInfo); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.FileHashes() {
			// TODO: skip the hash if unsuitable (define "suitable")
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third (probably out of scope for this proposal), add the FileHashes method on embed.FS.*file (which implements the FileInfo interface).


This proposal fixes the following objections:

The API as proposed does not let the caller request a particular implementation.

The caller will simply get all available implementations and can filter them out.

The API as proposed does not let the implementation say which hash it used.

The implementers are encouraged to indicate the algorithm used for each hash.

The API as proposed does not let the implementation return multiple hashes.

This one does.

what is expected to happen if the ContentHash returns an error before transport?

This implementation cannot return an error (the implementer choose to panic. Returning nil seems better suited).

Drop this proposal and let third-party code fill this need.

It is currently very cumbersome, since the middleware would need to open the file as well (which means having the exact same logic regarding URL cleanup as the http.FileServer). Here is my attempt: https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag/fileserver.go (even uglier, since I have use reflect to retrieve the underlying fs.File from the http.File).


Could a "github-collaborator" post a message in #43223 to notify the people who engaged in previous proposal of this updated proposal?


Original proposal (fs.File)

First, in io/fs, define

// A ContentHashFile is a file that can return hashes of its content in constant time.
type ContentHashFile interface {
	fs.File

	// ContentHash returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64.
	// Implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as the base64 encoding,
	// for interoperability with other systems (e.g. Subresource Integrity).
	//
	// ContentHash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHash should
	// return nil rather than take the time to compute one.
	ContentHash() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w http.ResponseWriter, file File) {
	if ch, ok := file.(fs.ContentHashFile); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.ContentHash() {
			// TODO: skip the hash if unsuitable (space, unicode, symbol)
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third, add the ContentHash method on http.ioFile file (as a proxy to the fs.File ContentHash method).

Fourth (probably out of scope for this proposal), add the ContentHash method on embed.FS files.

@gopherbot gopherbot added this to the Proposal milestone Jun 22, 2023
@oliverpool
Copy link
Author

I have a hacky implementation available here:
https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag

@oliverpool
Copy link
Author

Thinking out loud (sorry for the noise), it seems even better to add an optional method to fs.FileInfo (instead of fs.File):

Updated proposal:

First, in io/fs, define

// A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
	fs.FileInfo

	// FileHashes returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as base64 encoding,
	// for interoperability with other systems.
	//
	// FileHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, FileHashes should
	// return nil rather than take the time to compute one.
	FileHashes() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if its FileInfo implements FileHashesInfo and the FileHashes method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w ResponseWriter, fi fs.FileInfo) {
	if ch, ok := fi.(FileHashesInfo); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.FileHashes() {
			// TODO: skip the hash if unsuitable (define "suitable")
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third (probably out of scope for this proposal), add the FileHashes method on embed.FS.*file (which implements the FileInfo interface).

@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jul 21, 2023

To summarize the proposal above:

  1. Add a new extension method to FileInfo, not File.
  2. To avoid argument about which hash to use, the method returns a slice of hashes.
  3. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.
  4. The web server sets multiple ETag headers, one for each hash.

I think we can probably improve on a few of these decisions.

  1. I am not sure about attaching the method to FileInfo. It is extremely unlikely that any FileInfo implementation would already have the hashes in its state. Instead, it would have to call back to some accessor method that uses the File handle. If the File has been closed, this may not be available anymore at all. It seems clearer to keep the method an extension of File than an extension of FileInfo. It will probably be less work for a File implementation to add a new exported method on File than to thread a new method on FileInfo back to a new unexported method on File.

     type HashFile struct {
         File
         Hash() ([]Hash, error)
     }
    

    seems fine to me.

  2. Nothing to improve here.

  3. That's a nice format, but it means you have to pick it apart with string manipulation to get the algorithm. I suggest instead having a

     package fs
     
     type Hash struct {
         Algorithm string
         Sum []byte
     }
    
     func (h Hash) String() string { ... }
    

    where the String method returns that standard CSP form. Clients who want the string can easily get it; clients who want the algorithm can easily get it; and clients who want a different form can easily compute it.

  4. I can't find anything that says it is legal to send back multiple ETag headers, and I can't see what that means if you want to send back an If-Match header - which one do you use? Instead I think we should let the fs decide what the preferred hash is and put that one first. Then the web server just uses the first hash as the ETag.

@oliverpool
Copy link
Author

Thanks for the feedback!

  1. Add a new extension method to FileInfo, not File.

I am not sure about attaching the method to FileInfo

Logically, I would put the ContentHash "near" the ModTime (hence my proposition to augment FileInfo).

Trying to be more concrete, I was able to find 3 implementations of fs.FS in the stdlib:

  • embed.FS: file is both the File and the FileInfo, so no difference here (for now at least)
  • os.DirFS: very unlikely to provide the pre-computed content hash. Must be wrapped to provide this feature. As you point out, it is probably a little bit more work, but I found it quite doable to augment the Stat method: https://git.sr.ht/~oliverpool/exp/tree/fileinfo/item/httpetag/embed.go
  • zip.Reader: CRC32 is stored in FileHeader (which is wrapped to provide the fs.FileInfo)

So the first case dos not really influence the decision.
The second case is in favor of File.
And the zip case favors a bit the FileInfo attachment.

For cases outside of the stdlib, I looked up S3 implementations and found that the hashes were returned when GETting the file or requesting the HEAD (so adding it to the FileInfo would mirror both ways of accessing the hashes, while attaching to File would prevent exposing it when doing a HEAD request).

Hash() ([]Hash, error)

I would drop the error (since the hashes should not be computed and likely retrieved along the other properties).

  1. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.

I really like your struct proposition, because it also simplifies the ETag logic: just encode the raw byte with an encoding producing the right characters!

  1. The web server sets multiple ETag headers, one for each hash.

My example code only sets the ETAg once. I think this should be sufficient. However to work fine, the implementer should:

  1. Always send the hashes in the same order (otherwise the ETag will unexpectedly change between requests)
  2. Send the "preferred" hashes first ("preferred" should be defined more precisely, maybe "strongest" in the cryptographic sense ?)

PS: do you think that dropping a comment in the previous proposal would be a good idea, to gather more feedback?

@oliverpool
Copy link
Author

Updated proposal, taking into accounts the comments above:

// ContentHashesInfo provides pre-computed hashes of the file contents.
type ContentHashesInfo interface {
	FileInfo

	// ContentHashes returns pre-computed hashes of the file contents.
	//
	// ContentHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHashes should
	// return nil rather than take the time to compute one.
	//
	// The order of the returned hash must be stable.
	ContentHashes() []Hash
}

// Hash indicates the hash of a given content.
type Hash struct {
	// Algorithm indicates the algorithm used. Implementations are encouraged
	// to use package-like name for interoperability with other systems
	// (lowercase, without dash: e.g. sha256, sha1, crc32)
	Algorithm string
	// Sum is the result of the hash, it should not be modified.
	Sum []byte
}

I have created a new fileinfo_struct branch in my demo code.

@rsc
Copy link
Contributor

rsc commented Sep 20, 2023

I'm still on the fence about FileInfo vs File, but I'm willing to try FileInfo and see how it goes. It seems like we are at:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

The remaining question in my reply above is (4), namely what does HTTP do when Hash returns multiple hashes? As far as I can tell it makes no sense to send back multiple ETag headers.

@oliverpool
Copy link
Author

oliverpool commented Sep 21, 2023

what does HTTP do when Hash returns multiple hashes?

I would suggest to use the first suitable hash. For instance taking the first one with at least 32 bits (and truncating it to 512 bits):

		if w.Header().Get("Etag") != "" {
			return
		}
		const minLen, maxLen = 4, 64
		for _, h := range ch.ContentHashes() {
			buf := h.Sum
			if len(buf) < minLen {
				// hash should have at least 32 bits
				continue
			}
			if len(buf) > maxLen {
				buf = buf[:maxLen]
			}
			// Strong etag: any encoding middleware should set it to weak.
			w.Header().Set("Etag", `"`+base64.RawStdEncoding.EncodeToString(buf)+`"`)
			break
		}

@willfaught
Copy link
Contributor

Nit: Hash() returns more than one Hash. Hashes()?

@rsc
Copy link
Contributor

rsc commented Oct 4, 2023

It seems fine for Hash to return []Hash. It doesn't have to be Hashes.
Using the first Hash as the Etag seems fine too.

Have all remaining concerns about this proposal been addressed?

@rsc
Copy link
Contributor

rsc commented Oct 11, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Oct 26, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The proposal details are as follows.

In io/fs, we add:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).

In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.

@rsc rsc changed the title proposal: io/fs, net/http: define interface for automatic ETag serving io/fs, net/http: define interface for automatic ETag serving Oct 26, 2023
@rsc rsc modified the milestones: Proposal, Backlog Oct 26, 2023
@mauri870
Copy link
Member

mauri870 commented Nov 2, 2023

@oliverpool If you're interested in working on this, feel free to send a patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

5 participants