Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io/fs, net/http: define interface for automatic ETag serving #60940

Open
oliverpool opened this issue Jun 22, 2023 · 55 comments
Open

io/fs, net/http: define interface for automatic ETag serving #60940

oliverpool opened this issue Jun 22, 2023 · 55 comments

Comments

@oliverpool
Copy link

oliverpool commented Jun 22, 2023

Renewal of #43223

In the discussion of io/fs and embed, a few people asked for automatic serving of ETag headers for static content, using content hashes.

Here is a proposal which tries to address the concerns raised in #43223.

Accepted proposal

In io/fs, define

// HashFileInfo provides hashes of the file content in constant time.
type HashFileInfo interface {
	FileInfo
	// Hash returns content hashes of the file that uniquely
	// identifies the file contents.
	//
	// Hash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, Hash should
	// return nil rather than take the time to compute one.
	//
	// The order of the returned hashes must be constant (preferred hashes first).
	Hash() []Hash
}
// Hash indicates the hash of a given content.
type Hash struct {
	// Algorithm indicates the algorithm used. Implementations are encouraged
	// to use package-like name for interoperability with other systems
	// (lowercase, without dash: e.g. sha256, sha1, crc32)
	Algorithm string
	// Sum is the result of the hash, it should not be modified by the caller.
	Sum []byte
}

Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).

In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.


Original proposal (fs.File)

First, in io/fs, define

// A ContentHashFile is a file that can return hashes of its content in constant time.
type ContentHashFile interface {
	fs.File

	// ContentHash returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64.
	// Implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as the base64 encoding,
	// for interoperability with other systems (e.g. Subresource Integrity).
	//
	// ContentHash must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHash should
	// return nil rather than take the time to compute one.
	ContentHash() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if it implements ContentHashFile and the ContentHash method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w http.ResponseWriter, file File) {
	if ch, ok := file.(fs.ContentHashFile); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.ContentHash() {
			// TODO: skip the hash if unsuitable (space, unicode, symbol)
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third, add the ContentHash method on http.ioFile file (as a proxy to the fs.File ContentHash method).

Fourth (probably out of scope for this proposal), add the ContentHash method on embed.FS files.

This proposal fixes the following objections:

The API as proposed does not let the caller request a particular implementation.

The caller will simply get all available implementations and can filter them out.

The API as proposed does not let the implementation say which hash it used.

The implementers are encouraged to indicate the algorithm used for each hash.

The API as proposed does not let the implementation return multiple hashes.

This one does.

what is expected to happen if the ContentHash returns an error before transport?

This implementation cannot return an error (the implementer choose to panic. Returning nil seems better suited).

Drop this proposal and let third-party code fill this need.

It is currently very cumbersome, since the middleware would need to open the file as well (which means having the exact same logic regarding URL cleanup as the http.FileServer). Here is my attempt: https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag/fileserver.go (even uglier, since I have use reflect to retrieve the underlying fs.File from the http.File).


Could a "github-collaborator" post a message in #43223 to notify the people who engaged in previous proposal of this updated proposal?

@gopherbot gopherbot added this to the Proposal milestone Jun 22, 2023
@oliverpool
Copy link
Author

I have a hacky implementation available here:
https://git.sr.ht/~oliverpool/exp/tree/main/item/httpetag

@oliverpool
Copy link
Author

Thinking out loud (sorry for the noise), it seems even better to add an optional method to fs.FileInfo (instead of fs.File):

Updated proposal:

First, in io/fs, define

// A FileHashesInfo provides the file hashes in constant time.
type FileHashesInfo interface {
	fs.FileInfo

	// FileHashes returns content hashes of the file that uniquely
	// identifies the file contents.
	// The returned hashes should be of the form algorithm-base64,
	// and implementations are encouraged to use sha256, sha384, or sha512
	// as the algorithms and RawStdEncoding as base64 encoding,
	// for interoperability with other systems.
	//
	// FileHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, FileHashes should
	// return nil rather than take the time to compute one.
	FileHashes() []string
}

Second, in net/http, when serving a File (in serveFile, right before serveContent for instance), if its FileInfo implements FileHashesInfo and the FileHashes method succeeds and is alphanumeric (no spaces, no Unicode, no symbols, to avoid any kind of header problems), use that result as the default ETag.

func setEtag(w ResponseWriter, fi fs.FileInfo) {
	if ch, ok := fi.(FileHashesInfo); ok {
		if w.Header().Get("Etag") != "" {
			return
		}
		for _, h := range ch.FileHashes() {
			// TODO: skip the hash if unsuitable (define "suitable")
			// TODO: should the etag be weak or strong?
			w.Header().Set("Etag", `W/"`+h+`"`)
			break
		}
	}
}

Third (probably out of scope for this proposal), add the FileHashes method on embed.FS.*file (which implements the FileInfo interface).

@rsc
Copy link
Contributor

rsc commented Jul 12, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jul 21, 2023

To summarize the proposal above:

  1. Add a new extension method to FileInfo, not File.
  2. To avoid argument about which hash to use, the method returns a slice of hashes.
  3. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.
  4. The web server sets multiple ETag headers, one for each hash.

I think we can probably improve on a few of these decisions.

  1. I am not sure about attaching the method to FileInfo. It is extremely unlikely that any FileInfo implementation would already have the hashes in its state. Instead, it would have to call back to some accessor method that uses the File handle. If the File has been closed, this may not be available anymore at all. It seems clearer to keep the method an extension of File than an extension of FileInfo. It will probably be less work for a File implementation to add a new exported method on File than to thread a new method on FileInfo back to a new unexported method on File.

     type HashFile struct {
         File
         Hash() ([]Hash, error)
     }
    

    seems fine to me.

  2. Nothing to improve here.

  3. That's a nice format, but it means you have to pick it apart with string manipulation to get the algorithm. I suggest instead having a

     package fs
     
     type Hash struct {
         Algorithm string
         Sum []byte
     }
    
     func (h Hash) String() string { ... }
    

    where the String method returns that standard CSP form. Clients who want the string can easily get it; clients who want the algorithm can easily get it; and clients who want a different form can easily compute it.

  4. I can't find anything that says it is legal to send back multiple ETag headers, and I can't see what that means if you want to send back an If-Match header - which one do you use? Instead I think we should let the fs decide what the preferred hash is and put that one first. Then the web server just uses the first hash as the ETag.

@oliverpool
Copy link
Author

Thanks for the feedback!

  1. Add a new extension method to FileInfo, not File.

I am not sure about attaching the method to FileInfo

Logically, I would put the ContentHash "near" the ModTime (hence my proposition to augment FileInfo).

Trying to be more concrete, I was able to find 3 implementations of fs.FS in the stdlib:

  • embed.FS: file is both the File and the FileInfo, so no difference here (for now at least)
  • os.DirFS: very unlikely to provide the pre-computed content hash. Must be wrapped to provide this feature. As you point out, it is probably a little bit more work, but I found it quite doable to augment the Stat method: https://git.sr.ht/~oliverpool/exp/tree/fileinfo/item/httpetag/embed.go
  • zip.Reader: CRC32 is stored in FileHeader (which is wrapped to provide the fs.FileInfo)

So the first case dos not really influence the decision.
The second case is in favor of File.
And the zip case favors a bit the FileInfo attachment.

For cases outside of the stdlib, I looked up S3 implementations and found that the hashes were returned when GETting the file or requesting the HEAD (so adding it to the FileInfo would mirror both ways of accessing the hashes, while attaching to File would prevent exposing it when doing a HEAD request).

Hash() ([]Hash, error)

I would drop the error (since the hashes should not be computed and likely retrieved along the other properties).

  1. To identify the hashes, each hash in the slice is algorithm-base64, same format as Content-Security-Policy hashes.

I really like your struct proposition, because it also simplifies the ETag logic: just encode the raw byte with an encoding producing the right characters!

  1. The web server sets multiple ETag headers, one for each hash.

My example code only sets the ETAg once. I think this should be sufficient. However to work fine, the implementer should:

  1. Always send the hashes in the same order (otherwise the ETag will unexpectedly change between requests)
  2. Send the "preferred" hashes first ("preferred" should be defined more precisely, maybe "strongest" in the cryptographic sense ?)

PS: do you think that dropping a comment in the previous proposal would be a good idea, to gather more feedback?

@oliverpool
Copy link
Author

Updated proposal, taking into accounts the comments above:

// ContentHashesInfo provides pre-computed hashes of the file contents.
type ContentHashesInfo interface {
	FileInfo

	// ContentHashes returns pre-computed hashes of the file contents.
	//
	// ContentHashes must NOT compute any hash of the file during the call.
	// That is, it must run in time O(1) not O(length of file).
	// If no content hash is already available, ContentHashes should
	// return nil rather than take the time to compute one.
	//
	// The order of the returned hash must be stable.
	ContentHashes() []Hash
}

// Hash indicates the hash of a given content.
type Hash struct {
	// Algorithm indicates the algorithm used. Implementations are encouraged
	// to use package-like name for interoperability with other systems
	// (lowercase, without dash: e.g. sha256, sha1, crc32)
	Algorithm string
	// Sum is the result of the hash, it should not be modified.
	Sum []byte
}

I have created a new fileinfo_struct branch in my demo code.

@rsc
Copy link
Contributor

rsc commented Sep 20, 2023

I'm still on the fence about FileInfo vs File, but I'm willing to try FileInfo and see how it goes. It seems like we are at:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

The remaining question in my reply above is (4), namely what does HTTP do when Hash returns multiple hashes? As far as I can tell it makes no sense to send back multiple ETag headers.

@oliverpool
Copy link
Author

oliverpool commented Sep 21, 2023

what does HTTP do when Hash returns multiple hashes?

I would suggest to use the first suitable hash. For instance taking the first one with at least 32 bits (and truncating it to 512 bits):

		if w.Header().Get("Etag") != "" {
			return
		}
		const minLen, maxLen = 4, 64
		for _, h := range ch.ContentHashes() {
			buf := h.Sum
			if len(buf) < minLen {
				// hash should have at least 32 bits
				continue
			}
			if len(buf) > maxLen {
				buf = buf[:maxLen]
			}
			// Strong etag: any encoding middleware should set it to weak.
			w.Header().Set("Etag", `"`+base64.RawStdEncoding.EncodeToString(buf)+`"`)
			break
		}

@willfaught
Copy link
Contributor

Nit: Hash() returns more than one Hash. Hashes()?

@rsc
Copy link
Contributor

rsc commented Oct 4, 2023

It seems fine for Hash to return []Hash. It doesn't have to be Hashes.
Using the first Hash as the Etag seems fine too.

Have all remaining concerns about this proposal been addressed?

@rsc
Copy link
Contributor

rsc commented Oct 11, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Oct 26, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The proposal details are as follows.

In io/fs, we add:

type HashFileInfo interface {
    FileInfo
    Hash() []Hash
}

type Hash struct {
    Algorithm string
    Sum []byte
}

Then, in net/http.serveFile, serveFile calls Stat, and if the result implements HashFileInfo, it calls info.Hash. If that returns >=1 hashes, serveFile uses hash[0] as the Etag header, formatting it using Alg+":"+base64(Sum).

In package embed, the file type would add a Hash method and an assertion that it implements HashFileInfo. It would return a single hash with Algorithm “sha256”.

@rsc rsc changed the title proposal: io/fs, net/http: define interface for automatic ETag serving io/fs, net/http: define interface for automatic ETag serving Oct 26, 2023
@rsc rsc modified the milestones: Proposal, Backlog Oct 26, 2023
@mauri870
Copy link
Member

mauri870 commented Nov 2, 2023

@oliverpool If you're interested in working on this, feel free to send a patch

@earthboundkid
Copy link
Contributor

Should there be a package function in fs like

// Hashes returns fi.Hash() if fi implements HashFileInfo. Otherwise it returns nil.
func Hashes(fi FileInfo) []Hash {
    if hfi, ok := fi.(HashFileInfo); ok {
       return hfi.Hash()
    }
    return nil
}

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/562875 mentions this issue: fs: add HashFileInfo interface to enable ETag serving in net/http

@oliverpool
Copy link
Author

Should there be a package function

@earthboundkid I don't think that there is a need for it. However if usage shows that there is interest in such a package function, it can be added later.

sgrankin added a commit to sgrankin/cs that referenced this issue Feb 16, 2024
Enables browser caching.
Calculate & cache hashes on demand.

Remove if golang/go#60940 is ever complete.
@neild
Copy link
Contributor

neild commented Feb 26, 2024

A couple points I observed while reviewing https://go.dev/cl/562875, copying here for visibility:

A zip file's hash is a CRC32. 32 bits is super small. Do we want to automatically serve Etags headers with hashes that small? If we set a larger minimum (128 bits, maybe?), do we want to bother having zip entries implement FileInfoHash?

embed uses SHA-256 hashes truncated down to 128 bits. @rsc's comment above says that embed would add a Hash method returning a hash with algorithm "sha256". Do we want to expose truncated 128-bit hashes as used by the package today, or do we want to extend embed's hash to include the full sha256?

@oliverpool
Copy link
Author

A zip file's hash is a CRC32. 32 bits is super small.

I would argue that ETag should only be used for integrity check (not for security) and that in this case 32 bits should be fine (however note that if a stronger algorithm is available earlier in the list, it will be selected instead of CRC32).

Do we want to expose truncated 128-bit hashes as used by the package today, or do we want to extend embed's hash to include the full sha256?

I am opened to both (in the initial implementation, I tried to keep the changes as small as possible).

do we need the algorithm in [the ETag header] at all?

This follows the latest comment from the proposal review group, however the algorithm name could be dropped if wanted.

@neild
Copy link
Contributor

neild commented Feb 27, 2024

Aren't ETags used to identify when a cache needs reloading? If you get a duplicate tag, the user is left with a stale entry. I believe 32 bits gets you a 1% probability of a collision with only ~1000 entries, ~75% chance with 100000. That seems pretty high.

@oliverpool
Copy link
Author

ETags are used in combination with the URL. So the only issue is if the same URL gets the same hash (which probability should be quite low). Strictly speaking it is associated with a given resource. In the io.FS case every file is a different resource (identified by a different URL in our case).

Source: https://www.rfc-editor.org/rfc/rfc7232#section-2.3 (emphasis mine):

An entity-tag is an opaque validator for differentiating between multiple representations of the same resource

Furthermore MDN gives an example (emphasis mine):

Typically, the ETag value is a hash of the content, a hash of the last modification timestamp, or just a revision number

I am pretty sure that browsers associate the ETag value with a given URL.

@neild
Copy link
Contributor

neild commented Feb 28, 2024

On the other hand, it seems Google Cloud accepts crc32 and even recommends it:

This is an integrity check: You provide some data and its expected hash, and GCS returns an error if the data doesn't match the hash. The expectation is that the data and hash match unless something has gone wrong. The birthday paradox doesn't apply.

ETags (so far as I can tell) are used for cache validation: The expectation is that the data for a resource will change over time, and you use the ETag to tell when it has done so. In this case, the birthday paradox does apply.

ETags are used in combination with the URL. So the only issue is if the same URL gets the same hash (which probability should be quite low).

Are there not URLs where the data changes frequently? If a URL has data which changes once per second, I believe you'd have a ~50% chance of a 32-bit hash collision happening in a day.

I'm not at all familiar with ETag usage in practice, but looking at various sources of documentation I can't seem to find any examples of ETags containing hashes this short. GCS (https://cloud.google.com/storage/docs/hashes-etags#etags) returns an MD5 under certain circumstances, and an undefined value in others. (I'm going to guess that they started with MD5, people came to depend on it, and now they can't change it to something better.) https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag has an example with a 160-bit value encoded as a hex string.

The use case of serving files directly out of a .zip is pretty specialized and probably not really much of a concern in practice, but unless there's established practice of using a 32-bit CRC as an ETag value, I don't think we should be the ones to start doing it. (And possibly not even then.)

@neild
Copy link
Contributor

neild commented Feb 28, 2024

Regarding the hash provided for //go:embed files:

I don't know anything about the history of the implementation of embedded files, so I did a little bit of code archaeology. I'm not clear on why we store a per-file hash at all, but the hash is computed by cmd/internal/notsha256. The history of notsha256 indicates that it was added specifically as an arbitrary hash which does not need to match any defined standard.

I can't figure out why there's a hash in there; the embed package doesn't seem to use it.

I think deriving a "half-sha256" from this hash value is not a good idea, since this is specifically an arbitrary hash.

Storing an actual SHA-256 for embedded files may be possible, but https://go.dev/cl/402594 (which introduced notsha256) indicates computing a SHA-256 at build time may cause problems when GOEXPERIMENT=boringcrypto.

I see two possible paths:

  1. embed.FS provides sha256 hashes, as described in the accepted proposal: io/fs, net/http: define interface for automatic ETag serving #60940 (comment)
  2. embed.FS provides an arbitrary 128-bit hash value of no defined algorithm. (If we take this approach, do we want to grease the value in some way to avoid accidental dependence on the hash used? How?)

@oliverpool
Copy link
Author

oliverpool commented Feb 28, 2024

Regarding CRC32, apparently express used it until mid-2014 for etags: expressjs/express#1435

Possible solution for http:

  • try to find a hash with at least 128 bits: set the ETag and return if found
  • otherwise look again for a hash with at least 32 bits: if found, send it as weak ETag, as suggested by @mitar

Edit: as noted by @AGWA this would not be conformant with the RFC 7232 express made this mistake as well


Regarding embed, I think having a full sha-256 would be much more useful:

  • it can be used for SRI
  • it shouldn’t affect build time (since the algo is already run)
  • it will slightly increase the build size (however the embedded files are likely larger than 16 bytes, making the overhead acceptable)

@AGWA
Copy link

AGWA commented Feb 28, 2024

Weak ETags are meant for cases where two resources aren't byte-for-byte equal, but it's still OK to consider them equivalent for caching (e.g. they're simply encoded differently, or the only differences are semantically unimportant). See the entirety of RFC 7232 Section 2.1 for details. It's not OK for two entirely different resources to have the same ETag, whether it's weak or strong. IMO, that makes CRC32 and other short hashes unsuitable for ETags.

@neild
Copy link
Contributor

neild commented Feb 28, 2024

I recognize that we're almost certainly not going to try to redesign the API in the accepted proposal. With that said, I think there are a couple unfortunate points in HashFileInfo: It returns a []Hash, and therefore needs to allocate a slice. And Hash.Sum is a []byte, so it also needs to allocate a new value for each returned sum or trust the caller not to modify it.

I think I'd prefer something like:

type HashFileInfo interface {
  FileInfo

  // Hash returns a hash of the content of the file.
  //
  // If algorithm is zero, Hash returns a hash using an undefined algorighm.
  // If algorithm is non-zero, Hash returns a hash using the named algorithm if available.
  // Hash returns "" if no hash is available.
  //
  // Hash does not compute any hash of the file during the call.
  // That is, it runs in time O(1), not O(length of file).
  Hash(algorithm Algorithm) string
}

type Algorithm string

const (
  SHA256 = "sha256"
  // etc.
)

@mitar
Copy link
Contributor

mitar commented Feb 28, 2024

@neild: I think the issue with your proposal is that it makes hash encoding to a string non-optional? Maybe Hash should get two arguments then, one for algorithm and one for encoding?

@mitar
Copy link
Contributor

mitar commented Feb 28, 2024

Could we just document that returned slices should not be modified?

@oliverpool
Copy link
Author

oliverpool commented Feb 29, 2024

@rsc wrote #43223 (comment) in 2021, regarding the previous (declined) proposal.

  1. The API as proposed does not let the caller request a particular implementation.

For (1), I think that's a different proposal. There will be callers who don't care about the implementation. They just want any hash the file system has available. Like HTTP does for ETags. That's the focus of this proposal. I don't believe we have to engineer it more than that.

@earthboundkid
Copy link
Contributor

Returning a fresh slice every time seems bad for garbage collection. Returning the same slice seems like a recipe for spooky bugs. Time for iter.Seq[Hash]?

@oliverpool
Copy link
Author

To sum up the discussions that happened lately, I see 3 open questions:

  1. net/http how many bits must a "suitable hash" for ETag have, to have a collision probability low enough? (from the discussion, 32 bits are too low)
  2. embed: should the full sha256 hash be used instead of the current half-not-sha256? (as far as I understand, most people are in favor of this)
  3. io/fs: should the HashFileInfo be reworked to reduce allocations? (changing the signature had been previously discarded, documenting that the returned slices should not be modified could be brittle - I think that the work spared by using the HashFileInfo instead of reading the file is already a big-enough improvement regarding allocation)

My main question is: how can we reach a decision regarding those 3 points?

  1. Is there some agreed-upon recommendation regarding file-integrity to know how many bytes should be enough for an ETag?
  2. Who knows why a half-not-sha256 has been chosenAnd what would it take to store a proper sha256` hash?
  3. @rsc does your comment from 2021 still holds regarding the method signature?

@neild
Copy link
Contributor

neild commented Mar 4, 2024

I propose that 128 bits is the minimum sufficient hash size for automatically serving an ETag header. 128 bits is more than enough to avoid birthday paradox collisions.

I propose that rather than baking this limitation into net/http, we avoid adding any implementations of HashFileInfo to the standard library that return hashes less than 128 bits. This means not implementing HashFileInfo in archive/zip. (Note that users can already read the CRC32 of a file in a zip from the FileHeader; this only affects the fs.FS implementation in that package.)

I don't have a strong opinion on whether embed files should expose a sha256 hash or not, but either:

  • embed files should expose a full sha256; or
  • embed files should expose the existing 128-bit hash, without giving it a name, without guaranteeing that it will remain stable over time, and (ideally) with some grease in place to ensure that we don't get locked into keeping it stable. (Perhaps hash the Go release name into the value? Although this leaks the Go release via the Etag value, for what it's worth.)

@ianlancetaylor
Copy link
Contributor

@neild

I can't figure out why there's a hash in there; the embed package doesn't seem to use it.

Pretty sure it was put there specifically to support ETag. It's ready to be used when we are ready to use it.

@oliverpool
Copy link
Author

I propose that rather than baking this limitation into net/http, we avoid adding any implementations of HashFileInfo to the standard library that return hashes less than 128 bits. This means not implementing HashFileInfo in archive/zip.

I am not fully convinced by this line of reasoning. The goal of having an exported HashFileInfo interface is to allow anyone to implement it. So if we want net/http to behave correctly with 3rd party libraries, it must perform some checks on the hashes (we can't forbid 3rd-party libraries to exposes crc32 hashes or weaker). Having this check, I see no reason not to expose the crc32 hash in archive/zip (allowing its access through an interface is the very goal of an interface).

@oliverpool
Copy link
Author

I have updated the CL with @neild suggestions:

  • http: the algorithm name is not part of the ETag anymore
  • http: only hashes with a length between 128 and 264 bits are considered suitable
  • embed: the algorithm name is made to change on every release: go1.22.2/cmd/internal/notsha256[:16] (until the proper sha256 hash is exposed)

I am not sure that I would be able to change the hash of embed to a full sha256 and I would prefer if this could be done in a future CL.

I would appreciate any feedback, to move this forward.

@neild
Copy link
Contributor

neild commented Apr 8, 2024

Proposal committee, I'd appreciate your input here. This proposal is accepted, but there are some questions regarding it that need addressing. To summarize (see discussion above for details):

  1. The accepted proposal states that package embed will provide sha256 hashes. Embedded files currently contain an unexposed 128-bit hash. This hash is generated with cmd/internal/notsha256, which is explicitly not sha256 to avoid issues with the bootstrap process when GOEXPERIMENT=boringcrypto (see https://go.dev/cl/402594). Should embed use this hash or use sha256 as in the proposal? (Using sha256 will either require generating the hash at runtime, or figuring out another way around the bootstrap issue.)
  2. If embed should use the 128-bit notsha256 hash, what should the hash algorithm name be? (The HashFileInfo interface doesn't allow for unnamed, arbitrary hashes.)
  3. If embed should use the 128-bit notsha256 hash, should we apply grease to it by, for example, hashing the Go release into the value?
  4. The HashFileInfo interface requires allocating a slice and copying the hash value(s) on every call, if it is to avoid providing the caller with something that aliases private memory. Do we want to consider an allocation-free API instead (I suggested one above in io/fs, net/http: define interface for automatic ETag serving #60940 (comment))?

@mitar
Copy link
Contributor

mitar commented Apr 8, 2024

Isn't it maybe easier to make a generic fs wrapper with which you can wrap embed's fs which computes sha256 (or any other hash you really want)?

So then you could have something like:

//go:embed dir
var dirFiles embed.FS

var dirFilesWithHash = fs.WithHash(dirFiles, myHash)

This could then work with any source of a file system, like fstest.MapFS and so on.

@rsc
Copy link
Contributor

rsc commented May 8, 2024

I haven't replied here yet because every time I sit down to reply, I don't have any good answers. I will try to figure this out, but it will probably have to be after the Go 1.23 freeze. The intersection of cgo, sha256, boringcrypto, and these hashes makes things more difficult than they should be.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

What if we define a new hash function "goembed1" which is notsha256("goembed1"+content) truncated to 128 bits?
We add the new interfaces and APIs as described in the accept message, but instead of providing a sha256 hash, we provide a goembed1 hash. This would leak the fact that the server is serving from a go:embed file system. I don't know if that matters.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

I also agree with @neild's comment above that we should not add anything that returns a file hash < 128 bits to the standard library.

@mitar
Copy link
Contributor

mitar commented May 9, 2024

@rsc What about my proposal for a general fs.WithHash function? And we simply leave the internal embed hash internal. That seems to make make everything more composable, developer can control exactly which hash they want to use, and also testing is possible with non-embed but still fs types.

@rsc
Copy link
Contributor

rsc commented May 9, 2024

Anyone can write fs.WithHash themselves and they are welcome to. The point here was to provide a pre-computed hash that avoids the startup cost.

@neild
Copy link
Contributor

neild commented May 9, 2024

I don't have any particularly good answers either.

Perhaps we should say that go:embed files should provide a sha256 hash, and that figuring out whatever boringcrypto issues block providing one is a prerequisite for implementing that portion of this proposal?

We can add the fs.HashFileInfo interface even if embedded files don't implement it at first.

@oliverpool
Copy link
Author

notsha256("goembed1"+content) truncated to 128 bits

This sounds like a reasonable compromise and a good way forward in the meantime.

This would leak the fact that the server is serving from a go:embed file system. I don't know if that matters.

The leak would be fixed if the algorithm name is not considered for the ETag (as currently implemented, as suggested in https://go-review.googlesource.com/c/go/+/562875/19..26/src/net/http/fs.go#614). Or do you mean that the caller can recompute notsha256("goembed1"+content) and check the resulting hash against the ETag? (in this case the leak is more or less already present since go:embed files have not modtime when served).

we should not add anything that returns a file hash < 128 bits to the standard library

I would rather have the standard library makes use of (standard) interfaces as much as possible. The archive/zip package currently exposes the CRC32 hash in a specific way. I see no harm into exposing it as part of the interface. fs.HashFileInfo should not be considered a secure primitive (MD5 and SHA-1 produce 128+ bits and should be avoided whenever possible anyway).

We can add the fs.HashFileInfo interface even if embedded files don't implement it at first.

The core problem this proposal attempts to address is "automatic serving of ETag headers for static content, using content hashes" (as stated in #43223). I would prefer delaying this only if a clear and timed implementation strategy is worked out.

@neild
Copy link
Contributor

neild commented May 10, 2024

Or do you mean that the caller can recompute notsha256("goembed1"+content) and check the resulting hash against the ETag? (in this case the leak is more or less already present since go:embed files have not modtime when served).

Many files served by many web servers don't have a modification time. However, only files served by Go webservers using //go:embed files would have an ETag matching notsha256("goembed1"+content).

I don't know that this matters, but it would be an informational leak.

The archive/zip package currently exposes the CRC32 hash in a specific way. I see no harm into exposing it as part of the interface.

The proposed HashFileInfo documentation states that the hashes provided "uniquely identify" the file contents. While obviously no hash is a literal unique identifier, a 32-bit hash is short enough to be highly susceptible to birthday paradox collisions even outside adversarial scenarios, and cannot serve as a (pseudo-)unique identifier.

@rsc
Copy link
Contributor

rsc commented May 15, 2024

I hadn't seen the CL yet that drops the algorithm: from the Etag. That sounds like a good suggestion, and it eliminates the concerns about leaking that the hash is goembed1. That works for me.

@oliverpool
Copy link
Author

Is there anything that can be done to make some progress on this issue?

Is the current solution good enough? #60940 (comment)

  • http: the algorithm name is not part of the ETag anymore
  • http: only hashes with a length between 128 and 264 bits are considered suitable
  • embed: the algorithm name is made to change on every release: go1.22.2/cmd/internal/notsha256[:16] (until the proper sha256 hash is exposed - reminder: the name is ignored by the http handler)

@neild
Copy link
Contributor

neild commented Jun 4, 2024

It seems that cmd/internal/codesign is using notsha256 and inverting the result to produce an actually-sha256 hash.

This is disgusting, but it's precedent. Should we do the same for embedded file hashes? (We'd also need to increase the size of the stored hash, since it's only 128 bits right now.)

@oliverpool
Copy link
Author

Should we do the same for embedded file hashes?

Who can take the responsibility to answer this question?

I would really appreciate if we could move this issue forward.

@oliverpool
Copy link
Author

Friendly ping: should we rely on the fact that notsha256 is an inverted sha256, like cmd/internal/codesign already does?

@daddz
Copy link

daddz commented Nov 7, 2024

I'll add another friendly ping, this would be a really great feature to have, thanks @oliverpool for driving it this far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests