New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wgsl] Encoding for shader text needs to be specified #565
Comments
Can comments contain UTF-8 also? (and maybe identifiers but that's more controversial) |
UTF-8? The web is specified in some arbitrary Unicode encoding, as determined by the Content-Encoding header (or meta charset equivalent). If the strings come from JavaScript, they will be with the byte encoding of the JavaScript interpreter, which is likely UCS-2 / UTF-16, but this is an implementation detail. |
UTF-8 is the best encoding for a textual language for the web. HTML has to deal with legacy encodings, but a new format doesn't need to. |
The functions which accept shader source are IDL functions, which would accept |
Any text format transmitted over the internet needs to have its charset defined, or have a known mechanism for getting the charset. Otherwise, out-of-band knowledge is required to decode it to a DOMString. Let's assume there's never a dedicated API to load a WGSL shader, and they are always loaded through something generic like Fetch. Then there needs to be a way to transform the bytes that come off the wire into a DOMString. Fetch can give you various packagings. Some are raw bytes. If you get one of those, you need out-of-band knowledge of the encoding. That's not good, because it stops a webpage from having generic code to load shaders, if those shaders might have come from someone using a different encoding. There's also a text packaging, which assumes UTF-8 encoding. Thus, UTF-8 will be the easiest option for authors. For completeness' sake, there's also the option of putting the encoding as a charset parameter on the MIME type (but then the server needs to know the encoding somehow, which is inconvenient), or the file format can have an in-band charset declaration (like HTML and CSS do), but then there probably needs to be a dedicated loading API, or an API that takes an ArrayBuffer of bytes, or a way to turn an array of bytes into a DOMString. Those are all worse than just defining it to be utf-8. When we submit a MIME type definition for WGSL to IANA, we will probably need to explain charset considerations there. That is yet another reason the encoding must be defined. It's not good to transmit text over the Internet without defining the charset encoding. Of course, none of this stops anyone from locally editing a WGSL file in whatever encoding they, or their OS, or their text editor prefer. But when it's severed on the web, the charset must be defined somehow. |
(quoting myself)
This is true. However, after discussing this offline, I realized that it's still valuable for the spec to list an encoding, even if it isn't used anywhere inside WebGPU's implementation. If we consider the possibility that WGSL is a huge success, and has loads of tools, simply stating an official encoding in the spec is generally a good idea to reduce problems in long toolchains of compilers/optimizers/compressors etc. And if we had to pick a particular encoding, yeah, UTF-8 is the clear winner. |
Seems like consensus to spec as utf-8 for shader text. TBD details like identifiers. |
Resolves gpuweb#565. DOMString is UTF-16.
Resolves gpuweb#565. USVString is UTF-16.
You could use USVString for the methods taking a shader source as well, to indicate lone surrogates are not allowed (and end up as U+FFFD). (As for the commits above, note that USVString is not UTF-16 necessarily. What kind of backing you use for USVString is up to implementations and an ECMAScript engine written in Rust might well use Rust's native string type for it.) |
We do indeed use USVString for shader sources now, and I think we have a pretty good handle on the encoding considerations (finally). WGSL doesn't actually specify an encoding but says "UTF-8 is always a valid encoding for a WGSL program." |
I'd recommend saying that for interchange UTF-8 must be used. Sure, someone could handcraft something else and make it work, but there's no reason to even hint that's a thing that's endorsed. |
We don't have to say UTF-8 must be used. The tools can figure this out. |
Why not though? If you're defining a new interchange format it seems good to define both the encoding and a MIME type upfront rather than have the wider ecosystem try to sort that out. (Look how that is going for C/C++.) |
The WGSL standard itself is not defining an interchange format, it's defining a language. |
It's somewhat typical for "web" formats/languages to also define their MIME type and encoding in a section. You could do it separately, but that might lead to a bunch of confusion (see JSON, JavaScript). |
Yeah, I'm not opposed to doing it in a section or what not. That seems fine. |
Although the WGSL spec had the change to say that UTF-8 is always valid, it seems like (anticipating from w3c/i18n-activity#1500 thread) there needs to be a description that guides character encoding detection. In my opinion, the following RFC (this https://www.rfc-editor.org/rfc/rfc4329#page-4 RFC) is a good model in case that can be a solution:
Also see CSS input byte stream section: https://www.w3.org/TR/css-syntax-3/#input-byte-stream |
(This issue, by the comment above, now relates to #2546 issue) |
Then you might as well require UTF-8 exclusively, no? We shouldn't allow anything else for new formats so if you are going to touch the subject after all, just settle it with UTF-8. |
@annevk I think having UTF-8 as the only character encoding scheme sounds like a good step, but I'd like to defer the actual feasibility of taking that step and whether that'd answer the question to the UA involved with WebGPU & WGSL. |
I don't think this is necessary for v1. |
WGSL meeting minutes 2022-04-19
|
FWIW, it's not that weird that the in-memory representation is "UTF-16" (though that's not necessarily a given over the long term, fully ASCII strings already use a byte-per-code-point representation, and it's really more WTF-16 when it's not a byte buffer or something else), that's the case for other things too. E.g., WebVTT will be "UTF-16" (with above caveats) in memory, but always UTF-8 in a file. Same for JavaScript modules. Same for JSON as consumed by web browsers (e.g., through And yeah, the straightforward solution here would be to have a section that defines |
It would be nice if there was a standard decision / snippet we could just include here -- we're basically in the same boat as JavaScript/WebVTT. Some text we can copy/paste/refer to so each WG that wants to define a "Unicode file format" doesn't have to go through this process from scratch. |
@magcius like this? ("The input byte stream" section from CSS): https://www.w3.org/TR/css-syntax-3/#input-byte-stream |
That's far too complex as you exclusively deal with UTF-8 here. Something like this would be sufficient (based on text in the JSON RFC):
(I haven't checked if a BOM is generally ignored. If it is this last paragraph can be simplified somewhat.) And then separately submit a MIME type registration to IANA through https://www.iana.org/form/media-types. I would remove the encoding bits in "Textual Structure" and instead have a dedicated "Interchange" section or some such that states the above. |
To reiterate some concerns voiced during the WG calls: Is there no way to word this as UTF-8 must be supported, but other encodings may be supported? |
@ben-clayton That is also something I wonder about since sometimes Apache servers are configured to serve old charset, etc. So running UTF-8 decoding without any encoding scheme detection could mean hard to detect interpretation mismatches. Imagine a WGSL source with ASCII extended (in specific identifiers) can parse without any issues since most values will lie on the 0-127 range but then for the rest, depending on the identifier, it can parse as valid UTF-8 whereas the intended source is different. However, I think that's what CSS's approach supports well (why can't WGSL adopt it?). |
CSS is a legacy text format, we shouldn't take on its encoding complexity for a new format. New formats should use UTF-8 exclusively, I think there's pretty broad consensus on that in the internet community. I think it's fine to say that when the files are not used for interchange you can use something else, but if you plan to send these over the network you should match JSON, JavaScript modules, and frankly the majority of text transmitted over the internet today. |
For implementation purposes, I was expecting that no BOM would be treated as UTF-8. If a BOM was included, then we'd try to support that (erroring if the encoding is not supported by tooling).
Sure. My concerns here aren't really to do with network transmission, but files on disk. Tint is already in use by multiple teams / companies for offline shader compilation. I don't know the typical origin of or reasoning for non-UTF-8 files, but I've encountered enough to know they are in use. Stating that WGSL must only be encoded in UTF-8 is going to conflict with offline tooling for these teams that default to non-UTF-8 file formats. |
To my surprise, Rust https://doc.rust-lang.org/reference/input-format.html and Swift https://forums.swift.org/t/may-swift-source-code-be-in-utf-16-in-xcode/34494/2 are UTF-8 source only. |
Okay. It seems like this is the general direction of languages and compilers. I guess code-editors will move in this direction too. |
WGSL meeting minutes 2022-05-31
|
@xfq FYI, with today's meeting's decision, after the PR for this comes along, encoding of WGSL will be well-specified to be UTF-8 with ignoring of BOM. Thank you and i18n WG very much again for your help on various issues; much appreciated, and hopefully, the resolution here is satisfactory. I will delegate the close of the issue to post-PR-merge, which will also reflect in the i18n repo. |
The PR didd not take the BOM into account. But RFC 3629 says protocols that only use UTF-8 should forbid the byte order mark. I have updated the PR to say the BOM is not to be used. |
* Add draft text/wgsl media type registration. WGSL uses UTF-8 Fixes: #1682 #565 * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Clarify there is no byte-order-mark Follow the recommendation of RFC3629, that when a text segement only uses UTF-8, then the byte order mark is forbidden. Co-authored-by: Anne van Kesteren <annevk@annevk.nl>
#3542 added a draft media type encoding and fixed the encoding to be UTF-8 without byte order mark. |
To my understanding the spec currently doesn't state the encoding for the shader text, which should be stated explicitly for compat.
From talking to @dj2 , the tokens of the language and the regex for var names are ASCII. The only thing that currently can be non-ASCII is the entry point names, which are UTF-8.
The text was updated successfully, but these errors were encountered: