[wgsl] Encoding for shader text needs to be specified #565

zoddicus · 2020-02-27T20:45:58Z

To my understanding the spec currently doesn't state the encoding for the shader text, which should be stated explicitly for compat.

From talking to @dj2 , the tokens of the language and the regex for var names are ASCII. The only thing that currently can be non-ASCII is the entry point names, which are UTF-8.

Kangz · 2020-02-27T20:47:39Z

Can comments contain UTF-8 also? (and maybe identifiers but that's more controversial)

magcius · 2020-02-27T20:58:22Z

UTF-8? The web is specified in some arbitrary Unicode encoding, as determined by the Content-Encoding header (or meta charset equivalent). If the strings come from JavaScript, they will be with the byte encoding of the JavaScript interpreter, which is likely UCS-2 / UTF-16, but this is an implementation detail.

othermaciej · 2020-02-28T03:38:33Z

UTF-8 is the best encoding for a textual language for the web. HTML has to deal with legacy encodings, but a new format doesn't need to.

litherum · 2020-02-28T05:43:58Z

The functions which accept shader source are IDL functions, which would accept DOMString. Encoding is handled by the underlying loading infrastructure of the webpage. It doesn't need to be defined ~~(indeed: shouldn't be defined)~~ by the shading language spec.

othermaciej · 2020-02-29T06:54:27Z

Any text format transmitted over the internet needs to have its charset defined, or have a known mechanism for getting the charset. Otherwise, out-of-band knowledge is required to decode it to a DOMString.

Let's assume there's never a dedicated API to load a WGSL shader, and they are always loaded through something generic like Fetch. Then there needs to be a way to transform the bytes that come off the wire into a DOMString. Fetch can give you various packagings. Some are raw bytes. If you get one of those, you need out-of-band knowledge of the encoding. That's not good, because it stops a webpage from having generic code to load shaders, if those shaders might have come from someone using a different encoding. There's also a text packaging, which assumes UTF-8 encoding. Thus, UTF-8 will be the easiest option for authors.

For completeness' sake, there's also the option of putting the encoding as a charset parameter on the MIME type (but then the server needs to know the encoding somehow, which is inconvenient), or the file format can have an in-band charset declaration (like HTML and CSS do), but then there probably needs to be a dedicated loading API, or an API that takes an ArrayBuffer of bytes, or a way to turn an array of bytes into a DOMString. Those are all worse than just defining it to be utf-8.

When we submit a MIME type definition for WGSL to IANA, we will probably need to explain charset considerations there. That is yet another reason the encoding must be defined. It's not good to transmit text over the Internet without defining the charset encoding.

Of course, none of this stops anyone from locally editing a WGSL file in whatever encoding they, or their OS, or their text editor prefer. But when it's severed on the web, the charset must be defined somehow.

litherum · 2020-03-02T21:18:03Z

(quoting myself)

It doesn't need to be defined

This is true. However, after discussing this offline, I realized that it's still valuable for the spec to list an encoding, even if it isn't used anywhere inside WebGPU's implementation. If we consider the possibility that WGSL is a huge success, and has loads of tools, simply stating an official encoding in the spec is generally a good idea to reduce problems in long toolchains of compilers/optimizers/compressors etc.

And if we had to pick a particular encoding, yeah, UTF-8 is the clear winner.

kdashg · 2021-02-19T18:57:27Z

Seems like consensus to spec as utf-8 for shader text. TBD details like identifiers.

Resolves gpuweb#565.

Resolves gpuweb#565. DOMString is UTF-16.

Resolves gpuweb#565. USVString is UTF-16.

annevk · 2021-04-28T06:28:17Z

You could use USVString for the methods taking a shader source as well, to indicate lone surrogates are not allowed (and end up as U+FFFD).

(As for the commits above, note that USVString is not UTF-16 necessarily. What kind of backing you use for USVString is up to implementations and an ECMAScript engine written in Rust might well use Rust's native string type for it.)

kainino0x · 2021-04-28T07:37:38Z

We do indeed use USVString for shader sources now, and I think we have a pretty good handle on the encoding considerations (finally). WGSL doesn't actually specify an encoding but says "UTF-8 is always a valid encoding for a WGSL program."

annevk · 2021-04-28T08:05:55Z

I'd recommend saying that for interchange UTF-8 must be used. Sure, someone could handcraft something else and make it work, but there's no reason to even hint that's a thing that's endorsed.

kvark · 2021-04-28T15:36:58Z

We don't have to say UTF-8 must be used. The tools can figure this out.
I think the issue may be closed now?

annevk · 2021-04-28T15:39:11Z

Why not though? If you're defining a new interchange format it seems good to define both the encoding and a MIME type upfront rather than have the wider ecosystem try to sort that out. (Look how that is going for C/C++.)

kvark · 2021-04-28T15:42:28Z

The WGSL standard itself is not defining an interchange format, it's defining a language.
The MIME type in #1682 should totally make a reasonable choice of the encoding.

annevk · 2021-04-28T15:45:37Z

It's somewhat typical for "web" formats/languages to also define their MIME type and encoding in a section. You could do it separately, but that might lead to a bunch of confusion (see JSON, JavaScript).

kvark · 2021-04-28T16:09:08Z

Yeah, I'm not opposed to doing it in a section or what not. That seems fine.
I just want a clear separation of "this is the language" versus "this is a container for the source".

dneto0 · 2021-08-27T15:26:31Z

See also the discussion in PR #1626 (which later became #1646 "UTF-8 is always a valid encoding...").
There are similar arguments, by different sets of people.

mehmetoguzderin · 2022-04-05T21:42:43Z

Although the WGSL spec had the change to say that UTF-8 is always valid, it seems like (anticipating from w3c/i18n-activity#1500 thread) there needs to be a description that guides character encoding detection. In my opinion, the following RFC (this https://www.rfc-editor.org/rfc/rfc4329#page-4 RFC) is a good model in case that can be a solution:

First, respect the charset value in the headers.
Second, respect the byte-order mark (BOM) if the charset value is absent.
Third, as a fallback, detect as UTF-8.

Also see CSS input byte stream section: https://www.w3.org/TR/css-syntax-3/#input-byte-stream

mehmetoguzderin · 2022-04-05T21:44:01Z

(This issue, by the comment above, now relates to #2546 issue)

annevk · 2022-04-06T05:32:13Z

Then you might as well require UTF-8 exclusively, no? We shouldn't allow anything else for new formats so if you are going to touch the subject after all, just settle it with UTF-8.

mehmetoguzderin · 2022-04-06T11:21:18Z

@annevk I think having UTF-8 as the only character encoding scheme sounds like a good step, but I'd like to defer the actual feasibility of taking that step and whether that'd answer the question to the UA involved with WebGPU & WGSL.

litherum · 2022-04-20T02:44:07Z

I don't think this is necessary for v1.

kdashg · 2022-04-20T17:10:12Z

WGSL meeting minutes 2022-04-19

KG: The finder point is, are we defining a file format here. If so, it should be UTF-8.
MOD: Internationalization feedback: If we say that UTF-8 is valid, then you have to give a method for determining the encoding of the file
MM: Think we should not spend much time on this. Think we shouldn’t have to go farther
KG: Folks keep bothering us about this.
MM: Weird that in-memory JS representation will be UTF-16.
DN: My recollection: We register text/wgsl MIME type, and in this spec say in WGSL “if you transfer this across a wire, use text/wgsl” and hence make it UTF-8.
KG: We are chartered to specify the string that enters create-shader-module. Outside of that folks are on their own.
MM: can separate what’s in a file vs. what’s given in the browser.
KG: In part the i18n feedback is “looks like you’re specifying a file format, shouldn’t it be utf-8?”
BC: Don’t want to say a file must be utf-8. Expect tools
MM: Say “should”.
KG: I’ll propose a PR.

annevk · 2022-04-20T17:21:31Z

FWIW, it's not that weird that the in-memory representation is "UTF-16" (though that's not necessarily a given over the long term, fully ASCII strings already use a byte-per-code-point representation, and it's really more WTF-16 when it's not a byte buffer or something else), that's the case for other things too. E.g., WebVTT will be "UTF-16" (with above caveats) in memory, but always UTF-8 in a file. Same for JavaScript modules. Same for JSON as consumed by web browsers (e.g., through fetch()).

And yeah, the straightforward solution here would be to have a section that defines text/wgsl and says that UTF-8 must be used for producing its bytes and consuming its bytes. And make it explicit that for the moment this is not a testable requirement as browsers do not directly consume these resources in any fashion.

magcius · 2022-04-20T17:27:18Z

It would be nice if there was a standard decision / snippet we could just include here -- we're basically in the same boat as JavaScript/WebVTT. Some text we can copy/paste/refer to so each WG that wants to define a "Unicode file format" doesn't have to go through this process from scratch.

mehmetoguzderin · 2022-04-20T17:29:20Z

@magcius like this? ("The input byte stream" section from CSS): https://www.w3.org/TR/css-syntax-3/#input-byte-stream

annevk · 2022-04-21T07:21:57Z

That's far too complex as you exclusively deal with UTF-8 here.

Something like this would be sufficient (based on text in the JSON RFC):

WSGL program text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 and where applicable must use the text/wgsl MIME type.

Implementations MAY/MUST NOT [TODO: pick one, it doesn't really matter] add a byte order mark (U+FEFF) to the beginning of a networked-transmitted WGSL program text. Implementations that parse networked-transmitted WSGL program text MUST ignore the presence of a byte order mark rather than treating it as an error.

(I haven't checked if a BOM is generally ignored. If it is this last paragraph can be simplified somewhat.)

And then separately submit a MIME type registration to IANA through https://www.iana.org/form/media-types.

I would remove the encoding bits in "Textual Structure" and instead have a dedicated "Interchange" section or some such that states the above.

ben-clayton · 2022-04-21T09:24:23Z

To reiterate some concerns voiced during the WG calls:
There are concerns about the spec stating that UTF-8 is the only valid encoding for WGSL files. For better or worse, we still see source files being generated in non-UTF-8 encodings, and I suspect WGSL compilers, like Tint, would want to support UTF-16 and possibly other encodings.

Is there no way to word this as UTF-8 must be supported, but other encodings may be supported?

mehmetoguzderin · 2022-04-21T09:41:18Z

@ben-clayton That is also something I wonder about since sometimes Apache servers are configured to serve old charset, etc. So running UTF-8 decoding without any encoding scheme detection could mean hard to detect interpretation mismatches. Imagine a WGSL source with ASCII extended (in specific identifiers) can parse without any issues since most values will lie on the 0-127 range but then for the rest, depending on the identifier, it can parse as valid UTF-8 whereas the intended source is different. However, I think that's what CSS's approach supports well (why can't WGSL adopt it?).

annevk · 2022-04-21T09:49:52Z

CSS is a legacy text format, we shouldn't take on its encoding complexity for a new format. New formats should use UTF-8 exclusively, I think there's pretty broad consensus on that in the internet community. I think it's fine to say that when the files are not used for interchange you can use something else, but if you plan to send these over the network you should match JSON, JavaScript modules, and frankly the majority of text transmitted over the internet today.

ben-clayton · 2022-04-21T09:56:18Z

So running UTF-8 decoding without any encoding scheme detection could mean hard to detect interpretation mismatches

For implementation purposes, I was expecting that no BOM would be treated as UTF-8. If a BOM was included, then we'd try to support that (erroring if the encoding is not supported by tooling).

New formats should use UTF-8 exclusively, I think there's pretty broad consensus on that in the internet community. I think it's fine to say that when the files are not used for interchange you can use something else, but if you plan to send these over the network you should match JSON, JavaScript modules, and frankly the majority of text transmitted over the internet today.

Sure. My concerns here aren't really to do with network transmission, but files on disk. Tint is already in use by multiple teams / companies for offline shader compilation. I don't know the typical origin of or reasoning for non-UTF-8 files, but I've encountered enough to know they are in use. Stating that WGSL must only be encoded in UTF-8 is going to conflict with offline tooling for these teams that default to non-UTF-8 file formats.

mehmetoguzderin · 2022-04-21T09:56:39Z

To my surprise, Rust https://doc.rust-lang.org/reference/input-format.html and Swift https://forums.swift.org/t/may-swift-source-code-be-in-utf-16-in-xcode/34494/2 are UTF-8 source only.

ben-clayton · 2022-04-21T10:05:32Z

And so is Go.

Okay. It seems like this is the general direction of languages and compilers. I guess code-editors will move in this direction too.
I rescind my file-format / offline tooling concerns.

kdashg · 2022-05-31T19:12:00Z

WGSL meeting minutes 2022-05-31

KG: Said I’d write something. Write up that we’re UTF-8
DN: Anything to discuss, or just do?
BC: Was against, been convinced general direction languages are moving. Think just doing UTF-8 and reject other encodings is fine
MOD: I agree and we should ignore BOM.
KG: Anyone to write?
DN: I can do it.

mehmetoguzderin · 2022-05-31T19:56:08Z

@xfq FYI, with today's meeting's decision, after the PR for this comes along, encoding of WGSL will be well-specified to be UTF-8 with ignoring of BOM. Thank you and i18n WG very much again for your help on various issues; much appreciated, and hopefully, the resolution here is satisfactory. I will delegate the close of the issue to post-PR-merge, which will also reflect in the i18n repo.

Fixes: gpuweb#1682 gpuweb#565

dneto0 · 2022-10-31T17:38:03Z

The PR didd not take the BOM into account. But RFC 3629 says protocols that only use UTF-8 should forbid the byte order mark. I have updated the PR to say the BOM is not to be used.
See discussion at #3542 (comment)

* Add draft text/wgsl media type registration. WGSL uses UTF-8 Fixes: #1682 #565 * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Update wgsl/media-type-registration.txt Remove useless blanks Co-authored-by: Anne van Kesteren <annevk@annevk.nl> * Clarify there is no byte-order-mark Follow the recommendation of RFC3629, that when a text segement only uses UTF-8, then the byte order mark is forbidden. Co-authored-by: Anne van Kesteren <annevk@annevk.nl>

dneto0 · 2022-10-31T18:21:50Z

#3542 added a draft media type encoding and fixed the encoding to be UTF-8 without byte order mark.

zoddicus added the wgsl WebGPU Shading Language Issues label Feb 27, 2020

dj2 added this to Specification in WGSL Mar 24, 2020

grorg moved this from Resolved: Needs Specification to Under Discussion in WGSL Apr 25, 2020

dj2 mentioned this issue Oct 21, 2020

Should WGSL support Unicode identifiers? #1160

Closed

kdashg moved this from Needs Discussion to Resolved: Needs Specification Work in WGSL Feb 19, 2021

kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021

Specify WGSL as UTF-8-encoded text.

09be9c5

Resolves gpuweb#565.

kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021

Specify WGSL as UTF-16-encoded text.

55ba8c4

Resolves gpuweb#565. DOMString is UTF-16.

kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021

Specify WGSL as UTF-16-encoded text.

8ef882a

Resolves gpuweb#565. USVString is UTF-16.

annevk mentioned this issue Apr 28, 2021

Define a MIME type #1682

Closed

Kangz added this to the V1.0 milestone Feb 2, 2022

dneto0 assigned kdashg Apr 19, 2022

w3cbot added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Apr 24, 2022

mehmetoguzderin mentioned this issue May 18, 2022

wgsl: Non UTF-8 text is a valid encoding. gpuweb/cts#1429

Closed

kdashg removed their assignment May 31, 2022

dneto0 added the wgsl resolved Resolved - waiting for a change to the WGSL specification label May 31, 2022

kdashg assigned dneto0 May 31, 2022

Kangz added copyediting Pure editorial stuff (copyediting, bikeshed, etc.) and removed wgsl resolved Resolved - waiting for a change to the WGSL specification labels Sep 20, 2022

dneto0 added a commit to dneto0/gpuweb that referenced this issue Oct 20, 2022

Add draft text/wgsl media type registration. WGSL uses UTF-8

48791e8

Fixes: gpuweb#1682 gpuweb#565

dneto0 mentioned this issue Oct 20, 2022

Add draft text/wgsl media type registration. WGSL uses UTF-8 #3542

Merged

dneto0 closed this as completed Oct 31, 2022

WGSL automation moved this from Resolved: Needs Specification Work to Done Oct 31, 2022

[wgsl] Encoding for shader text needs to be specified #565

[wgsl] Encoding for shader text needs to be specified #565

Comments

zoddicus commented Feb 27, 2020

Kangz commented Feb 27, 2020

magcius commented Feb 27, 2020

othermaciej commented Feb 28, 2020

litherum commented Feb 28, 2020 • edited

othermaciej commented Feb 29, 2020

litherum commented Mar 2, 2020

kdashg commented Feb 19, 2021

annevk commented Apr 28, 2021

kainino0x commented Apr 28, 2021

annevk commented Apr 28, 2021

kvark commented Apr 28, 2021

annevk commented Apr 28, 2021 • edited

kvark commented Apr 28, 2021

annevk commented Apr 28, 2021

kvark commented Apr 28, 2021

dneto0 commented Aug 27, 2021

mehmetoguzderin commented Apr 5, 2022 • edited

mehmetoguzderin commented Apr 5, 2022

annevk commented Apr 6, 2022

mehmetoguzderin commented Apr 6, 2022

litherum commented Apr 20, 2022

kdashg commented Apr 20, 2022

annevk commented Apr 20, 2022

magcius commented Apr 20, 2022 • edited

mehmetoguzderin commented Apr 20, 2022

annevk commented Apr 21, 2022 • edited

ben-clayton commented Apr 21, 2022

mehmetoguzderin commented Apr 21, 2022

annevk commented Apr 21, 2022

ben-clayton commented Apr 21, 2022 • edited

mehmetoguzderin commented Apr 21, 2022

ben-clayton commented Apr 21, 2022

kdashg commented May 31, 2022

mehmetoguzderin commented May 31, 2022

dneto0 commented Oct 31, 2022

dneto0 commented Oct 31, 2022

litherum commented Feb 28, 2020 •

edited

annevk commented Apr 28, 2021 •

edited

mehmetoguzderin commented Apr 5, 2022 •

edited

magcius commented Apr 20, 2022 •

edited

annevk commented Apr 21, 2022 •

edited

ben-clayton commented Apr 21, 2022 •

edited