Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wgsl] Encoding for shader text needs to be specified #565

Closed
zoddicus opened this issue Feb 27, 2020 · 41 comments
Closed

[wgsl] Encoding for shader text needs to be specified #565

zoddicus opened this issue Feb 27, 2020 · 41 comments
Assignees
Labels
copyediting Pure editorial stuff (copyediting, bikeshed, etc.) i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. wgsl WebGPU Shading Language Issues
Projects
Milestone

Comments

@zoddicus
Copy link
Contributor

To my understanding the spec currently doesn't state the encoding for the shader text, which should be stated explicitly for compat.

From talking to @dj2 , the tokens of the language and the regex for var names are ASCII. The only thing that currently can be non-ASCII is the entry point names, which are UTF-8.

@zoddicus zoddicus added the wgsl WebGPU Shading Language Issues label Feb 27, 2020
@Kangz
Copy link
Contributor

Kangz commented Feb 27, 2020

Can comments contain UTF-8 also? (and maybe identifiers but that's more controversial)

@magcius
Copy link

magcius commented Feb 27, 2020

UTF-8? The web is specified in some arbitrary Unicode encoding, as determined by the Content-Encoding header (or meta charset equivalent). If the strings come from JavaScript, they will be with the byte encoding of the JavaScript interpreter, which is likely UCS-2 / UTF-16, but this is an implementation detail.

@othermaciej
Copy link

UTF-8 is the best encoding for a textual language for the web. HTML has to deal with legacy encodings, but a new format doesn't need to.

@litherum
Copy link
Contributor

litherum commented Feb 28, 2020

The functions which accept shader source are IDL functions, which would accept DOMString. Encoding is handled by the underlying loading infrastructure of the webpage. It doesn't need to be defined (indeed: shouldn't be defined) by the shading language spec.

@othermaciej
Copy link

Any text format transmitted over the internet needs to have its charset defined, or have a known mechanism for getting the charset. Otherwise, out-of-band knowledge is required to decode it to a DOMString.

Let's assume there's never a dedicated API to load a WGSL shader, and they are always loaded through something generic like Fetch. Then there needs to be a way to transform the bytes that come off the wire into a DOMString. Fetch can give you various packagings. Some are raw bytes. If you get one of those, you need out-of-band knowledge of the encoding. That's not good, because it stops a webpage from having generic code to load shaders, if those shaders might have come from someone using a different encoding. There's also a text packaging, which assumes UTF-8 encoding. Thus, UTF-8 will be the easiest option for authors.

For completeness' sake, there's also the option of putting the encoding as a charset parameter on the MIME type (but then the server needs to know the encoding somehow, which is inconvenient), or the file format can have an in-band charset declaration (like HTML and CSS do), but then there probably needs to be a dedicated loading API, or an API that takes an ArrayBuffer of bytes, or a way to turn an array of bytes into a DOMString. Those are all worse than just defining it to be utf-8.

When we submit a MIME type definition for WGSL to IANA, we will probably need to explain charset considerations there. That is yet another reason the encoding must be defined. It's not good to transmit text over the Internet without defining the charset encoding.

Of course, none of this stops anyone from locally editing a WGSL file in whatever encoding they, or their OS, or their text editor prefer. But when it's severed on the web, the charset must be defined somehow.

@litherum
Copy link
Contributor

litherum commented Mar 2, 2020

(quoting myself)

It doesn't need to be defined

This is true. However, after discussing this offline, I realized that it's still valuable for the spec to list an encoding, even if it isn't used anywhere inside WebGPU's implementation. If we consider the possibility that WGSL is a huge success, and has loads of tools, simply stating an official encoding in the spec is generally a good idea to reduce problems in long toolchains of compilers/optimizers/compressors etc.

And if we had to pick a particular encoding, yeah, UTF-8 is the clear winner.

@dj2 dj2 added this to Specification in WGSL Mar 24, 2020
@grorg grorg moved this from Resolved: Needs Specification to Under Discussion in WGSL Apr 25, 2020
@kdashg kdashg moved this from Needs Discussion to Resolved: Needs Specification Work in WGSL Feb 19, 2021
@kdashg
Copy link
Contributor

kdashg commented Feb 19, 2021

Seems like consensus to spec as utf-8 for shader text. TBD details like identifiers.

kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021
kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021
Resolves gpuweb#565.
DOMString is UTF-16.
kdashg added a commit to kdashg/gpuweb that referenced this issue Apr 12, 2021
Resolves gpuweb#565.
USVString is UTF-16.
@annevk
Copy link
Contributor

annevk commented Apr 28, 2021

You could use USVString for the methods taking a shader source as well, to indicate lone surrogates are not allowed (and end up as U+FFFD).

(As for the commits above, note that USVString is not UTF-16 necessarily. What kind of backing you use for USVString is up to implementations and an ECMAScript engine written in Rust might well use Rust's native string type for it.)

@kainino0x
Copy link
Contributor

We do indeed use USVString for shader sources now, and I think we have a pretty good handle on the encoding considerations (finally). WGSL doesn't actually specify an encoding but says "UTF-8 is always a valid encoding for a WGSL program."

@annevk
Copy link
Contributor

annevk commented Apr 28, 2021

I'd recommend saying that for interchange UTF-8 must be used. Sure, someone could handcraft something else and make it work, but there's no reason to even hint that's a thing that's endorsed.

@kvark
Copy link
Contributor

kvark commented Apr 28, 2021

We don't have to say UTF-8 must be used. The tools can figure this out.
I think the issue may be closed now?

@annevk
Copy link
Contributor

annevk commented Apr 28, 2021

Why not though? If you're defining a new interchange format it seems good to define both the encoding and a MIME type upfront rather than have the wider ecosystem try to sort that out. (Look how that is going for C/C++.)

@kvark
Copy link
Contributor

kvark commented Apr 28, 2021

The WGSL standard itself is not defining an interchange format, it's defining a language.
The MIME type in #1682 should totally make a reasonable choice of the encoding.

@annevk
Copy link
Contributor

annevk commented Apr 28, 2021

It's somewhat typical for "web" formats/languages to also define their MIME type and encoding in a section. You could do it separately, but that might lead to a bunch of confusion (see JSON, JavaScript).

@kvark
Copy link
Contributor

kvark commented Apr 28, 2021

Yeah, I'm not opposed to doing it in a section or what not. That seems fine.
I just want a clear separation of "this is the language" versus "this is a container for the source".

@dneto0
Copy link
Contributor

dneto0 commented Aug 27, 2021

See also the discussion in PR #1626 (which later became #1646 "UTF-8 is always a valid encoding...").
There are similar arguments, by different sets of people.

@Kangz Kangz added this to the V1.0 milestone Feb 2, 2022
@mehmetoguzderin
Copy link
Member

mehmetoguzderin commented Apr 5, 2022

Although the WGSL spec had the change to say that UTF-8 is always valid, it seems like (anticipating from w3c/i18n-activity#1500 thread) there needs to be a description that guides character encoding detection. In my opinion, the following RFC (this https://www.rfc-editor.org/rfc/rfc4329#page-4 RFC) is a good model in case that can be a solution:

  1. First, respect the charset value in the headers.
  2. Second, respect the byte-order mark (BOM) if the charset value is absent.
  3. Third, as a fallback, detect as UTF-8.

Also see CSS input byte stream section: https://www.w3.org/TR/css-syntax-3/#input-byte-stream

@mehmetoguzderin
Copy link
Member

(This issue, by the comment above, now relates to #2546 issue)

@annevk
Copy link
Contributor

annevk commented Apr 6, 2022

Then you might as well require UTF-8 exclusively, no? We shouldn't allow anything else for new formats so if you are going to touch the subject after all, just settle it with UTF-8.

@mehmetoguzderin
Copy link
Member

@annevk I think having UTF-8 as the only character encoding scheme sounds like a good step, but I'd like to defer the actual feasibility of taking that step and whether that'd answer the question to the UA involved with WebGPU & WGSL.

@litherum
Copy link
Contributor

I don't think this is necessary for v1.

@kdashg
Copy link
Contributor

kdashg commented Apr 20, 2022

WGSL meeting minutes 2022-04-19
  • KG: The finder point is, are we defining a file format here. If so, it should be UTF-8.
  • MOD: Internationalization feedback: If we say that UTF-8 is valid, then you have to give a method for determining the encoding of the file
  • MM: Think we should not spend much time on this. Think we shouldn’t have to go farther
  • KG: Folks keep bothering us about this.
  • MM: Weird that in-memory JS representation will be UTF-16.
  • DN: My recollection: We register text/wgsl MIME type, and in this spec say in WGSL “if you transfer this across a wire, use text/wgsl” and hence make it UTF-8.
  • KG: We are chartered to specify the string that enters create-shader-module. Outside of that folks are on their own.
  • MM: can separate what’s in a file vs. what’s given in the browser.
  • KG: In part the i18n feedback is “looks like you’re specifying a file format, shouldn’t it be utf-8?”
  • BC: Don’t want to say a file must be utf-8. Expect tools
  • MM: Say “should”.
  • KG: I’ll propose a PR.

@annevk
Copy link
Contributor

annevk commented Apr 20, 2022

FWIW, it's not that weird that the in-memory representation is "UTF-16" (though that's not necessarily a given over the long term, fully ASCII strings already use a byte-per-code-point representation, and it's really more WTF-16 when it's not a byte buffer or something else), that's the case for other things too. E.g., WebVTT will be "UTF-16" (with above caveats) in memory, but always UTF-8 in a file. Same for JavaScript modules. Same for JSON as consumed by web browsers (e.g., through fetch()).

And yeah, the straightforward solution here would be to have a section that defines text/wgsl and says that UTF-8 must be used for producing its bytes and consuming its bytes. And make it explicit that for the moment this is not a testable requirement as browsers do not directly consume these resources in any fashion.

@magcius
Copy link

magcius commented Apr 20, 2022

It would be nice if there was a standard decision / snippet we could just include here -- we're basically in the same boat as JavaScript/WebVTT. Some text we can copy/paste/refer to so each WG that wants to define a "Unicode file format" doesn't have to go through this process from scratch.

@mehmetoguzderin
Copy link
Member

@magcius like this? ("The input byte stream" section from CSS): https://www.w3.org/TR/css-syntax-3/#input-byte-stream

@annevk
Copy link
Contributor

annevk commented Apr 21, 2022

That's far too complex as you exclusively deal with UTF-8 here.

Something like this would be sufficient (based on text in the JSON RFC):

WSGL program text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 and where applicable must use the text/wgsl MIME type.

Implementations MAY/MUST NOT [TODO: pick one, it doesn't really matter] add a byte order mark (U+FEFF) to the beginning of a networked-transmitted WGSL program text. Implementations that parse networked-transmitted WSGL program text MUST ignore the presence of a byte order mark rather than treating it as an error.

(I haven't checked if a BOM is generally ignored. If it is this last paragraph can be simplified somewhat.)

And then separately submit a MIME type registration to IANA through https://www.iana.org/form/media-types.

I would remove the encoding bits in "Textual Structure" and instead have a dedicated "Interchange" section or some such that states the above.

@ben-clayton
Copy link
Collaborator

To reiterate some concerns voiced during the WG calls:
There are concerns about the spec stating that UTF-8 is the only valid encoding for WGSL files. For better or worse, we still see source files being generated in non-UTF-8 encodings, and I suspect WGSL compilers, like Tint, would want to support UTF-16 and possibly other encodings.

Is there no way to word this as UTF-8 must be supported, but other encodings may be supported?

@mehmetoguzderin
Copy link
Member

@ben-clayton That is also something I wonder about since sometimes Apache servers are configured to serve old charset, etc. So running UTF-8 decoding without any encoding scheme detection could mean hard to detect interpretation mismatches. Imagine a WGSL source with ASCII extended (in specific identifiers) can parse without any issues since most values will lie on the 0-127 range but then for the rest, depending on the identifier, it can parse as valid UTF-8 whereas the intended source is different. However, I think that's what CSS's approach supports well (why can't WGSL adopt it?).

@annevk
Copy link
Contributor

annevk commented Apr 21, 2022

CSS is a legacy text format, we shouldn't take on its encoding complexity for a new format. New formats should use UTF-8 exclusively, I think there's pretty broad consensus on that in the internet community. I think it's fine to say that when the files are not used for interchange you can use something else, but if you plan to send these over the network you should match JSON, JavaScript modules, and frankly the majority of text transmitted over the internet today.

@ben-clayton
Copy link
Collaborator

ben-clayton commented Apr 21, 2022

So running UTF-8 decoding without any encoding scheme detection could mean hard to detect interpretation mismatches

For implementation purposes, I was expecting that no BOM would be treated as UTF-8. If a BOM was included, then we'd try to support that (erroring if the encoding is not supported by tooling).

New formats should use UTF-8 exclusively, I think there's pretty broad consensus on that in the internet community. I think it's fine to say that when the files are not used for interchange you can use something else, but if you plan to send these over the network you should match JSON, JavaScript modules, and frankly the majority of text transmitted over the internet today.

Sure. My concerns here aren't really to do with network transmission, but files on disk. Tint is already in use by multiple teams / companies for offline shader compilation. I don't know the typical origin of or reasoning for non-UTF-8 files, but I've encountered enough to know they are in use. Stating that WGSL must only be encoded in UTF-8 is going to conflict with offline tooling for these teams that default to non-UTF-8 file formats.

@mehmetoguzderin
Copy link
Member

@ben-clayton
Copy link
Collaborator

And so is Go.

Okay. It seems like this is the general direction of languages and compilers. I guess code-editors will move in this direction too.
I rescind my file-format / offline tooling concerns.

@w3cbot w3cbot added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Apr 24, 2022
@kdashg kdashg removed their assignment May 31, 2022
@dneto0 dneto0 added the wgsl resolved Resolved - waiting for a change to the WGSL specification label May 31, 2022
@kdashg
Copy link
Contributor

kdashg commented May 31, 2022

WGSL meeting minutes 2022-05-31
  • KG: Said I’d write something. Write up that we’re UTF-8
  • DN: Anything to discuss, or just do?
  • BC: Was against, been convinced general direction languages are moving. Think just doing UTF-8 and reject other encodings is fine
  • MOD: I agree and we should ignore BOM.
  • KG: Anyone to write?
  • DN: I can do it.

@mehmetoguzderin
Copy link
Member

@xfq FYI, with today's meeting's decision, after the PR for this comes along, encoding of WGSL will be well-specified to be UTF-8 with ignoring of BOM. Thank you and i18n WG very much again for your help on various issues; much appreciated, and hopefully, the resolution here is satisfactory. I will delegate the close of the issue to post-PR-merge, which will also reflect in the i18n repo.

@Kangz Kangz added copyediting Pure editorial stuff (copyediting, bikeshed, etc.) and removed wgsl resolved Resolved - waiting for a change to the WGSL specification labels Sep 20, 2022
dneto0 added a commit to dneto0/gpuweb that referenced this issue Oct 20, 2022
@dneto0
Copy link
Contributor

dneto0 commented Oct 31, 2022

The PR didd not take the BOM into account. But RFC 3629 says protocols that only use UTF-8 should forbid the byte order mark. I have updated the PR to say the BOM is not to be used.
See discussion at #3542 (comment)

dneto0 added a commit that referenced this issue Oct 31, 2022
* Add draft text/wgsl media type registration. WGSL uses UTF-8

Fixes: #1682 #565

* Update wgsl/media-type-registration.txt

Remove useless blanks

Co-authored-by: Anne van Kesteren <annevk@annevk.nl>

* Update wgsl/media-type-registration.txt

Remove useless blanks

Co-authored-by: Anne van Kesteren <annevk@annevk.nl>

* Update wgsl/media-type-registration.txt

Remove useless blanks

Co-authored-by: Anne van Kesteren <annevk@annevk.nl>

* Update wgsl/media-type-registration.txt

Remove useless blanks

Co-authored-by: Anne van Kesteren <annevk@annevk.nl>

* Clarify there is no byte-order-mark

Follow the recommendation of RFC3629, that when a text segement
only uses UTF-8, then the byte order mark is forbidden.

Co-authored-by: Anne van Kesteren <annevk@annevk.nl>
@dneto0
Copy link
Contributor

dneto0 commented Oct 31, 2022

#3542 added a draft media type encoding and fixed the encoding to be UTF-8 without byte order mark.

@dneto0 dneto0 closed this as completed Oct 31, 2022
WGSL automation moved this from Resolved: Needs Specification Work to Done Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
copyediting Pure editorial stuff (copyediting, bikeshed, etc.) i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. wgsl WebGPU Shading Language Issues
Projects
WGSL
Done
Development

No branches or pull requests