Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: In an ESM-first mode, how should extensionless entry points be handled? #49431

Open
GeoffreyBooth opened this issue Sep 1, 2023 · 10 comments
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem. wasi Issues and PRs related to the WebAssembly System Interface. wasm Issues and PRs related to WebAssembly.

Comments

@GeoffreyBooth
Copy link
Member

Building off of #49295 (comment) and #31415, we’re considering a new mode, probably enabled by flag, where all of the current places where Node defaults to CommonJS would instead default to ESM. One of the trickiest questions to answer for defining such a new mode is how extensionless entry points should be handled.

In an early version of the ESM implementation, extensionless entry points followed the same interpretation of .js files; if they were in a "type": "module" package.json scope, they were run as ESM, otherwise they were run as CommonJS. They couldn’t be referenced via import; extensionless support was limited to entry points, to handle the CLI tool use case without complicating the resolution algorithm. #31415 removed this ESM extensionless entry point support because there was desire to someday permit extensionless entry points to be not just ESM JavaScript but also Wasm. We haven’t followed up since.

There is a strong desire from users for the ability to write CLI apps or “shell scripts” using ESM, without needing things like symlinks or CommonJS wrappers in order to enable extensionless entry points being interpreted as ESM. I can think of two potential ways to enable such, which we could limit to being enabled by our new flag:

  1. Extensionless is just always ESM in a "type": "module" scope (or in this new ESM-first mode, outside of a "type": "commonjs" scope). There’s simply no ability for extensionless Wasm entry points; Node.js is a JavaScript runtime, after all. This tracks how there’s also currently no support for extensionless native/.node-file entry points.

  2. Use the Wasm magic byte to disambiguate between Wasm and ESM entry. Either always check for magic byte before trying to parse, or check after failure to parse as ESM JavaScript.

Are there other approaches?

I think we need to find some way to enable this use case, either behind this new “ESM by default” flag or in Node generally. It keeps coming up and the pressure from users will only increase as ESM adoption continues. @LiviaMedeiros @nodejs/loaders @nodejs/wasi @nodejs/tsc

@GeoffreyBooth GeoffreyBooth added module Issues and PRs related to the module subsystem. esm Issues and PRs related to the ECMAScript Modules implementation. wasm Issues and PRs related to WebAssembly. wasi Issues and PRs related to the WebAssembly System Interface. labels Sep 1, 2023
@LiviaMedeiros
Copy link
Contributor

Perhaps it can go like this:

  • switching on content type
    • text/javascript -> ESM
    • application/wasm -> Wasm
    • application/json -> JSON
    • unknown -> error (or fallback to ESM)
    • not applicable -> switching on extension
      • mjs -> ESM
      • cjs -> CJS
      • js -> default JS
      • wasm -> Wasm
      • json -> JSON
      • empty string -> switching on file header
        • 0x0061736D######## -> Wasm ver. ########
        • unknown -> ESM
      • unknown -> error (or fallback to ESM)

I don't really understand what "failure to parse as ESM JavaScript" means here (maybe "no SyntaxError"?) and don't think we should implement it, but AFAIK we can't start valid code with NUL byte even in sloppy mode, so switching on file header is deterministic in any order of checks.

Default JS means whatever is defined by package scope, or fallback to internal default.

Fallback to ESM adds convenience at the cost of breaking change if we add anything else later.

I feel -0 about idea of using magic overall: at least because this method can give false positive on arbitrary input data, or we may run into potential collisions if we add more detectable-by-file-contents formats in the future.
But realistically, it looks fine.

@GeoffreyBooth
Copy link
Member Author

So I did a little experiment in the node repo:

node
Welcome to Node.js v20.5.1.
Type ".help" for more information.
> fs.readFileSync('./test/fixtures/simple.wasm', 'utf8')
'\x00asm\x01\x00\x00\x00\x01\x07\x01`\x02\x7F\x7F\x01\x7F\x03\x02\x01\x00\x07\x07\x01\x03add\x00\x00\n\t\x01\x07\x00 \x00 \x01j\x0B'

Those \x00asm characters are the Wasm magic bytes, as @aduh95 mentioned. If I try to eval this file as JavaScript, or try to eval just the initial \x00 byte, I get Uncaught SyntaxError: Invalid or unexpected token. (Others more expert in this than I, please correct me if I’m doing this wrong or interpreting the results incorrectly.)

Since all Wasm files start with these bytes, and these bytes always fail to parse as JavaScript, I think it’s safe to say that it’s impossible to have an ambiguous file that might be runnable as either Wasm or JavaScript. Or put another way, trying to parse any Wasm file as JavaScript will always fail. So therefore I think the last part of your algorithm, where we’re down to the “no file extension present” section, can go like this:

if no file extension:
  read the file as a utf8 string and try to run it as ESM JavaScript
  if it throws a parsing error (Uncaught SyntaxError: Invalid or unexpected token):
    check for the Wasm magic bytes. if found, run the file as Wasm
    // in the future, check for other magic byte headers for other supported types and run those
    else error

Because we’re trying JavaScript first, we’re free to expand the list of supported binary file types in the future, presuming they each have magic bytes or defined headers. We’ll never be able to support a non-JavaScript string source type as an extensionless file, but I think that’s okay.

@aduh95
Copy link
Contributor

aduh95 commented Sep 2, 2023

@GeoffreyBooth do you have a load hook implementation in mind that could behave as you are describing? AFAIK we must be deterministic, trying to trigger SyntaxError will not work – and would be quite inefficient anyway.

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Sep 2, 2023

@GeoffreyBooth do you have a load hook implementation in mind that could behave as you are describing? AFAIK we must be deterministic, trying to trigger SyntaxError will not work – and would be quite inefficient anyway.

Can you explain in particular why my proposed flow wouldn’t work? Not saying I have high confidence in it, but I think it should work, and it would be deterministic. Here’s how I would phrase it for the docs:

Node.js will load an extensionless entry point file and attempt to parse it as ES module JavaScript. If it fails to parse, Node.js will inspect the contents of the loaded entry point file and attempt to recognize a known header signature; and if such a header is found, Node.js will evaluate the file as that type. The following header signatures will be evaluated in order:

  • \x00asm: Wasm

It’s basically like try/catch with opening files.

@LiviaMedeiros
Copy link
Contributor

Because we’re trying JavaScript first, we’re free to expand the list of supported binary file types in the future, presuming they each have magic bytes or defined headers. We’ll never be able to support a non-JavaScript string source type as an extensionless file, but I think that’s okay.

Even if we assume that we can implement check if file contents are ES module, right now we're discussing only two formats, and they are mutually exclusive. Because of exclusivity, it doesn't matter in which order do we check, and it should be implemented in "Wasm -> ESM" order in code anyway, because complexity and performance cost is incomparable.

I think, the order of actions should be like this:

  1. We implement support for extensionless ES modules.
  2. We implement support for extensionless Wasm by adding magic number check.
  3. We implement support for various future formats by adding magic number checks and/or heuristics.

Between 1 and 2 there is no breaking change, because trying to launch extensionless Wasm after 1 will throw because it can never be a valid JavaScript code. Hence, I think 1 has good chances to land first.

2 adds guessing by magic number, so this change might be controversial and meet some opposition.

3, if there is less simple check (for example, check if code is not ESM but CJS or TS), it will be even more controversial.

Two questions I have at this point:

  • Let's say, we support both extensionless ESM and extensionless Wasm. User runs node extless, and extless is a file with JavaScript that has syntax error somewhere inside.
    What do we want to throw: the SyntaxError as is? Or we throw vague "failed to recognize extless format"? Or we throw a whole AggregateError with failures to parse it as ESM and as Wasm? Or something else?

  • Can we agree that this issue does not depend on ESM-first mode, and it won't be a breaking change to start implementing it right now?

@bakkot
Copy link

bakkot commented Sep 6, 2023

What would a "wasm entry point" mean? Wasm can't do anything unless provided with an environment, which is normally done from JS. I guess the idea would be to provide the WASI interface? But that still doesn't provide a way to import any JS things, so it would basically just be wasmtime. Which I guess is kind of handy, but isn't that related to node.

@GeoffreyBooth
Copy link
Member Author

What would a “wasm entry point” mean?

I’m not sure; you can see what #31388 did (which was removed in #31415); I assume that’s what we would be trying to enable.

@GeoffreyBooth
Copy link
Member Author

Because of exclusivity, it doesn't matter in which order do we check, and it should be implemented in "Wasm -> ESM” order in code anyway, because complexity and performance cost is incomparable.

This is true; however since the vast majority of cases will be JavaScript, shouldn’t we try that first before attempting to read the magic bytes? As in, optimize for the common case?

I think, the order of actions should be like this:

  1. We implement support for extensionless ES modules.
  2. We implement support for extensionless Wasm by adding magic number check.
  3. We implement support for various future formats by adding magic number checks and/or heuristics.

Agreed. I don’t know what other formats there might even be; .node files?

  • Let’s say, we support both extensionless ESM and extensionless Wasm. User runs node extless, and extless is a file with JavaScript that has syntax error somewhere inside.
    What do we want to throw: the SyntaxError as is? Or we throw vague “failed to recognize extless format”? Or we throw a whole AggregateError with failures to parse it as ESM and as Wasm? Or something else?

We have a few cases within the ESM loader where inside the error path, we do some additional checks to try to give a more informative error, like for example suggesting changing import { named } from 'commonjs-package' to import pkg from 'commonjs-package'; const { named } = pkg.

What I was thinking was that when we try to evaluate the entry point, if it errors we would catch that error, and then check for magic bytes and if found, run as Wasm; but if the magic bytes aren’t found, we just re-throw the original JavaScript error. The user doesn’t need to know about the Wasm check. If the magic bytes are found, the original error is discarded and we try to evaluate as Wasm and potentially throw any errors generated from that.

  • Can we agree that this issue does not depend on ESM-first mode, and it won’t be a breaking change to start implementing it right now?

Within a "type": "module" scope it wouldn’t be a breaking change, but otherwise (in an implicit or explicit "type": "commonjs" scope) it would be a breaking change, since extensionless files are currently parsed as CommonJS. Maybe we can do the same “if fail to execute as JavaScript, try to run as Wasm” for a CommonJS scope as well, but it’s a lot less clear that doing so wouldn’t be a breaking change. It would probably need a flag to be enabled, so if we’re going to flag it why not just put it within the overall new mode’s flag rather than something else? That way it can apply everywhere, within a "type": "module" scope or outside, which would be desirable since outside the scope is where this is most desired.

@LiviaMedeiros
Copy link
Contributor

This is true; however since the vast majority of cases will be JavaScript, shouldn’t we try that first before attempting to read the magic bytes? As in, optimize for the common case?

It depends on if we can try it without any overhead for JavaScript. I would assume that it wouldn't be easy to achieve, if possible at all.

Even if it's feasible, this can be discussed in a PR that adds the guessing by magic number and/or syntax error, so we can measure complexity and do benchmarks. It's a purely internal implementation, the outcome will be the same anyway. Documenting this as "switching on [magically guessing the type]" without explicit order can also work.

Within a "type": "module" scope it wouldn’t be a breaking change, but otherwise (in an implicit or explicit "type": "commonjs" scope) it would be a breaking change, since extensionless files are currently parsed as CommonJS. Maybe we can do the same “if fail to execute as JavaScript, try to run as Wasm” for a CommonJS scope as well, but it’s a lot less clear that doing so wouldn’t be a breaking change. It would probably need a flag to be enabled, so if we’re going to flag it why not just put it within the overall new mode’s flag rather than something else? That way it can apply everywhere, within a "type": "module" scope or outside, which would be desirable since outside the scope is where this is most desired.

I don't see a problem here.

After first change (I'll go ahead and make a PR), it will be symmetrical: extensionless inside module means ESM, extensionless inside commonjs means CJS, extensionless in typeless means default value.
Current default value is CJS so nothing breaks. The --module flag will flip both .js and extensionless in the same way.

Running Wasm-alike file as CJS should throw SyntaxError. If we'll want to have extensionless Wasm work in commonjs scope, it can be done in same way as with module scope, maybe even in same PR.

@andrew-aladjev

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem. wasi Issues and PRs related to the WebAssembly System Interface. wasm Issues and PRs related to WebAssembly.
Projects
None yet
Development

No branches or pull requests

5 participants