Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: --experimental-detect-module doesn’t run as ESM files that have CommonJS parse errors above the first ESM syntax #50917

Closed
chjj opened this issue Nov 26, 2023 · 28 comments · Fixed by #52024
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem.

Comments

@chjj
Copy link
Contributor

chjj commented Nov 26, 2023

Case 1 (a perfectly valid ES module):

const module = {};
import fs from 'node:fs';

Results in:

$ node --experimental-detect-module case1.js
case1.js:1
const module = {};
      ^

SyntaxError: Identifier 'module' has already been declared

Case 2 (another valid module)

const response = await fetch('https://icanhazip.com');
import fs from 'node:fs';
console.log(await response.text());

Results in:

$ node --experimental-detect-module case2.js
case2.js:1
const response = await fetch('https://icanhazip.com');
                 ^^^^^

SyntaxError: await is only valid in async functions and the top level bodies of modules

Solutions

The first case can be solved by placing the code in another function wrapper, allowing module, exports, etc to be shadowed (note that this might require manual removal of the hashbang if there is one).

The second case can be solved by adding the above "await" error message to the list of ESM errors to check for.

@targos
Copy link
Member

targos commented Nov 26, 2023

I tried to fix Case 2, but there's a problem with top-level await: it doesn't always generate the same syntax error.

With the test case: console.log(await Promise.resolve("foo"))
Error is: SyntaxError: missing ) after argument list

@targos
Copy link
Member

targos commented Nov 26, 2023

PR with the test if someone knows how to handle this case: #50918

@targos
Copy link
Member

targos commented Nov 26, 2023

Case 1 is similar as it also generates an error that's not specific to ESM syntax.

@chjj
Copy link
Contributor Author

chjj commented Nov 26, 2023

@targos, case 1 can be fixed with an additional wrapper. If we assume the normal CJS wrapper is something like:

(function(module, exports, require, __filename, __dirname) {
  CODE
})();

For the check we can do something like:

(function(module, exports, require, __filename, __dirname) {
  (function() {
    CODE
  })();
})();

This allows all of the CJS "globals" to be shadowed and redeclared with const, let, etc.

That said, I'm not familiar enough with the codebase to know whether the compiled ESM checking function is being cached for later use in the case that the code is indeed a CJS module.

If the function is being cached for later use, this might be some kind of breaking change for code that expects to throw on things on const module = {} -- however, I expect code that somehow relies on this behavior to be astonishingly rare, and I struggle to even think of a use case for it.

@chjj
Copy link
Contributor Author

chjj commented Nov 26, 2023

Just a thought: I feel like the proper way to do this would be to actually traverse the AST and check for ESM nodes before the compilation step. Does v8 expose the raw AST?

@joyeecheung
Copy link
Member

Just a thought: I feel like the proper way to do this would be to actually traverse the AST and check for ESM nodes before the compilation step. Does v8 expose the raw AST?

V8 does not expose the AST (for encapsulation concerns, I gather). And we rely on V8 errors for now because using another parser would likely result in a performance regression in the no-failure case.

@joyeecheung
Copy link
Member

joyeecheung commented Nov 26, 2023

It seems to me both cases can be solved with a different strategy: when there are syntax errors when trying to parse it as CommonJS, don't even detect the errors and just try parsing it as ESM unconditionally. And if that fails, it really fails. (I thought that was the original plan, but I don't know why it wasn't implemented that way, though I didn't give any thorough review the original PR myself either)

@targos
Copy link
Member

targos commented Nov 26, 2023

When there are syntax errors when trying to parse it as CommonJS, don't even detect the errors and just try parsing it as ESM unconditionally

If you do that, ambiguous files (which are syntactically valid in ESM and CJS) will be executed as ESM.

@meyfa
Copy link
Contributor

meyfa commented Nov 26, 2023

If you do that, ambiguous files (which are syntactically valid in ESM and CJS) will be executed as ESM.

Ambiguous files would still be executed as CJS, no? Only if they cause a syntax error (any syntax error) when parsing as CJS would they be parsed again as ESM. If they are truly ambiguous, this should cause the same syntax error. If it succeeds, however, then they weren't ambiguous to begin with, and should just be executed as ESM.

I think the proposal is to ignore the specific type of syntax error; it isn't to try ESM first (which would obviously be a much larger change).

@joyeecheung
Copy link
Member

joyeecheung commented Nov 26, 2023

If you do that, ambiguous files (which are syntactically valid in ESM and CJS) will be executed as ESM.

No, it'll still be executed in CJS. We can first try to compile it as CJS, and if it fails due to SyntaxError (any SyntaxError, not just selected ones) we try to compile it as ESM, but it if passes we execute it as CJS. And if the second parse as ESM still doesn't compile we fail.

@targos
Copy link
Member

targos commented Nov 26, 2023

What about:

const module = {};

This is not unambiguous ESM and throws a SyntaxError is CJS. I know it's not a very realistic example but I'm afraid we can find some if we think about it.

@joyeecheung
Copy link
Member

We can just say in this case we recognize it as ESM. The rule is set by us, and I think it's fair to just say, if a script attempts to declare a new binding module, it will be recognized as ESM.

@joyeecheung
Copy link
Member

Or we don’t need to say anything about any syntax in particular, just that

  1. If it can be parsed as CJS, it’s executed as CJS. Doesn’t matter if it can be parsed as ESM or not. If the user wants to execute a script that can be parsed both ways as ESM, just make that explicit through any of the available options l
  2. If it cannot be parsed as CJS, we will attempt to parse it as ESM. If that parses we execute it as ESM.
  3. If it doesn’t parse as any of these Node.js will fail.

@joyeecheung joyeecheung added the loaders Issues and PRs related to ES module loaders label Nov 28, 2023
@GeoffreyBooth
Copy link
Member

Please don’t forget to tag @nodejs/loaders

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Dec 3, 2023

So unless I’m misremembering, the current algorithm is:

  1. Try to parse as CommonJS. If V8 throws one of these errors:

node/src/node_contextify.cc

Lines 1406 to 1409 in 2e458d9

constexpr std::array<std::string_view, 3> esm_syntax_error_messages = {
"Cannot use import statement outside a module", // `import` statements
"Unexpected token 'export'", // `export` statements
"Cannot use 'import.meta' outside a module"}; // `import.meta` references

  1. Then try again to parse and evaluate as ESM. Otherwise run as CommonJS.

So if in step 1 V8 throws on some other parse error, before the first import or export statement, then it just stops. I think that’s what we’re seeing in both of these cases: in each one, the first line contains code that’s invalid as CommonJS, and so V8 errors on it, and so V8 never gets far enough to error on the import or export statement and recategorize the code as ESM.

From the beginning detection was never intended to catch every valid ESM file; when we implemented it we knowingly excluded top-level await from the list of syntax that would trigger “this is ESM,” because we can’t distinguish the error that V8 throws for top-level await from the identical error that V8 throws for any await inside a sync function. The goal is only to detect ES modules that would throw a “cannot use import/export statements outside of an ES module” error and run those as ESM. We could perhaps update the docs to clarify this, that detection is conservative and requires this error to be thrown if people want it to succeed.

I don’t think we want to have multiple parsing passes to catch edge cases such as CommonJS parsing errors before the first import statement. Such code is very rare, and it’s not worth the performance hit to everyone to catch something that will throw an error. We could perhaps update the error path, so only when we’re preparing the SyntaxError: Identifier 'module' has already been declared or SyntaxError: await is only valid in async functions and the top level bodies of modules error messages in these examples we check “does this parse as ESM?” and if so suggest using one of the explicit methods of running the file as ESM.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Dec 3, 2023

@joyeecheung

  1. If it can be parsed as CJS, it’s executed as CJS. Doesn’t matter if it can be parsed as ESM or not. If the user wants to execute a script that can be parsed both ways as ESM, just make that explicit through any of the available options.
  2. If it cannot be parsed as CJS, we will attempt to parse it as ESM. If that parses we execute it as ESM.
  3. If it doesn’t parse as any of these Node.js will fail.

If you change step 2 to

  1. If it cannot be parsed as CJS, and during the attempted parse it threw one of these errors, we will attempt to run it as ESM.

Then you’re describing the current behavior. Though it’s an interesting idea to attempt “run as ESM” for any parsing error when attempting to parse as CommonJS, rather than restricting the “try as ESM” behavior to only ESM syntax errors.

So the first thing I’m wondering is whether there’s a difference in practice between “we will attempt to run it as ESM” and “we will attempt to parse it as ESM. If that parses we execute it as ESM.” Is the latter an additional parsing pass? If so, is there a noticeable performance cost to doing so? (Perhaps not, if V8 caches.)

Let’s say there’s no meaningful performance cost. In that case, why not try again as ESM for any CommonJS parsing failure? What would be the downside? The only one I can think of is that we might get a different parsing error, that the user might not expect. For example, a module consisting of:

const { readFile } = require('fs/promises');
const contents = await readFile('file.txt');

Would throw SyntaxError: await is only valid in async functions and the top level bodies of modules when run as CommonJS or ReferenceError: require is not defined in ES module scope, you can use import instead when run as ESM. Which error is the “right” one to throw in this case? Presumably we would be throwing the second one, as it’s the most recent error before the process exits, but in this case it would be quite confusing since from the user’s perspective this is obviously CommonJS code. Or perhaps it doesn’t matter which error the user gets?

The other possible footgun I can think of is if a user is editing a CommonJS file and makes a mistake that causes it to fail parsing as CommonJS, but it does still parse as ESM. Then the user would have unintentionally opted into the “run ESM by detection” behavior. It’s quite explicit to need to type import or export or import.meta to opt into ESM, and users wouldn’t be surprised if we ran such code as ESM; but a CommonJS-only syntax error is a lot more surprising of a way to opt into ESM detection. I can’t think of a realistic example of such a case, however.

@GeoffreyBooth GeoffreyBooth added module Issues and PRs related to the module subsystem. esm Issues and PRs related to the ECMAScript Modules implementation. and removed loaders Issues and PRs related to ES module loaders labels Dec 3, 2023
@ljharb
Copy link
Member

ljharb commented Dec 3, 2023

What it means is that a file will be in sloppy mode up until the first import or export is written, at which point it would suddenly be in strict mode. While the programming models in which that's likely are rare, when they do happen, it will be surprising and very very difficult to debug.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 3, 2023

Would throw SyntaxError: await is only valid in async functions and the top level bodies of modules when run as CommonJS or ReferenceError: require is not defined in ES module scope, you can use import instead when run as ESM. Which error is the “right” one to throw in this case? Presumably we would be throwing the second one, as it’s the most recent error before the process exits; perhaps it doesn’t matter which error the user gets?

I don’t think we need to extend to reference errors. Just syntax errors are enough. So it can just fail as 1. Or just that we can stash the first error somewhere and if ESM errors again, in step 3 we throw the first one. Or we can just show both. If users can be surprised by what happens, we just tell them what happens.

@aduh95
Copy link
Contributor

aduh95 commented Dec 3, 2023

Yeah, let’s not extend to ReferenceErrors, otherwise we would be executing some code twice. Showing both errors is the best UX, so we should try to do that – unless it turns out to be too difficult to do

@GeoffreyBooth
Copy link
Member

I think at the very least we can improve the documentation and error messages. I’m unsure what, if anything, we should do beyond that. The only thing I can think of is to remove the restriction for “try again as ESM” from just the ESM-related syntax errors to any parsing/syntax error, but I’m wary that doing so could introduce other issues; but the only potential problem I can think of is the “different errors thrown when compiled as CommonJS versus when compiled as ESM” issue, which is something we can certainly handle one way or another. @joyeecheung @targos @aduh95 can you think of any issues that would be created if we did so?

I think in practice what this would look like would be replacing

node/src/node_contextify.cc

Lines 1487 to 1492 in 2e458d9

for (const auto& error_message : esm_syntax_error_messages) {
if (message.find(error_message) != std::string_view::npos) {
found_error_message_caused_by_module_syntax = true;
break;
}
}
with a check that the V8 error message began with SyntaxError:.

@GeoffreyBooth
Copy link
Member

I can think of one reason not to change "try as ESM after getting an ESM-related parse error" to "try as ESM after getting any parse error": other tools would then have no reliable way to know how Node will run a file. Currently they can look at our docs, where it lists the syntax we check for (import statement but not expression, export, import.meta) and parse the JavaScript accordingly to know what Node will do. If we run as ESM for anything that fails to parse in CommonJS, that ability would be lost.

The counterargument is that if there's a CommonJS syntax error before the first appearance of this syntax, Node will still fail to run the file as ESM even though it has the opt-in syntax. So the current behavior is more like "if the file parses as CommonJS except for this syntax, Node will run as ESM" which is much less straightforward to evaluate by tools; though I'd expect many to skip over the edge case presented by this issue and just assume that files can parse correctly.

@GeoffreyBooth
Copy link
Member

I’ve been discussing with @guybedford and we think we have a solution for this. So cjs-module-lexer, which is already in the codebase, can tell us if a source string contains ESM syntax; and it can do so without erroring early on (most) other syntax errors. (If a file is so mangled that it can’t be parsed, like unterminated strings or the like, it might not find the ESM syntax; but such a file wouldn’t be runnable as either CommonJS or as ESM, so it wouldn’t matter.) So the flow could be (new part in bold):

  • Pass a module’s source to V8’s compileFunction, asking V8 to try to parse the source as CommonJS.
  • If it parses, run as CommonJS (end).
  • Else inspect the parse error that V8 threw.
  • If it’s one of the ESM syntax ones (import or export statement, import.meta) run as ESM (end).
  • (New addition): Else ask cjs-module-lexer to parse the source and see if it finds ESM syntax.
  • If cjs-module-lexer finds ESM syntax, run as ESM (end).
  • Else throw original V8 syntax error.

This should cause the examples in this issue, which contain code that can run as ESM but not as CommonJS, to successfully evaluate as ESM; while not expanding our algorithm beyond what it currently is, and is easily reproducible by other tools. And the additional parse would only happen for the rare case of a file that:

  • Is .js without "type" set in the nearest parent package.json, and
  • Can’t parse (or run) as CommonJS, and
  • Can parse as ESM, and
  • Contains syntax above the first import or export statement or import.meta reference, that can’t parse as CommonJS but can parse as ESM: top-level await or redeclaring the CommonJS wrapper variables

@GeoffreyBooth GeoffreyBooth changed the title node --experimental-detect-module is currently flawed Bug: --experimental-detect-module doesn’t run as ESM files that have CommonJS parse errors above the first ESM syntax Dec 12, 2023
@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Mar 7, 2024

In discussion with @joyeecheung and @guybedford I think there might be an even simpler solution. The insight is twofold:

  • There are a limited number of syntax errors that could be thrown when code is parsed as CommonJS but not as ESM. This issue lists two of them, for const module/exports/require/__filename/__dirname; or top-level await.
  • It’s possible to write ESM code that throws the same error (like you can have an ES module with const module; const module; or await in a sync function) but if you do so, you’ll get the same error, and it won’t matter if the error was thrown from a CommonJS parse or an ESM parse.

We could add these additional errors to this list of exceptions that trigger a retry as ESM. So the flow would be (new part in bold):

  • Pass a module’s source to V8’s compileFunction, asking V8 to try to parse the source as CommonJS.
  • If it parses, run as CommonJS (end).
  • Else inspect the parse error that V8 threw.
  • If it’s one of the ESM syntax ones (import or export statement, import.meta) or has already been declared for one of the CommonJS module-scoped variables, or await in a sync function, run as ESM.
  • Else throw original V8 syntax error.

Would this work? If so, are there any errors to add to our list besides the below?

  • SyntaxError: Identifier 'module' has already been declared
  • SyntaxError: Identifier 'exports' has already been declared
  • SyntaxError: Identifier 'require' has already been declared
  • SyntaxError: Identifier '__filename' has already been declared
  • SyntaxError: Identifier '__dirname' has already been declared
  • SyntaxError: await is only valid in async functions and the top level bodies of modules

The other nice thing about this approach is that it’s still replicable by other tools.

Edit: Here’s an example that would be an issue with this approach:

exports.blah = 6;
const module = 'test';

The const module would trigger ESM evaluation, which would cause a runtime error on the undefined exports. If we print both errors, then any ESM file that errors would print a usually irrelevant CommonJS parse error (the error that triggered ESM evaluation in the first place). If we print only the ESM error, there will be cases like this one where the user intended CommonJS and opted into ESM evaluation by accident.

@joyeecheung
Copy link
Member

I don't understand why that's an issue. What's wrong with saying "hey looks like this file is neither correct CJS nor ESM. If we try to interpret it as CJS the error is here, if we try to interpret it as ESM the error is here"?

@GeoffreyBooth
Copy link
Member

I don’t understand why that’s an issue. What’s wrong with saying “hey looks like this file is neither correct CJS nor ESM. If we try to interpret it as CJS the error is here, if we try to interpret it as ESM the error is here”?

Imagine a 1000-line CommonJS file; you add const module = on line 900 and suddenly the whole file gets parsed as ESM and errors on the module.exports = way up on line 5. It would be surprising.

Also I assume it wouldn’t be too hard to show two parse errors if neither mode parses successfully, since both parses happen one after each other in the same function, but to show a CommonJS parse error and an ESM runtime error would mean somehow remembering the CommonJS parse errors for every ES module being parsed and linked before they’re later evaluated.

@joyeecheung
Copy link
Member

I don't think that's that problematic. If they don't want surprises, just make the module type explicit. Otherwise treat it as sloppy mode, which is just...JavaScript.

Re. reparsing I think they can be done together. At least for require(esm) that's easily doable. For the ESM loader, some refactoring needs to be done because it has too many abstractions in the way.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Mar 8, 2024

I don’t think that’s that problematic.

Well great; if such an outcome is acceptable, then we have multiple solutions.

Obviously something which has fewer surprises would be better, though. Here’s another potential solution, based on #50917 (comment):

If the CommonJS parse throws one of the six potential errors listed above, (has already been declared / await is only valid), then we do a special second CommonJS parse where add a wrapper inside the CommonJS wrapper:

(function(module, exports, require, __filename, __dirname) { (async function() {
  CODE
})(); })();

If the has already been declared or await is only valid errors were thrown in the first parse because of a variable declaration or await at the top level, this new inner async function will eliminate those errors, and then we retry as ESM; else we throw whatever error was thrown by the first CommonJS parse.

If this second CommonJS parse still errors (on the same error?) then the user really did write code like const require = ...'; const require = ... or await in a sync function that wasn’t the CommonJS wrapper, and therefore this module won’t successfully parse as ESM; so just return the first CommonJS parse error.

I think this would cover all the cases in this issue, without needing to involve cjs-module-lexer; and this extra CommonJS parse would only happen in very limited circumstances on an error path, so it would rarely affect performance. Does this seem like it could work?

@joyeecheung
Copy link
Member

joyeecheung commented Jul 29, 2024

I just ran into the "accidentally opted into ESM mode" issue by simply running a file that does const module = require('module');:

❯ node test-tsc.js
/Users/joyee/projects/node/test-tsc.js:3
const module = require('module');
      ^

SyntaxError: Identifier 'module' has already been declared
    at internalCompileFunction (node:internal/vm:73:18)
    at wrapSafe (node:internal/modules/cjs/loader:1153:20)
    at Module._compile (node:internal/modules/cjs/loader:1197:27)
    at Module._extensions..js (node:internal/modules/cjs/loader:1287:10)
    at Module.load (node:internal/modules/cjs/loader:1091:32)
    at Module._load (node:internal/modules/cjs/loader:938:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:83:12)
    at node:internal/main/run_main_module:23:47

Node.js v20.5.0
❯ out/Release/node test-tsc.js
(node:11443) [MODULE_TYPELESS_PACKAGE_JSON] Warning: file:///Users/joyee/projects/node/test-tsc.js parsed as an ES module because module syntax was detected; to avoid the performance penalty of syntax detection, add "type": "module" to /package.json
(Use `node --trace-warnings ...` to show where the warning was created)
file:///Users/joyee/projects/node/test-tsc.js:3
const module = require('module');
               ^

ReferenceError: require is not defined in ES module scope, you can use import instead
    at file:///Users/joyee/projects/node/test-tsc.js:3:16
    at ModuleJob.run (node:internal/modules/esm/module_job:262:25)
    at async onImport.tracePromise.__proto__ (node:internal/modules/esm/loader:482:26)

Node.js v23.0.0-pre

The latter is not a case that we can detect based on syntax - it's a reference error thrown at evaluation time. I think the suggestion myself and @aduh95 made previously about showing both errors still make sense - the current message isn't very helpful for users accidentally opting into ESM mode. Showing both errors at least helps users understand what's going on.

On a side note: the current modified error message isn't super helpful either, a beginner would think it's telling them to do const module = import('module'), which doesn't make sense. If it's modified to be ESM it should be import module from 'node:module' or const module = await import('node:module').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem.
Projects
None yet
7 participants