Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return decoded specifiers and identifiers #61

Closed
evanw opened this issue Feb 19, 2021 · 9 comments
Closed

Return decoded specifiers and identifiers #61

evanw opened this issue Feb 19, 2021 · 9 comments

Comments

@evanw
Copy link

evanw commented Feb 19, 2021

Context: vitejs/vite#2083. Import paths are strings and can therefore contain escape sequences, like this:

import test from "./\u0442\u0435\u0441\u0442";
export default test;

Right now the documentation for this module seems to encourage using this library in a way that doesn't handle any escape sequences. Taking a raw slice of the input means that the backslashes are literal backslashes in the path instead of the actual underlying escape sequences. So in other words you would get the string "./\\u0442\\u0435\\u0441\\u0442" instead of the string "./тест" and then fail to import the path.

This can of course be worked around by interpreting the escape sequences yourself. But I thought I'd file this issue because it could be nice for either this library to do that for you or for there to be a recommended way of doing this, since one reason to use this library is to extract the import paths. If that's out of scope, maybe just calling out that you need to be sure to handle escape sequences yourself when you use this library with a link to the spec?

@guybedford
Copy link
Owner

Thanks, yes the API of this project could be more polished to avoid these footguns. It was actually originally written for https://github.com/guybedford/es-module-shims hence why it is entirely offset based, with a heavy focus on the overall library footprint.

I've added a new section to the readme in https://github.com/guybedford/es-module-lexer#escape-sequences, I think that should cover the concerns, but further clarifications welcome.

@evanw
Copy link
Author

evanw commented Feb 20, 2021

For this reason, when handling specifiers or identifiers it can advisable to pass the full string to JSON.parse:

JSON.parse(source.substring(imports[i].s, imports[i].e));

That workaround had occurred to me too. However, it doesn't work for a few reasons:

  • It's missing the quote characters, so it won't actually work at all. You'd have to do this instead:

    JSON.parse('"' + source.substring(imports[i].s, imports[i].e) + '"');
  • This is JavaScript, not JSON. They support different escape sequences. Here are some counter-examples that don't work with this approach:

    import './\x61\x62\x63.js';
    import './\u{20204}.js';
  • It will technically break for single-quoted strings containing a double quote. I assume this case would never come up in real code though.

@guybedford
Copy link
Owner

It's missing the quote characters, so it won't actually work at all. You'd have to do this instead:

The .s and .e offset indices include the " quote characters exactly for this reason, so you don't need additions.

This is JavaScript, not JSON. They support different escape sequences.

Both of the examples support JSON.parse in my test.

It will technically break for single-quoted strings containing a double quote

Because of the first point, this case works out fine too.

@guybedford
Copy link
Owner

Because of the first point, this case works out fine too.

Ah, of course this doesn't, will put some thought to this one.

@guybedford guybedford changed the title Recommended way to handle Unicode text? Return decoded specifiers and identifiers Feb 20, 2021
@evanw
Copy link
Author

evanw commented Feb 20, 2021

Here's my test:

import lexer from 'es-module-lexer';

function tryCatch(fn) {
  console.log(`\n${fn}:`);
  try {
    return fn();
  } catch (err) {
    return err;
  }
}

let js, imp;

js = `import 'abc.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse(js.substring(imp.s, imp.e))));
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

js = `import './\\x61\\x62\\x63.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

js = `import './\\u{20204}.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"')));

I'm running with node v15.2.0 and es-module-lexer v0.3.26 from npm. This is the output:


() => JSON.parse(js.substring(imp.s, imp.e)):
SyntaxError: Unexpected token a in JSON at position 0
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:16:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:16:13

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
abc.js

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
SyntaxError: Unexpected token x in JSON at position 4
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:21:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:21:13

() => JSON.parse('"' + js.substring(imp.s, imp.e) + '"'):
SyntaxError: Unexpected token { in JSON at position 5
    at JSON.parse (<anonymous>)
    at file:///Users/evan/Desktop/example.mjs:25:33
    at tryCatch (file:///Users/evan/Desktop/example.mjs:6:12)
    at file:///Users/evan/Desktop/example.mjs:25:13

Edit: I think indirect eval should work fine though:

import lexer from 'es-module-lexer';

function tryCatch(fn) {
  console.log(`\n${fn}:`);
  try {
    return fn();
  } catch (err) {
    return err;
  }
}

let js, imp;

js = `import 'abc.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './\\x61\\x62\\x63.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './\\u{20204}.js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

js = `import './".js';`;
[[imp]] = await lexer.parse(js);
console.log(tryCatch(() => (0, eval)(js.substring(imp.s - 1, imp.e + 1))));

The output:

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
abc.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./abc.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./𠈄.js

() => (0, eval)(js.substring(imp.s - 1, imp.e + 1)):
./".js

@guybedford
Copy link
Owner

Ok, I've put together a PR in #62. Let me know if that seems like it can work here.

@evanw
Copy link
Author

evanw commented Feb 20, 2021

Yeah that approach seems good to me.

@guybedford
Copy link
Owner

Released in 0.4.0.

@evanw
Copy link
Author

evanw commented Feb 20, 2021

I can confirm the fix. Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants