Skip to content
This repository has been archived by the owner on Sep 2, 2023. It is now read-only.

Loader Hooks #351

Closed
Tracked by #12 ...
reasonablytall opened this issue Jul 17, 2019 · 108 comments
Closed
Tracked by #12 ...

Loader Hooks #351

reasonablytall opened this issue Jul 17, 2019 · 108 comments

Comments

@reasonablytall
Copy link
Contributor

reasonablytall commented Jul 17, 2019

Hooking into the dependency loading steps in Nodejs should be easy, efficient, and reliable across CJS+ESM. Loader hooks would allow for developers to make systematic changes to dependency loading without breaking other systems.

It looks like discussion on this topic has died down, but I'm really interested in loader hooks and would be excited to work on an implementation! There's of prior discussion to parse through, and with this issue I'm hoping to reignite discussion and to create a place for feedback.

Some of that prior discussion:


edit (mylesborins)

here is a link to the design doc

https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq

@jkrems jkrems added the modules-agenda To be discussed in a meeting label Jul 17, 2019
@reasonablytall
Copy link
Contributor Author

Some use cases I've encountered:

I'm working on a custom dependency bundler and loader designed to improve cold-start startup times by transparently loading from a bundle to avoid file-system overhead. Currently, I have to monkey-patch module and reimplement CJS resolution with @soldair's node-module-resolution. I have to deeply understand and often reimplement CJS+ESM internals to work on this.

I also want to load modules from V8 code-cache similar to v8-compile-cache. Again I have to re-implement Module._compile and manually handle fallback for other extensions.

Some other use cases that would benefit:

  • Transpilers/compileres (babel, ts-node)
  • Bundlers (webpack, browserify, ncc)
  • Other custom loaders (pnp, tink, loading from zip, etc)
  • Code instrumentation (coverage, logging, testing)

@devsnek
Copy link
Member

devsnek commented Jul 17, 2019

I think the current exposed hooks are the right hooks to expose, but we definitely need to work on polishing the API:

  • dynamic modules are a bit rough atm
  • how are hooks registered
  • how do multiple registered hooks interact

@MylesBorins
Copy link
Member

Very excited to see interest in this! I believe that @bmeck has a POC that has memory leaks that need to be fixed. @guybedford may know about this too

@jkrems
Copy link
Contributor

jkrems commented Jul 17, 2019

@devsnek I think there's more things that list is missing. E.g. providing resource content, not just format. Or the question of if we can extend aspects of this feature to CommonJS (e.g. for the tink/entropic/yarn case that currently requires monkey-patching the CommonJS loader or even the fs module itself). The current hooks were a good starting point but I would disagree that they are the right hooks.

@devsnek
Copy link
Member

devsnek commented Jul 17, 2019

@jkrems i think cjs loader hooks are outside the realm of our design. cjs can only deal with local files and it uses filenames, not urls.

Providing resource content is an interesting idea though. I wonder if we could just figure out a way to pass vm modules to the loader.

@bmeck
Copy link
Member

bmeck commented Jul 17, 2019

@devsnek we discussed and even implemented a PoC of intercepting CJS in the middle of last year and had a talk on the how/why in both

These would only allow for files for require since that is what CJS works with but it should be tenable. Interaction with require.cache is a bit precarious but solvable if enough agreement can be reached.

@devsnek
Copy link
Member

devsnek commented Jul 17, 2019

@bmeck i don't doubt it can be done, i'm just less convinced it makes sense to include with the esm loader hooks given the large differences in the systems.

@guybedford
Copy link
Contributor

@a-lxe thanks for opening this discussion. It was interesting to hear you say that multiple loaders were one of the features you find important here. The PR at nodejs/node#18914 could certainly be revived. Is this something you are thinking of working on? I'd be glad to collaborate on this work if you would like to discuss it further at all.

@reasonablytall
Copy link
Contributor Author

@guybedford Yeah! At least to me it seems right now the singular --loader api is insufficient for current loader use cases. For example in my projects I test with ts-node istanbul, mocha, and source-map-support -- each of which hooks into loading in one way or another IIRC. Optimally these could each independently interface with a loader hook api and smoothly compound on each other.

I think a node loader hook api needs to provide mechanisms for compounding on and falling back to already registered hooks (or the default cjs/esm behavior). I'm not really sure yet where to focus work, but I definitely want to collaborate :)

@guybedford
Copy link
Contributor

@a-lxe agreed we need a way to chain loaders. Would the approach in nodejs/node#18914 work for you, or if not, how would you want to go about it differently? One way to start might be to get that rebased and working again and then to iterate on it from there.

@reasonablytall
Copy link
Contributor Author

@guybedford I like the way nodejs/node#18914 chains the loaders and provides parent to allow fallback/augmentation of both the resolve + dynamic instantiation steps. I have some ideals for what a loader hook api should look like (particularly wrt supporting cjs) but I don't think those should get in the way of providing multiple --loader for esm. To be honest working on reviving that PR would be really useful for me in getting up to speed with things, so I would be happy to get started on that.

Some gripes which are more relevant to the initial --loader implementation rather than the multiple --loader feature:

  • Why is there no runtime api for registering loaders? The current mechanisms using --require to register loaders via preloaded modules feel nice to me, and allow for opting to manually register at a particular point and with dynamic parameters.
  • A loader has to implement both hooks (and add fallback overhead) even if it only affects one.
  • Similarly, I feel like the hooks could be more granular. @bmeck 's Resource APIs for Node splits things into locate, retrieve, and translate hooks. With an additional initialize hook for actually creating the module, these match the Allow for multiple --loader flags node#18914 functionality, with an added bonus that only initialize needs to have coupling with cjs/esm. I'm curious what you think on this.
  • Doesn't hook into cjs require :'(

Also, your last comment on nodejs/node#18914 hints at another loaders implementation by @bmeck. Does this exist in an actionable state?

@BridgeAR this work also exists as part of the new loaders work which @bmeck started, so that effectively takes over from this PR already. Closing sounds sensible to me.

@guybedford
Copy link
Contributor

Why is there no runtime api for registering loaders?

Loaders are a higher-level feature of the environment, kind of like a boot system feature. They sit at the root of the security model for the application, so there are some security concerns here. In addition to that, hooking loaders during runtime can lead to unpredictable results, since any already-loaded modules will not get loaders applied. I'm sure @bmeck can clarify on these points, but those are the two I remember on this discussion offhand.

A loader has to implement both hooks (and add fallback overhead) even if it only affects one.

There is nothing to say we won't have CJS loader hooks or a generalized hook system, but it's just that our priority to date has been getting the ESM loader worked out. In addition the ESM hooks allow async functions, while CJS hooks would need some manipulation to support async calls. There's also the problem of the loaders running in different resolution spaces (URLs v paths) as discussed. Once we have our base ESM loader API finalized I'm sure we could extend it to CJS with some extra resolution metadata and handling of the resolution spaces, but I very much feel that loader unification is a "nice to have" that is additive over the base-level ESM API which should be the priority for us to consolidate and work towards first. That loader stability and architecture should take preference in the development process. That said, if you want to work on CJS unification first, feel free, but there are no guarantees the loader API will be stable or even unflagged unless we work hard towards that singular goal right now. So what I'm saying is chained loaders, whether the loader is off-thread, whether the API will be abstracted to deal with multi-realm and non-registry based API, and the translate hook all take preference in the path to a stable API to me, overy unifying ESM and CJS hooks. And that path is already very tenuous and unlikely, so that we should focus our combined efforts on the API stability first and foremost.

Similarly, I feel like the hooks could be more granular.

Implementing a translate or fetch hook for --loader could certainly be done and was a deliberate omission in the loader API. It is purely a problem of writing the code, making a PR, and the real hard part - getting consensus!

Doesn't hook into cjs require :'(

As mentioned above, this work can be done, but I would prefer to get the ground work done first.

@reasonablytall
Copy link
Contributor Author

That all makes a lot of sense and I appreciate you describing it for me 🙂

I can start with pulling nodejs/node#18914 and getting that in a working state.

@GeoffreyBooth
Copy link
Member

Just to spark some discussion, here’s a wholly theoretical potential API that I could imagine being useful to me as a developer:

import { registerHook } from 'module';
import { promises as fs, constants as fsConstants } from 'fs';

registerHook('beforeRead', async function automaticExtensionResolution (module) {
  const extensions = ['', '.mjs', '.js', '.cjs'];
  for (let i = 0; i < extensions.length; i++) {
    const resolvedPathWithExtension = `${module.resolvedPath}${extensions[i]}`;
    try {
      await fs.access(resolvedPathWithExtension, fsConstants.R_OK);
      module.originalResolvedPath = module.resolvedPath;
      module.resolvedPath = resolvedPathWithExtension;
      break;
    } catch {}
  }
  return module;
}, 10);

The new registerHook method takes three arguments:

  • The hook name, which is a point in Node’s code where these callbacks will be run. beforeRead, afterRead, etc.
  • The function to call at that point, which takes as input an object with all the properties related to import or require resolution and module loading that developers might want to override. Properties set on this object persist to other callbacks registered to later hooks in the process (e.g. module.foo set during beforeRead would be accessible in a different callback registered to afterRead).
  • (Optional): The priority to call registered callbacks. If multiple callbacks have the same priority level, they are evaluated in the order that they were registered.

In the first example, my automaticExtensionResolution callback is registered to beforeRead because it’s important to rewrite the path that Node tries to load before Node tries to load any files from disk (because './file' wouldn’t exist but './file.js' might, and we don’t want an exception thrown before our callback can tell Node to load './file.js' instead). I’m imagining the module object here has an unused specifier property with whatever the original string was, e.g. pkg/file, and what Node would resolve that to in resolvedPath, e.g. ./node_modules/pkg/file.

Another example:

import { registerHook } from 'module';
import CoffeeScript from 'coffeescript';

registerHook('afterRead', async function transpileCoffeeScript (module) {
  if (/\.coffee$|\.litcoffee$|\.coffee\.md$/.test(module.resolvedPath)) {
    module.source = CoffeeScript.compile(module.source);
  }
  return module;
}, 10);

This hook is registered after Node has loaded the file contents from disk (module.source) but before the contents are added to Node’s cache or evaluated. This gives my callback a chance to modify those contents before Node does anything with them.

And so on. I have no idea how close or far any of the above is from the actual implementation of the module machinery; hopefully it’s not so distant as to be useless. Most of the loader use cases in our README could be satisfied by an API like this:

  • Code coverage/instrumentation: In an afterRead hook, a callback could count lines of code or the like.
  • Runtime loaders, transpilation at import time: In an afterRead hook, example above.
  • Arbitrary sources for module source text: In a beforeRead hook, our callback could assign content into module.source (and then Node would know to not read from disk for this module).
  • Mock modules (injection): Basically the same as previous, a beforeRead hook could return a different path to load instead, or prefill the source code to use: if (module.specifier === 'request') module.source = mockRequest etc. Ideally Node would handle if source were an actual module rather than just a string to be evaluated.
  • Specifier resolution customization: This is like the extension resolution example above, though if we also want to support specifiers that Node can’t resolve, like import 'https://something', we would need another hook like beforeResolve.
  • Package encapsulation: We’re already implementing this as "exports", but it could just as easily be implemented as a loader, at least for ESM. It would be in beforeRead like the extension resolution example.
  • Conditional imports: In beforeRead or afterRead, based on some condition the module.source could be set to an empty string.

Anyway this is just to start a discussion of what kind of public-facing API we would want, and the kind of use cases it would support. I’m not at all married to any of the above, I’m just hoping that we come up with something that has roughly the same versatility as this.

@reasonablytall
Copy link
Contributor Author

Thanks for this write-up @GeoffreyBooth! Some thoughts to add to the discussion:

To me this looks like a transformer architecture, which exposes the entire in-progress module object to each hook, as opposed to the current --loader implementation, which has functional resolve and instantiate hooks. I would worry about exposing too large a surface area, ie developers doing something like reading a new file to override the old in the afterRead hook. Besides the transformer architecture, the differences largely come down to which hooks are exposed.

This api also doesn't allow for a loader to prevent other loaders from acting, which the wip multiple loader implementation at nodejs/node#18914 does. I don't think that's a bad thing, and I would be interested in hearing what people think on that front.

I'm not sure about the optional priority parameter. I don't think loaders should know much about what other loaders are registered or be making decisions about which order they're executed in. The user controls the order by choosing the order in which they register the loaders.

@GeoffreyBooth
Copy link
Member

These are all good points. I would err on the side of exposing a lot of surface area, though, as that’s what users are used to from CommonJS. A lot of the power of things like stubs are because the surface area is huge.

In particular, I think we do want to allow reading a new file to override the old, or at least modifying the loaded string (which of course could be modified by loading the contents of a new file); otherwise we can’t have transpilers, for example, or stubs that are conditional based on the contents of a file rather than just the name in the specifier.

The priority option is just a convenience, so that the user doesn’t need to be careful about the order that they register hooks.

One thing that I thought of after posting was to add the concept of package scope to this. A lot of loaders will only be useful in the app’s package scope, not the combined scope of the app plus all its dependencies. We probably want some easy way to limit the callbacks to just the package scope around process.cwd().

@reasonablytall
Copy link
Contributor Author

On the afterRead point, you're totally right -- there needs to be a way of knowing+overriding the original loaded source (which can currently be done by loading the source and modifying it in the instantiate hook). I think I gave a bad example: by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn't be modified (a better example might be module.resolvedPath = 'something' in afterRead).

@GeoffreyBooth
Copy link
Member

by providing the entire in-progress module object, the user can modify aspects of it even in hooks where it shouldn’t be modified (a better example might be module.resolvedPath = 'something' in afterRead).

Node could simply ignore any changes to “shouldn’t be modified” properties. That’s probably better than trying to lock them down or removing them from the module object, since late hooks might very well want to know things like what the resolvedPath was at an earlier step. This also feels like something we can work out in the implementation stage of building this.

@bmeck
Copy link
Member

bmeck commented Jul 24, 2019

So there is a lot to talk about on loaders. We have had multiple meetings discussing some design constraints to keep in mind. I think setting up another meeting just to review things from the past would be helpful.

@arcanis
Copy link

arcanis commented Jul 29, 2019

At the moment, from PnP's perspective:

  • We'd need a hook that takes the bare request + the path of the file that makes the require call, and is expected to return the unqualified resolution (so in a node_modules context it means such a loader would return /n_m/lodash rather than /n_m/lodash/index.js).

    • This is required for the CJS use case, as we have no reason to reimplement the extension / main resolution part.
  • We'd need a hook to instruct Node how to load the source for virtual files. For example, given /cache/lodash.zip/index.js, we would replace the default hook (that would use readFile) by a custom one that would open the zip, get the file source, and return it.

    • Note that in this case the source file might not exist at all. So rather than discussing in terms of paths, I think it would be beneficial to treat the resolution as an opaque identifier that doesn't necessarily match anything on the filesystem. We would then have a particular function (let's say require.toDiskPath) that would turn such opaque identifiers into paths usable on the disk (or throw an exception if impossible).

@devsnek
Copy link
Member

devsnek commented Jul 29, 2019

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

@arcanis
Copy link

arcanis commented Jul 29, 2019

the first one is already possible with our current design (not including cjs). the second one is interesting and should probably exist, but it is unlikely that a cjs version can be added without breaking some modules that are loaded by it.

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM 🙂

I think for cjs it would be doable if the require.resolve API was deprecated and split between two functions: require.req(string): Resolution and require.path(Resolution): URL, but it might be out of scope for this group and as I mentioned the fs layer is working decently well enough at the moment that it's not an emergency to find something else.

@bmeck
Copy link
Member

bmeck commented Jul 29, 2019

We're currently monkey-patching the fs module to add transparent support for zip and a few other extra files. It works for compatibility purposes, and it's likely we'll have to keep it as long as cjs environments are the norm, but I'm aware of the shortcomings of the approach and I'd prefer to have a long-term plan to sunset it someday, at least for ESM 🙂

Part of my worry and reason why I feel we need to expand loader hooks as best we can for CJS is exactly that we don't guarantee this to work currently or in the future. Even if we cannot support C++ modules (the main problem with this FS patching approach that has been known since at least 2014 when I spoke on it) we can cover most situations and WASM at least can begin to replace some C++ module needs. I see this as a strong indication that we need to solve this somehow or provide some level of discretion for what is supported.

@bmeck
Copy link
Member

bmeck commented Aug 5, 2019

We have a mostly stable design document, please feel free to comment or request edit access as needed, at https://docs.google.com/document/d/1J0zDFkwxojLXc36t2gcv1gZ-QnoTXSzK1O6mNAMlync/edit#heading=h.xzp5p5pt8hlq .

The main contention is towards the bottom around potential implementations, but reading the things before then explain a lot of different ideas from historical threads and research over the past few years and have been summarized.

@guybedford
Copy link
Contributor

guybedford commented Aug 6, 2019

Great to see work moving here! I really like the overall model, we maybe just have a few disagreements about the exact APIs. I’ve already statement my feedback in the doc, but will summarize it again here:

  1. I think the resolve hook and the body hook should be separated to allow for proper composability. By having the same call that resolves the module also load the module, makes it harder to add simple instrumentation hooks of source, or simple resolver hooks. For example, say I wanted a loader that resolves .coffee extensions before .js extensions. Calling the parent resolve function will give me { resolved: /path/to/resolved.js, body: streamingBodyForResolvedJs } for that resolved file. That is the loader might have already opened the file descriptor potentially for the .js resolution, when it is in fact a .coffee resolution that we want to load. This conflation seems like it might cause issues.

  2. Using web APIs like Response and Blob seems like bringing unnecessary web baggage to the API. For example Response can be replaced by an async iterator and a string identifier for the format. Blob can be replaced by simply an interface containing the source string and a format string. I’m not sure what benefit is brought by using these web APIs that were designed to serve specific web use cases we don’t necessarily share (at least without seeing justification for what these use cases are and why we might share them). With the use of MIMEs for formats, do we now expect every transform to define its own MIME type?

  3. How would loader errors be handled over the loader serialization protocol? Can an Error instance be properly cloned over this boundary with the stack information etc? Or would we just print the stack trace to the console directly from the loader context, while providing an opaque error to the user. We need to ensure we maintain a good debugging experience for loader errors, so we need to make sure we can get error stacks. Or is the stack information itself an information leak?

Most of the above is relatively superficial though - the core of the model seems good to me. (1) means having two-phase messaging with loaders, so is slightly architectural though.

@bmeck
Copy link
Member

bmeck commented Aug 6, 2019

@guybedford

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.


Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose. To that end I'd like to posit the following:

  1. Most APIs working on sources do not support streaming such as JSON.parse, JS parsers such as esprima, and WebAssembly.compile/instantiate. Even naive RegExp searches on the body will want to buffer them to a full body before searching. I think we should not focus on streaming for the first iteration in light of this.
  2. Data may be wanted in either a binary format or a textual format. This largely depends on the format. Consumption methods for both should be available as some naive steps can lead to corruption like split UTF code points. I like Blob because it does support this via .text() and .arrayBuffer().
  3. Streaming sources need care about how they are consumed. For example, reading the start of a stream to see if it begins with a magic number. If they cannot be replayed/cloned safely this is a problem. I like Response or a subset of that API because it has already solved these problems while preserving meta-data.
  4. When possible, opaque data structures allow for streaming either eagerly or lazily and can be largely swapped without consequence as we determine the best approach. When doing things eagerly, they can buffer and even complete reading before being requested. When doing things lazily, they can avoid costly I/O waste if they are not consumed. To this end, I believe we should have an opaque wrapper that does provide meta-data and if a resource is available prior to the stream of the resource's body.

If that sounds fine, we can add constraints and a data structure to the design document.

Overall, I do not think streaming is necessarily the best first pass given how little I expect it to be useful currently.

I found Blob to be a well suited fit for the above points if we wrap it in a container type so that we can iterate on streaming. It has plenty of existing documentation on how to use it as well as compatibility and familiarity. It may not be the most ergonomic API for all use cases, but I think it fits well and don't see advantages in making our own.


Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience. They are a leak technically, but I do not consider them a fatal leak since a loader can throw their own object instead of a JS Error if they wish to censor things. Not all things thrown necessarily have a stack associated with them, so if the question is mostly about how Errors are serialized it would just be ensuring they serialize properly (whatever we decide) when being sent across the messaging system. There is a question of async stack traces if we nest messages across threads but I am unsure if we want to even support cross thread stack traces as the ids of modules could conflict unless we add more data to represent the context.

I would be wary about user actionability on these messages as Loaders are likely to be more difficult to write properly than other APIs. However, debuggers and the like should also work if they want to debug things that way.

@guybedford
Copy link
Contributor

Per "separation". I agree there needs to be a "fetch"/"retrieve" hook of some kind, but not that resolve` should not be able to return a body. The problem you explain above is about passing data to parent loaders such as list of extensions, but does not seem to be fixed by separating loaders that I can tell.

As another example, consider a loader which applies a security policy that only certain modules on the file system can be loaded. This loader is added last in the chain, and basically provides a filter on the resolver, throwing for resolutions that are not permitted. The issue then with the model is that by the time the permission loader throws, the file might have already been opened by the underlying parent loader. This is the sort of separation of concerns that concerns me.

Per APIs, we can argue about which APIs to use but we should start making lists of what features are desirable rather than bike shedding without purpose.

The basic requirement is being able to determine what buffer to execute, and how to execute it in the module system. The simplest interface that captures this requirement is -

interface Output {
  source: String | Buffer;
  format: 'wasm' | 'module' | 'addon' | 'commonjs' | 'json'
}

The above could be extended to support streams by supporting source as an async iterator as well, but I'm certainly not pushing streams support yet either.

Error stacks are able to be serialized properly, but it depends on what you are seeking from a debugging experience.

Thanks for the clarifications re error stacks, we should just make sure we are aware of the debugging experience implications and properly support these workflows. Just getting the sync stack copied across as a string should be fine I guess.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented May 8, 2020

This has been something I've been thinking about as well. I think for a loader to handle both ESM and CommonJS files, it also needs to hook into require.extensions (at least for now). I'd like to proxy/override or wrap require.extensions['.js'] for extensions that should be handled similarly, but without necessarily having the “throw if this is ESM” check. I wonder if there's a not-terrible way to prevent that, aside from copying the source of require.extensions['.js'] into my code.

Another thing we might want to consider is making the ESM loader hooks simply the loader hooks, to apply to both CommonJS and ESM, finally replacing the deprecated require.extensions. I haven't given thought into how that would work in practice, but there are lots of use cases such as instrumentation where the loader should affect all files, not just ESM ones, and it's more work for loader authors to hook into both of Node's loaders in very different ways.

@cspotcode
Copy link

cspotcode commented May 8, 2020

@GeoffreyBooth the problem with unifying the hooks is that require() must be synchronous.

Right now, when a user installs ts-node's ESM hooks, we know they can't have installed any other hooks. Depending how you think about, we are responsible for implementing some basic features that would ideally be implemented by third-party ESM hooks.

If a third-party library installs require.extensions['.coffee'], for example, do we need special logic in our hooks to resolve() .coffee files and getFormat them the same as .js files? We cannot pass them to defaultGetFormat because it doesn't understand .coffee files.

@GeoffreyBooth
Copy link
Member

the problem with unifying the hooks is that require() must be synchronous.

Hmmmm. That is a problem. @jkrems or @weswigham is there any hope here, or would hooks that work with CommonJS run into the same issues that Wes’ “require of ESM” PR did?

I suppose one could write synchronous hooks? CoffeeScript transpilation is synchronous, for example; I dunno if TypeScript’s is? Obviously there couldn’t be a sync HTTP loader but there are plenty of useful loader cases that don’t need async.

Anyway applying these hooks to CommonJS as well is a long-term maybe goal. AFAIK CommonJS wasn’t designed to be hookable/customizable, and the current ways people do it are monkey-patched hacks more or less. Perhaps it’s best left as is.

@bmeck
Copy link
Member

bmeck commented May 26, 2020

@GeoffreyBooth one of the motivations of moving loading off thread/isolated was we have proof of concept that we can use a SharedArrayBuffer to sleep the main thread while the loader does async tasks but looks as if it were blocking via Atomics.wait. It does not have the same effect as require of ESM.

@GeoffreyBooth
Copy link
Member

one of the motivations of moving loading off thread/isolated

What's the benefit of moving to a threaded loader in terms of user story? Do we get improved performance? Improved security? New functionality that wouldn't have been possible in a single-threaded loader?

@jkrems
Copy link
Contributor

jkrems commented May 27, 2020

Do we get improved performance?

Decreased resource/memory usage is the most expected outcome, especially with multiple contexts or threads that all need hooks. If the APIs for hooks aren't designed to run in isolation, this would be hard or impossible to achieve in userland (since none of the hook implementations would be compatible with those assumptions by default). It's much more realistic to run one TSC or babel instance than 100 per process.

Another aspect is increased stability: The loader can't accidentally be broken by or break the application, at least not as easily as when they run in the same global scope.

@bmeck
Copy link
Member

bmeck commented May 27, 2020

@GeoffreyBooth

So we have 2 users effectively:

  1. application code
  2. loader code

I think for application code not much would be visibly affected and/or varies to much per application to state much about it.

I think for loader code there are a lot of pro/con to consider but I believe the pros outweigh the cons significantly. In theory, a portion of the pros could be left to users to do things like spin up workers inside of their own loader, but a variety of things are not feasible or have higher advantages if done by the runtime itself.

The overall story is a bit complicated in terms of performance but overall I'd say for simple workflows that putting them on a thread is worse, but for complex workflows and applications it is better.

  • 😄 - loaders only need to be instantiated once instead of per thread/context
  • 😢 - spinning up a loader is much more costly (expect an extra 10ms)
  • 😄 - you can spin up multiple loaders at once
  • 😢 - loaders have to communicate using serialized messages, no object sharing
  • 😄 - doing CPU heavy work in a loader doesn't block the application/other loaders (if multiple loaders are ever allowed)
  • 😢 - you have to do duplicate work sometimes in threads

Security has a bunch of discussions about what your security model is but for some simple statements that aren't really controversial:

  • 😄 - prototype pollution won't be a way to attack loader hook callbacks
  • 😄 - when auditing, don't have to worry about application code mutating references inside of loader code
  • 😄 - trust boundary is well defined by using a different context, so things like the permissions PR can apply different levels of trust to the loader vs application code
  • 😐 - generated code from a loader is still susceptible to attacks, same as status quo
  • 😐 - loader DOS E.G. ReDOS won't be a DOS for the application itself. Still a concern though.

Per features that if loaders are not on same thread:

  • 😄 - blocking the main thread to do asynchronous operations
  • 😄 - you can share data to be reused across all threads for loading purposes (code/resolution cache)
  • 😢 - you cannot directly share object references with a loader and application code. This affects patterns like testdouble uses

I do strongly think we need to solve the object reference problem, but we haven't really spent time looking into it to my knowledge. Even if we don't move off thread we likely need to solve the object reference problem in order to allow a solution for saving variables in getGlobalPreloadCode such that they can be referenced in code generated by a loader without being mutable/globals.

@just-boris
Copy link

Another use-case for custom loaders. Stubbing CSS-imports in UI libraries.

Currently there are similar solutions for CommonJS

I tried to implement the same functionality using the new Node.js loader API: https://gist.github.com/just-boris/b07d66e306c94cf42db41b010231fbbf

Works well for such cases.

@milahu
Copy link

milahu commented Sep 29, 2021

solved the "machine-level store and symlinked node_modules" problem
with LD_PRELOAD and nodejs-hide-symlinks ... like an early marriage of pnpm and /nix/store : )

@frank-dspeed
Copy link

Hi i am defining a new packaging module standard it is in Fact a ECMAScript Module based standard for packaging modules to get reused so i want to define some fundamentals and conventions on property to express importent meta like integrity related content hashes and the used hash algorithm and verification algorithm.

how can i pull that off chained with the loader hooks api to maybe directly support the needed functions

i call it web-module standard it is designed to form a world wide web scale shared p2p distributed content addressed module build cache and module based package system

so Web 4.0 compose apps out of modules without even the need for nodejs any web platform works for that.

and we can directly link remote contexts then no roundtrips via packaging are needed if we agree on a module standard that can get directly used even when the module is relativ Zero Trust. like it is with any npm package.

what do you think?

@GeoffreyBooth
Copy link
Member

Closing we we have https://github.com/nodejs/loaders now devoted to this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Custom Loading
Outstanding Needs
Development

No branches or pull requests