Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javascript backend - run node package install once per workspace #20826

Merged
merged 10 commits into from
Jul 8, 2024

Conversation

riisi
Copy link
Contributor

@riisi riisi commented Apr 21, 2024

Fixes #19236 and fixes #19235.

I've been poking around at this backend for a while trying to understand it. I think I'm most of the way there...
I wanted to raise this PR now for feedback - I think it's fairly complete as is - but I'm not sure if I have any blind spots in both understanding the original design intent and how it's being used.

There is a lot more functionality than I expected going in - huge props to @tobni for that - it would be great to fix the remaining issues and get some docs out so it can be used more & we can get more feedback.

Current process

For each first party package, the backend runs 'npm install' on its first party dependency packages. This is a recursive process.

Each npm install command is executed in the context of that package, generating a 'node_modules' directory containing 1st and 3rd party deps (in separate directories). Source code (from javascript_sources targets) is then copied in.
The output of node_modules from any 1st party dependency installs is merged to form node_modules for the package.

Problems

Per #19235, newer versions of package managers generate integrity files in each node_modules directories. These cause conflicts on merge.

Secondly, and more critically, it seems (I've not dug too much into this, but some past discussion here), that each npm install is effectively installing all transient deps.

E.g. in the case of a tree like package-A -> package-B -> package-C -> package-D ... I think while the intent is for package-A install to only install direct dependents, it's installing all transitive, so there's a lot more IO than intended. Effectively, it is merging ((package-A + trans deps) + (package-B + trans deps) + (package-c + trans deps)...))

New approach

Perform one install for the package in the workspace context.
The install command is npm install --workspace package-a.

The install command needs to have visibility of the root package.json, which contains the 'workspaces' configuration, as well as any dependent package.json. Currently this is all package.json in the project.

The output is a single node_modules directory at the project level (assuming default package manager config - hoisting options could be configured differently).

Advantages

Simpler and is similar to how the user would use the package manager natively to install in a mono-repo setup.
The backend is simpler to troubleshoot as there is only one sandbox per install.
We delegate more of the package resolution process to the package managers.
Performance should be theoretically better given the issue above - but I've not tested. If anyone has anything handy for testing, please let me know.

Disadvantages

We are not caching each module install within Pants, which I'm guessing was the original intent. But I'm not sure we're gaining much from this approach currently.

@riisi riisi added bug backend: JavaScript JavaScript backend-related issues labels Apr 21, 2024
Copy link
Contributor

@tobni tobni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Happy to see contributions to the JS backend.

  1. There's some unused fields / code that should be removed following the removal of the rule calls / package caches,
  2. There ought to be at least one test case that support the implementation. I suspect all tests pass because of how much heavy lifting the package managers already do. At least exposing the integrity file case as passing would be a good case to cover.
  3. Did you consider Install entire node project when creating node_modules digest, subset in sandboxes #19236? I'm fine with this change as is because it is a path forward, but I'm curious if you spent any thought on it.

This README is mainly for understanding the current backend functionality for PR review purposes.
Plan to migrate much of this content to user docs.

## Definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed through this README, LMK if you want it thoroughly checked.

### Dependencies between packages

In the example above, for an installation of 'package-a' to also install 'package-c', the following things must in place:
- Both packages must be in a workspace - i.e. the directories added to a common parent package.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed true, but I think the backend is fairly close to being able to serve as a workspace manager as well. Whether that is complexity pants is ready to take on: I doubt.


No import, package.json dep:
- node_modules missing source code for 1st party deps, everything else present
- Probably in this case, it would be nice to generate a warning, e.g. "Package-a source code has a dependency on package-b but package-b is not present in package-a's listed dependencies in package.json".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but I'd also expect linters to detect unused deps more effectively. There's also a plethora of plugin-style packages that hijack instead of introducing explicit exports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true - that would work.


If any of the package.json files in the workspace is modified, this causes a reinstall, as the process for 'Preparing configured default npm version.', (and also the subsequent installation process) has these files as an input.

It may be possible to limit this by only including the package.json files for packages' dependencies.
Copy link
Contributor

@tobni tobni Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when you also install the entirety of the workspace each time.

If pants want to cache these results and also support the fields presence in package.json, we'd need to parse and re-write relevant subsets of that file to the processes that require package.json contents, as that is the "pants way".

src/python/pants/backend/javascript/nodejs_project_test.py Outdated Show resolved Hide resolved
src/python/pants/backend/javascript/nodejs_project.py Outdated Show resolved Hide resolved
@huonw
Copy link
Contributor

huonw commented Apr 21, 2024

Fixes #19236 and #19235.

(Totally minor, but FYI GitHub's auto-closing doesn't handle "and" like this. I've adjusted it to repeat the fixes. 😄 )

@riisi
Copy link
Contributor Author

riisi commented Apr 21, 2024

Thanks for reviewing so quickly! It will probably take me a while to go back through everything.

  1. Did you consider Install entire node project when creating node_modules digest, subset in sandboxes #19236? I'm fine with this change as is because it is a path forward, but I'm curious if you spent any thought on it.

Only briefly. I've just been trying to find a path forward mainly while trying to understand the backend at the same time.

If I'm understanding that correctly, it's suggesting to, instead of running npm install --workspace package-a for each workspace, run npm install --workspaces (i.e. install workspaces for the whole project at once) and then extract the results for each workspace?

It seems to me there would be a couple of disadvantages:

  • if a user is only working with a particular workspace, the whole project needs to be installed - if there are issues affecting other parts of the project, (e.g. a. missing package, or package conflicts, as hypotheticals), I was thinking it might be better to be able to only need to install a single workspace
  • If the lockfile is updated (or other upstream cache invalidation), then all workspaces need to be re-installed which would take longer - unless there's a way around this?

@riisi riisi force-pushed the javascript branch 2 times, most recently from 1c5c8b7 to fb8e41d Compare May 12, 2024 05:15
@riisi
Copy link
Contributor Author

riisi commented May 12, 2024

I've reworked the package integration test to cover the case of importing another workspace package, which triggered the integrity file conflict. In doing so, I found a bug, which held me up for a while.

A newer version of pnpm (>8) than the default for node 16 (6.11.0) is needed for workspaces support. I found that the version specified on the subsystem options was not being used. Although it was being installed in a separate sandbox, the output was not captured, and on the process invocation for pnpm install, corepack was installing the default version.

I've left a couple of comments on the fix for now - I'm just not sure if I've missed anything re. the original intent for the prepare_corepack_tool rule @tobni ?

@riisi riisi added the category:bugfix Bug fixes for released features label May 12, 2024
Copy link
Contributor

@tobni tobni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the new bug you've found is real, I would argue keeping it to a separate issue and pull request is cleaner. I can look into this.

To make your integration test pass I believe removing src/js from the path will work as a temporary.

w.r.t to the pnpm version, I think maybe best we can do is bump the default? It is unfortunate, but I'm not keen on validating tool versions for features. I believe pants should aim to please instead of throwing errors that might not be relevant for a particular usecase. Users can use the nodejs subsystem and its package managers without writing javascript, after all.

Edit:
Honestly, I think just bumping to nodejs v18 should do the trick.

src/python/pants/backend/javascript/subsystems/nodejs.py Outdated Show resolved Hide resolved
src/python/pants/backend/javascript/subsystems/nodejs.py Outdated Show resolved Hide resolved
@riisi
Copy link
Contributor Author

riisi commented Jun 1, 2024

@tobni I've taken a stab at updating the node version and removing /src/js from the integration test files.
This does seem to resolve the bug above, as well as providing a pnpm version that supports workspaces.

However, I still have one more problem when the test is run without explicitly providing package manager versions (e.g. "--nodejs-package-managers={'npm': '7.20.1', 'pnpm': '8.15.5', 'yarn': '1.22.15'}",) - I get the error The local project doesn't feature a 'packageManager' field - please explicit the package manager to pack, or update the manifest to reference it.

In the file layout scenario used in the test, the workspace root shouldn't contain a package.json (for pnpm and yarn), and not necessarily a package manager.

To address this for now, I modified the logic in prepare_corepack_tool() to supply the package manager name, but seems undesirable as this would override the package.json spec if it was provided.

However, in my testing it seems that if a version is provided in package.json, then it is used: i.e.

@rule(desc="Preparing Corepack managed tool.")
async def prepare_corepack_tool(
    request: CorepackToolRequest, environment: NodeJSProcessEnvironment, nodejs: NodeJS
) -> CorepackToolDigest:
    version = request.version or nodejs.package_managers.get(request.tool) # <-- version here is that from package.json and fed into CLI args
    if not version and request.input_digest == EMPTY_DIGEST:
        raise ValueError(f"Could not determine tool version for {request.tool}.")
    tool_spec = f"{request.tool}@{version}" if version else request.tool
    tool_description = tool_spec if version else f"default {tool_spec} version"

so this doesn't actually matter.

This goes back to my earlier comment above on the same section of code:

            # TODO: Given below, is there ever any need to supply an input_digest (e.g. package.json files) here?

Thoughts on what to do here?

Aside from this, there are some failing tests, I think all related to package-lock files needing updating after the node updates.
I can fix - just want to avoid rework in case we need to backtrack on anything above.

@tobni
Copy link
Contributor

tobni commented Jun 1, 2024

Essentially corepack_home is missing a {chroot} token.

Did you try this? To clarify, the environment variable COREPACK_HOME is not correctly set when preparing the tool. This is why the working_directory setting is not working, and why removing 1 level of folders:

...removing /src/js from the integration test files.

works. {chroot} is a magic token that should be added to the env variable passed in the sandbox (Process constructor). {chroot} tokens become absolute paths to the sandbox root folder. The $COREPACK_HOME env variable is currently accidentally relative inside the sandbox, so when assigning a cwd other then what {chroot} turns into causes $COREPACK_HOME results to be not captured in the output digest.

However, I still have one more problem when the test is run without explicitly providing package manager versions (e.g. "--nodejs-package-managers={'npm': '7.20.1', 'pnpm': '8.15.5', 'yarn': '1.22.15'}",) - I get the error The local project doesn't feature a 'packageManager' field - please explicit the package manager to pack, or update the manifest to reference it.

Had a little look through and here's what I gather:

I was fairly certain that corepack vendored their own releases of package managers so that e.g corepack prepare pnpm --activate worked OOTB if no other version was provided. Testing now reveals I was mistaken, in order for that mode of operation to work the required invocation would be corepack prepare --all --activate. Unfortunately this invocation downloads all package managers, but I think that price tag is worth it.

I think that desired behaviour here is to "just work" if we can and allow configurability that allows larger repos to continue to function. I.e the rules should be

  1. If package.json#packageManager contains a package manager version, it should be used; otherwise
  2. If pants.toml#[nodejs].package_managers specifies a version for a package manager, this should be used; otherwise
  3. Lastly we should fall back on the version corepack provides via its "last known good releases".

I think what is actually missing is case 3, which would be different from the wip you've pushed to remote.

And I agree, as you've spotted, there is no reason to provide a digest because the relevant bit (if present) to achieve 1 is already parsed earlier, so there's no need for corepack to do it again! I suspect the addition of install_from_resolve support is what caused the digest to become unnecessary.

Lastly:
I think a nodejs version upgrade has to be a separate pull request from the rest of these changes. It needs to be very clear in release documentation that the default nodejs version is being bumped.

@tobni
Copy link
Contributor

tobni commented Jun 6, 2024

I'm going to address the corepack issues via #21020 to cut out the noise from the changes herein.

Edit:
#21021

@riisi
Copy link
Contributor Author

riisi commented Jun 6, 2024 via email

@riisi
Copy link
Contributor Author

riisi commented Jun 22, 2024

I removed the last WIP commit with the node upgrade and rebased after your fix was merged @tobni. It works fine now in a subdir.

The update to pnpm v8 is causing a coverage test to fail - see comment here. I think we should do the node upgrade first and fix that there.

@tobni
Copy link
Contributor

tobni commented Jul 2, 2024

@riisi Headsup that nodejs18 PR was just merged :)

@riisi
Copy link
Contributor Author

riisi commented Jul 5, 2024

@tobni Updated this now following @krishnan-chandra 's changes.

Copy link
Contributor

@tobni tobni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@tobni tobni merged commit df64a7c into pantsbuild:main Jul 8, 2024
25 checks passed
@riisi riisi deleted the javascript branch July 8, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: JavaScript JavaScript backend-related issues bug category:bugfix Bug fixes for released features
Projects
None yet
3 participants