Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk/nodejs] Fix construct to wait for RPC operations to complete #6452

Merged
merged 6 commits into from
Apr 5, 2021

Conversation

justinvp
Copy link
Member

@justinvp justinvp commented Mar 2, 2021

This change fixes the provider implementation of Construct for multi-lang components written in Node.js to wait for any in-flight RPCs to finish before returning the results, s.t. all registered child resources are created.

Tests fail before the SDK change, and pass after.

Fixes #6365

// We want to ensure the component's slow child resource will actually be created when the
// component is created inside `construct`.

const CREATION_DELAY = 15 * 1000; // 15 second delay in milliseconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not a huge fan of introducing delays like this in tests. I was thinking about having the provider spin until it received some signal, but I couldn't think of a good place where to send the signal in the tests. Open to other ideas.

15 seconds seemed to be the least amount of time to reliably reproduce to issue, at least on my machine. Any less and sometimes the tests would pass without the SDK change.

Copy link
Contributor

@EvanBoyle EvanBoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question so far. I'll build this change locally and report back my findings as well.

@@ -308,6 +308,10 @@ class Server implements grpc.UntypedServiceImplementation {
};

const result = await this.provider.construct(name, type, inputs, opts);

// Wait for RPC operations to complete and disconnect.
await runtime.disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the mapping of provider instance to construct request? One-to-one or one-to-many?

disconnecting the runtime causes the rpc clients for the engine and resmon to disconnect, but I believe those connections can be re-established the next time a mon/engine client is accessed in the node runtime.

Copy link
Member Author

@justinvp justinvp Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the mapping of provider instance to construct request? One-to-one or one-to-many?

Pretty sure one-to-many. The provider process will run and can receive multiple construct calls. We reset the values on each invocation of construct:

public async construct(call: any, callback: any): Promise<void> {
try {
const req: any = call.request;
const type = req.getType();
const name = req.getName();
if (!this.provider.construct) {
callback(new Error(`unknown resource type ${type}`), undefined);
return;
}
// Configure the runtime.
//
// NOTE: these are globals! We should ensure that all settings are identical between calls, and eventually
// refactor so we can avoid the global state.
runtime.resetOptions(req.getProject(), req.getStack(), req.getParallel(), this.engineAddr,
req.getMonitorendpoint(), req.getDryrun());
const pulumiConfig: {[key: string]: string} = {};
const rpcConfig = req.getConfigMap();
if (rpcConfig) {
for (const [k, v] of rpcConfig.entries()) {
pulumiConfig[k] = v;
}
}
runtime.setAllConfig(pulumiConfig);

We should be able to disconnect from the monitor and engine as the next call to construct will re-establish connections to these.

Though, this is changing globals... Seems like we'll run into problems if there are ever concurrent calls to construct, though not sure offhand if that's even possible given the event loop. @pgavlin, do you know off hand if concurrent calls are possible? Should we be serializing calls to construct in the Node.js implementation so that only one call to construct runs at a time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we'll run into problems if there are ever concurrent calls to construct, though not sure offhand if that's even possible given the event loop. @pgavlin, do you know off hand if concurrent calls are possible?

Concurrent calls are absolutely possible, yes. Serializing construct calls is a good short-term workaround, but I'm not sure that it's sustainable long-term due to the limits it places on parallelism. We might want to scope the engine connection somehow (or at least the in-flight RPCs). IIRC @lukehoban suggested investigating node's createContext API to see if we could run each invocation of Construct in its own context.

I think that we will also need to await inside of the provider's implementation of Cancel.

@EvanBoyle
Copy link
Contributor

I did a quick check, and this doesn't solve all of my runtime issues. There is one issue in particular where when using nested multi-lang components, one output resolves as undefined on the first update but resolves fine on the second update. This change doesn't seem to fix that.

@justinvp
Copy link
Member Author

justinvp commented Mar 3, 2021

There is one issue in particular where when using nested multi-lang components, one output resolves as undefined on the first update but resolves fine on the second update. This change doesn't seem to fix that.

@EvanBoyle, Do you have a small repro? Or point me to the code where you're seeing this?

@EvanBoyle
Copy link
Contributor

Ok, good news is the issue where the program was not waiting for the ECS service/task to get created is no longer an issue. I was able to remove the service and task from the state of the component and it is still working as expected.

I don't have a simple repro for the output issue. But my repro is here:

https://github.com/pulumi/astro/blob/main/provider/nodejs/astro/components/fargateContainerApp.ts#L120

If you change that line to instead be const endpoint = app.endpoint, then that output seems to get dropped from the dependency graph. Other components begin so see astoOutputs resolve as undefined at that point. I'm wondering if there is some issue about serialized outputs from resource references. In that example, FargateApp is a component resource, and LocalApp is a component resource. Trying to use LocalApp.endpoint as part of FargateApp.astroOutputs seems to cause astroOutputs to resolve undefined.

@justinvp
Copy link
Member Author

justinvp commented Mar 8, 2021

Let's tackle the nested issue separately (tracked in #6477).

Copy link
Member

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialization looks good. My only concern is that this will deadlock on nested calls to construct. We can fix that when we fix the more general concurrency issues.

This change fixes the provider implementation of `Construct` for multi-lang components written in Node.js to wait for any in-flight RPCs to finish before returning the results, s.t. all registered child resources are created.
Wait for RPC operations after serializing the state as serializing some values requires the monitor to be connected (e.g. to determine if the monitor supports resource references).
Serialize invocations of `construct` so that each call runs one after another, avoiding concurrent runs, since we're `construct` modifies global state.
@justinvp justinvp merged commit 394f79f into master Apr 5, 2021
@pulumi-bot pulumi-bot deleted the justin/constructwait branch April 5, 2021 18:11
flostadler added a commit that referenced this pull request Jun 20, 2024
Due to global state issues in the nodejs runtime, we serialized MLC construct requests in #6452.
This had the downside that cloud resources were not created in parallel if multiple instances
of the same component were instantiated in the same program.

This change removes the serialization of Construct and Call calls.
The global state issues were already fixed in #10568
github-merge-queue bot pushed a commit that referenced this pull request Jun 25, 2024
<!--- 
Thanks so much for your contribution! If this is your first time
contributing, please ensure that you have read the
[CONTRIBUTING](https://github.com/pulumi/pulumi/blob/master/CONTRIBUTING.md)
documentation.
-->

# Description

<!--- Please include a summary of the change and which issue is fixed.
Please also include relevant motivation and context. -->
Due to global state issues in the nodejs runtime, we serialized MLC
construct requests in #6452.
This had the downside that cloud resources were not created in parallel
if multiple instances of the same component were instantiated in the
same program.

This change removes the serialization of Construct and Call calls and
moves the uncaught error handling into a generic handler that keeps
track of all inflight Call/Construct callbacks and responds to them in
case of a uncaught error in user code.
The global state issues were already fixed in #10568

Fixes #7629 

## Checklist

- [x] I have run `make tidy` to update any new dependencies
- [x] I have run `make lint` to verify my code passes the lint check
  - [x] I have formatted my code using `gofumpt`

<!--- Please provide details if the checkbox below is to be left
unchecked. -->
- [ ] I have added tests that prove my fix is effective or that my
feature works
<!--- 
User-facing changes require a CHANGELOG entry.
-->
- [x] I have run `make changelog` and committed the
`changelog/pending/<file>` documenting my change
<!--
If the change(s) in this PR is a modification of an existing call to the
Pulumi Cloud,
then the service should honor older versions of the CLI where this
change would not exist.
You must then bump the API version in
/pkg/backend/httpstate/client/api.go, as well as add
it to the service.
-->
- [ ] Yes, there are changes in this PR that warrants bumping the Pulumi
Cloud API version
<!-- @pulumi employees: If yes, you must submit corresponding changes in
the service repo. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sdk/nodejs] construct doesn't wait for all child resources to be created
3 participants