Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty preview step on tests is proposing changes that are not expected #199

Closed
metral opened this issue Jul 23, 2019 · 25 comments · Fixed by #200 or #239
Closed

Empty preview step on tests is proposing changes that are not expected #199

metral opened this issue Jul 23, 2019 · 25 comments · Fixed by #200 or #239
Assignees
Milestone

Comments

@metral
Copy link
Contributor

metral commented Jul 23, 2019

After the initial update of the replace-cluster-add-subnets test, the empty preview step is being shown changes even though it should not:

  • See the templateBody of the ASG being planned for an update.
  • See the empty preview step erroring as changes are being proposed when not expected.

e.g. here is a sample of the changes being proposed from a run of another test, the migrate-nodegroups, also hitting this change proposal for templateBody:

image

@metral
Copy link
Contributor Author

metral commented Jul 24, 2019

Reopening. as we hit this issue again in a travis CI run today. Output: https://s3-us-west-2.amazonaws.com/eng.pulumi.com/travis-logs/pulumi/pulumi-eks/834.1/failed-tests.tar.gz

@metral
Copy link
Contributor Author

metral commented Aug 7, 2019

Another run of CI hit this same issue again in test replace-cluster-add-subnets, showing a proposed change for the templateBody of the CFStack, when the code was not touched: https://gist.github.com/metral/60aff5b74d7cafa980cee56403853151#file-log-txt-L1502.

Generally speaking, we're observing that the templateBody be proposed for changes for NodeGroups of tests, that have been created on the initial update but then run through the empty preview step.

Note: The change in templateBody of the OP pic is showing that the subnets used go from 4 subnets to 2 subnets, even though changes were not made. This is in us-west-2 which has 4 AZ's.

@metral metral changed the title Test replace-cluster-add-subnets errors on empty preview as it does not expect changes Empty preview step on tests is proposing changes that are not expected Aug 7, 2019
@metral
Copy link
Contributor Author

metral commented Aug 7, 2019

Full output logs with -v=10: command-output.tar.gz.

Note the changes in VPCZoneIdentifier from 4 subnets to 2 in the planned Diff of command-output/pulumi-preview-empty.20190807-145341.daef9.log:

18331 provider_plugin.go:572] Provider[aws, 0xc0004cfaa0].Diff(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-standard-nodes,arn:aws:cloudformation:us-west-2:153052954103:stack/migrate-nodegroups-ng-standard-df8d4dab/79b171c0-b95d-11e9-94a6-067      37af08cb6) success: changes=2 #replaces=[] #stables=[onFailure name disableRollback timeoutInMinutes] delbefrepl=false, diffs=#[templateBody], detaileddiff=map[templateBody:kind:UPDATE ]
12318 I0807 14:53:39.954061   18331 step_generator.go:461] Planner decided to update 'urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-standard-nodes' (oldprops=map[__defaults:{[]} name:{migrate-nodegroups-ng-standard-df8d4dab} tags:{map[Name:{migrate-nodegroups-ng-standard-n      odes} __defaults:{[]}]} templateBody:{
12319                 AWSTemplateFormatVersion: '2010-09-09'
12320                 Outputs:
12321                     NodeGroup:
12322                         Value: !Ref NodeGroup
12323                 Resources:
12324                     NodeGroup:
12325                         Type: AWS::AutoScaling::AutoScalingGroup
12326                         Properties:
12327                           DesiredCapacity: 3
12328                           LaunchConfigurationName: migrate-nodegroups-ng-standard-nodeLaunchConfiguration-93e2d38
12329                           MinSize: 3
12330                           MaxSize: 10
12331                           VPCZoneIdentifier: ["subnet-0ee3176e3cf1f57ad","subnet-0600768622fca39b4","subnet-0eb02d2044415fff7","subnet-0479a9bdc91f0ddad"]
12332                           Tags:
12333
12334                           - Key: Name
12335                             Value: migrate-nodegroups-eksCluster-f4a345b-worker
12336                             PropagateAtLaunch: 'true'
12337                           - Key: kubernetes.io/cluster/migrate-nodegroups-eksCluster-f4a345b
12338                             Value: owned
12339                             PropagateAtLaunch: 'true'
12340                         UpdatePolicy:
12341                           AutoScalingRollingUpdate:
12342                             MinInstancesInService: '1'
12343                             MaxBatchSize: '1'
12344                 }] inputs=map[__defaults:{[]} name:{migrate-nodegroups-ng-standard-df8d4dab} tags:{map[Name:{migrate-nodegroups-ng-standard-nodes} __defaults:{[]}]} templateBody:{
12345                 AWSTemplateFormatVersion: '2010-09-09'
12346                 Outputs:
12347                     NodeGroup:
12348                         Value: !Ref NodeGroup
12349                 Resources:
12350                     NodeGroup:
12351                         Type: AWS::AutoScaling::AutoScalingGroup
12352                         Properties:
12353                           DesiredCapacity: 3
12354                           LaunchConfigurationName: migrate-nodegroups-ng-standard-nodeLaunchConfiguration-93e2d38
12355                           MinSize: 3
12356                           MaxSize: 10
12357                           VPCZoneIdentifier: ["subnet-0ee3176e3cf1f57ad","subnet-0600768622fca39b4"]
12358                           Tags:
12359
12360                           - Key: Name
12361                             Value: migrate-nodegroups-eksCluster-f4a345b-worker
12362                             PropagateAtLaunch: 'true'
12363                           - Key: kubernetes.io/cluster/migrate-nodegroups-eksCluster-f4a345b
12364                             Value: owned
12365                             PropagateAtLaunch: 'true'
12366                         UpdatePolicy:
12367                           AutoScalingRollingUpdate:
12368                             MinInstancesInService: '1'
12369                             MaxBatchSize: '1'
12370                 }]
12371 I0807 14:53:39.954102   18331 plan_executor.go:142] planExecutor.Execute(...): incoming event (nil? false, <nil>)
12372 I0807 14:53:39.954112   18331 plan_executor.go:224] planExecutor.handleSingleEvent(...): received RegisterResourceEvent
12373 I0807 14:53:39.954131   18331 registry.go:146] GetProvider(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::pulumi:providers:aws::default_0_18_26::ae7273b7-36d6-4b22-813f-0592ae1547c9)
12374 I0807 14:53:39.954148   18331 provider_plugin.go:424] Provider[aws, 0xc0004cfaa0].Check(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-2xlarge-nodes) executing (#olds=4,#news=3
12375 I0807 14:53:39.954157   18331 rpc.go:69] Marshaling property for RPC[Provider[aws, 0xc0004cfaa0].Check(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-2xlarge-nodes).olds]: __defaults={[]}
12376 I0807 14:53:39.954169   18331 rpc.go:69] Marshaling property for RPC[Provider[aws, 0xc0004cfaa0].Check(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-2xlarge-nodes).olds]: name={migrate-nodegroups-ng-2xlarge-eca6a7bd}
12377 I0807 14:53:39.954176   18331 rpc.go:69] Marshaling property for RPC[Provider[aws, 0xc0004cfaa0].Check(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-2xlarge-nodes).olds]: tags={map[Name:{migrate-nodegroups-ng-2xlarge-nodes} __defaults:{[]}]}
12378 I0807 14:53:39.954187   18331 rpc.go:69] Marshaling property for RPC[Provider[aws, 0xc0004cfaa0].Check(urn:pulumi:p-it-argon-migrate-no-42244f66::migrate-nodegroups::eks:index:NodeGroup$aws:cloudformation/stack:Stack::migrate-nodegroups-ng-2xlarge-nodes).olds]: Name={migrate-nodegroups-ng-2xlarge-nodes}

Subnets Definition & Use:

Note that awsx.ec2.Vpc.numberOfAvailabilityZones is not set, so it defaults to 2. And we're in us-west-4 that has 4 AZ's, and we create 1 private and 1 public in each of the AZ's, so we have 4 subnets: 2 private and 2 public.

The Diff is proposing dropping the last 2 elements in the array of 4 when it is rendered as a string for the CFTemplateBody.

// Allocate a new VPC with custom settings, and a public & private subnet per AZ.
const vpc = new awsx.ec2.Vpc(`${projectName}`, {
cidrBlock: "172.16.0.0/16",
subnets: [{ type: "public" }, { type: "private" }],
});
// Export VPC ID and Subnets.
export const vpcId = vpc.id;
export const allVpcSubnets = vpc.privateSubnetIds.concat(vpc.publicSubnetIds);
// Create 3 IAM Roles and matching InstanceProfiles to use with the nodegroups.
const roles = iam.createRoles(projectName, 3);
const instanceProfiles = iam.createInstanceProfiles(projectName, roles);
// Create an EKS cluster.
const myCluster = new eks.Cluster(`${projectName}`, {
version: "1.13",
vpcId: vpcId,
subnetIds: allVpcSubnets,
nodeAssociatePublicIpAddress: false,
skipDefaultNodeGroup: true,
deployDashboard: false,
instanceRoles: roles,
enabledClusterLogTypes: ["api", "audit", "authenticator",
"controllerManager", "scheduler"],
});
export const kubeconfig = myCluster.kubeconfig;

Computing subnets & Using in CFTemplateBody:

const workerSubnetIds = args.nodeSubnetIds ? pulumi.output(args.nodeSubnetIds) : pulumi.output(core.subnetIds).apply(ids => computeWorkerSubnets(parent, ids));
if (args.desiredCapacity === undefined) {
args.desiredCapacity = 2;
}
if (args.minSize === undefined) {
args.minSize = 1;
}
if (args.maxSize === undefined) {
args.maxSize = 2;
}
let minInstancesInService = 1;
if (args.spotPrice) {
minInstancesInService = 0;
}
const autoScalingGroupTags: InputTags = pulumi.all([
eksCluster.name,
args.autoScalingGroupTags,
]).apply(([clusterName, asgTags]) => (<aws.Tags>{
"Name": `${clusterName}-worker`,
[`kubernetes.io/cluster/${clusterName}`]: "owned",
...asgTags,
}));
const cfnTemplateBody = pulumi.all([
nodeLaunchConfiguration.id,
args.desiredCapacity,
args.minSize,
args.maxSize,
tagsToAsgTags(autoScalingGroupTags),
workerSubnetIds.apply(JSON.stringify),
]).apply(([launchConfig, desiredCapacity, minSize, maxSize, asgTags, vpcSubnetIds]) => `
AWSTemplateFormatVersion: '2010-09-09'
Outputs:
NodeGroup:
Value: !Ref NodeGroup
Resources:
NodeGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
DesiredCapacity: ${desiredCapacity}
LaunchConfigurationName: ${launchConfig}
MinSize: ${minSize}
MaxSize: ${maxSize}
VPCZoneIdentifier: ${vpcSubnetIds}
Tags:
${asgTags}
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: '${minInstancesInService}'
MaxBatchSize: '1'
`);

computeWorkerSubnets()

async function computeWorkerSubnets(parent: pulumi.Resource, subnetIds: string[]): Promise<string[]> {
const publicSubnets: string[] = [];
const privateSubnets: string[] = [];
for (const subnetId of subnetIds) {
// Fetch the route table for this subnet.
const routeTable = await (async () => {
try {
// Attempt to get the explicit route table for this subnet. If there is no explicit rouute table for
// this subnet, this call will throw.
return await aws.ec2.getRouteTable({ subnetId: subnetId }, { parent: parent });
} catch {
// If we reach this point, the subnet may not have an explicitly associated route table. In this case
// the subnet is associated with its VPC's main route table (see
// https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html#RouteTables for details).
const subnet = await aws.ec2.getSubnet({ id: subnetId }, { parent: parent });
const mainRouteTableInfo = await aws.ec2.getRouteTables({
vpcId: subnet.vpcId,
filters: [{
name: "association.main",
values: ["true"],
}],
}, { parent: parent });
return await aws.ec2.getRouteTable({ routeTableId: mainRouteTableInfo.ids[0] }, { parent: parent });
}
})();
// Once we have the route table, check its list of routes for a route to an internet gateway.
const hasInternetGatewayRoute = routeTable.routes
.find(r => !!r.gatewayId && !isPrivateCIDRBlock(r.cidrBlock)) !== undefined;
if (hasInternetGatewayRoute) {
publicSubnets.push(subnetId);
} else {
privateSubnets.push(subnetId);
}
}
return privateSubnets.length === 0 ? publicSubnets : privateSubnets;
}

/cc @pgavlin @lukehoban thoughts?

@lukehoban lukehoban self-assigned this Aug 8, 2019
@lukehoban lukehoban added this to the 0.26 milestone Aug 8, 2019
@lukehoban
Copy link
Member

@metral Have you been able to reproduce this locally outside of the test harness by following similar steps? This would be easier to investigate with a repro.

I tried to take the test case in question replace-cluster-add-subnets , deploy it, and the run an empty preview. I did this 5 times and never saw any changes proposed.

I actually can't tell from the notes above whether this is the same steps that you believe are triggering this in CI, or whether there is a more involved set of steps that are leading to this.

I'm also generally confused by many of the notes above. The test explicitly creates a VPC with only 3 public subnets and then deploys the cluster using 2 of the 3. I have no idea where the snippet you show above is ever coming up with 4 subnets - that doesn't even seem physically possible using the example in question here.

@lukehoban
Copy link
Member

Note that awsx.ec2.Vpc.numberOfAvailabilityZones is not set, so it defaults to 2. And we're in us-west-4 that has 4 AZ's, and we create 1 private and 1 public in each of the AZ's, so we have 4 subnets: 2 private and 2 public.

I do not follow this - especially since (a) this example uses awsx.Network which creates only public subnets by default and (b) it explicitly requests only 3 AZs.

@lukehoban
Copy link
Member

Oh - I guess I was confused as the examples in this issue shift between two different test cases. I think the references to 4 subnets are all related to migrate-nodegroups?

@metral
Copy link
Contributor Author

metral commented Aug 9, 2019

I'm also generally confused by many of the notes above. The test explicitly creates a VPC with only 3 public subnets and then deploys the cluster using 2 of the 3. I have no idea where the snippet you show above is ever coming up with 4 subnets - that doesn't even seem physically possible using the example in question here.

Sorry for the lack of clarity. The 4 subnets was in ref to the migrate-nodegroup not the replace-subnets, as it creates 1 private and 1 public in the 2 AZs, for 4 total.

I do not follow this - especially since (a) this example uses awsx.Network which creates only public subnets by default and (b) it explicitly requests only 3 AZs

Also intended for the migrate-nodegroup.

The repro is just running both replace-cluster-add-subnets and migrate-nodegroups locally.

I'm running a series of test on the most recent v0.18.10 master to see if this is even possible since I have not seen the error today, and will update with more clarity if necessary.

@lukehoban
Copy link
Member

Got it.

I was unable to repro against replace-cluster-add-subnets. I'll take a look at migrate-nodegroups as well.

@metral
Copy link
Contributor Author

metral commented Aug 9, 2019

On 10 runs of both tests locally, replace-cluster-add-subnets failed twice with this error. This time interestingly enough, one of the failed runs went from 1 subnet to 2 on the empty preview step.

image

@lukehoban
Copy link
Member

Do you have copies of the checkpoints before and after this error?

What exact set of stack operations are leading to this when it happens?

went from 1 subnet to 2

It should never have been possible to end up deploying only 1 subnet with this test. The key question is really how you managed to get a deployment with only 1 subnet in the first place.

@metral
Copy link
Contributor Author

metral commented Aug 9, 2019

Do you have copies of the checkpoints before and after this error?

No, since I was running make test_all with only those 2 tests enabled, and the tests don't save state.

It should never have been possible to end up deploying only 1 subnet with this test.

Agreed, this is odd. I'll look into running more tests with checkpoints being saved in between each step.

@metral
Copy link
Contributor Author

metral commented Aug 21, 2019

@lukehoban here is the command output for the replace-cluster-add-subnet test that hits this templateBody diff UPDATE issue with the subnets on the empty preview step. I ran this test only locally, with the other tests and examples commented out, across 5 runs and 1 managed to produce it.

envvars set:

export AWS_REGION=us-west-2
export TF_LOG=TRACE
export PULUMI_TEST_DEBUG_LOG_LEVEL=9
export PULUMI_TEST_DEBUG_UPDATES=true

replace-cluster-add-subnet-command-output.tar.gz

@metral
Copy link
Contributor Author

metral commented Aug 22, 2019

Another set of output from a failed CI job yesterday for the replace-cluster-add-subnet test:

https://s3-us-west-2.amazonaws.com/eng.pulumi.com/travis-logs/pulumi/pulumi-eks/964.1/failed-tests.tar.gz

This time, unlike the usual failed runs of this test, goes from using 1 subnet to 2 subnets, whereas other failed runs usually propose less subnets to use:

Resources:
        NodeGroup:
                Type: AWS::AutoScaling::AutoScalingGroup
                Properties:
                        DesiredCapacity: 2
                        LaunchConfigurationName: test-replace-cluster-add-subnets-nodeLaunchConfiguration-da70d18
                        MinSize: 1
                        MaxSize: 2
                        VPCZoneIdentifier: ["subnet-0161c3fc1520e5f57"]
                        Tags:

                        - Key: Name
                        Value: test-replace-cluster-add-subnets-eksCluster-52a59d5-worker
                        PropagateAtLaunch: 'true'
                        - Key: kubernetes.io/cluster/test-replace-cluster-add-subnets-eksCluster-52a59d5
                        Value: owned
                        PropagateAtLaunch: 'true'
                UpdatePolicy:
                        AutoScalingRollingUpdate:
                        MinInstancesInService: '1'
                        MaxBatchSize: '1'
}] inputs=map[__defaults:{[]} name:{test-replace-cluster-add-subnets-defe117e} tags:{map[Name:{test-replace-cluster-add-subnets-nodes} __defaults:{[]}]} templateBody:{
AWSTemplateFormatVersion: '2010-09-09'
Outputs:
        NodeGroup:
                Value: !Ref NodeGroup
Resources:
        NodeGroup:
                Type: AWS::AutoScaling::AutoScalingGroup
                Properties:
                        DesiredCapacity: 2
                        LaunchConfigurationName: test-replace-cluster-add-subnets-nodeLaunchConfiguration-da70d18
                        MinSize: 1
                        MaxSize: 2
                        VPCZoneIdentifier: ["subnet-0161c3fc1520e5f57","subnet-03a3c1bb5eee25f16"]
                        Tags:

envvars set in CI:

export AWS_REGION=us-west-2
export PULUMI_TEST_DEBUG_LOG_LEVEL=9
export PULUMI_TEST_DEBUG_UPDATES=true

@lukehoban lukehoban assigned pgavlin and unassigned lukehoban Aug 23, 2019
@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

I believe I know what's happening here. FWIW, this repro'd locally when I ran the Pulumi program manually (i.e. outside of the test harness).

I applied the following patch to determine which subnet IDs were being passed to the cluster's constructor and the subnet groupings determined by computeWorkerSubnets:

diff --git a/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts b/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
index 2feb420..da7d293 100644
--- a/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
+++ b/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
@@ -13,6 +13,8 @@ const publicSubnetIds = vpc.publicSubnetIds.sort();
 // Remove this line after the init update to repro: https://git.io/fj8cn
 publicSubnetIds.pop();

+pulumi.all(publicSubnetIds).apply(ids => console.log(`public subnets IDs: ${JSON.stringify(ids)}`));
+
 const cluster = new eks.Cluster(projectName, {
     vpcId: vpc.vpcId,
     subnetIds: publicSubnetIds,
diff --git a/nodejs/eks/nodegroup.ts b/nodejs/eks/nodegroup.ts
index aa26340..e683383 100644
--- a/nodejs/eks/nodegroup.ts
+++ b/nodejs/eks/nodegroup.ts
@@ -551,6 +551,8 @@ async function computeWorkerSubnets(parent: pulumi.Resource, subnetIds: string[]
                 // this subnet, this call will throw.
                 return await aws.ec2.getRouteTable({ subnetId: subnetId }, { parent: parent });
             } catch {
+                console.log("looking for main route table");
+
                 // If we reach this point, the subnet may not have an explicitly associated route table. In this case
                 // the subnet is associated with its VPC's main route table (see
                 // https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html#RouteTables for details).
@@ -575,6 +577,7 @@ async function computeWorkerSubnets(parent: pulumi.Resource, subnetIds: string[]
             privateSubnets.push(subnetId);
         }
     }
+    console.log(`subnets: ${JSON.stringify({publicSubnets, privateSubnets})}`);
     return privateSubnets.length === 0 ? publicSubnets : privateSubnets;
 }

Running the example produced the following output:

Updating (repro):

     Type                                   Name                                                           Status                  Info
 +   pulumi:pulumi:Stack                    test-replace-cluster-add-subnets-repro                         **creating failed**     1 error; 4 messages
 +   ├─ awsx:network:Network                test-replace-cluster-add-subnets-net                           created
 +   │  ├─ aws:ec2:Vpc                      test-replace-cluster-add-subnets-net                           created
 +   │  ├─ aws:ec2:Subnet                   test-replace-cluster-add-subnets-net-0                         created
 +   │  ├─ aws:ec2:InternetGateway          test-replace-cluster-add-subnets-net                           created
 +   │  ├─ aws:ec2:Subnet                   test-replace-cluster-add-subnets-net-2                         created
 +   │  ├─ aws:ec2:Subnet                   test-replace-cluster-add-subnets-net-1                         created
 +   │  ├─ aws:ec2:RouteTable               test-replace-cluster-add-subnets-net                           created
 +   │  ├─ aws:ec2:RouteTableAssociation    test-replace-cluster-add-subnets-net-1                         created
 +   │  ├─ aws:ec2:RouteTableAssociation    test-replace-cluster-add-subnets-net-0                         created
 +   │  └─ aws:ec2:RouteTableAssociation    test-replace-cluster-add-subnets-net-2                         created
 +   └─ eks:index:Cluster                   test-replace-cluster-add-subnets                               created
 +      ├─ eks:index:ServiceRole            test-replace-cluster-add-subnets-instanceRole                  created
 +      │  ├─ aws:iam:Role                  test-replace-cluster-add-subnets-instanceRole-role             created
 +      │  ├─ aws:iam:RolePolicyAttachment  test-replace-cluster-add-subnets-instanceRole-e1b295bd         created
 +      │  ├─ aws:iam:RolePolicyAttachment  test-replace-cluster-add-subnets-instanceRole-03516f97         created
 +      │  └─ aws:iam:RolePolicyAttachment  test-replace-cluster-add-subnets-instanceRole-3eb088f2         created
 +      ├─ eks:index:ServiceRole            test-replace-cluster-add-subnets-eksRole                       created
 +      │  ├─ aws:iam:Role                  test-replace-cluster-add-subnets-eksRole-role                  created
 +      │  ├─ aws:iam:RolePolicyAttachment  test-replace-cluster-add-subnets-eksRole-90eb1c99              created
 +      │  └─ aws:iam:RolePolicyAttachment  test-replace-cluster-add-subnets-eksRole-4b490823              created
 +      ├─ aws:ec2:SecurityGroup            test-replace-cluster-add-subnets-eksClusterSecurityGroup       created
 +      ├─ pulumi-nodejs:dynamic:Resource   test-replace-cluster-add-subnets-cfnStackName                  created
 +      ├─ aws:ec2:SecurityGroupRule        test-replace-cluster-add-subnets-eksClusterInternetEgressRule  created
 +      ├─ aws:iam:InstanceProfile          test-replace-cluster-add-subnets-instanceProfile               created
 +      └─ aws:eks:Cluster                  test-replace-cluster-add-subnets-eksCluster                    created

System Messages
  ^C received; cancelling. If you would like to terminate immediately, press ^C again.
  Note that terminating immediately may lead to orphaned resources and other inconsistencies.


Diagnostics:
  pulumi:pulumi:Stack (test-replace-cluster-add-subnets-repro):
    public subnets IDs: ["subnet-04dd72b952839976c","subnet-09f45234dc4ff170e"]
    looking for main route table
    looking for main route table
    subnets: {"publicSubnets":[],"privateSubnets":["subnet-04dd72b952839976c","subnet-09f45234dc4ff170e"]}

    error: update canceled

Resources:
    + 26 created

Duration: 11m10s

Note that computeWorkerSubnets has decided that the two public subnets are in fact private. It arrived at that result because it could not find a route table that had been explicitly associated with either subnet. I believe that this is because the appropriate RouteTableAssociations had not yet been created when computeWorkerSubnets ran. Further evidence is provided by a second preview, which indicates the expected set of public subnets:

hela:eks199 pgavlin$ pulumi up
Previewing update (repro):

     Type                                  Name                                                                Plan       Info
     pulumi:pulumi:Stack                   test-replace-cluster-add-subnets-repro                                         2 messages
     └─ eks:index:Cluster                  test-replace-cluster-add-subnets
 +      ├─ pulumi:providers:kubernetes     test-replace-cluster-add-subnets-eks-k8s                            create
 +      ├─ pulumi-nodejs:dynamic:Resource  test-replace-cluster-add-subnets-vpc-cni                            create
 +      ├─ aws:ec2:SecurityGroup           test-replace-cluster-add-subnets-nodeSecurityGroup                  create
 +      ├─ kubernetes:core:ConfigMap       test-replace-cluster-add-subnets-nodeAccess                         create
 +      ├─ aws:ec2:SecurityGroupRule       test-replace-cluster-add-subnets-eksNodeIngressRule                 create
 +      ├─ aws:ec2:SecurityGroupRule       test-replace-cluster-add-subnets-eksClusterIngressRule              create
 +      ├─ aws:ec2:SecurityGroupRule       test-replace-cluster-add-subnets-eksNodeInternetEgressRule          create
 +      ├─ aws:ec2:SecurityGroupRule       test-replace-cluster-add-subnets-eksNodeClusterIngressRule          create
 +      ├─ aws:ec2:SecurityGroupRule       test-replace-cluster-add-subnets-eksExtApiServerClusterIngressRule  create
 +      ├─ aws:ec2:LaunchConfiguration     test-replace-cluster-add-subnets-nodeLaunchConfiguration            create
 +      ├─ aws:cloudformation:Stack        test-replace-cluster-add-subnets-nodes                              create
 +      └─ pulumi:providers:kubernetes     test-replace-cluster-add-subnets-provider                           create

Diagnostics:
  pulumi:pulumi:Stack (test-replace-cluster-add-subnets-repro):
    public subnets IDs: ["subnet-04dd72b952839976c","subnet-09f45234dc4ff170e"]
    subnets: {"publicSubnets":["subnet-04dd72b952839976c","subnet-09f45234dc4ff170e"],"privateSubnets":[]}

Resources:
    + 12 to create
    26 unchanged

The worst-case scenario for this race is when some but not all RouteTableAssociations have been created: in this case, computeWorkerSubnets will incorrectly partition public subnets into the list of private subnets. Worker nodes would then be attached to public subnets, even if private subnets were also provided. IIRC, the nodes attached to public subnets would not be able to communicate with the cluster. This case will also cause a subsequent empty preview step to error as changes will be proposed when not expected.

In order to solve this issue, we must either
a) add retry logic to computeWorkerSubnets, or
b) ensure that the RouteTableAssociations are created prior to invoking computeWorkerSubnets

Of these, I think that (b) is a more reliable approach, though it does put more of the onus on the user. In the case of this example, we can either have awsx.Network fold a dependency on the RTAs into the list of subnet IDs, or we can simply have the EKS cluster depend on the VPC instance. For now, I suggest we take the latter approach for the sake of simplicity.

@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

we can simply have the EKS cluster depend on the VPC instance. For now, I suggest we take the latter approach for the sake of simplicity.

I tried this locally, and it doesn't work: simply adding the VPC as a dependency cannot block the Cluster constructor, and even though the data source calls are parented, they do not wait for their parent to resolve (since they only need access to its provider bag, which is available promptly). Instead, I think we can fix this by folding a dependency on the VPC's URN into the list of public subnet IDs:

diff --git a/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts b/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
index 2feb420..3bbfeef 100644
--- a/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
+++ b/nodejs/eks/examples/tests/replace-cluster-add-subnets/index.ts
@@ -13,6 +13,8 @@ const publicSubnetIds = vpc.publicSubnetIds.sort();
 // Remove this line after the init update to repro: https://git.io/fj8cn
 publicSubnetIds.pop();

+publicSubnetIds = publicSubnetIds.map(id => pulumi.all([vpc.urn, id]));
+
 const cluster = new eks.Cluster(projectName, {
     vpcId: vpc.vpcId,
     subnetIds: publicSubnetIds,

@lukehoban
Copy link
Member

cc @CyrusNajmabadi as an FYI as well - as this is an interesting case of something we may ultimately need to fix in awsx.ec2.Vpc (or more generally in our component model).

@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

The alternate approach I suggested above did not work either, presumably because vpc.urn does not depend upon the VPC's children. I do think that we can just use vpc.subnetIds, though, which mixes in the proper dependencies.

@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

We could also migrate that test to use awsx.ec2.Vpc instead of awsx.Network, which does look like it has the proper dependencies in its list of public subnets.

@metral
Copy link
Contributor Author

metral commented Aug 23, 2019

We could also migrate that test to use awsx.ec2.Vpc instead of awsx.Network, which does look like it has the proper dependencies in its list of public subnets.

I plan on moving it to aws.ec2.Vpc along with any other trailing tests using awsx.Network, but this will not change much as I've seen the same error for the migrate-nodegroups test which uses aws.ec2.Vpc

See #199 (comment)

and its definition here:

// Allocate a new VPC with custom settings, and a public & private subnet per AZ.
const vpc = new awsx.ec2.Vpc(`${projectName}`, {
cidrBlock: "172.16.0.0/16",
subnets: [{ type: "public" }, { type: "private" }],
});
// Export VPC ID and Subnets.
export const vpcId = vpc.id;
export const allVpcSubnets = vpc.privateSubnetIds.concat(vpc.publicSubnetIds);

pgavlin added a commit that referenced this issue Aug 23, 2019
`publicSubnetIds` does not properly incoporate a dependency upon each
subnet's `RouteTableAssociation`, which causes a race with the code in
`computeWorkerSubnets` that reads the subnet's route table in order to
determine the sets of public and private subnets.

Part of #199.
@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

I can't speak to that test, as I haven't yet attempted a repro.

My read of the awsx code would seem to indicate that the proper dependencies are there, and I have confirmed that using either of the minor changes listed in #199 (comment) or #199 (comment) fixes replace-cluster-add-subnets locally.

@pgavlin
Copy link
Member

pgavlin commented Aug 23, 2019

It is possible that AWS may have an eventual consistency thing happening here on top of the dependency issues mentioned above.

@lukehoban
Copy link
Member

Per hashicorp/terraform#5335 (comment), it sounds like this class of eventual consistency can happen in AWS.

Looking at the debug logs, the initial "creation" returns a success for the aws_route however the ec2/DescribeRouteTables does not actually return the rule immediately.

@lukehoban
Copy link
Member

I believe the right answer here is going to be to add the option to pass publicSubnetIds+privateSubnetIds instead of subnetIds. And when doing so, avoid using the magic lookup functions. We'll most likely want to eventually deprecate subnetIds, but I don't think we have to do that right away. This effectively amounts to re-opening and fixing #179.

We could also consider adding sleeps on the lookups we do in computeWorkerSubnets to mitigate even for folks who continue relying on this.

Either way - these are changes we should make in the EKS library - there does not appear to strictly speaking be a bug in any lower layers here (except for the overall eventual consistency in AWS itself).

@lukehoban lukehoban assigned metral and unassigned pgavlin Aug 25, 2019
@pgavlin
Copy link
Member

pgavlin commented Aug 26, 2019

there does not appear to strictly speaking be a bug in any lower layers here

There is a bug in the deprecated aws.Network type: the IDs in the publicSubnetIds property do not incorporate the RouteTableAssociations as dependencies. Though doing so may not reliably fix these failures, we may want to fix this for the sake of consistency with the other places these IDs are exposed (e.g. the subnetIds property of aws.Network).

@metral
Copy link
Contributor Author

metral commented Sep 4, 2019

There is a bug in the deprecated aws.Network type: the IDs in the publicSubnetIds property do not incorporate the RouteTableAssociations as dependencies.

@pgavlin Do you want to open an issue in pulumi/aws to track this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment