Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Update can cause a Delete #362

Closed
lukehoban opened this issue Apr 13, 2019 · 3 comments
Closed

An Update can cause a Delete #362

lukehoban opened this issue Apr 13, 2019 · 3 comments
Assignees
Labels
p1 A bug severe enough to be the next item assigned to an engineer
Milestone

Comments

@lukehoban
Copy link
Member

We have seen a case where calling Update on a tfbridge-based Pulumi provider can lead to invoking a Delete on the underlying Terraform provider.

The repro (below) is unfortunately complex, but the result is that we see the following update (note this is not a replace or delete):

Updating (dev):

     Type                      Name        Status                  Info
     pulumi:pulumi:Stack       luke-dev  **failed**              1 error; 46 debugs
 ~   └─ gcp:container:Cluster  kepler      **updating failed**     [diff: ~ipAllocationPolicy,nodePools]; 1 error; 1 warning

But the following logs in the provider:

    debug: Deleting GKE cluster kepler
    debug: Locking "google-container-cluster/pulumi-development/us-west2/kepler"
    debug: Locked "google-container-cluster/pulumi-development/us-west2/kepler"
    debug: Waiting for state to become: [success]
    debug: Google API Request Details:
    debug: ---[ REQUEST ]---------------------------------------
    debug: DELETE /v1beta1/projects/pulumi-development/locations/us-west2/clusters/kepler?alt=json&prettyPrint=false HTTP/1.1
    debug: Host: container.googleapis.com
    debug: User-Agent: google-api-go-client/0.5 Terraform/0.11.9 (+https://www.terraform.io) terraform-provider-google-beta/dev
    debug: Accept-Encoding: gzip
    debug:
    debug:
    debug: -----------------------------------------------------
    debug: Google API Response Details:

And indeed - from our Update logic in pulumi-terraform there is a path to this code which calls the Delete on the underlying provider.

https://github.com/pulumi/pulumi-terraform/blob/788550dffb09eff703d3ee8fb27d4d163cbb7211/vendor/github.com/hashicorp/terraform/helper/schema/resource.go#L258-L264

It appears that somehow, we can construct a Diff object for an update which has d.Destroy || d.RequiresNew().


Repro:

import * as pulumi from '@pulumi/pulumi';
import * as gcp from '@pulumi/gcp';

const project = "pulumi-development";

const network = createK8sNetworkPrivate(project, 'kepler');
const cluster = createK8sPrivateClusterAutoSubnets(
  'kepler',
  '1.11.7-gke.12',
  project,
  'us-west2',
  network,
  'kepler-subnet'
);
createK8sNodePool(
  'kepler',
  project,
  'us-west2',
  cluster.name,
  '1.11.7-gke.12',
  3,
  'n1-standard-2',
  ['kepler-target-pool'],
  3,
  10,
  [
    'https://www.googleapis.com/auth/compute',
    'https://www.googleapis.com/auth/devstorage.read_only',
    'https://www.googleapis.com/auth/logging.write',
    'https://www.googleapis.com/auth/monitoring',
    'https://www.googleapis.com/auth/trace.append'
  ]
);
// Create private cluster subnet and export it
export function createK8sNetworkPrivate(project: string, network: string) {
  const k8sNetworkPrivate = new gcp.compute.Network(`${network}-private`, {
    project,
    name: `${network}-private`,
    autoCreateSubnetworks: false
  });
  return k8sNetworkPrivate;
}

export function createK8sPrivateClusterAutoSubnets(
  name: string,
  minMasterVersion: string,
  project: string,
  region: string,
  network: gcp.compute.Network,
  subnetName: string | pulumi.Output<string>
) {
  const k8sCluster = new gcp.container.Cluster(name, {
    name,
    minMasterVersion,
    project,
    region,
    privateClusterConfig: {
      enablePrivateNodes: true,
      masterIpv4CidrBlock: '172.16.2.0/28'
    },
    ipAllocationPolicy: {
      createSubnetwork: true,
      subnetworkName: subnetName
    },
    network: network.selfLink,
    removeDefaultNodePool: true,
    nodePools: [
      {
        name: 'default-pool',
        nodeCount: 0
      }
    ]
  });
  return k8sCluster;
}

// Create K8s Node Pool
export function createK8sNodePool(
  nodePoolName: string,
  project: string,
  region: string,
  clustername: pulumi.Output<string>,
  version: string,
  initialNodeCount: number,
  machineType: string,
  nodePoolTags: string[],
  minNodeCount: number,
  maxNodeCount: number,
  nodeOAuthScopes: string[]
) {
  const k8s_node_pool = new gcp.container.NodePool(nodePoolName, {
    project,
    region,
    cluster: clustername,
    version,
    initialNodeCount,
    nodeConfig: {
      machineType,
      tags: nodePoolTags,
      oauthScopes: nodeOAuthScopes
    },
    management: {
      autoRepair: true,
      autoUpgrade: true
    },
    autoscaling: {
      minNodeCount,
      maxNodeCount
    }
  });

  return k8s_node_pool;
}

Deploy that then:

  1. Comment out the single node pool (so it is an empty array)
  2. Add nodeIpv4CidrBlock: "10.174.56.0/22", to ipAllocationPolicy (getting the correct value to set from the outputs of the currently deployed cluster.
  3. Do another pulumi update
@lukehoban lukehoban self-assigned this Apr 13, 2019
@lukehoban lukehoban added this to the 0.22 milestone Apr 13, 2019
@lukehoban
Copy link
Member Author

The "diff" for nodePools appears to be what triggers this. The diff we compute for node_pool:# is old: 1, new:0. This then causes a replacement because node_pool is marked ForceNew.

We can see in the logs that Diff was passed different values than Update for the olds - which is what allowed this to happen:

I0413 10:38:48.761221   43201 rpc.go:68] Marshaling property for RPC[Provider[gcp, 0xc00012f680].Diff(urn:pulumi:dev::naveen::gcp:container/cluster:Cluster::kepler,kepler).olds]: nodePools={[]}
I0413 10:38:48.761401   43201 rpc.go:68] Marshaling property for RPC[Provider[gcp, 0xc00012f680].Diff(urn:pulumi:dev::naveen::gcp:container/cluster:Cluster::kepler,kepler).news]: nodePools={[]}
I0413 10:38:48.864082   43201 rpc.go:68] Marshaling property for RPC[Provider[gcp, 0xc00012f680].Update(kepler,urn:pulumi:dev::naveen::gcp:container/cluster:Cluster::kepler).olds]: nodePools={[{map[__defaults:{[]} name:{default-pool} nodeCount:{0}]}]}
I0413 10:38:48.864365   43201 rpc.go:68] Marshaling property for RPC[Provider[gcp, 0xc00012f680].Update(kepler,urn:pulumi:dev::naveen::gcp:container/cluster:Cluster::kepler).news]: nodePools={[]}

The result is that the diff doesn't indicate any replacement to Pulumi's planning engine, but then the Update that gets computed includes a diff that forces replacement.

@lukehoban
Copy link
Member Author

lukehoban commented Apr 13, 2019

So - root cause of this is the following two lines in the Pulumi engine:

diff, err := prov.Diff(urn, id, oldOutputs, newInputs, allowUnknowns)
outs, rst, upderr := prov.Update(s.URN(), s.old.ID, s.old.All(), s.new.Inputs)

Note that for "olds" - we pass only the old outputs to Diff, but we pass a merge of old inputs and old outputs to Update. In cases where the old output differed from the old input, this will lead to calls to Update which have changes the provider did not get a chance to see during Diff. This seems to deeply violate the contract of the Pulumi provider interface.

Subtly - the call to .All does a map merge, which recursively merges arrays. So even though it intends to overlay outputs on top of inputs, it actually merges the values of the arrays in outputs and inputs. This leads to not replacing the input array of length 1 with the output array of length 0, but instead returning effectively the input array unchanged.

// All returns all resource state, including the inputs and outputs, overlaid in that order.
func (s *State) All() PropertyMap {
	return s.Inputs.Merge(s.Outputs)
}

@swgillespie What are the expected semantics here? I assume we have to present the same values to Diff and to Update. But I'm unsure:

  1. Are merged outputs and inputs the right thing to be passing to both?
  2. Is state.All() actually a correct merge? What are the required semantics of "merged outputs and inputs"?

lukehoban added a commit that referenced this issue Apr 13, 2019
If the computed diff requires replacement, the contract with the Pulumi provider model has been violated.  We should avoid progressing and doing deletes in this case.

Related to #362.
lukehoban added a commit that referenced this issue Apr 15, 2019
If the computed diff requires replacement, the contract with the Pulumi provider model has been violated.  We should avoid progressing and doing deletes in this case.

Related to #362.
@lukehoban
Copy link
Member Author

The core issue is fixed with #363. It will no longer be possible to unintentionally trigger a delete. However, it will now be an assert instead. Ultimately, we will have to somehow reconcile the core resource model issues here with using outputs in some places and merged inputs+outputs in other places. I've opened pulumi/pulumi#2650 to track that - but it is not as critical as we've patched the most significant failure mode this can trigger.

@infin8x infin8x added the p1 A bug severe enough to be the next item assigned to an engineer label Jul 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 A bug severe enough to be the next item assigned to an engineer
Projects
None yet
Development

No branches or pull requests

2 participants