Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

dbilleci-lightstream · 2019-02-14T22:01:59Z

Found this out while doing some dev work. Every time we deploy an AKS cluster, we have a tag of "created" which is the current timestamp. So while iterating during development, I found out that if you redeploy the stack (which includes the AKS cluster, which includes a new tag on that cluster) the cluster will attempt to replace itself (for a tag?) but it ends up seeming to "create" a new one (of the same name) which I think is simply an update.. and then "deletes the old one" of the same name.. which means you end up with a deleted cluster.

    # Attempting to create my stack
    Previewing update (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan
     +   pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2  create
     +   ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg               create
     +   ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              create
     +   └─ azure:role:Assignment                     a1307151-3007-0006-0000-000000000000  create
    
    Resources:
        + 4 to create
    
    Do you want to perform this update? yes
    Updating (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Status                  Info
     +   pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2  created
     +   ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg               created
     +   ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              created
     +   └─ azure:role:Assignment                     a1307151-3007-0006-0000-000000000000  **creating failed**     1 error
    
    Diagnostics:
      azure:role:Assignment (a1307151-3007-0006-0000-000000000000):
        error: Plan apply failed: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal a1307151a68745cdb5fc6c430fff5cb3 does not exist in the directory a948e66e-9b93-46ca-a43b-0f42b9978a41."
    
    Resources:
        + 3 created
    
    Duration: 12m20s
    
    error: update failed
    
    -------
    
    # whoops! error, (fixed it) let's try again (tag gets updated)
    root@b450443eff38:/data/pulumi/infra/azure/aks-backend# pulumi up
    Previewing update (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan        Info
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2
     +-  ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              replace     [diff: ~tags]
     +   └─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  create
    
    Resources:
        + 1 to create
        +-1 to replace
        2 changes. 2 unchanged
    
    Do you want to perform this update? yes
    Updating (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Status       Info
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2
     +-  ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              replaced     [diff: ~tags]
     +   └─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  created
    
    Resources:
        + 1 created
        +-1 replaced
        2 changes. 2 unchanged
    
    Duration: 6m20s
    
    -------
    
    
    # says it's replaced.. but where is it??? I can't find it anywhere, no resource.. it's gone! let's refresh and find out what happened..
    root@b450443eff38:/data/pulumi/infra/azure/aks-backend# pulumi refresh
    Previewing refresh (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan       
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2             
         ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg                          
     -   ├─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  delete     
     -   └─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              delete     
     
    Resources:
        - 2 to delete
        2 unchanged

It's gone! Is this pulumi or TF provider?

The text was updated successfully, but these errors were encountered:

dbilleci-lightstream · 2019-02-14T22:19:04Z

To be clear, although I think it's odd that a tag change would require tearing down a cluster in the first place, if that is required.. so be it.. but I'd expect that there still is a cluster (a new one) that is left over at the end over the operation.. there isn't! There is NO CLUSTER at all! Sorry if I wasn't clear.

lukehoban · 2019-02-14T22:19:38Z

Yes - this doesn't look right - we'll look into it.

hausdorff · 2019-02-14T22:25:26Z

Updating a tag should not cause a replace. The docs say it's a PATCH request, and we've verified you can update tags in-place from the console. Seems like a bug in TF.

lukehoban · 2019-02-15T19:32:18Z

I looked into this.

It is actually another case similar to pulumi/pulumi-aws#451.

The ultimate offender is agentPoolProfile.osDiskSizeGb. This is not at all clear from the diff we present to the user in the CLI, but the Diff returned from the provider indicates that the replace is being forced by that property changing.

In the TF schema it looks like:

"os_disk_size_gb": {
    Type:         schema.TypeInt,
    Optional:     true,
    ForceNew:     true,
    Computed:     true,
    ValidateFunc: validation.IntAtLeast(1),
},

This value is computed, and gets populated with the value 30, but doesn't have a default set for this value. And then since not provided in the inputs, a Diff is seen (changing from 30 to nothing). In theory this is an upstream schema issue - anything both Computed and ForceNew is dangerous. I suspect, though will verify, that this same problem reproduces in Terraform.

If I add osDiskSizeGb: 30, to my agentPoolProfile, I see that I get an "update" proposed instead of a "replace" as expected - so that is the workaround here.

     Type                                         Name          Plan       Info
     pulumi:pulumi:Stack                          20190214-dev             2 warnings
 ~   └─ azure:containerservice:KubernetesCluster  cluster       update     [diff: +tags~agentPoolProfile]

There are three issues to follow up on:

The fact that the pulumi CLI misleads on what is causing the actual diff/replace here
Verify whether this reproduces in Terraform and if so report to upstream provider
Understand whether we can/should do anything to be more defensive against this kind of issue in providers

lukehoban · 2019-02-15T19:55:08Z

This apparently does not reproduce with Terraform, so we'll need to look at what fix is appropriate on the Pulumi side.

$ terraform apply
azurerm_kubernetes_cluster.test: Refreshing state... (ID: /subscriptions/0282681f-7a9e-424b-80b2-...nerService/managedClusters/acctestaks1)

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ azurerm_kubernetes_cluster.test
      tags.%:           "0" => "1"
      tags.Environment: "" => "Production"

lukehoban · 2019-02-15T19:59:32Z

Opened pulumi/pulumi#2453 on (1) above.

lukehoban · 2019-02-15T20:30:31Z

says it's replaced.. but where is it??? I can't find it anywhere, no resource.. it's gone! let's refresh and find out what happened..

I wasn't able to reproduce this aspect of the original bug report. Even when I do cause a replace to be triggered - it does successfully create a new cluster and delete the old cluster.

Duration: 6m20s

This is also surprisingly short for a replace to have actually happened. For me this took 16m27s, which sounds more correct as creating a new AKS cluster by itself takes ~12m. Definitely seems like something went wrong creating the new cluster - and that that wasn't reported back correctly.

slikk66 · 2019-02-15T20:33:22Z

@lukehoban try setting the {name} parameter on the cluster properties, see if that triggers the "delete" aspect.. that's how I had it. This is my personal account, I filed the bug report from work account (I was forced to create a separate work one...)

slikk66 · 2019-02-15T20:34:45Z

Here's the code:

            var aks_name = AzureUtils.get_resource_name('aks', this.context['env_id'], this.context['role_title_color'], this.context['region_key'], true);

            var cluster = new azure.containerservice.KubernetesCluster(aks_name,{
                name: aks_name,
                location: this.definition_data.region_map[this.context['region_key']]['name'],
                resourceGroupName: resource_group_container_name,
                dnsPrefix: aks_name.toLowerCase(),
                agentPoolProfile: {
                    name: "default",
                    count: Number(this.aks['aks_config']['node_count']),
                    vnetSubnetId: aks_subnet_id,
                    vmSize: this.aks['aks_config']['node_vm_size']
                },
                linuxProfile: {
                    adminUsername: 'azureuser',
                    sshKeys: aks_ssh_keys
                },
                addonProfile: {
                    httpApplicationRouting: {
                        enabled: false
                    }
                },
                tags: aks_tags,
                networkProfile: {
                    networkPlugin: 'kubenet',
                    podCidr: "192.168.0.1/17",
                    serviceCidr: "192.168.128.1/17",
                    dockerBridgeCidr: "172.17.0.1/16",
                    dnsServiceIp: "192.168.128.10"
                },
                servicePrincipal: {
                    clientId: this.config.require('spn_client_id'), 
                    clientSecret: this.config.require('spn_client_secret')
                }
            },{
                dependsOn: resource_group_container
            });

lukehoban · 2019-02-15T23:34:43Z

Okay - I can reproduce the disappearing cluster issue - and it is indeed tied to specifying name.

What happens is:

Changing the name forces replacement
By default Pulumi does create-before-delete replacement
Azure has a strange default behaviour where "create" will silently "update" if the name matches an existing resource
As a result, an "update" of the existing resources happens during the create-replace instead of creating a new resource
Finally, the delete of the "replaced" resource happens, which deletes the resource that was just updated

The root of this is (3), which I believe is being changed in the 2.0.0 versions of the Azure Terraform provider per https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html#changes-when-importing-existing-resources.

A workaround now (which will be necessary to do the replacement at all after the 2.0.0 change) is to set deleteBeforeReplace: true or to leave off the name. The former will delete the original cluster before creating the new one with the same name. The latter will allow the replacement cluster to be created with a different name.

lukehoban · 2019-03-02T19:22:08Z

Opened 3 issues to track various pieces of this:

Report the actual diff as computed by the provider pulumi#2453 on the incorrect diff displayed to the user
Properties marked as Computed, Optional and ForceNew cause unexpected Replaces pulumi-terraform#329 on the unexpected replaces
Enable ARM_PROVIDER_STRICT to protect against dangerous behavior #193 to ensure that Create fails when there is an existing resource which would have prevented the disappearance

I'll close this one itself out and we'll track in those other three issue.

pgavlin · 2019-03-04T22:59:50Z

FWIW, I cannot reproduce this with the latest bits. Looking through the history, it looks like this commit changed the relevant field from an optional forcenew with a default to an optional computed forcenew. If I remove the computed annotation from the schema, I can reproduce this issue.

lukehoban self-assigned this Feb 14, 2019

lukehoban added this to the 0.21 milestone Feb 14, 2019

lukehoban mentioned this issue Feb 15, 2019

Report the actual diff as computed by the provider pulumi/pulumi#2453

Closed

lukehoban mentioned this issue Feb 15, 2019

Properties marked as Computed, Optional and ForceNew cause unexpected Replaces pulumi/pulumi-terraform#329

Closed

This was referenced Feb 25, 2019

increasing minMasterVersion of google container cluster previews replacement instead of update pulumi/pulumi-gcp#88

Closed

DatabaseInstance replace logic seems wrong pulumi/pulumi-gcp#65

Closed

lukehoban closed this as completed Mar 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

dbilleci-lightstream commented Feb 14, 2019 •

edited

Loading

dbilleci-lightstream commented Feb 14, 2019

lukehoban commented Feb 14, 2019

hausdorff commented Feb 14, 2019

lukehoban commented Feb 15, 2019 •

edited

Loading

lukehoban commented Feb 15, 2019

lukehoban commented Feb 15, 2019

lukehoban commented Feb 15, 2019 •

edited

Loading

slikk66 commented Feb 15, 2019

slikk66 commented Feb 15, 2019

lukehoban commented Feb 15, 2019

lukehoban commented Mar 2, 2019

pgavlin commented Mar 4, 2019

Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

Comments

dbilleci-lightstream commented Feb 14, 2019 • edited Loading

dbilleci-lightstream commented Feb 14, 2019

lukehoban commented Feb 14, 2019

hausdorff commented Feb 14, 2019

lukehoban commented Feb 15, 2019 • edited Loading

lukehoban commented Feb 15, 2019

lukehoban commented Feb 15, 2019

lukehoban commented Feb 15, 2019 • edited Loading

slikk66 commented Feb 15, 2019

slikk66 commented Feb 15, 2019

lukehoban commented Feb 15, 2019

lukehoban commented Mar 2, 2019

pgavlin commented Mar 4, 2019

dbilleci-lightstream commented Feb 14, 2019 •

edited

Loading

lukehoban commented Feb 15, 2019 •

edited

Loading

lukehoban commented Feb 15, 2019 •

edited

Loading