Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating tags on AKS cluster attempts to "replace" it, but ends up deleting it completely #182

Closed
dbilleci-lightstream opened this issue Feb 14, 2019 · 12 comments
Assignees
Milestone

Comments

@dbilleci-lightstream
Copy link

dbilleci-lightstream commented Feb 14, 2019

Found this out while doing some dev work. Every time we deploy an AKS cluster, we have a tag of "created" which is the current timestamp. So while iterating during development, I found out that if you redeploy the stack (which includes the AKS cluster, which includes a new tag on that cluster) the cluster will attempt to replace itself (for a tag?) but it ends up seeming to "create" a new one (of the same name) which I think is simply an update.. and then "deletes the old one" of the same name.. which means you end up with a deleted cluster.


    # Attempting to create my stack
    Previewing update (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan
     +   pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2  create
     +   ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg               create
     +   ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              create
     +   └─ azure:role:Assignment                     a1307151-3007-0006-0000-000000000000  create
    
    Resources:
        + 4 to create
    
    Do you want to perform this update? yes
    Updating (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Status                  Info
     +   pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2  created
     +   ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg               created
     +   ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              created
     +   └─ azure:role:Assignment                     a1307151-3007-0006-0000-000000000000  **creating failed**     1 error
    
    Diagnostics:
      azure:role:Assignment (a1307151-3007-0006-0000-000000000000):
        error: Plan apply failed: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal a1307151a68745cdb5fc6c430fff5cb3 does not exist in the directory a948e66e-9b93-46ca-a43b-0f42b9978a41."
    
    Resources:
        + 3 created
    
    Duration: 12m20s
    
    error: update failed
    
    -------
    
    # whoops! error, (fixed it) let's try again (tag gets updated)
    root@b450443eff38:/data/pulumi/infra/azure/aks-backend# pulumi up
    Previewing update (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan        Info
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2
     +-  ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              replace     [diff: ~tags]
     +   └─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  create
    
    Resources:
        + 1 to create
        +-1 to replace
        2 changes. 2 unchanged
    
    Do you want to perform this update? yes
    Updating (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Status       Info
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2
     +-  ├─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              replaced     [diff: ~tags]
     +   └─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  created
    
    Resources:
        + 1 created
        +-1 replaced
        2 changes. 2 unchanged
    
    Duration: 6m20s
    
    -------
    
    
    # says it's replaced.. but where is it??? I can't find it anywhere, no resource.. it's gone! let's refresh and find out what happened..
    root@b450443eff38:/data/pulumi/infra/azure/aks-backend# pulumi refresh
    Previewing refresh (aks-backend-x-stage-eus2):
    
         Type                                         Name                                  Plan       
         pulumi:pulumi:Stack                          aks-backend-aks-backend-x-stage-eus2             
         ├─ azure:core:ResourceGroup                  Backend-X-Stage-Eus2-Rg                          
     -   ├─ azure:role:Assignment                     b18a5cc9-3007-0006-0000-000000000000  delete     
     -   └─ azure:containerservice:KubernetesCluster  Backend-X-Stage-Eus2-Aks              delete     
     
    Resources:
        - 2 to delete
        2 unchanged
    

It's gone! Is this pulumi or TF provider?

@dbilleci-lightstream
Copy link
Author

To be clear, although I think it's odd that a tag change would require tearing down a cluster in the first place, if that is required.. so be it.. but I'd expect that there still is a cluster (a new one) that is left over at the end over the operation.. there isn't! There is NO CLUSTER at all! Sorry if I wasn't clear.

@lukehoban
Copy link
Member

Yes - this doesn't look right - we'll look into it.

@lukehoban lukehoban self-assigned this Feb 14, 2019
@lukehoban lukehoban added this to the 0.21 milestone Feb 14, 2019
@hausdorff
Copy link

Updating a tag should not cause a replace. The docs say it's a PATCH request, and we've verified you can update tags in-place from the console. Seems like a bug in TF.

@lukehoban
Copy link
Member

lukehoban commented Feb 15, 2019

I looked into this.

It is actually another case similar to pulumi/pulumi-aws#451.

The ultimate offender is agentPoolProfile.osDiskSizeGb. This is not at all clear from the diff we present to the user in the CLI, but the Diff returned from the provider indicates that the replace is being forced by that property changing.

In the TF schema it looks like:

"os_disk_size_gb": {
    Type:         schema.TypeInt,
    Optional:     true,
    ForceNew:     true,
    Computed:     true,
    ValidateFunc: validation.IntAtLeast(1),
},

This value is computed, and gets populated with the value 30, but doesn't have a default set for this value. And then since not provided in the inputs, a Diff is seen (changing from 30 to nothing). In theory this is an upstream schema issue - anything both Computed and ForceNew is dangerous. I suspect, though will verify, that this same problem reproduces in Terraform.

If I add osDiskSizeGb: 30, to my agentPoolProfile, I see that I get an "update" proposed instead of a "replace" as expected - so that is the workaround here.

     Type                                         Name          Plan       Info
     pulumi:pulumi:Stack                          20190214-dev             2 warnings
 ~   └─ azure:containerservice:KubernetesCluster  cluster       update     [diff: +tags~agentPoolProfile]

There are three issues to follow up on:

  1. The fact that the pulumi CLI misleads on what is causing the actual diff/replace here
  2. Verify whether this reproduces in Terraform and if so report to upstream provider
  3. Understand whether we can/should do anything to be more defensive against this kind of issue in providers

@lukehoban
Copy link
Member

This apparently does not reproduce with Terraform, so we'll need to look at what fix is appropriate on the Pulumi side.

$ terraform apply
azurerm_kubernetes_cluster.test: Refreshing state... (ID: /subscriptions/0282681f-7a9e-424b-80b2-...nerService/managedClusters/acctestaks1)

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ azurerm_kubernetes_cluster.test
      tags.%:           "0" => "1"
      tags.Environment: "" => "Production"

@lukehoban
Copy link
Member

Opened pulumi/pulumi#2453 on (1) above.

@lukehoban
Copy link
Member

lukehoban commented Feb 15, 2019

says it's replaced.. but where is it??? I can't find it anywhere, no resource.. it's gone! let's refresh and find out what happened..

I wasn't able to reproduce this aspect of the original bug report. Even when I do cause a replace to be triggered - it does successfully create a new cluster and delete the old cluster.

Duration: 6m20s

This is also surprisingly short for a replace to have actually happened. For me this took 16m27s, which sounds more correct as creating a new AKS cluster by itself takes ~12m. Definitely seems like something went wrong creating the new cluster - and that that wasn't reported back correctly.

@slikk66
Copy link

slikk66 commented Feb 15, 2019

@lukehoban try setting the {name} parameter on the cluster properties, see if that triggers the "delete" aspect.. that's how I had it. This is my personal account, I filed the bug report from work account (I was forced to create a separate work one...)

@slikk66
Copy link

slikk66 commented Feb 15, 2019

Here's the code:

            var aks_name = AzureUtils.get_resource_name('aks', this.context['env_id'], this.context['role_title_color'], this.context['region_key'], true);

            var cluster = new azure.containerservice.KubernetesCluster(aks_name,{
                name: aks_name,
                location: this.definition_data.region_map[this.context['region_key']]['name'],
                resourceGroupName: resource_group_container_name,
                dnsPrefix: aks_name.toLowerCase(),
                agentPoolProfile: {
                    name: "default",
                    count: Number(this.aks['aks_config']['node_count']),
                    vnetSubnetId: aks_subnet_id,
                    vmSize: this.aks['aks_config']['node_vm_size']
                },
                linuxProfile: {
                    adminUsername: 'azureuser',
                    sshKeys: aks_ssh_keys
                },
                addonProfile: {
                    httpApplicationRouting: {
                        enabled: false
                    }
                },
                tags: aks_tags,
                networkProfile: {
                    networkPlugin: 'kubenet',
                    podCidr: "192.168.0.1/17",
                    serviceCidr: "192.168.128.1/17",
                    dockerBridgeCidr: "172.17.0.1/16",
                    dnsServiceIp: "192.168.128.10"
                },
                servicePrincipal: {
                    clientId: this.config.require('spn_client_id'), 
                    clientSecret: this.config.require('spn_client_secret')
                }
            },{
                dependsOn: resource_group_container
            });

@lukehoban
Copy link
Member

Okay - I can reproduce the disappearing cluster issue - and it is indeed tied to specifying name.

What happens is:

  1. Changing the name forces replacement
  2. By default Pulumi does create-before-delete replacement
  3. Azure has a strange default behaviour where "create" will silently "update" if the name matches an existing resource
  4. As a result, an "update" of the existing resources happens during the create-replace instead of creating a new resource
  5. Finally, the delete of the "replaced" resource happens, which deletes the resource that was just updated

The root of this is (3), which I believe is being changed in the 2.0.0 versions of the Azure Terraform provider per https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html#changes-when-importing-existing-resources.

A workaround now (which will be necessary to do the replacement at all after the 2.0.0 change) is to set deleteBeforeReplace: true or to leave off the name. The former will delete the original cluster before creating the new one with the same name. The latter will allow the replacement cluster to be created with a different name.

@lukehoban
Copy link
Member

Opened 3 issues to track various pieces of this:

  1. Report the actual diff as computed by the provider pulumi#2453 on the incorrect diff displayed to the user
  2. Properties marked as Computed, Optional and ForceNew cause unexpected Replaces pulumi-terraform#329 on the unexpected replaces
  3. Enable ARM_PROVIDER_STRICT to protect against dangerous behavior #193 to ensure that Create fails when there is an existing resource which would have prevented the disappearance

I'll close this one itself out and we'll track in those other three issue.

@pgavlin
Copy link
Member

pgavlin commented Mar 4, 2019

FWIW, I cannot reproduce this with the latest bits. Looking through the history, it looks like this commit changed the relevant field from an optional forcenew with a default to an optional computed forcenew. If I remove the computed annotation from the schema, I can reproduce this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants