Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource/aws_ecs_task_definition: prevent spurious environment variable diffs #11463

Merged

Conversation

jbergknoff-rival
Copy link
Contributor

@jbergknoff-rival jbergknoff-rival commented Jan 3, 2020

An ECS container definition's environment variables are stored by ECS as a list (rather than a map), and the ECS service reorders them arbitrarily because there is no meaningful order to them. There's already code in place to sort the list of environment variable when computing the plan, so that, if that reordering is the only change, then the container definition is considered clean.

However, if anything else in the container definition changes, then we rightfully get a plan. The environment variables still show up in the plan, and, because of the reordering, there appears to be a big change. It's extremely difficult to inspect manually.

Here's an example:

provider "aws" {}

resource "aws_ecs_task_definition" "default" {
  family                = "reordering-demonstration"
  container_definitions = jsonencode(
    [
      {
        "name": "main",
        "image": "hello-world",
        "cpu": 10,
        "memory": 512,
        "essential": true,
        "environment": [
          {"name": "abc", "value": "123"},
          {"name": "def", "value": "456"},
          {"name": "ghi", "value": "789"},
          {"name": "jkl", "value": "321"},
          {"name": "mno", "value": "654"},
          {"name": "pqr", "value": "987"},
        ]
      }
    ]
  )
}

Apply, and then change the memory from 512 to 1024. The next plan is:

Terraform will perform the following actions:

  # aws_ecs_task_definition.default must be replaced
-/+ resource "aws_ecs_task_definition" "default" {
      ~ arn                      = "..." -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [ # forces replacement
              ~ {
                    cpu          = 10
                  ~ environment  = [
                      - {
                          - name  = "pqr"
                          - value = "987"
                        },
                      - {
                          - name  = "ghi"
                          - value = "789"
                        },
                      - {
                          - name  = "jkl"
                          - value = "321"
                        },
                        {
                            name  = "abc"
                            value = "123"
                        },
                        {
                            name  = "def"
                            value = "456"
                        },
                      + {
                          + name  = "ghi"
                          + value = "789"
                        },
                      + {
                          + name  = "jkl"
                          + value = "321"
                        },
                        {
                            name  = "mno"
                            value = "654"
                        },
                      + {
                          + name  = "pqr"
                          + value = "987"
                        },
                    ]
                    essential    = true
                    image        = "hello-world"
                  ~ memory       = 512 -> 1024
                  - mountPoints  = [] -> null
                    name         = "main"
                  - portMappings = [] -> null
                  - volumesFrom  = [] -> null
                } # forces replacement,
            ]
        )
        family                   = "reordering-demonstration"
      ~ id                       = "reordering-demonstration" -> (known after apply)
      + network_mode             = (known after apply)
      - requires_compatibilities = [] -> null
      ~ revision                 = 5 -> (known after apply)
      - tags                     = {} -> null
    }

Plan: 1 to add, 0 to change, 1 to destroy.

The correct plan would show memory = 512 -> 1024 but environment would be left alone.

This PR attempts to correct the problem by ordering the list of environment variables (by name) when we read the task definition from ECS, and when we serialize to state. In my testing (with the toy example above), it fixes the issue.

This approach also gives a clean plan if environment variables in the container definition are reordered in the Terraform source. That seems like reasonable behavior to me, because there is no meaningful order to the list.

Community Note

  • Please vote on this pull request by adding a 馃憤 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Release note for CHANGELOG:

BUG FIXES:
* resource/aws_ecs_task_definition: don't show a spurious environment variable diff when another property on the task definition is changing.

Output from acceptance testing:

The useful thing to test here is that, while a change to some other attribute is legitimately resulting in a plan, the unchanged environment variables don't show up as changed in the diff. Unfortunately, I didn't see a way to inspect the contents of a plan using the testing framework. If somebody can advise on where/how to write a test for this, I'll add one. It probably should be a unit test, though, rather than an acceptance test.

@jbergknoff-rival jbergknoff-rival requested a review from Jan 3, 2020
@ghost ghost added size/XS needs-triage service/ecs labels Jan 3, 2020
saich
saich approved these changes Jan 3, 2020
Copy link

@saich saich left a comment

Adding unit tests could speed up the review process

@MOURIK
Copy link

@MOURIK MOURIK commented Feb 28, 2020

I confirm the issue... Any update regarding this merge request ?
thanks in advance.

@rafilkmp3
Copy link

@rafilkmp3 rafilkmp3 commented Mar 10, 2020

I really can't believe this is not merged yet

@rafilkmp3
Copy link

@rafilkmp3 rafilkmp3 commented Mar 10, 2020

@jbergknoff-rival can I build myself the provider with your code? this is today my big issue

@jbergknoff-rival
Copy link
Contributor Author

@jbergknoff-rival jbergknoff-rival commented Mar 10, 2020

can I build myself the provider with your code?

Yes, absolutely. My team runs a custom build of this provider with a bunch of useful PRs and bugfixes included (including this one).

@rafilkmp3
Copy link

@rafilkmp3 rafilkmp3 commented Mar 28, 2020

another week on paradise

@pkoch
Copy link

@pkoch pkoch commented Mar 29, 2020

@rafilkmp3 Still working on this? Do you know enough go/tf to perhaps add some tests to this patch, like @saich mentioned?

@saich @jbergknoff-rival: what next steps can we take here to see this get to master faster?

@rafilkmp3
Copy link

@rafilkmp3 rafilkmp3 commented Mar 29, 2020

@pkoch I'm not working on it, I just use the @ jbergknoff-rival version and I feel powerless because it is not merged. Trust me if I had been able to develop tests to make this happen, I certainly would have done it a long time ago.

@jbergknoff-rival
Copy link
Contributor Author

@jbergknoff-rival jbergknoff-rival commented Mar 30, 2020

@pkoch it's in the hands of the maintainers, and I don't know of anything we can do to get this merged. As far as I know, none of the maintainers have seen this PR. I'd like to add tests if somebody more in-the-know would advise on my question (in the PR description) about how to write a test against the contents of a plan.

@nevins-ellevation
Copy link

@nevins-ellevation nevins-ellevation commented Mar 30, 2020

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

@jbergknoff-rival
Copy link
Contributor Author

@jbergknoff-rival jbergknoff-rival commented Mar 30, 2020

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

Thanks for the suggestion. I don't think this would test what this PR addresses, though, which is: when there is a good reason to show a plan, the environment variables should only show up as changed if they're actually changing.

The PR doesn't affect any interactions with ECS, doesn't change whether plans are clean or not, etc.

@nevins-ellevation
Copy link

@nevins-ellevation nevins-ellevation commented Mar 31, 2020

Could you do something like the existing testAccCheckEcsTaskDefinitionRecreated test but validate that when you re-ordered the environment it didn't change revisions?

Thanks for the suggestion. I don't think this would test what this PR addresses, though, which is: when there is a good reason to show a plan, the environment variables should only show up as changed if they're actually changing.

The PR doesn't affect any interactions with ECS, doesn't change whether plans are clean or not, etc.

My thought is that if you do two terraform runs, if changes in the order of the environmental variables doesn't cause the task revision to have changed between the runs you know that the changes in this PR are working correctly.

@jbergknoff-rival jbergknoff-rival force-pushed the jbergknoff/ecs-task-definition-env-ordering branch from 2e89b7b to e718a30 Compare Apr 2, 2020
@Jimmyscene
Copy link

@Jimmyscene Jimmyscene commented Apr 8, 2020

Another bump, this bug is really annoying

@Jimmyscene
Copy link

@Jimmyscene Jimmyscene commented Apr 16, 2020

@bflad any opinions on this?

@nathanielks
Copy link
Contributor

@nathanielks nathanielks commented Apr 16, 2020

As far as I can tell, I don't believe there's any acceptance tests that can be added with the existing testing tool kit. The TestAwsEcsContainerDefinitionsAreEquivalent_arrays test already verifies that environment variables will be sorted when comparing local state to the AWS API. The challenge for this PR is it needs the ability to test the diff Terraform is generating, which I don't believe there's a testing API for currently.

I've done some digging and I believe it would be possible to add some sort of Plan diff testing functionality around here: https://github.com/terraform-providers/terraform-provider-aws/blob/2b074d0a45342f79c2cf763a26b5b4d157feadc7/vendor/github.com/hashicorp/terraform-plugin-sdk/helper/resource/testing_config.go#L164-L172

A DiffTestFunc property could be added to TestStep which could be run after the plan has been made and the diff itself could be tested using p.Changes. I'll report back if I'm able to get something working.

As an aside, I did learn about some testing functionality I wasn't aware of previously. When defining TestStep's, it's possible to specify PlanOnly and ExpectNonEmptyPlan for simply verifying a plan generates a change or not. Check out this test for an example of how to use those fields.

@nathanielks
Copy link
Contributor

@nathanielks nathanielks commented Apr 16, 2020

One challenge I'm experiencing in attempting to test this out is making changes to testing_config.go doesn't appear to make any difference when running make testacc for some reason. I'm still a golang novice, so I am unsure of why it's not picking up the changes I'm making. For reference, I'm simply adding an additional logging statement to line 84 to verify changes to the file get picked up, and the log isn't getting output.

The change itself:

	// If this step is a PlanOnly step, skip over this first Plan and subsequent
	// Apply, and use the follow up Plan that checks for perpetual diffs
	if !step.PlanOnly {
		// Plan!
		if p, stepDiags := ctx.Plan(); stepDiags.HasErrors() {
			return state, newOperationError("plan", stepDiags)
		} else {
+			// Add an additional logging statement
+			log.Printf("[WARN] Test: please work?")
			log.Printf("[WARN] Test: Step plan: %s", legacyPlanComparisonString(newState, p.Changes))
		}

The command I'm running:

TF_LOG=WARN make testacc TEST=./aws TESTARGS='-run=TestAccAWSEcrRepository_image_scanning_configuration'

@rabidscorpio
Copy link

@rabidscorpio rabidscorpio commented Apr 29, 2020

@bflad This is a clean PR that fixes a long-standing issue (see #3035), can we please get some traction on this?

@nwsparks
Copy link

@nwsparks nwsparks commented May 22, 2020

What is preventing this from moving forward?

@stuzlogle
Copy link

@stuzlogle stuzlogle commented May 26, 2020

no idea if this is a blocker but the provider has a quarterly roadmap I found when looking for the progress on another issue - https://github.com/terraform-providers/terraform-provider-aws/blob/master/ROADMAP.md

@nwsparks
Copy link

@nwsparks nwsparks commented Jun 19, 2020

If this was a feature addition my opinion would be different, but this is a major bug on a widely used service that that is incredibly frustrating to deal with as well as dangerous since it makes it difficult to tell what is actually changing in the environment of the application.

If you sort by reactions, this is the 3rd most upvoted open PR which per the faq is one of the main criteria for review. As you've mentioned @pkoch the PR appears fine and has multiple approvals, so people are just left to guess as to why it isn't being merged? Some guidance would surely be nice instead and perhaps more priority given to bug fixes in the future by the maintainers.

In any case, apologies for the distraction and appreciate the attempts to get it moving.

@pkoch
Copy link

@pkoch pkoch commented Jun 19, 2020

Just learned that Hashicorp provides community office hours: https://www.hashicorp.com/community/office-hours/#technical-office-hours. Perhaps that's a good forum to present our case after we've made sure we've gone through all the expected due-diligence.

@nathanielks
Copy link
Contributor

@nathanielks nathanielks commented Jun 19, 2020

Great idea, @pkoch 馃憤

@pkoch
Copy link

@pkoch pkoch commented Jun 19, 2020

To those looking for a bit of relief from this problem, I can point out that ECS now supports using environmentFiles for EC2 launch type.

@pkoch
Copy link

@pkoch pkoch commented Jun 19, 2020

I think @nathanielks is being a bit too bold in expanding the scope of this PR to make the diffing possible. I understand that the instructions do ask for an acceptance test coverage of new behavior but I'm not sure it's a reasonable ask if there's no testing infra that can help us right now. Adding that additional burden for that piece of infra to this PR seems undue.

One thing that bothers me here is that, even if we we managed to get that plan diffing feature in, since we have to assume AWS is going to be random and nondeterministic in their ordering, any acceptance test will inherently be flaky with false passes 鈥 AWS could happen to return just the right order. I'm not sure how allergic the maintainers of this repo are to that, but that'd be something I'd like an explicit ack on.

I think our best bet would be to add a well placed unit test. This is so internal and it's hard-to-control enough that I feel comfortable arguing in favor of that. I'm not sure which part of the code it should exercise, tho. I took a dumb approach at something by just fork-lifting StateFunc's meat, but I bet a person more familiar with the project's code organization and preferences would have a better idea than mine. You can check said unit test at https://gist.github.com/pkoch/33d5e70018e70b05b79daa2dc99ee867

Since we can't meet the established criteria for PR acceptance, we're stuck waiting for someone with more power than us.

Meanwhile, I took the liberty of booking an office hours slot with them. It's going to be at 12:30 - 13:00 Lisbon Time, on June 23, 2020. Wish me luck! 馃

@nathanielks
Copy link
Contributor

@nathanielks nathanielks commented Jun 19, 2020

Thanks for taking the time to book a spot with them, @pkoch!

@jboero
Copy link

@jboero jboero commented Jun 23, 2020

@paultyng do you or anyone on modules team have an idea when this can merge? @pkoch asking on office hours during Hashiconf EU. Thanks!

@pkoch
Copy link

@pkoch pkoch commented Jun 23, 2020

Just left the call. Had a great nice chat with @jboero, where we tried getting the attention of some folks that might be able to provide some firmer guidance (as you can see in his comment), and we'll keep trying to move this forward once inch at a time, as people are able to squeeze this into their busy schedules.

Thank you so much for helping us out, @jboero! 馃グ

@breathingdust
Copy link
Member

@breathingdust breathingdust commented Jun 23, 2020

Hi all 馃憢 ,

Sorry for the delay in response. We recognize the value this has to the community and as such this item is already in our backlog as a priority to address. We are currently working through our ROADMAP items for this quarter, once that is complete we will be in position to offer feedback and hopefully merge this pull request.

We appreciate your patience!

@bflad bflad added bug and removed needs-triage labels Jun 24, 2020
@bflad bflad self-assigned this Jun 24, 2020
bflad
bflad approved these changes Jun 24, 2020
Copy link
Member

@bflad bflad left a comment

This looks right to me and agreed that adding testing for this is not necessarily straightforward, so rather than holding this up further on that, I will approve this on a lack of regressions from our existing unit testing and acceptance testing. Thank you @jbergknoff-rival and everyone involved. 馃殌

Output from acceptance testing:

--- PASS: TestAccAWSEcsTaskDefinition_arrays (18.99s)
--- PASS: TestAccAWSEcsTaskDefinition_basic (31.95s)
--- PASS: TestAccAWSEcsTaskDefinition_changeVolumesForcesNewResource (32.13s)
--- PASS: TestAccAWSEcsTaskDefinition_constraint (19.21s)
--- PASS: TestAccAWSEcsTaskDefinition_ExecutionRole (20.29s)
--- PASS: TestAccAWSEcsTaskDefinition_Fargate (25.76s)
--- PASS: TestAccAWSEcsTaskDefinition_Inactive (37.80s)
--- PASS: TestAccAWSEcsTaskDefinition_inferenceAccelerator (19.31s)
--- PASS: TestAccAWSEcsTaskDefinition_ProxyConfiguration (31.52s)
--- PASS: TestAccAWSEcsTaskDefinition_Tags (52.97s)
--- PASS: TestAccAWSEcsTaskDefinition_withDockerVolume (19.42s)
--- PASS: TestAccAWSEcsTaskDefinition_withDockerVolumeMinimalConfig (19.41s)
--- PASS: TestAccAWSEcsTaskDefinition_withEcsService (76.30s)
--- PASS: TestAccAWSEcsTaskDefinition_withEFSVolume (32.52s)
--- PASS: TestAccAWSEcsTaskDefinition_withEFSVolumeMinimal (32.26s)
--- PASS: TestAccAWSEcsTaskDefinition_withIPCMode (20.19s)
--- PASS: TestAccAWSEcsTaskDefinition_withNetworkMode (20.36s)
--- PASS: TestAccAWSEcsTaskDefinition_withPidMode (29.78s)
--- PASS: TestAccAWSEcsTaskDefinition_withScratchVolume (18.39s)
--- PASS: TestAccAWSEcsTaskDefinition_withTaskRoleArn (20.21s)
--- PASS: TestAccAWSEcsTaskDefinition_withTaskScopedDockerVolume (18.99s)

@bflad bflad added this to the v2.68.0 milestone Jun 24, 2020
@bflad bflad merged commit 0be34d4 into hashicorp:master Jun 24, 2020
1 check passed
bflad added a commit that referenced this issue Jun 24, 2020
@jbergknoff-rival
Copy link
Contributor Author

@jbergknoff-rival jbergknoff-rival commented Jun 24, 2020

@bflad thank you! While this is fresh in our minds, please also have a look at #13813 which solves the same problem for various resources with IAM-like policies.

@jbergknoff-rival jbergknoff-rival deleted the jbergknoff/ecs-task-definition-env-ordering branch Jun 25, 2020
@jbergknoff-rival jbergknoff-rival restored the jbergknoff/ecs-task-definition-env-ordering branch Jun 25, 2020
@ghost
Copy link

@ghost ghost commented Jun 26, 2020

This has been released in version 2.68.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@pkoch
Copy link

@pkoch pkoch commented Jun 26, 2020

@rafilkmp3 The day has come. Welcome to paradise.

@jfirebaugh
Copy link

@jfirebaugh jfirebaugh commented Jun 30, 2020

Is anyone else getting even worse diffing with this change? I now see the "after" state represented as stringified JSON with no whitespace (i.e. virtually unreadable):

  # <resource> must be replaced
-/+ resource "aws_ecs_task_definition" "main" {
      ...
      ~ container_definitions    = jsonencode(
            <nicely formatted JSON data>
        ) -> "<stringified JSON with no whitespace>" # forces replacement
      ...
    }

@pkoch
Copy link

@pkoch pkoch commented Jun 30, 2020

I have tried this out and only seen the expected behavior. :/

Can you perhaps produce a minimal example and open a new issue?

@jbergknoff-rival jbergknoff-rival deleted the jbergknoff/ecs-task-definition-env-ordering branch Jun 30, 2020
@jfirebaugh
Copy link

@jfirebaugh jfirebaugh commented Jun 30, 2020

I reduced it to this diff, as generated by the 0.67.0 provider:

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        ~ awslogs-region        = "us-west-2" -> null
          awslogs-group         = "group"
          awslogs-stream-prefix = "prefix"
      }
      secretOptions = []
    }

For whatever reason, the 0.68.0 provider is unable to generate a readable diff for this change. However, changing awslogs-region to null was an unintended change in my configuration, and after fixing it, I do see the expected behavior. So this is probably not something that will be commonly encountered.

@ghost
Copy link

@ghost ghost commented Jul 24, 2020

I'm going to lock this issue because it has been closed for 30 days . This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@hashicorp hashicorp locked and limited conversation to collaborators Jul 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug service/ecs size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.