Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD Best Practice for plan-all output? #720

Closed
wollerman opened this issue May 31, 2019 · 5 comments
Closed

CI/CD Best Practice for plan-all output? #720

wollerman opened this issue May 31, 2019 · 5 comments
Labels

Comments

@wollerman
Copy link

I'm starting to place our infrastructure into a CI/CD pipeline with steps for validate -> plan (with planfile output) -> apply.

At first I thought I could simply run terragrunt plan-all -out "planfile" in one step, save some plan file, and then run terragrunt apply-all -input=false "planfile" in another step. However, it seems that terragrunt actually places these plan files into the individual module .terragrunt-cache directories.

Next I provided an absolute path like terragrunt plan-all -out /my/plan/file which seems to generate the single file I'm after.

However, on the next step of trying to apply, since it's a single plan I expect I should use terragrunt apply /my/plan/file as opposed to the apply-all syntax. Doing this actually fails:

$ cd $CI_COMMIT_REF_NAME
$ terragrunt apply -input=false "$CI_PROJECT_DIR/planfile"
[terragrunt] [/builds/path/to/infrastructure-live/dev] 2019/05/31 22:39:40 Running command: terraform --version
[terragrunt] 2019/05/31 22:39:40 Reading Terragrunt config file at /builds/path/to/infrastructure-live/dev/terraform.tfvars
[terragrunt] 2019/05/31 22:39:40 Did not find any Terraform files (*.tf) in /builds/path/to/infrastructure-live/dev
[terragrunt] 2019/05/31 22:39:40 Unable to determine underlying exit code, so Terragrunt will exit with error code 1
ERROR: Job failed: exit code 1

I'm not sure if I should actually be letting the individual modules deal with plan files or trying to consolidate into this single file.

Is there a best practice for how to handle this situation?

@yorinasub17 yorinasub17 added enhancement New feature or request question and removed enhancement New feature or request labels May 31, 2019
@yorinasub17
Copy link
Contributor

plan-all simply iterates over the subdirectories, looking for those that have tfvars files and running the equivalent of terragrunt plan. Each CLI arg that isn't understood by terragrunt is passed through to the individual plan.

Given that, when you give an absolute path for the plan output, each plan is outputting to that file with replacement. This means that the last plan that runs is the one that is actually stored there, which isn't what you expect.

This use case that you are describing (having a single plan output from plan-all) is a feature that isn't currently supported. In fact, we generally discourage the usage of the xxx-all flavor for day to day operational usage. We designed the xxx-all variants with the specific purpose of standing up the stack from scratch. Once the infra is up, we recommend running the individual plans and applies manually.

There are a few reasons for this:

plan-all output is broken for certain use cases

If you have interdependencies between the modules, then the plan output from plan-all will be wrong. This is because plan-all does not do any apply in between. So let's consider the following scenario. Suppose you had the following folder structure, where vpc manages a VPC and eks will manage an EKS cluster in that VPC. The two are linked through a terragrunt dependency chain.

.
├── account.tfvars
├── eks
│   └── terraform.tfvars
└── vpc
    └── terraform.tfvars

In this scenario, when you run terragrunt plan-all, what you will see is the output of running terragrunt plan in the vpc folder, and then the output of running terragrunt plan in the eks folder.

Now suppose that nothing is deployed. In this case, the latter plan will fail because the vpc isn't deployed yet, so it doesn't know what to do for some resources (the VPC lookup will fail if you are using terraform_remote_state).

What about if we had already deployed resources? In this case the plan command will successfully complete because it has all the information available in the state and so there will be no errors.

However, if the vpc has some changes that affect the eks module, then this plan is WRONG. This is simply because the plan for the eks module is based off of the current state, not the future state. When you do apply-all, the vpc changes will apply, which will modify the state and thus change what terraform will do for the eks module when it applies that module.

So in practice, plan-all is something that you really shouldn't use unless you know exactly what you are looking for.

We have discussed in the past of trying to address this concern. You can see #262 for the relevant thread.

If you are always running xxx-all, should that be a single module?

The whole reason to componentize your infrastructure is to localize changes and the blast radius. You are trying to minimize operator error from changes caused in one component to spill over to other components. Using the xxx-all variants is roughly the equivalent of having all the code defined in a single stack, only you lose all the benefits of doing so (can't cross reference modules, not treated as a single atomic operation, etc) and get all the problems (slow to do anything, harder to read / parse the output, less secure because you now need permissions for everything not just the small subset, etc). Of course, there are certain situations where this is useful to do (a massive version bump across the stack, updating OS AMI across the stack, etc) and for those cases the xxx-all variant exists, but it is usually not recommended to do this all the time.

You can see our post 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code for more reasons why you don't want a single stack.

More error prone

Using the xxx-all variant on a regular basis depends on having all the dependencies between all your modules defined clearly in the terragrunt config. Otherwise, you can have very subtle bugs where the apply happens in the wrong order, leading to partial rollout. Since there is no syntactic support for this, you are much more likely to get the dependency chain wrong. Basically, the added complexity introduces a lot more room to make mistakes with the xxx-all variants of the commands.

CI/CD Pipeline

So going back to your original question about CI/CD best practices, this is an area that we are actively working on. In general, we find that it is much harder to setup a generic CI/CD pipeline for arbitrary infrastructure changes. This is mostly because:

  1. The cost of a mistake is catastrophic. If you rollout the wrong version of your app, you can oftentimes rollback to a previous version. If you rollout the wrong infra code that destroys your VPC (or worse, your database), you have wiped out your entire app, potentially all the data with it.
  2. Violates principle of least privileges. If you want to support arbitrary infra code CI/CD, your CI/CD system needs the privileges to deploy that. This ranges all the way from creating EC2 instances, to read/write access to IAM roles/users/etc. Pretty soon your CI system will need admin level privilege on your AWS account, which is probably not something you want for too long.
  3. Terraform will fail. Very often. Failed deployments are VERY common with infra code. Most CI/CD systems do not have first class support for error handling and retries.
  4. Manual steps are necessary. Unlike an app deploy, infra deploys typically require many gates and approval chains. This means that your pipeline needs to be broken up with approval flows that wait for approving steps (approving plan output, approval from multiple entities, etc). This is again not something that most CI/CD systems have first class support for.

Right now, our general recommendation is to build very limited CI/CD pipelines for infrastructure that changes often. For example, you can have a CI/CD pipeline just for rolling out your application. Or you can have one just for making changes to the ASG.* This can help you workaround a lot of the issues mentioned above.

In this model, you are unlikely to use the plan-all variant, and instead will be using plan and apply on a single module. In this case, it will work very similar to plain terraform, although the main difference is that you will most likely want to use absolute paths to store the plan output so that it is findable (as you discovered).

* Note that we don't have any special tooling around implementing such a pipeline (e.g we don't have scripts that autodetects the change to route to the relevant pipeline).

@wollerman
Copy link
Author

I really appreciate the extremely thorough writeup! You've made some awesome points.

You're right that most CI/CD doesn't have manual triggering as a first class citizen. We have the capability, but it seems that there are plenty of other reasons why we should avoid this approach for now.

Again, thanks for the top to bottom explanation!

@marshall7m
Copy link

Here’s a Terraform module that detects all changes within Terragrunt dependency chains and orchestrates a continuous deployment flow that respects the dependeny chain order. The module offers AWS account-level dependencies and approval requirements. In addition, it includes handling for rolling back newly deployed Terraform provider resources. See the README.md for the full description of how this all works here.

Side note: I’m the owner of this repository. If you do check it out, I'm open to any questions and brutally honest feedback!

@norman-zon
Copy link

One option to get around this in CI, at least Github Actions, is to not use run-all, but use matrix jobs that run for each terragrunt.hcl file. This also offers the advantage of being able to re-run single jobs and, in my tests, was generally faster.
This leverages the Github Actions built-in fromJson, which can be used to generate a 'dynamic' job matrix, that gets assembled in an earlier job:

jobs:
  build_matrix:
    name: Build Matix
    runs-on: [self-hosted, linux]

    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Get all directories with terragrunt.hcl
        uses: sergeysova/jq-action@v2
        id: find
        with:
          cmd: for file in $(find * -mindepth 1 -name terragrunt.hcl); do dirname $file; done | jq -R -s -c 'split("\n")[:-1]' | sed 's/"/\\"/g'

      - id: set-matrix
        run: echo ::set-output name=matrix::${{ steps.find.outputs.value }}

    outputs:
        matrix: ${{ steps.set-matrix.outputs.matrix }}
 run:
    name: Run Terragrunt
    runs-on: [self-hosted, linux]
    needs: build_matrix
    strategy:
      fail-fast: false
      matrix:
        dir: ${{ fromJson(needs.build_matrix.outputs.matrix) }}

    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - uses: actions/setup-node@v2
        with:
          node-version: '14'

      - uses: hashicorp/setup-terraform@v1
        with:
          terraform_wrapper: false # needed for terragrunt to be able to parse TF output
          
      - name: Setup Terragrunt
        uses: autero1/action-terragrunt@v1.1.1
        with:
          terragrunt_version: "latest"
 
      - name: Run terragrunt
        run: |
          cd ${{ matrix.dir }}
          terragrunt plan --terragrunt-ignore-external-dependencies -detailed-exitcode -lock=false -parallelism=30

Hope this is useful to someone else.

@trallnag
Copy link

@norman-zon, won't this lead to issues if one stack depends on another one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants