Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform state not updated on a failed run #527

Closed
Spritekin opened this issue Jul 23, 2023 · 8 comments · Fixed by #539 or #531
Closed

Terraform state not updated on a failed run #527

Spritekin opened this issue Jul 23, 2023 · 8 comments · Fixed by #539 or #531

Comments

@Spritekin
Copy link
Contributor

Spritekin commented Jul 23, 2023

I have a run with a series of resources, the plan finishes ok.
During the apply, some of the resources are created, but the run fails mid execution due to some problem which prevents a pod from being created.
Then it fails, like this:

Terraform has been successfully initialized!
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.aws_route53_record.hostname[0]: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creation complete after 11s [id=vol-0b79209ea04d62be9]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creation complete after 11s [id=vol-04889688362e7f912]
module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this: Creating...
╷
│ Error: persistentvolumes "sonarqube-12-postgresql" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this":
│   24: resource "kubernetes_persistent_volume" "this" {
│
╵
╷
│ Error: secrets "sonarqube-12-postgresql-credentials" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this":
│   24: resource "kubernetes_secret" "this" {
│
╵
╷
│ Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists]
│ 	status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95
│
│   with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0],
│   on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname":
│    5: resource "aws_route53_record" "hostname" {
│
╵
╷
│ Error: persistentvolumes "sonarqube-12" already exists
│
│   with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this":
│   30: resource "kubernetes_persistent_volume" "this" {
│
╵
 
Error: exit status 1: Error: persistentvolumes "sonarqube-12-postgresql" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this": 24: resource "kubernetes_persistent_volume" "this" { Error: secrets "sonarqube-12-postgresql-credentials" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this": 24: resource "kubernetes_secret" "this" { Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists] status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95 with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0], on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname": 5: resource "aws_route53_record" "hostname" { Error: persistentvolumes "sonarqube-12" already exists with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this": 30: resource "kubernetes_persistent_volume" "this" {

The problem is, in case of a failure, the state file is not updated, then the resources created by this run are not under Terraform control.
If I decide to delete the workspace, any objects created during the failed run are ignored and left in the account.

So, in case of a failed run, the state file should be updated with references to any resource created during the run.

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

@leg100
Copy link
Owner

leg100 commented Jul 23, 2023

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

It looks like this functionality - not specific to Terraform Cloud - was added only recently, in v1.5.0:

https://github.com/hashicorp/terraform/blob/v1.5/CHANGELOG.md#150-june-12-2023

Terraform CLI's local operations mode will now attempt to persist state snapshots to the state storage backend periodically during the apply step, thereby reducing the window for lost data if the Terraform process is aborted unexpectedly. (hashicorp/terraform#32680)

OTF currently runs terraform apply and if it fails it skips uploading the state file. I'll change the behaviour so that the state file is uploaded regardless of whether it failed or succeeded.

Thank you for bringing this to my attention.

@Spritekin
Copy link
Contributor Author

Spritekin commented Jul 26, 2023

Hi @leg100

I have little experience with Golang so please let me see if I understand this.

The worker finishes the phase and regardless of the outcome it invokes the FinishPhase but it will set "Errored" in the finishOptions.

// Regardless of success, mark phase as finished
_, err = w.FinishPhase(ctx, r.ID, r.Phase(), finishOptions)
if err != nil {
log.Error(err, "finishing phase")
return
}
}

In the service.FinishPhase it creates a report for output (if not Errored) then it updates the state in the DB. It still passes a reference to the error.

otf/internal/run/service.go

Lines 324 to 326 in 6b9e6b1

run, err := s.db.UpdateStatus(ctx, runID, func(run *Run) error {
return run.Finish(phase, opts)
})

In the DB updateStatus method it checks if the run was errored and it it was it won't update the status.

otf/internal/run/db.go

Lines 113 to 115 in 6b9e6b1

if err := fn(run); err != nil {
return err
}

So if I remove those 3 lines in db.go it should continue saving the state even in case of an error which is what this issue is about.

Is that right?

@leg100
Copy link
Owner

leg100 commented Jul 26, 2023

@Spritekin That part of code is concerned with updating the run status, not terraform state.

See the fix I've made here:

https://github.com/leg100/otf/pull/539/files

If you read ./internal/agent/steps.go from the top then you'll get an understanding of what's going on. Each run is composed of many steps. Each step is processed sequentially. If one step fails, it stops and later steps are skipped.

The problem in this case is that the terraform apply step fails and so it would skip the next and last step that uploads the state.

The fix is to combine the two steps into one. In Go, a defer() is called at the end of a function regardless of error. So that is what it used to ensure any change to the state is uploaded regardless of what happens to the apply.

leg100 added a commit that referenced this issue Jul 26, 2023
leg100 pushed a commit that referenced this issue Jul 26, 2023
🤖 I have created a release *beep* *boop*
---


## [0.1.2](v0.1.1...v0.1.2)
(2023-07-26)


### Bug Fixes

* agent race error ([#537](#537))
([6b9e6b1](6b9e6b1))
* handle run-events request from terraform cloud backend
([#534](#534))
([b1998bd](b1998bd))
* terraform apply partial state updates
([#539](#539))
([d25e7e4](d25e7e4)),
closes [#527](#527)


### Miscellaneous

* removed unused config file
([84fe3b1](84fe3b1))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@leg100
Copy link
Owner

leg100 commented Jul 26, 2023

@Spritekin It would be appreciated if you can give the new release a test run, and even better if you can replicate the error scenario above, to see if it successfully persists partial state updates.

@Spritekin
Copy link
Contributor Author

Spritekin commented Jul 26, 2023

sure I will do now... I'm rebuilding my images with AWS CLI support.

https://hub.docker.com/r/spritekin/otfd

@Spritekin
Copy link
Contributor Author

Spritekin commented Jul 26, 2023

Now I can't replicate the failure... don't worry I will build a new module soon, new modules usually fail during tests. Will keep you posted.

@pat-s
Copy link
Contributor

pat-s commented Jul 26, 2023

@leg100 maybe it would be helpful to list custom images like this or mine in the Readme? To prevent other users from building duplicate ones unnecessarily

@Spritekin
Copy link
Contributor Author

@leg100 I'm still testing but looking good. I say this because in a first run I got a series of objects being created then my module failed... the apply phase looks like this:

Terraform has been successfully initialized!
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.aws_ebs_volume.this: Creation complete after 10s [id=vol-01aa843277aca7c3b]
module.oss_prod_penpot_16.aws_ebs_volume.this: Creation complete after 10s [id=vol-08e4a387cf4a974d9]
module.oss_prod_penpot_16.module.redis.aws_ebs_volume.this: Creation complete after 10s [id=vol-0e7cd4a17438f3929]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Creation complete after 12s [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Creation complete after 23s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.aws_route53_record.hostname[0]: Creation complete after 47s [id=Z05392673CT6X22PUH929_penpot-16.oss.ossy.dev_CNAME]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Creation complete after 41s [id=penpot-16-redis]
╷
│ Error: template: penpot/templates/frontend/ingress.yaml:25:19: executing "penpot/templates/frontend/ingress.yaml" at <.Values.ingress.tls>: nil pointer evaluating interface {}.tls
│
│   with module.oss_prod_penpot_16.helm_release.application,
│   on .terraform/modules/oss_prod_penpot_16/main.tf line 272, in resource "helm_release" "application":
│  272: resource "helm_release" "application" {
│

But then if I try to rerun th plan already recognises the created objects and tries to refresh:

module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Refreshing state... [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Refreshing state... [id=penpot-16-redis]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Refreshing state... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Refreshing state... [id=penpot-16-redis]
 

This wouldn't happen before and it tried to recreate all objects and failed because they existed already.

Then I tried deleting the stack and the objects were delete correctly.

module.oss_prod_penpot_16.module.redis.helm_release.redis: Destroying... [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Destroying... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Destroying... [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Destroying... [id=n1/penpot-16-postgresql-credentials]

So I think this feature is good. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants