Terraform state not updated on a failed run #527

Spritekin · 2023-07-23T15:34:09Z

I have a run with a series of resources, the plan finishes ok.
During the apply, some of the resources are created, but the run fails mid execution due to some problem which prevents a pod from being created.
Then it fails, like this:

Terraform has been successfully initialized!
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creating...
module.oss_prod_sonarqube_12.aws_route53_record.hostname[0]: Creating...
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Still creating... [10s elapsed]
module.oss_prod_sonarqube_12.aws_ebs_volume.this: Creation complete after 11s [id=vol-0b79209ea04d62be9]
module.oss_prod_sonarqube_12.module.postgresql.aws_ebs_volume.this: Creation complete after 11s [id=vol-04889688362e7f912]
module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this: Creating...
module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this: Creating...
╷
│ Error: persistentvolumes "sonarqube-12-postgresql" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this":
│   24: resource "kubernetes_persistent_volume" "this" {
│
╵
╷
│ Error: secrets "sonarqube-12-postgresql-credentials" already exists
│
│   with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this,
│   on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this":
│   24: resource "kubernetes_secret" "this" {
│
╵
╷
│ Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists]
│ 	status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95
│
│   with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0],
│   on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname":
│    5: resource "aws_route53_record" "hostname" {
│
╵
╷
│ Error: persistentvolumes "sonarqube-12" already exists
│
│   with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this,
│   on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this":
│   30: resource "kubernetes_persistent_volume" "this" {
│
╵
 
Error: exit status 1: Error: persistentvolumes "sonarqube-12-postgresql" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/local_volume.tf line 24, in resource "kubernetes_persistent_volume" "this": 24: resource "kubernetes_persistent_volume" "this" { Error: secrets "sonarqube-12-postgresql-credentials" already exists with module.oss_prod_sonarqube_12.module.postgresql.kubernetes_secret.this, on .terraform/modules/oss_prod_sonarqube_12.postgresql/main.tf line 24, in resource "kubernetes_secret" "this": 24: resource "kubernetes_secret" "this" { Error: creating Route 53 Record: InvalidChangeBatch: [Tried to create resource record set [name='sonarqube-12.oss.ossy.dev.', type='CNAME'] but it already exists] status code: 400, request id: e4654ae8-c869-462f-9b74-f4f2bd742b95 with module.oss_prod_sonarqube_12.aws_route53_record.hostname[0], on .terraform/modules/oss_prod_sonarqube_12/dns.tf line 5, in resource "aws_route53_record" "hostname": 5: resource "aws_route53_record" "hostname" { Error: persistentvolumes "sonarqube-12" already exists with module.oss_prod_sonarqube_12.kubernetes_persistent_volume.this, on .terraform/modules/oss_prod_sonarqube_12/local_volume.tf line 30, in resource "kubernetes_persistent_volume" "this": 30: resource "kubernetes_persistent_volume" "this" {

The problem is, in case of a failure, the state file is not updated, then the resources created by this run are not under Terraform control.
If I decide to delete the workspace, any objects created during the failed run are ignored and left in the account.

So, in case of a failed run, the state file should be updated with references to any resource created during the run.

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

The text was updated successfully, but these errors were encountered:

leg100 · 2023-07-23T18:22:39Z

This behavior differs from Terraform Cloud which creates multiple state files during a single run and keeps all the created objects on the state.

It looks like this functionality - not specific to Terraform Cloud - was added only recently, in v1.5.0:

https://github.com/hashicorp/terraform/blob/v1.5/CHANGELOG.md#150-june-12-2023

Terraform CLI's local operations mode will now attempt to persist state snapshots to the state storage backend periodically during the apply step, thereby reducing the window for lost data if the Terraform process is aborted unexpectedly. (hashicorp/terraform#32680)

OTF currently runs terraform apply and if it fails it skips uploading the state file. I'll change the behaviour so that the state file is uploaded regardless of whether it failed or succeeded.

Thank you for bringing this to my attention.

Spritekin · 2023-07-26T01:37:57Z

Hi @leg100

I have little experience with Golang so please let me see if I understand this.

The worker finishes the phase and regardless of the outcome it invokes the FinishPhase but it will set "Errored" in the finishOptions.

otf/internal/agent/worker.go

Lines 70 to 76 in 6b9e6b1

    
           	// Regardless of success, mark phase as finished 
        
           	_, err = w.FinishPhase(ctx, r.ID, r.Phase(), finishOptions) 
        
           	if err != nil { 
        
           		log.Error(err, "finishing phase") 
        
           		return 
        
           	} 
        
           }

In the service.FinishPhase it creates a report for output (if not Errored) then it updates the state in the DB. It still passes a reference to the error.

otf/internal/run/service.go

Lines 324 to 326 in 6b9e6b1

    
           run, err := s.db.UpdateStatus(ctx, runID, func(run *Run) error { 
        
           	return run.Finish(phase, opts) 
        
           })

In the DB updateStatus method it checks if the run was errored and it it was it won't update the status.

otf/internal/run/db.go

Lines 113 to 115 in 6b9e6b1

    
           if err := fn(run); err != nil { 
        
           	return err 
        
           }

So if I remove those 3 lines in db.go it should continue saving the state even in case of an error which is what this issue is about.

Is that right?

leg100 · 2023-07-26T08:01:04Z

@Spritekin That part of code is concerned with updating the run status, not terraform state.

See the fix I've made here:

https://github.com/leg100/otf/pull/539/files

If you read ./internal/agent/steps.go from the top then you'll get an understanding of what's going on. Each run is composed of many steps. Each step is processed sequentially. If one step fails, it stops and later steps are skipped.

The problem in this case is that the terraform apply step fails and so it would skip the next and last step that uploads the state.

The fix is to combine the two steps into one. In Go, a defer() is called at the end of a function regardless of error. So that is what it used to ensure any change to the state is uploaded regardless of what happens to the apply.

Fixes #527

🤖 I have created a release *beep* *boop* --- ## [0.1.2](v0.1.1...v0.1.2) (2023-07-26) ### Bug Fixes * agent race error ([#537](#537)) ([6b9e6b1](6b9e6b1)) * handle run-events request from terraform cloud backend ([#534](#534)) ([b1998bd](b1998bd)) * terraform apply partial state updates ([#539](#539)) ([d25e7e4](d25e7e4)), closes [#527](#527) ### Miscellaneous * removed unused config file ([84fe3b1](84fe3b1)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

leg100 · 2023-07-26T10:33:09Z

@Spritekin It would be appreciated if you can give the new release a test run, and even better if you can replicate the error scenario above, to see if it successfully persists partial state updates.

Spritekin · 2023-07-26T10:36:21Z

sure I will do now... I'm rebuilding my images with AWS CLI support.

https://hub.docker.com/r/spritekin/otfd

Spritekin · 2023-07-26T11:34:02Z

Now I can't replicate the failure... don't worry I will build a new module soon, new modules usually fail during tests. Will keep you posted.

pat-s · 2023-07-26T12:18:09Z

@leg100 maybe it would be helpful to list custom images like this or mine in the Readme? To prevent other users from building duplicate ones unnecessarily

Spritekin · 2023-07-28T13:35:03Z

@leg100 I'm still testing but looking good. I say this because in a first run I got a series of objects being created then my module failed... the apply phase looks like this:

Terraform has been successfully initialized!
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Creation complete after 2s [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.aws_ebs_volume.this: Creation complete after 10s [id=vol-01aa843277aca7c3b]
module.oss_prod_penpot_16.aws_ebs_volume.this: Creation complete after 10s [id=vol-08e4a387cf4a974d9]
module.oss_prod_penpot_16.module.redis.aws_ebs_volume.this: Creation complete after 10s [id=vol-0e7cd4a17438f3929]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Creation complete after 1s [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Creation complete after 0s [id=n1/penpot-16]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Creation complete after 12s [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Creation complete after 23s [id=penpot-16-postgresql]
module.oss_prod_penpot_16.aws_route53_record.hostname[0]: Creation complete after 47s [id=Z05392673CT6X22PUH929_penpot-16.oss.ossy.dev_CNAME]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Creation complete after 41s [id=penpot-16-redis]
╷
│ Error: template: penpot/templates/frontend/ingress.yaml:25:19: executing "penpot/templates/frontend/ingress.yaml" at <.Values.ingress.tls>: nil pointer evaluating interface {}.tls
│
│   with module.oss_prod_penpot_16.helm_release.application,
│   on .terraform/modules/oss_prod_penpot_16/main.tf line 272, in resource "helm_release" "application":
│  272: resource "helm_release" "application" {
│

But then if I try to rerun th plan already recognises the created objects and tries to refresh:

module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Refreshing state... [id=n1/penpot-16-postgresql-credentials]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume.this: Refreshing state... [id=penpot-16-redis]
module.oss_prod_penpot_16.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16]
module.oss_prod_penpot_16.module.postgresql.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_persistent_volume_claim.this: Refreshing state... [id=n1/penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Refreshing state... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.helm_release.redis: Refreshing state... [id=penpot-16-redis]

This wouldn't happen before and it tried to recreate all objects and failed because they existed already.

Then I tried deleting the stack and the objects were delete correctly.

module.oss_prod_penpot_16.module.redis.helm_release.redis: Destroying... [id=penpot-16-redis]
module.oss_prod_penpot_16.module.postgresql.helm_release.postgresql: Destroying... [id=penpot-16-postgresql]
module.oss_prod_penpot_16.module.redis.kubernetes_secret.this: Destroying... [id=n1/penpot-16-redis-credentials]
module.oss_prod_penpot_16.module.postgresql.kubernetes_secret.this: Destroying... [id=n1/penpot-16-postgresql-credentials]

So I think this feature is good. Thank you so much!

leg100 mentioned this issue Jul 26, 2023

fix: terraform apply partial state updates #539

Merged

leg100 closed this as completed in #539 Jul 26, 2023

leg100 added a commit that referenced this issue Jul 26, 2023

fix: terraform apply partial state updates (#539)

d25e7e4

Fixes #527

github-actions bot mentioned this issue Jul 25, 2023

chore(master): release 0.1.2 #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform state not updated on a failed run #527

Terraform state not updated on a failed run #527

Spritekin commented Jul 23, 2023 •

edited

Loading

leg100 commented Jul 23, 2023 •

edited

Loading

Spritekin commented Jul 26, 2023 •

edited

Loading

leg100 commented Jul 26, 2023

leg100 commented Jul 26, 2023

Spritekin commented Jul 26, 2023 •

edited

Loading

Spritekin commented Jul 26, 2023 •

edited

Loading

pat-s commented Jul 26, 2023

Spritekin commented Jul 28, 2023

Terraform state not updated on a failed run #527

Terraform state not updated on a failed run #527

Comments

Spritekin commented Jul 23, 2023 • edited Loading

leg100 commented Jul 23, 2023 • edited Loading

Spritekin commented Jul 26, 2023 • edited Loading

leg100 commented Jul 26, 2023

leg100 commented Jul 26, 2023

Spritekin commented Jul 26, 2023 • edited Loading

Spritekin commented Jul 26, 2023 • edited Loading

pat-s commented Jul 26, 2023

Spritekin commented Jul 28, 2023

Spritekin commented Jul 23, 2023 •

edited

Loading

leg100 commented Jul 23, 2023 •

edited

Loading

Spritekin commented Jul 26, 2023 •

edited

Loading

Spritekin commented Jul 26, 2023 •

edited

Loading

Spritekin commented Jul 26, 2023 •

edited

Loading