Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug - Error: Failed to persist state to backend on modernisation-platform #5859

Closed
SteveLinden opened this issue Dec 21, 2023 · 7 comments · Fixed by #5911 or #6022
Closed

Bug - Error: Failed to persist state to backend on modernisation-platform #5859

SteveLinden opened this issue Dec 21, 2023 · 7 comments · Fixed by #5911 or #6022
Assignees
Labels
bug Something isn't working

Comments

@SteveLinden
Copy link
Contributor

SteveLinden commented Dec 21, 2023

Expected Behavior

The state should save without issues.

Actual Behavior

We get the above error as seen in https://github.com/ministryofjustice/modernisation-platform/actions/runs/7285740796/job/19855252914#step:7:113 as an example

Steps to Reproduce the Problem

Run a full release that amends everything, e.g. adding a role access change. The number that happen is not consistent but it has been happening more recently.

Version

Example is the run for PR #5840

Modules

modernisation-platform

Account

No response

@SteveLinden SteveLinden added the bug Something isn't working label Dec 21, 2023
@ewastempel
Copy link
Contributor

The error to be fixed:

Error: Failed to save state

Error saving state: failed to upload state: operation error S3: PutObject,
failed to rewind transport stream for retry, request stream is not seekable

Error: Failed to persist state to backend

The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.

Running "terraform apply" again at this point will create a forked state,
making it harder to recover.

To retry writing this state, use the following command:
    terraform state push errored.tfstate

@ewastempel
Copy link
Contributor

The fix has been temporarily deployed (see the slack thread) to the scheduled baseline pipeline only and it has already worked with the happy path and a failure on apply. It still needs evidence of errored state failure on apply and a successful state push. This will require some time, but if no pipelines fails due to an errored state for over a week or two, this is probably a good enough test.

Leaving this issue open, for when there is more evidence and to then enrol it to all other pipelines.

Putting it into the blocked column (or feel free to put it back into the backlog, if easier.

@dms1981
Copy link
Contributor

dms1981 commented Jan 8, 2024

https://mojdt.slack.com/archives/C015UBQ78MR/p1702984257007459 << You can see a short Slack conversation with our AWS TAM here where we were given some guidance / linked to the S3 performance design considerations whitepaper.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

@ewastempel
Copy link
Contributor

Thanks David, but according to the doc, we are not hitting the limit in a single workflow run. FYI, the implemented solution is not looking into the limits, but is to push the errored state.

@ewastempel
Copy link
Contributor

It appears that the error may be a terraform bug:
hashicorp/terraform#34528

We see the same issue in v1.6.6. Terraform trace for more insights:

2024-01-16T14:27:51.0906051Z 2024-01-16T14:27:50.918Z [DEBUG] states/remote: state read lineage is: e29388d3-e6cf-9115-e169-5f1d9f976c58; lineage is: e29388d3-e6cf-9115-e169-5f1d9f976c58
2024-01-16T14:27:51.0907551Z 2024-01-16T14:27:50.934Z [INFO]  backend-s3: Uploading remote state: tf_backend.operation=Put tf_backend.req_id=23d0d7ed-07da-d344-e6e7-c8ba9a9e84cd tf_backend.s3.bucket=*** tf_backend.s3.path=***
2024-01-16T14:27:51.0918815Z 2024-01-16T14:27:50.941Z [DEBUG] backend-s3: HTTP Request Sent: aws.region=eu-west-2 aws.s3.bucket=*** aws.s3.key=*** rpc.method=PutObject rpc.service=S3 rpc.system=aws-api tf_aws.sdk=aws-sdk-go-v2 tf_aws.signing_region="" tf_backend.operation=Put tf_backend.req_id=23d0d7ed-07da-d344-e6e7-c8ba9a9e84cd tf_backend.s3.bucket=*** tf_backend.s3.path=*** http.request.header.x_amz_decoded_content_length=129597 http.request.header.authorization="AWS4-HMAC-SHA256 Credential=ASIA************KEUA/20240116/eu-west-2/s3/aws4_request, SignedHeaders=accept-encoding;amz-sdk-invocation-id;content-encoding;content-length;content-type;host;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length;x-amz-sdk-checksum-algorithm;x-amz-security-token;x-amz-server-side-encryption;x-amz-trailer, Signature=*****" http.request.header.x_amz_date=20240116T142750Z http.request.header.x_amz_content_sha256=STREAMING-UNSIGNED-PAYLOAD-TRAILER http.request.header.x_amz_trailer=x-amz-checksum-sha256 net.peer.name=***.s3.eu-west-2.amazonaws.com http.request_content_length=129679 http.request.body="[Redacted: 126.6 KB (129,679 bytes), Type: application/json]" http.url=https://***.s3.eu-west-2.amazonaws.com/***?x-id=PutObject http.request.header.x_amz_security_token="*****" http.request.header.x_amz_acl=bucket-owner-full-control http.request.header.amz_sdk_request="attempt=1; max=5" http.request.header.accept_encoding=identity http.request.header.x_amz_server_side_encryption=AES256 http.request.header.content_encoding=aws-chunked http.request.header.x_amz_sdk_checksum_algorithm=SHA256 http.method=PUT http.user_agent="APN/1.0 HashiCorp/1.0 Terraform/1.6.6 (+https://www.terraform.io) aws-sdk-go-v2/1.24.0 os/linux lang/go#1.21.5 md/GOOS#linux md/GOARCH#amd64 api/s3#1.47.5 ft/s3-transfer" http.request.header.amz_sdk_invocation_id=1fae9720-5924-45a9-832c-2bce77cca664 http.request.header.content_type=application/json
2024-01-16T14:27:51.5915142Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: reading initial snapshot from errored.tfstate
2024-01-16T14:27:51.5916619Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: snapshot file has nil snapshot, but that's okay
2024-01-16T14:27:51.5917812Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: read nil snapshot
2024-01-16T14:27:51.5919595Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: Importing snapshot with lineage "e29388d3-e6cf-9115-e169-5f1d9f976c58" serial 251 as the initial state snapshot at errored.tfstate
2024-01-16T14:27:51.5921167Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: preparing to manage state snapshots at errored.tfstate
2024-01-16T14:27:51.5922242Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: no previously-stored snapshot exists
2024-01-16T14:27:51.5923195Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: state file backups are disabled
2024-01-16T14:27:51.5924429Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: forcing lineage "e29388d3-e6cf-9115-e169-5f1d9f976c58" serial 251 for migration/import
2024-01-16T14:27:51.5925663Z 2024-01-16T14:27:51.591Z [TRACE] statemgr.Filesystem: writing snapshot at errored.tfstate
2024-01-16T14:27:51.5940360Z 
2024-01-16T14:27:51.5940616Z Error: Failed to save state
2024-01-16T14:27:51.5940990Z 
2024-01-16T14:27:51.5941454Z Error saving state: failed to upload state: operation error S3: PutObject,
2024-01-16T14:27:51.5942382Z failed to rewind transport stream for retry, request stream is not seekable
2024-01-16T14:27:51.5942793Z 
2024-01-16T14:27:51.5942939Z Error: Failed to persist state to backend
2024-01-16T14:27:51.5943194Z 
2024-01-16T14:27:51.5943479Z The error shown above has prevented Terraform from writing the updated state
2024-01-16T14:27:51.5944159Z to the configured backend. To allow for recovery, the state has been written
2024-01-16T14:27:51.5944977Z to the file "errored.tfstate" in the current working directory.
2024-01-16T14:27:51.5945328Z 
2024-01-16T14:27:51.5945594Z Running "terraform apply" again at this point will create a forked state,
2024-01-16T14:27:51.5946094Z making it harder to recover.
2024-01-16T14:27:51.5946293Z 
2024-01-16T14:27:51.5946481Z To retry writing this state, use the following command:
2024-01-16T14:27:51.5946923Z     terraform state push errored.tfstate
2024-01-16T14:27:51.5947279Z 
2024-01-16T14:27:51.6009041Z data.aws_iam_policy_document.assume_role_policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6010040Z data.aws_iam_roles.github_actions_role: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6010940Z module.member-access[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6011803Z module.member-access-us-east[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6012988Z module.member-access-eu-central[0].data.aws_iam_policy_document.assume-role-policy: Reading...
2024-01-16T14:27:51.6013994Z module.member-access[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6015084Z module.member-access-us-east[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6016230Z module.member-access-eu-central[0].data.aws_iam_policy_document.assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6017237Z module.member-access[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6018331Z module.member-access-us-east[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6019555Z module.member-access-us-east[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6021455Z module.member-access-eu-central[0].data.aws_iam_policy_document.combined-assume-role-policy: Reading...
2024-01-16T14:27:51.6022749Z module.member-access[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6023985Z module.member-access-eu-central[0].data.aws_iam_policy_document.combined-assume-role-policy: Read complete after 0s [id=<REDACTED>]
2024-01-16T14:27:51.6025019Z module.member-access[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6025868Z module.member-access-eu-central[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6026729Z module.member-access-us-east[0].aws_iam_role.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6027463Z data.aws_iam_session_context.whoami: Read complete after 1s [id=<REDACTED>]
2024-01-16T14:27:51.6028171Z data.aws_organizations_organization.root_account: Read complete after 1s [id=<REDACTED>]
2024-01-16T14:27:51.6029100Z module.ssm-cross-account-access.aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6030158Z module.instance-scheduler-access[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6031179Z module.member-access-us-east[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6032185Z module.member-access-eu-central[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]
2024-01-16T14:27:51.6033154Z module.member-access[0].aws_iam_role_policy_attachment.default: Refreshing state... [id=<REDACTED>]

NOTE, s3 bucket and tf backend values were further redacted with ***.

Additionally, the CloudTrail does not show any errors for the above HTTP request:

2024-01-16T14:26:09.295+00:00 {"eventVersion":"1.09","userIdentity":{"type":"AssumedRole","principalId":"AROA5YRRXHENV4XBVCHPR:s3-replication","arn":"arn:aws:sts::946070829339:assumed-role/AWSS3BucketReplication-terraform-state/s3-replication","accountId":"946070829339","accessKeyId":"ASIA5YRRXHENUFNQVHN6","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"AROA5YRRXHENV4XBVCHPR","arn":"arn:aws:iam::946070829339:role/AWSS3BucketReplication-terraform-state","accountId":"946070829339","userName":"AWSS3BucketReplication-terraform-state"},"attributes":{"creationDate":"2024-01-16T14:20:19Z","mfaAuthenticated":"false"}},"invokedBy":"s3.amazonaws.com"},"eventTime":"2024-01-16T14:24:29Z","eventSource":"s3.amazonaws.com","eventName":"PutObject","awsRegion":"eu-west-1","sourceIPAddress":"s3.amazonaws.com","userAgent":"s3.amazonaws.com","requestParameters":{"bucketName":"modernisation-platform-terraform-state-replication","accessControlList":{"x-amz-grant-full-control":"id=\"22f81ca85d0d968a6c79bd16b75cca751d35a70b9a567245a376350149977d4b\", id=\"d26d93f7a0f00df8f4a8d63e50c3a9fa259e7ad1d02069bf40d84dd999c8c41f\""},"Host":"s3.eu-west-1.amazonaws.com","x-amz-server-side-encryption":"AES256","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q","key":"environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate","x-amz-storage-class":"STANDARD"},"responseElements":{"x-amz-server-side-encryption":"AES256","x-amz-expiration":"expiry-date=\"Fri, 16 Jan 2026 00:00:00 GMT\", rule-id=\"main\"","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q"},"additionalEventData":{"SignatureVersion":"SigV4","aclRequired":"Yes","CipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","bytesTransferredIn":484292,"SSEApplied":"SSE_S3","AuthenticationMethod":"AuthHeader","x-amz-id-2":"mbUoyGDeBaax91fI2c5fehUOd40GS/S9kjLtR/TQL45BbvbU9cDpGhQYGKVqmEZ33L0E9fbGEQ0=","bytesTransferredOut":0},"requestID":"P8ZXMGC84BM13J4T","eventID":"b1a78289-1ca6-42b0-8d35-974357919dc0","readOnly":false,"resources":[{"type":"AWS::S3::Object","ARN":"arn:aws:s3:::modernisation-platform-terraform-state-replication/environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate"},{"accountId":"946070829339","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::modernisation-platform-terraform-state-replication"}],"eventType":"AwsApiCall","managementEvent":false,"recipientAccountId":"946070829339","eventCategory":"Data"}

2024-01-16T14:26:55.698+00:00 {"eventVersion":"1.09","userIdentity":{"type":"AWSAccount","principalId":"AROAUJX7QETDMJM3NOGOH:githubactionsrolesession","accountId":"295814833350"},"eventTime":"2024-01-16T14:24:07Z","eventSource":"s3.amazonaws.com","eventName":"PutObject","awsRegion":"eu-west-2","sourceIPAddress":"20.75.95.33","userAgent":"[APN/1.0 HashiCorp/1.0 Terraform/1.6.6 (+https://www.terraform.io) aws-sdk-go-v2/1.24.0 os/linux lang/go#1.21.5 md/GOOS#linux md/GOARCH#amd64 api/s3#1.47.5 ft/s3-transfer]","requestParameters":{"bucketName":"modernisation-platform-terraform-state","Host":"modernisation-platform-terraform-state.s3.eu-west-2.amazonaws.com","x-amz-acl":"bucket-owner-full-control","x-amz-server-side-encryption":"AES256","key":"environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate","x-id":"PutObject"},"responseElements":{"x-amz-server-side-encryption":"AES256","x-amz-expiration":"expiry-date=\"Fri, 16 Jan 2026 00:00:00 GMT\", rule-id=\"main\"","x-amz-version-id":"Oj4l.k7Sjr25T7.fcaheWHxPEjEI7R3Q"},"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","bytesTransferredIn":484374,"SSEApplied":"SSE_S3","AuthenticationMethod":"AuthHeader","x-amz-id-2":"9DbayMa6lMrO5IKUZQVuCvTzxQ5co5+8IEXD80mEs7gvw8k1YBZTeyOBTXG2YlGAPRQcM682K2EBSo9uylAyyQ==","bytesTransferredOut":0},"requestID":"R9BZEZ499V7M2F1V","eventID":"affb50fc-5365-482e-8551-611f4b8cef94","readOnly":false,"resources":[{"type":"AWS::S3::Object","ARN":"arn:aws:s3:::modernisation-platform-terraform-state/environments/bootstrap/delegate-access/nomis-data-hub-development/terraform.tfstate"},{"accountId":"946070829339","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::modernisation-platform-terraform-state"}],"eventType":"AwsApiCall","managementEvent":false,"recipientAccountId":"946070829339","sharedEventID":"06d62d60-ce1d-4507-a9c3-e53f90e69753","eventCategory":"Data","tlsDetails":{"tlsVersion":"TLSv1.2","cipherSuite":"ECDHE-RSA-AES128-GCM-SHA256","clientProvidedHostHeader":"modernisation-platform-terraform-state.s3.eu-west-2.amazonaws.com"}}

which is a good indication that the problem lies on terraform (no issue in cloudtrail and the state was actually saved in this instance, but the terraform still fails).

@ewastempel
Copy link
Contributor

The state push fix for the state persistence failure is now rolled out to the scheduled baseline workflow with temporarily suppression of slack alerts for when the state push is successful.
There will be separate issues to track the fix rollout to other workflows.
Also, once hashicorp/terraform#34528 is fixed, the alerting should be re-enabled.

@ewastempel
Copy link
Contributor

To roll out the fix to other repos/workflows see this issue: #6038

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment