Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-636: add automated backup sidecar #1287

Closed
wants to merge 22 commits into from

Conversation

Elbehery
Copy link
Contributor

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 30, 2024

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dusk125 and tjungblu June 30, 2024 14:19
Copy link
Contributor

openshift-ci bot commented Jun 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2024
@Elbehery
Copy link
Contributor Author

/retest

@Elbehery Elbehery force-pushed the add_backup_sidecar branch 2 times, most recently from 822750b to 803e77a Compare July 1, 2024 00:07
- |
#!/bin/sh
set -euo pipefail
cp --verbose --recursive --preserve --reflink=auto /var/lib/etcd/ /var/backup/etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you make this a command, so we have some go code we can also test properly. Eg in here: https://github.com/openshift/cluster-etcd-operator/tree/master/pkg/cmd

Didn't we want to take the snapshot with etcdctl for starters? and there's no retention and schedule either

@Elbehery
Copy link
Contributor Author

Elbehery commented Jul 9, 2024

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jul 9, 2024
@Elbehery Elbehery force-pushed the add_backup_sidecar branch 6 times, most recently from 08917b6 to 898db4b Compare July 9, 2024 19:48
@Elbehery Elbehery changed the title ETCD-636: add automated backup side car ETCD-636: add automated backup sidecar Jul 9, 2024
@Elbehery
Copy link
Contributor Author

Elbehery commented Jul 9, 2024

/test e2e-operator

Copy link
Contributor

openshift-ci bot commented Jul 9, 2024

@Elbehery: The following commands are available to trigger required jobs:

  • /test e2e-agnostic-ovn
  • /test e2e-agnostic-ovn-upgrade
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-serial
  • /test e2e-aws-ovn-single-node
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-operator
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test configmap-scale
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-disruptive-ovn
  • /test e2e-aws-etcd-certrotation
  • /test e2e-aws-etcd-recovery
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-gcp
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-disruptive-ovn
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-metal-ovn-ha-cert-rotation-shutdown
  • /test e2e-metal-ovn-sno-cert-rotation-shutdown
  • /test e2e-metal-single-node-live-iso
  • /test e2e-operator-fips
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic-ovn
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic-ovn-upgrade
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-certrotation
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-metal-ovn-ha-cert-rotation-shutdown
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-metal-ovn-sno-cert-rotation-shutdown
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-operator
  • pull-ci-openshift-cluster-etcd-operator-master-e2e-operator-fips
  • pull-ci-openshift-cluster-etcd-operator-master-images
  • pull-ci-openshift-cluster-etcd-operator-master-unit
  • pull-ci-openshift-cluster-etcd-operator-master-verify
  • pull-ci-openshift-cluster-etcd-operator-master-verify-deps

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Elbehery
Copy link
Contributor Author

Elbehery commented Jul 9, 2024

failures are due to authentication to the image registry

{  release "release-latest" failed: could not watch pod: the pod ci-op-0zcbppy9/release-latest failed after 1m13s (failed containers: release): ContainerFailed one or more containers exited

Container release exited with code 1, reason Error
---
29a0873cc59feb1f2f00fc818420330b4210e10021601e3f3c63ba87cc790e5e oc-mirror
info: Loading sha256:fcad3a8a4a17cbc13e8eaaa8503b2ec3c5d4c006ddc39cc798daaba62d8691d9 operator-lifecycle-manager
info: Loading sha256:2dd0da125e0e23b5530863690b8a629d524d1a393b63b6c3ef27cb46f7a0f961 openstack-cluster-api-controllers
info: Loading sha256:fd1e3fb553482d31c4831b845aaf2ef68ea4976cb595b40f14e3dc014e74b2d1 operator-marketplace
info: Loading sha256:88e955c3aaf50a4c75f61ed34542354389386463c61a1931dabac61bacb6d056 operator-framework-tools
info: Loading sha256:dd5d0691c5135ebc8173c5b3486c4fd7b187b65b0375e557bfa35115c7660e74 service-ca-operator
info: Loading sha256:85d166c140588335aa7712be55565931d2f4eccbb62e0031e43f8f4976e3633b tests
info: Loading sha256:5a3b73c0132212ad41a9e40d3c09df29e67f56b1757bdc018d1fe60df7493c1a vsphere-cluster-api-controllers
info: Loading sha256:0f036c80d66513d8125850a11f4d3117888af17b6f2037eed07148a83875a91f vsphere-problem-detector
info: Included 190 images from 72 input operators into the release
error: failed to push image registry.build03.ci.openshift.org/ci-op-0zcbppy9/release:latest: uploading the source layer sha256:7a4643f5f2a50088993f8d8f43a8f86bc0c497a96e1323a5a5eaf051bfa8dcc8 failed: Patch "https://registry.build03.ci.openshift.org/v2/ci-op-0zcbppy9/release/blobs/uploads/175ca850-94e9-4f12-96bd-2215443abd65?_state=OK6IMXDu0oEqW_1kApl-CScd7gplm-bE0vJwspCQk7t7Ik5hbWUiOiJjaS1vcC0wemNicHB5OS9yZWxlYXNlIiwiVVVJRCI6IjE3NWNhODUwLTk0ZTktNGYxMi05NmJkLTIyMTU0NDNhYmQ2NSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyNC0wNy0wOVQxMjoyOTo1MC45ODg4MTkyODhaIn0%3D": http2: Transport: cannot retry err [stream error: stream ID 2349; REFUSED_STREAM; received from peer] after Request.Body was written; define Request.GetBody to avoid this error
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-07-09T12:30:21Z"}
---}

@Elbehery
Copy link
Contributor Author

Elbehery commented Jul 9, 2024

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery Elbehery force-pushed the add_backup_sidecar branch 6 times, most recently from 4795bb3 to 7e61751 Compare July 23, 2024 00:37
Copy link
Contributor

openshift-ci bot commented Jul 23, 2024

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities 803e77a link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-aws-ovn-single-node b4f9919 link true /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-etcd-recovery b4f9919 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-operator b4f9919 link true /test e2e-operator
ci/prow/e2e-aws-etcd-certrotation b4f9919 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-ovn-etcd-scaling b4f9919 link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown b4f9919 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown b4f9919 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-operator-fips b4f9919 link false /test e2e-operator-fips

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Elbehery
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2024
Comment on lines +338 to +343
- hostPath:
path: /var/backup/etcd
name: backup-dir
- hostPath:
path: /etc/kubernetes
name: config-dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have to review to this later and see if all of this aligns with what's outlined in the enhancement, but manifest changes related to this feature like this need to be feature gated and can't be in GA by default.

@Elbehery
Copy link
Contributor Author

closing this in favor of #1301

@Elbehery Elbehery closed this Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants