Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make authorized networks update more resilient #142

Merged
merged 4 commits into from
Jun 14, 2024

Conversation

aptituz
Copy link
Contributor

@aptituz aptituz commented Jun 13, 2024

This change makes the cluster update more resilient by

  1. Adding a check for running operations before trying
    to update the cluster. Only afterwards the cluster
    configuration is fetched and updated, to ensure we
    work on the most current state.
  2. Adding retries if the cluster update operation fails.
    This is necessary, because two different jobs can
    pass the check for running operations and get to the
    point of trying the update with only a minimal delay,
    and in these cases one of each would fail.

The retry uses an exponential backoff algorithm with
a bit of randomization to avoid many users of this
action becoming synchronized and pestering the GKE
api at once.
Fixes: #75

aptituz and others added 3 commits June 6, 2024 15:02
This change makes the cluster update more resilient by
1. Adding a check for running operations before trying
   to update the cluster. Only afterwards the cluster
   configuration is fetched and updated, to ensure we
   work on the most current state.
2. Adding retries if the cluster update operation fails.
   This is necessary, because two different jobs can
   pass the check for running operations and get to the
   point of trying the update with only a minimal delay,
   and in these cases one of each would fail.

The retry uses an exponential backoff algorithm with
a bit of randomization to avoid many users of this
action becoming synchronized and pestering the GKE
api at once.
Make authorized networks update more resilient
@aptituz aptituz requested a review from a team as a code owner June 13, 2024 08:28
@aptituz
Copy link
Contributor Author

aptituz commented Jun 13, 2024

In case you would like to see the updated action in action (uh, what a fun to write!):
Here is an example run, also showcasing the problem that is solved with the new action: https://github.com/metro-digital-inner-source/companion-card-service/actions/runs/9479120342/job/26117833835#step:9:1

cova-fe
cova-fe previously approved these changes Jun 14, 2024
Copy link

@cova-fe cova-fe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@aptituz
Copy link
Contributor Author

aptituz commented Jun 14, 2024

@cova-fe Thanks for the review. I noticed that one of the workflows was failing, as I apparently forgot to run "npm run prepare". I've run it another time and I think you need to approve it again for the workflows to be run again.

@cova-fe cova-fe merged commit c7547a2 into metro-digital:main Jun 14, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gcp-gke-control-plane-auth-networks-updater should be more resilient
2 participants