Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1883903:Add retries to SDN's RBAC proxy #786

Merged

Conversation

juanluisvaladas
Copy link
Contributor

Because kube-proxy may not be initialized by the time the RBAC proxy
starts it may crashloop for a while. Doesn't have any actual impact but
the restarts show in oc get pod and people may worry about that.

Copy link
Contributor

@danielmellado danielmellado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, although, could you please put up a warning message should the retry fails?

@juanluisvaladas juanluisvaladas force-pushed the rbac-add-retries branch 2 times, most recently from 9b6c5df to 0cc2102 Compare September 10, 2020 11:05
@juanluisvaladas
Copy link
Contributor Author

@danielmellado Done

Comment on lines 192 to 193
until [ "$retries" -ge 20 ]
do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
until [ "$retries" -ge 20 ]
do
while [[ "${retries}" -lt 20 ]]; do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the stylistic suggestions here still apply, here and elsewhere. (single-line rather than multi-line and while rather than until because those are both more standard. [[ rather than [ and ${retries} rather than $retries because they have fewer gotchas and it matches the kube/OCP shell script style guide that was never finalized...

"https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/api/v1/namespaces/openshift-sdn/services/sdn" |
python -c 'import json,sys; print(json.load(sys.stdin)["metadata"]["creationTimestamp"])' &&
break ||
echo "WARN: Failed to get sdn service from API" 1>&2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is expected it should shouldn't be a "warning". And you should say you're retrying.

Also, can you use if/else rather than &&/||? (I realize the variable assignment might make that tricky, so maybe no...)

(EDIT: fixed "shouldn't")

)
retries=$(( retries + 1 ))
sleep 15
done

TS=$(date -d "${TS}" +%s)
WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... github won't let me review outside the context of the diff, but below here I see:

              if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
                echo $(date -Iseconds) WARN: sdn-metrics-certs not mounted after 20 minutes.
              elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
                echo $(date -Iseconds) INFO: sdn-metrics-certs not mounted. Waiting one hour.

"one hour" should be "20 minutes" shouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will slide that in the PR

--cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
"https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/api/v1/namespaces/openshift-sdn/services/sdn" |
python -c 'import json,sys; print(json.load(sys.stdin)["metadata"]["creationTimestamp"])' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick, but are we 100% sure that the json python module would be available within the system? Maybe there's something around that installs it and I'm not aware of it, but if that's not the case, we should also catch that exception or make sure it's installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a RHEL container so we're guaranteed to have it.

@juanluisvaladas juanluisvaladas force-pushed the rbac-add-retries branch 2 times, most recently from 943a321 to f1efa45 Compare September 10, 2020 21:50
@juanluisvaladas
Copy link
Contributor Author

@danwinship while I was addressing it, I realized it never never exited if the retries were never successful, so besides addressing your comments I also added these new lines:

          if [ "${retries}" -ge 20 ]; then
            echo $(date -Iseconds) FATAL: Unable to get sdn service from API.
            exit 1
          fi

@juanluisvaladas
Copy link
Contributor Author

/retest

@danwinship
Copy link
Contributor

Do we need a similar patch for any of the other daemonsets?

@juanluisvaladas
Copy link
Contributor Author

Yes, ovnkube-node and ovnkube-master daemonsets will need this as well, but because #778 is not merged yet I can't modify it here, so I asked @bond95 to add this code in his PR

@rcarrillocruz
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 14, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@juanluisvaladas
Copy link
Contributor Author

Still flaky
/hold

@juanluisvaladas juanluisvaladas changed the title Add retries to SDN's RBAC proxy [WIP]Add retries to SDN's RBAC proxy Sep 18, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 18, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2020
@juanluisvaladas
Copy link
Contributor Author

There's something wrong, just added a -x to get more logging, don't review yet

@juanluisvaladas juanluisvaladas force-pushed the rbac-add-retries branch 2 times, most recently from cc3bd35 to 8a95b2d Compare September 21, 2020 09:20
@juanluisvaladas
Copy link
Contributor Author

@danwinship I had to replace the if curl | python; then break; fi for a curl | python && break.

With the variable asignation it wasn't working as I expected.

Because kube-proxy may not be initialized by the time the RBAC proxy
starts it may crashloop for a while. Doesn't have any actual impact but
the restarts show in oc get pod and people may worry about that.
@juanluisvaladas
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2020
@juanluisvaladas juanluisvaladas changed the title [WIP]Add retries to SDN's RBAC proxy Add retries to SDN's RBAC proxy Sep 30, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2020
@juanluisvaladas
Copy link
Contributor Author

There was an issue doing the break inside the subshell. Now this is fixed and ready to merge.Can you please lgtm?

@rcarrillocruz
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juanluisvaladas, rcarrillocruz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 30, 2020

@juanluisvaladas: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure 7782ba1 link /test e2e-azure
ci/prow/e2e-vsphere 7782ba1 link /test e2e-vsphere
ci/prow/e2e-metal-ipi dbdc25e link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@juanluisvaladas
Copy link
Contributor Author

/retitle Bug 1883903:Add retries to SDN's RBAC proxy

@openshift-ci-robot openshift-ci-robot changed the title Add retries to SDN's RBAC proxy Bug 1883903:Add retries to SDN's RBAC proxy Sep 30, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 30, 2020
@openshift-ci-robot
Copy link
Contributor

@juanluisvaladas: This pull request references Bugzilla bug 1883903, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1883903:Add retries to SDN's RBAC proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit fe416a3 into openshift:master Sep 30, 2020
@openshift-ci-robot
Copy link
Contributor

@juanluisvaladas: All pull requests linked via external trackers have merged:

Bugzilla bug 1883903 has been moved to the MODIFIED state.

In response to this:

Bug 1883903:Add retries to SDN's RBAC proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants