Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1670

Merged
merged 3 commits into from May 20, 2020
Merged

Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1670

merged 3 commits into from May 20, 2020

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Apr 21, 2020

Move gcp-routes to the MCO, call it openshift-gcp-routes. Also, improve the behavior significantly:

  1. always accept connections from external sources
  2. use downfiles instead of stopping the service
  3. implement the equivalent in gcp-routes-controller.

@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2020

cc @sttts

@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2020

PR adding downfile support to gcp-routes.sh: https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/899

@squeed squeed changed the title cmd/gcp-routes-controller: use downfile if available, tweak timing Bug 1802534: cmd/gcp-routes-controller: use downfile if available, tweak timing Apr 21, 2020
@openshift-ci-robot
Copy link
Contributor

@squeed: This pull request references Bugzilla bug 1802534, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1802534: cmd/gcp-routes-controller: use downfile if available, tweak timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 21, 2020
@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2020

/hold
until the gcp PR is merged and "deployed" in to CI.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2020
@LorbusChris
Copy link
Member

/cherry-pick fcos
/cc @vrutkovs

@openshift-cherrypick-robot

@LorbusChris: once the present PR merges, I will cherry-pick it on top of fcos in a new PR and assign it to you.

In response to this:

/cherry-pick fcos
/cc @vrutkovs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@squeed
Copy link
Contributor Author

squeed commented Apr 21, 2020

Seems the unit test scaffolding is out-of-date, so this is not going to pass unit tests. Investigating.

@umohnani8
Copy link
Contributor

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor

seeing:
E0421 19:17:08.844158 1 leaderelection.go:331] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ci-op-jd83nzz1-1354f.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.0.2:6443: i/o timeout

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/pods/openshift-sdn_sdn-controller-2t46q_sdn-controller.log

@squeed
Copy link
Contributor Author

squeed commented Apr 22, 2020

seeing:
E0421 19:17:08.844158 1 leaderelection.go:331] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ci-op-jd83nzz1-1354f.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.0.2:6443: i/o timeout

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1670/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1940/artifacts/e2e-gcp-op/pods/openshift-sdn_sdn-controller-2t46q_sdn-controller.log

Nuts. Investigating.

@squeed
Copy link
Contributor Author

squeed commented Apr 22, 2020

Those error messages are red herrings; the node maintained access to the internal api server at that time. I suspect they're just from masters rebooting. Continuing the investigation.

@squeed
Copy link
Contributor Author

squeed commented Apr 22, 2020

Okay, this is weird: tests are failing because one of the worker nodes failed to start the MCD after a reboot - because CRIO won't create the sandbox.

Apr 21 21:21:32.525864 ci-op-d4rtc-w-b-xrmnd.c.openshift-gce-devel-ci.internal hyperkube[1630]: I0421 21:21:32.525753 1630 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"machine-config-daemon-4kr98", UID:"fb999c1d-c1ac-4e8f-bf2e-98e2b785acf4", APIVersion:"v1", ResourceVersion:"27971", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = reserving pod sandbox name: error reserving pod name k8s_machine-config-daemon-4kr98_openshift-machine-config-operator_fb999c1d-c1ac-4e8f-bf2e-98e2b785acf4_1 for id 6bfb87bd099b9954b3ee5654c35e9b6fc8403df3f06f6f1a57331195f94dc28e: name is reserved

This definitely isn't caused by my change - mine only touches the master nodes. But it's not good. I wonder if the problem lies with kubelet, crio, or libpod?

@squeed
Copy link
Contributor Author

squeed commented Apr 22, 2020

Known (and very scary) bug: 1785399
/retest

@squeed
Copy link
Contributor Author

squeed commented Apr 22, 2020

Updated, fixed unit test scaffolding. This job is pretty prone to triggering the crio leak, so it might be a while before this goes green. AFAICT it's not a bug in the PR.

@knobunc
Copy link
Contributor

knobunc commented Apr 22, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@squeed
Copy link
Contributor Author

squeed commented May 19, 2020

This seems to be a horrid test issue.
/hold
to stop the retests until it clears up

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020
@squeed
Copy link
Contributor Author

squeed commented May 19, 2020

/test e2e-aws

1 similar comment
@squeed
Copy link
Contributor Author

squeed commented May 19, 2020

/test e2e-aws

@squeed
Copy link
Contributor Author

squeed commented May 20, 2020

/hold cancel
now that quay is back
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@squeed
Copy link
Contributor Author

squeed commented May 20, 2020

It looks like whatever the problem in https://bugzilla.redhat.com/show_bug.cgi?id=1828606 is biting us.

Given that this doesn't touch aws at all, would this be a candidate for an /override ?

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 50bc7b4 into openshift:master May 20, 2020
@openshift-ci-robot
Copy link
Contributor

@squeed: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1670. Bugzilla bug 1802534 has been moved to the MODIFIED state.

In response to this:

Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@LorbusChris: new pull request created: #1741

In response to this:

/cherry-pick fcos
/cc @vrutkovs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts
Copy link
Contributor

sttts commented Jun 4, 2020

/cherry-pick release-4.4

@openshift-cherrypick-robot

@sttts: new pull request created: #1780

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet