Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor csr approvals #9711

Merged

Conversation

michaelgugino
Copy link
Contributor

@michaelgugino michaelgugino commented Aug 22, 2018

Currently, csr approval process for nodes is quite
fragile.

This commit creates a new custom module oc_csr_client
which facilitates handling the multiple steps involved
for approving pending node certificates.

The module attempts to approve all 'client' csrs
for any nodes provided via node_list, missing csrs
are ignored as long as the missing node is in a
'Ready' status as reported by oc get nodes.

Next, the module approves csrs for 'server' certificates.
Similar to the client process, missing node csrs
are acceptable as long as the node's api endpoint
is reachable without error, indicating a server
certificate is deployed.

In cases of long delay between issuing a csr and
csr approval, there may be several outstanding
'server' csrs. This module will approve any
outstanding csrs.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1571515

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 22, 2018
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 22, 2018
def get_ready_nodes(module, oc_bin, oc_conf):
'''Get list of nodes currently ready vi oc'''
# plain output is way less to parse.
command = "{} {} get nodes".format(oc_bin, oc_conf)
Copy link
Member

@vrutkovs vrutkovs Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overengineered version would be:
oc get nodes -o jsonpath="{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}" to output smth like
ip-172-18-14-252.ec2.internal:OutOfDisk=False;MemoryPressure=False;DiskPressure=False;PIDPressure=False;Ready=True;
split it and filter out lines with Ready=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's incomprehensible and probably just as much splitting and filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to past discussions where David had stated that oc output is not guaranteed stable I think we should either get yaml output or a template like Vadim is suggesting. Perhaps not as complicated, but it still needs to be a format that we're sure of so if they change column ordering we're not busted.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 22, 2018
@michaelgugino michaelgugino force-pushed the csr-approve-refactor branch 2 times, most recently from 76cedc7 to 45a59ee Compare August 22, 2018 22:13
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 22, 2018
@michaelgugino michaelgugino changed the title WIP: refactor csr approvals refactor csr approvals Aug 22, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 22, 2018
Copy link
Contributor Author

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run this in a few different iterations against a local cluster (removed certs of a node, restarted), though not on a fresh cluster. Hopefully CI passes :)


short_description: Retrieve, approve, and verify node client csrs

version_added: "2.4"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.6

from ansible.module_utils.basic import AnsibleModule

try:
from json.decoder import JSONDecodeError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python2 vs python 3.4+

oc_csr_client:
oc_bin: "/usr/bin/oc"
oc_conf: "/etc/origin/master/admin.kubeconfig"
node_list: "['node1.example.com', 'node2.example.com']"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove double quotes on this line.

@@ -0,0 +1,6 @@
{"ANSIBLE_MODULE_ARGS": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file.

@@ -0,0 +1,288 @@
#!/usr/bin/env python
# pylint: disable=missing-docstring
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line and add some docstrings :)


DOCUMENTATION = '''
---
module: oc_csr_client
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what we should name it though. All the other oc modules actually map pretty much 1:1 with oc foo bar cli interactions.

supports_check_mode=False,
argument_spec=module_args
)
oc_bin = module.params['oc_bin']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff gets passed around a lot. Probably should refactor into a class object, this will help will reporting during failures.

@michaelgugino michaelgugino force-pushed the csr-approve-refactor branch 2 times, most recently from a7cbcec to 1a96157 Compare August 23, 2018 02:41
@sdodson sdodson removed the request for review from mtnbikenc August 23, 2018 13:21
@michaelgugino michaelgugino force-pushed the csr-approve-refactor branch 2 times, most recently from c87511b to 37f799f Compare August 23, 2018 17:38
Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's probably functionally correct, just some nits about module name and defaults.

- name: Approve node certificates when bootstrapping
oc_csr_client:
oc_bin: "{{ openshift_client_binary }}"
oc_conf: "{{ openshift.common.config_base }}/master/admin.kubeconfig"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're naming this module oc_ I'd expect it to default this like all the other oc modules do.

when:
- l_nodes_to_join|length > 0

- when: approve_out is failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we're sure this is 100% lets leave in debug log generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer necessary. No module provides actually useful failure information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds good


DOCUMENTATION = '''
---
module: oc_csr_client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what we should name it though. All the other oc modules actually map pretty much 1:1 with oc foo bar cli interactions.

def get_ready_nodes(module, oc_bin, oc_conf):
'''Get list of nodes currently ready vi oc'''
# plain output is way less to parse.
command = "{} {} get nodes".format(oc_bin, oc_conf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to past discussions where David had stated that oc output is not guaranteed stable I think we should either get yaml output or a template like Vadim is suggesting. Perhaps not as complicated, but it still needs to be a format that we're sure of so if they change column ordering we're not busted.

@michaelgugino michaelgugino force-pushed the csr-approve-refactor branch 2 times, most recently from f49176d to a9c01f0 Compare August 23, 2018 20:51
Currently, csr approval process for nodes is quite
fragile.

This commit creates a new custom module oc_csr_approve
which facilitates handling the multiple steps involved
for approving pending node certificates.

The module attempts to approve all 'client' csrs
for any nodes provided via node_list, missing csrs
are ignored as long as the missing node is in a
'Ready' status as reported by oc get nodes.

Next, the module approves csrs for 'server' certificates.
Similar to the client process, missing node csrs
are acceptable as long as the node's api endpoint
is reachable without error, indicating a server
certificate is deployed.

In cases of long delay between issuing a csr and
csr approval, there may be several outstanding
'server' csrs.  This module will approve any
outstanding csrs.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1571515
Copy link
Member

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Worked perfectly here

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2018
@michaelgugino
Copy link
Contributor Author

/test install

@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 24, 2018

@michaelgugino: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/gcp-crio fef0430 link /test gcp-crio

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Aug 24, 2018

May require tweaking of the retries and time but we can figure that out later.
/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: michaelgugino, sdodson, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [michaelgugino,sdodson,vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member

sdodson commented Aug 24, 2018

gcp-crio was an e2e flake.

@sdodson sdodson merged commit 6016dea into openshift:master Aug 24, 2018
@michaelgugino
Copy link
Contributor Author

/cherrypick release-3.10

@openshift-cherrypick-robot

@michaelgugino: #9711 failed to apply on top of branch "release-3.10":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	roles/openshift_node/tasks/upgrade.yml
Falling back to patching base and 3-way merge...
Auto-merging roles/openshift_node/tasks/upgrade.yml
CONFLICT (content): Merge conflict in roles/openshift_node/tasks/upgrade.yml
Patch failed at 0001 Refactor csr approvals: oc_csr_approve

In response to this:

/cherrypick release-3.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants