New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/bootstrap: add a script to collect info if cluster failed to start #1561
data/bootstrap: add a script to collect info if cluster failed to start #1561
Conversation
|
|
||
| FILTER=gzip queue resources/openapi.json.gz ${OC} get --raw /openapi/v2 | ||
|
|
||
| echo "Gathering node journals ..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an expectation that the apiserver is not up when we are running this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, added a TODO item to handle that, but I don't think it would hurt to attempt to collect those (along with other resources which assume API server is working)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some APIs may be available even if bootstrap failed. I believe @deads2k called out he'd like to fetch a list of api resources if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an expectation that the apiserver is not up when we are running this script.
This script must not die if the KAS is down and it should using the KAS wherever it can. That means that @abhinavdahiya is correct in saying that you cannot rely on oc adm node-logs to work.
I think that a best effort section for gathering cluster-specific resources that does a series of oc get pod -ojson, oc get node -ojson and the like (see https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L361-L381) is valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on not relying on oc adm node-logs, we need a 100% successful method for that.
| #!/usr/bin/env bash | ||
|
|
||
| ARTIFACTS="${1:-/tmp/artifacts}" | ||
| mkdir -p "${ARTIFACTS}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want an earlier set -e or some &&-chaining or some such to error the script out when commands like this one fail for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, no need for the mkdir here, since that's covered by the mkdir -p "${ARTIFACTS}/bootstrap/journals" a few lines down.
| ARTIFACTS="${1:-/tmp/artifacts}" | ||
| mkdir -p "${ARTIFACTS}" | ||
|
|
||
| echo "Gathering bootstrap journals ..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably write these to stderr, if we plan on using stdout to stream a tarball back to the install host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find that managing stdout is found to be too risky we can revert to using scp I guess but it'd be nice if this were a one liner that resulted in a tarball on local host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but it'd be nice if this were a one liner that resulted in a tarball on local host.
Hey, you can ssh ... && scp ... all on one line ;).
| mkdir -p "${ARTIFACTS}/bootstrap/containers" | ||
| for container in $(crictl ps --all --quiet) | ||
| do | ||
| container_name=$(crictl ps -a --id ${container} -v | grep -oP "Name: \K(.*)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather use --output=json and then use python or some such (we don't have jq handy) to extract the container name from the JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't make us depend on python on RHCOS hosts ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't make us depend on python on RHCOS hosts ;)
grep for structured data makes me sad ;). Maybe we can update crictl to get Go templating or some such to give us an eventual off ramp?
| for container in $(crictl ps --all --quiet) | ||
| do | ||
| container_name=$(crictl ps -a --id ${container} -v | grep -oP "Name: \K(.*)") | ||
| crictl logs "${container}" >& "${ARTIFACTS}/bootstrap/containers/${container_name}.log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to at least attempt to include logs from previous attempts at this container, via --last with some kind of suffix here.
6dad200
to
40e9fc9
Compare
|
@wking PTAL |
|
I'll work on addressing the shellcheck errors now. |
|
Still some shellcheck issues: |
e62dffa
to
a0bde99
Compare
|
@vrutkovs I saw this used on a down cluster this morning and the output from it is great. |
|
I've pushed 7bca822 onto the top of this branch, using |
|
Thanks, I'd made some similar changes and they broke things, I'll give
yours a try soon.
…On Fri, Apr 12, 2019, 7:34 PM W. Trevor King ***@***.***> wrote:
I've pushed 7bca822
<7bca822>
onto the top of this branch, using while read ... to avoid some of the
ShellCheck issues and adding some more quoting. @vrutkovs
<https://github.com/vrutkovs>, assuming that this gets through CI (I
removed a few ShellCheck comments in the hopes that I could find ways to
avoid those too, but they might fail this round), are you ok with us
squashing these changes down onto your original commit, or would you prefer
they stay separate?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1561 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAC8IS-p3EKVYlaRzkd9zs7xA4O_mjJkks5vgRgMgaJpZM4ckYh3>
.
|
25796cc
to
bc3f820
Compare
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1561/pull-ci-openshift-installer-master-e2e-aws/5282/artifacts/e2e-aws/pods.json | pods-with-issues | grep -B1 'cannot get resource\|master has not created a default'
W0413 01:10:20.870684 1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
--
I0413 00:59:11.440413 2541 node.go:267] Starting openshift-sdn network plugin
F0413 00:59:11.478516 2541 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start
--
I0413 00:59:11.630112 2262 node.go:267] Starting openshift-sdn network plugin
F0413 00:59:11.650826 2262 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not startThis is rhbz#1698672: /retest |
/retest |
|
Green :) |
cef6dce
to
8d67a3f
Compare
| mkdir -p "${ARTIFACTS}/bootstrap/containers" | ||
| sudo crictl ps --all --quiet | while read -r container | ||
| do | ||
| container_name="$(sudo crictl ps -a --id "${container}" -v | grep -oP "Name: \\K(.*)")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we stay consistent with options? -a is the same as --all
Do we need to run this command twice? Why don't we get the json output, or parse this information line by line?
sudo crictl ps --all | grep -v CONTAINER_ID | while read -r container_line
do
container_name="$(awk '{printt $5}'' < container_line" ### something like this
- This would save on having to make 2 crictl commands, for the information you fundamentally already have.
| } | ||
| mkdir -p "${ARTIFACTS}/control-plane" "${ARTIFACTS}/resources" | ||
|
|
||
| echo "Gathering cluster resources ..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the BootStrap API is down will any of these commands (below) run/work? These all use oc and this could be problematic if the bootstrap api is down.
Should we consider other ways to get at this information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some might work, depends on how bad the state is
|
|
||
| echo "Gather remote logs" | ||
| export MASTERS=() | ||
| if [ "$(stat --printf="%s" "${ARTIFACTS}/resources/masters.list")" -ne "0" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if test -s "${ARTIFACTS}/resources/masters.list"?
| wait | ||
|
|
||
| echo "Gather remote logs" | ||
| export MASTERS=() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to export this is there? You're just using it in the local shell.
| # Find out master IPs from etcd discovery record | ||
| DOMAIN=$(sudo oc --config=/opt/openshift/auth/kubeconfig whoami --show-server | grep -oP "api.\\K([a-z\\.]*)") | ||
| # shellcheck disable=SC2031 | ||
| mapfile -t MASTERS < "$(dig -t SRV "_etcd-server-ssl._tcp.${DOMAIN}" +short | cut -f 4 -d ' ' | sed 's/.$//')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/.$// -> s/\.$//
|
/lgtm |
data/data/bootstrap/files/usr/local/bin/installer-masters-gather.sh
Outdated
Show resolved
Hide resolved
3098329
to
838dd94
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, vrutkovs, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
|
/retest |
|
|
||
| echo "Gathering bootstrap containers ..." | ||
| mkdir -p "${ARTIFACTS}/bootstrap/containers" | ||
| sudo crictl ps --all --quiet | while read -r container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does crictl ps --all --quiet include containers which have terminated? (e.g. due to invalid flags) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should list these too - --all includes all states
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
Don't remove existing artifact gathering just yet. Depends on openshift/installer#1561
Initial version of a script, which collects information from bootstrap and control-plane nodes for use when the bootstrap never completes. This is just a first pass and intended to be good enough that we'll start deriving value from it in our CI testing where we often get bootstrap failures and lack the requisite logging to debug those failures.
See https://jira.coreos.com/browse/CORS-1050
/cc @jstuever
TODO:
Followup: