Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/bootstrap: add a script to collect info if cluster failed to start #1561

Merged
merged 1 commit into from Apr 15, 2019

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented Apr 9, 2019

Initial version of a script, which collects information from bootstrap and control-plane nodes for use when the bootstrap never completes. This is just a first pass and intended to be good enough that we'll start deriving value from it in our CI testing where we often get bootstrap failures and lack the requisite logging to debug those failures.

$ ssh -A core@bootstrap /usr/local/bin/installer-gather.sh                                                                                                         
Gathering bootstrap journals ...
Gathering bootstrap containers ...
Gathering cluster resources ...
Waiting for logs ...
Gather remote logs
Gathering master journals ...
Gathering master containers ...
Waiting for logs ...
Gathering master journals ...
Gathering master containers ...
Waiting for logs ...
Gathering master journals ...
Gathering master containers ...
Waiting for logs ...
Log bundle written to ~/log-bundle.tar.gz
$ scp core@bootstrap:log-bundle.tar.gz .
$ tar xfv log-bundle.tar.gz 
./
./bootstrap/
./bootstrap/journals/
./bootstrap/journals/bootkube.log
./bootstrap/journals/openshift.log
./bootstrap/journals/kubelet.log
./bootstrap/journals/crio.log
./bootstrap/containers/
./bootstrap/pods/
./bootstrap/pods/a24cf71f605b.log
./bootstrap/pods/a24cf71f605b.inspect
./bootstrap/pods/995443839a3a.log
./bootstrap/pods/995443839a3a.inspect
...
./control-plane/
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/journals/
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/journals/openshift.log
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/journals/kubelet.log
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/journals/crio.log
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/journals/bootkube.log
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/containers/
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/containers/kube-controller-manager-3.log
./control-plane/ip-10-0-143-189.us-east-2.compute.internal/containers/kube-controller-manager-3.inspect
...
./resources/
./resources/nodes.list
./resources/masters.list
./resources/containers
./resources/api-pods
./resources/apiservices.json
./resources/clusteroperators.json

See https://jira.coreos.com/browse/CORS-1050

/cc @jstuever

TODO:

  • Gather journal/container status and logs from master/nodes when apiserver is not up

Followup:

  • Accept control plane hosts via arguments in order to ensure this still works even if there is no API from which to fetch the list of masters

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 9, 2019

FILTER=gzip queue resources/openapi.json.gz ${OC} get --raw /openapi/v2

echo "Gathering node journals ..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an expectation that the apiserver is not up when we are running this script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added a TODO item to handle that, but I don't think it would hurt to attempt to collect those (along with other resources which assume API server is working)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some APIs may be available even if bootstrap failed. I believe @deads2k called out he'd like to fetch a list of api resources if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an expectation that the apiserver is not up when we are running this script.

This script must not die if the KAS is down and it should using the KAS wherever it can. That means that @abhinavdahiya is correct in saying that you cannot rely on oc adm node-logs to work.

I think that a best effort section for gathering cluster-specific resources that does a series of oc get pod -ojson, oc get node -ojson and the like (see https://github.com/openshift/release/blob/master/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L361-L381) is valuable.

Copy link
Member

@sdodson sdodson Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on not relying on oc adm node-logs, we need a 100% successful method for that.

#!/usr/bin/env bash

ARTIFACTS="${1:-/tmp/artifacts}"
mkdir -p "${ARTIFACTS}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want an earlier set -e or some &&-chaining or some such to error the script out when commands like this one fail for whatever reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, no need for the mkdir here, since that's covered by the mkdir -p "${ARTIFACTS}/bootstrap/journals" a few lines down.

ARTIFACTS="${1:-/tmp/artifacts}"
mkdir -p "${ARTIFACTS}"

echo "Gathering bootstrap journals ..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably write these to stderr, if we plan on using stdout to stream a tarball back to the install host?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find that managing stdout is found to be too risky we can revert to using scp I guess but it'd be nice if this were a one liner that resulted in a tarball on local host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but it'd be nice if this were a one liner that resulted in a tarball on local host.

Hey, you can ssh ... && scp ... all on one line ;).

mkdir -p "${ARTIFACTS}/bootstrap/containers"
for container in $(crictl ps --all --quiet)
do
container_name=$(crictl ps -a --id ${container} -v | grep -oP "Name: \K(.*)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use --output=json and then use python or some such (we don't have jq handy) to extract the container name from the JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't make us depend on python on RHCOS hosts ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't make us depend on python on RHCOS hosts ;)

grep for structured data makes me sad ;). Maybe we can update crictl to get Go templating or some such to give us an eventual off ramp?

for container in $(crictl ps --all --quiet)
do
container_name=$(crictl ps -a --id ${container} -v | grep -oP "Name: \K(.*)")
crictl logs "${container}" >& "${ARTIFACTS}/bootstrap/containers/${container_name}.log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to at least attempt to include logs from previous attempts at this container, via --last with some kind of suffix here.

@sdodson sdodson force-pushed the installer-gather branch 4 times, most recently from 6dad200 to 40e9fc9 Compare April 12, 2019 14:46
@sdodson
Copy link
Member

sdodson commented Apr 12, 2019

@wking PTAL
I've taken this over from Vadim while he's focused on CI. We intend to follow up with cleanup but wanted to get this in so we could start gathering this data in CI jobs that fail to attain bootstrap complete.

@sdodson
Copy link
Member

sdodson commented Apr 12, 2019

I'll work on addressing the shellcheck errors now.

@wking
Copy link
Member

wking commented Apr 12, 2019

Still some shellcheck issues:

./data/data/bootstrap/files/usr/local/bin/installer-gather.sh:19:81: note: Backslash is literal in "\K". Prefer explicit escaping: "\\K". [SC1117]
./data/data/bootstrap/files/usr/local/bin/installer-masters-gather.sh:17:40: note: Double quote to prevent globbing and word splitting. [SC2086]
./data/data/bootstrap/files/usr/local/bin/installer-masters-gather.sh:17:74: note: Backslash is literal in "\K". Prefer explicit escaping: "\\K". [SC1117]

@deads2k
Copy link
Contributor

deads2k commented Apr 12, 2019

@vrutkovs I saw this used on a down cluster this morning and the output from it is great.

@wking
Copy link
Member

wking commented Apr 12, 2019

I've pushed 7bca822 onto the top of this branch, using while read ... to avoid some of the ShellCheck issues and adding some more quoting. @vrutkovs, assuming that this gets through CI (I removed a few ShellCheck comments in the hopes that I could find ways to avoid those too, but they might fail this round), are you ok with us squashing these changes down onto your original commit, or would you prefer they stay separate?

@sdodson
Copy link
Member

sdodson commented Apr 12, 2019 via email

@wking wking force-pushed the installer-gather branch 2 times, most recently from 25796cc to bc3f820 Compare April 13, 2019 00:15
@wking
Copy link
Member

wking commented Apr 13, 2019

e2e-aws:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1561/pull-ci-openshift-installer-master-e2e-aws/5282/artifacts/e2e-aws/pods.json | pods-with-issues | grep -B1 'cannot get resource\|master has not created a default'
W0413 01:10:20.870684       1 authentication.go:272] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
--
I0413 00:59:11.440413    2541 node.go:267] Starting openshift-sdn network plugin
F0413 00:59:11.478516    2541 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start
--
I0413 00:59:11.630112    2262 node.go:267] Starting openshift-sdn network plugin
F0413 00:59:11.650826    2262 cmd.go:114] Failed to start sdn: failed to validate network configuration: master has not created a default cluster network, network plugin "redhat/openshift-ovs-networkpolicy" can not start

This is rhbz#1698672:

/retest

@wking
Copy link
Member

wking commented Apr 13, 2019

e2e-aws:

Flaky tests:

[Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]
[sig-network] Services should have session affinity work for NodePort service [Suite:openshift/conformance/parallel] [Suite:k8s]

Failing tests:

[sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (ext4)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s]

/retest

@wking
Copy link
Member

wking commented Apr 14, 2019

Green :)

@vrutkovs vrutkovs force-pushed the installer-gather branch 5 times, most recently from cef6dce to 8d67a3f Compare April 15, 2019 15:08
mkdir -p "${ARTIFACTS}/bootstrap/containers"
sudo crictl ps --all --quiet | while read -r container
do
container_name="$(sudo crictl ps -a --id "${container}" -v | grep -oP "Name: \\K(.*)")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we stay consistent with options? -a is the same as --all
Do we need to run this command twice? Why don't we get the json output, or parse this information line by line?

sudo crictl ps --all | grep -v CONTAINER_ID | while read -r container_line
do
    container_name="$(awk '{printt $5}'' < container_line"      ### something like this
  • This would save on having to make 2 crictl commands, for the information you fundamentally already have.

}
mkdir -p "${ARTIFACTS}/control-plane" "${ARTIFACTS}/resources"

echo "Gathering cluster resources ..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the BootStrap API is down will any of these commands (below) run/work? These all use oc and this could be problematic if the bootstrap api is down.

Should we consider other ways to get at this information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some might work, depends on how bad the state is


echo "Gather remote logs"
export MASTERS=()
if [ "$(stat --printf="%s" "${ARTIFACTS}/resources/masters.list")" -ne "0" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if test -s "${ARTIFACTS}/resources/masters.list"

?

wait

echo "Gather remote logs"
export MASTERS=()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to export this is there? You're just using it in the local shell.

# Find out master IPs from etcd discovery record
DOMAIN=$(sudo oc --config=/opt/openshift/auth/kubeconfig whoami --show-server | grep -oP "api.\\K([a-z\\.]*)")
# shellcheck disable=SC2031
mapfile -t MASTERS < "$(dig -t SRV "_etcd-server-ssl._tcp.${DOMAIN}" +short | cut -f 4 -d ' ' | sed 's/.$//')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/.$// -> s/\.$//

@sdodson
Copy link
Member

sdodson commented Apr 15, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2019
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019
@wking
Copy link
Member

wking commented Apr 15, 2019

/lgtm
/hold cancel

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 15, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2019
sdodson added a commit to sdodson/release that referenced this pull request Apr 15, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
@sdodson
Copy link
Member

sdodson commented Apr 15, 2019

/retest
certified-operators crashlooping

        "Pod openshift-marketplace/certified-operators-cb6975674-gcnkt is not healthy: Back-off 5m0s restarting failed container=certified-operators pod=certified-operators-cb6975674-gcnkt_openshift-marketplace(aa89f2af-5faf-11e9-801b-0abc0535887e)",

@openshift-merge-robot openshift-merge-robot merged commit 2b2487c into openshift:master Apr 15, 2019

echo "Gathering bootstrap containers ..."
mkdir -p "${ARTIFACTS}/bootstrap/containers"
sudo crictl ps --all --quiet | while read -r container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does crictl ps --all --quiet include containers which have terminated? (e.g. due to invalid flags) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should list these too - --all includes all states

sdodson added a commit to sdodson/release that referenced this pull request Apr 18, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
sdodson added a commit to sdodson/release that referenced this pull request Apr 18, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
sdodson added a commit to sdodson/release that referenced this pull request Apr 18, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
sdodson added a commit to sdodson/release that referenced this pull request Apr 18, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
sdodson added a commit to sdodson/release that referenced this pull request Apr 22, 2019
Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants