Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

must-gather: add more info for ceph crash #818

Merged
merged 1 commit into from
Oct 14, 2020

Conversation

crombus
Copy link
Contributor

@crombus crombus commented Oct 12, 2020

add command to check for ceph crash info
for every crash and collect core dump of
rook crash.

Signed-off-by: crombus pkundra@redhat.com

@@ -159,4 +162,15 @@ for ns in $namespaces; do
{ timeout 120 oc debug nodes/"${node}" -- bash -c "test -f /host/var/lib/rook/log/${ns}/ceph-volume.log && cat /host/var/lib/rook/log/${ns}/ceph-volume.log" > "${NODE_OUTPUT_DIR}"/ceph-volume.log; } >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
done
oc delete -f pod_helper.yaml

# Collecting ceph prepare volume logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment needs to be changed to capturing crash dumps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -159,4 +162,15 @@ for ns in $namespaces; do
{ timeout 120 oc debug nodes/"${node}" -- bash -c "test -f /host/var/lib/rook/log/${ns}/ceph-volume.log && cat /host/var/lib/rook/log/${ns}/ceph-volume.log" > "${NODE_OUTPUT_DIR}"/ceph-volume.log; } >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
done
oc delete -f pod_helper.yaml

# Collecting ceph crash dump
for node in $(oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' --no-headers | grep -w 'Ready' | awk '{print $1}'); do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for node in $(oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' --no-headers | grep -w 'Ready' | awk '{print $1}'); do
for node in $(oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' --no-headers | awk '/Ready/ {print $1}'); do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

printf "collecting crash logs from node %s \n" "${node}" | tee -a "${BASE_COLLECTION_PATH}"/gather-debug.log
CRASH_OUTPUT_DIR=${CEPH_COLLECTION_PATH}/crash_${node}
mkdir -p "${CRASH_OUTPUT_DIR}"
oc debug nodes/"${node}" --to-namespace="${ns}" -- bash -c "sleep 5m" & >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oc debug nodes/"${node}" --to-namespace="${ns}" -- bash -c "sleep 5m" & >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
oc debug nodes/"${node}" --to-namespace="${ns}" -- bash -c "sleep 5m" & >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mkdir -p "${CRASH_OUTPUT_DIR}"
oc debug nodes/"${node}" --to-namespace="${ns}" -- bash -c "sleep 5m" & >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
sleep 60
oc rsync -n "${ns}" `oc get pods -n "${ns}"| grep debug| awk '{print $1}'`:/host/var/lib/rook/openshift-storage/crash/ "${CRASH_OUTPUT_DIR}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oc rsync -n "${ns}" `oc get pods -n "${ns}"| grep debug| awk '{print $1}'`:/host/var/lib/rook/openshift-storage/crash/ "${CRASH_OUTPUT_DIR}"
oc rsync -n "${ns}" $(oc get pods -n "${ns}"|awk '/debug/ {print $1}'):/host/var/lib/rook/openshift-storage/crash/ "${CRASH_OUTPUT_DIR}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -136,6 +136,9 @@ for ns in $namespaces; do
for i in $(timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "ceph osd lspools --connect-timeout=15"|awk '{print $2}'); do
{ timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "rbd ls -p $i" >> "${COMMAND_OUTPUT_DIR}/pools_rbd_$i"; } >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1;
done
for i in $(timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "ceph crash ls --connect-timeout=15"|awk '{print $1}'); do
{ timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "ceph crash info $i" >> "${COMMAND_OUTPUT_DIR}/pools_rbd_$i"; } >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need --connect-timeout=15 here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

mkdir -p "${CRASH_OUTPUT_DIR}"
oc debug nodes/"${node}" --to-namespace="${ns}" -- bash -c "sleep 5m" & >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1
sleep 60
oc rsync -n "${ns}" `oc get pods -n "${ns}"| awk '/debug/{print $1}'`:/host/var/lib/rook/openshift-storage/crash/ "${CRASH_OUTPUT_DIR}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks are deprecated use $() instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 13, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2020
@@ -136,6 +136,7 @@ for ns in $namespaces; do
for i in $(timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "ceph osd lspools --connect-timeout=15"|awk '{print $2}'); do
{ timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "rbd ls -p $i" >> "${COMMAND_OUTPUT_DIR}/pools_rbd_$i"; } >> "${BASE_COLLECTION_PATH}"/gather-debug.log 2>&1;
done
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@crombus crombus force-pushed the crash branch 7 times, most recently from 061008f to 53dc021 Compare October 13, 2020 16:03
add command to check for ceph crash info
for every crash and collect core dump of
rook crash.

Signed-off-by: crombus <pkundra@redhat.com>
@crombus
Copy link
Contributor Author

crombus commented Oct 13, 2020

/retest

Copy link
Member

@agarwal-mudit agarwal-mudit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: agarwal-mudit, jarrpa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
@agarwal-mudit
Copy link
Member

/retest

@openshift-ci-robot
Copy link

@crombus: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws 66aabd7 link /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 966b77e into red-hat-storage:master Oct 14, 2020
@crombus
Copy link
Contributor Author

crombus commented Oct 14, 2020

/cherry-pick release-4.6

@openshift-cherrypick-robot

@crombus: new pull request created: #830

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@crombus crombus deleted the crash branch October 19, 2020 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants