Skip to content

Conversation

@rajatcing
Copy link

Signed-off-by: RAJAT SINGH rajasing@redhat.com
Fixes #840

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2020
@rajatcing
Copy link
Author

@jarrpa I have some doubts that need some clarification.

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from 873fed9 to 1c64dc0 Compare November 12, 2020 12:02
@rajatcing
Copy link
Author

rajatcing commented Nov 12, 2020

Well, collection from paths /var/log works. But the script keeps looping fo one of the pod.

sent 138 bytes  received 160035 bytes  24642.00 bytes/sec
total size is 159588  speedup is 1.00
collecting finished for node ip-10-0-131-8.ec2.internal
collecting kernel logs  from node ip-10-0-244-172.ec2.internal
Starting pod/ip-10-0-244-172ec2internal-debug ...
To use host binaries, run `chroot /host`
waiting for the debug pod to be in ready state
waiting for the debug pod to be in ready state
waiting for the debug pod to be in ready state
waiting for the debug pod to be in ready state
waiting for the debug pod to be in ready state
waiting for the debug pod to be in ready state

cc @crombus

@rajatcing
Copy link
Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@rajatcing rajatcing marked this pull request as draft November 12, 2020 12:10
Copy link
Contributor

@crombus crombus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Copy link
Contributor

@crombus crombus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on comments on the bug. Create func for kernel level logs which will include whole of the /val/log folder using rsync and journalctl -o verbose cmd. Call that func inside the crash_collection() before rsync cmd and make sure to start a different process. This will work fine for now.

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from 1c64dc0 to 67fc1fd Compare November 24, 2020 05:44
@rajatcing
Copy link
Author

That makes sense but again, there's some issue that I'm hitting that you might need to take a look at.

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch 2 times, most recently from e1fb847 to 8d29cd9 Compare November 24, 2020 05:49
@rajatcing
Copy link
Author

Added some code to take a look for you 👍

@crombus crombus added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 24, 2020
@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch 3 times, most recently from 72c3cd2 to dcc1b67 Compare November 26, 2020 11:44
@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from dcc1b67 to e8150b8 Compare December 1, 2020 10:45
@rajatcing rajatcing requested a review from jarrpa December 1, 2020 14:02
@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch 2 times, most recently from 2b1a0ab to 3c96f2d Compare December 2, 2020 08:31
@rajatcing
Copy link
Author

@crombus While I see what you're going for with requesting the functions be moved to the cluster-scoped file, I think it makes sense to leave them in the Ceph file. Though they operate on Nodes, which are cluster-scoped, they are being run for the purpose of collecting system-level information to troubleshoot Ceph problems specifically. And we certainly want to avoid looping through nodes more times than is strictly necessary.

Right Jose, that is somewhat I was thinking, It does not make sense to start debug pods twice, once for journal collection and then, once again for ceph related collection. We can keep it as it is for now and will start making changes in the next PRs as I briefed above.

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch 3 times, most recently from d4a9878 to a2db3a6 Compare December 9, 2020 11:06
@rajatcing rajatcing requested a review from crombus December 9, 2020 14:09
@crombus
Copy link
Contributor

crombus commented Dec 9, 2020

@rajatsing I tested in my local I had some strange issues

[must-gather-5mnnl] POD rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1650) [Receiver=3.1.2]
[must-gather-5mnnl] POD rsync: [Receiver] write error: Broken pipe (32)
[must-gather-5mnnl] POD error: exit status 23

@rajatcing
Copy link
Author

@rajatsing I tested in my local I had some strange issues

[must-gather-5mnnl] POD rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1650) [Receiver=3.1.2]
[must-gather-5mnnl] POD rsync: [Receiver] write error: Broken pipe (32)
[must-gather-5mnnl] POD error: exit status 23

I believe that these are coming because the path does not exist

[must-gather-h7hzm] POD rsync: link_stat "/host/var/log/sysstat" failed: No such file or directory (2)
[must-gather-h7hzm] POD rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1659) [Receiver=3.1.3]
[must-gather-h7hzm] POD rsync: [Receiver] write error: Broken pipe (32)

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from a2db3a6 to 8f71e48 Compare December 10, 2020 10:13
@rajatcing
Copy link
Author

I don't understand why CI fails when it passes locally make ocs-operator-ci. Are there any issues with the CI, it also does not give me some proper error info to work on.

error: some steps failed:
  * could not run steps: step ocs-operator-ci failed: test "ocs-operator-ci" failed: the pod ci-op-rxlt4gb8/ocs-operator-ci was deleted without completing after 30s (failed containers: )
time="2020-12-10T10:17:07Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:running_pod'"

@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from 8f71e48 to 82c1b84 Compare December 10, 2020 11:09
Copy link
Contributor

@crombus crombus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall look good. just one request I think you can merge the two functions journal_collection and kernel_collections

Signed-off-by: RAJAT SINGH <rajasing@redhat.com>
@rajatcing rajatcing force-pushed the must-gather-ceph-commands branch from 82c1b84 to 69903ed Compare December 10, 2020 11:54
@rajatcing rajatcing requested a review from crombus December 10, 2020 11:55
Copy link
Contributor

@crombus crombus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crombus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2020
@openshift-merge-robot
Copy link
Contributor

@rajatsing: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/red-hat-storage-ocs-ci-e2e-aws 69903ed link /test red-hat-storage-ocs-ci-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@crombus
Copy link
Contributor

crombus commented Dec 11, 2020

/hold until the decision is made on this.

@crombus crombus added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 11, 2020
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 10, 2021
@openshift-ci-robot
Copy link

@rajatsing: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

must-gather: add kernel level logs

8 participants