Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pod lifecycle intervals to separate pages #26908

Merged
merged 13 commits into from
Apr 4, 2022

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Mar 14, 2022

This may be the final edition. It sorts the pod lifecycle on the intervals chart by namespace.

There are still lots and lots of them, but you can see it ripple out by namespace if this works.

The openshift-tests artifact directory contains multiple intervals charts for pods in particular namespaces. This allows for determining which pods were running under what conditions. Keep in mind that readiness here reflects the container status, not the success or failure of an individual check.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2022
@deads2k deads2k changed the title Pod events 4 sort add pod lifecycle intervals to separate pages Mar 21, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Mar 21, 2022

/retest

1 similar comment
@deads2k
Copy link
Contributor Author

deads2k commented Mar 21, 2022

/retest

@DennisPeriquet
Copy link
Contributor

In this job from this PR,
I looked at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26908/pull-ci-openshift-origin-master-e2e-gcp/1503480460666212352/artifacts/e2e-gcp/openshift-e2e-test/artifacts/junit/e2e-intervals_everything_20220314-220614.json

searched for: "ns/openshift-etcd pod/revision-pruner-7-ci-op-33pz4yh4-2a78c-c8x64-master-1 uid/0f5d9416-976b-4825-953b-084df247fffc" and saw:

        {
            "level": "Info",
            "locator": "ns/openshift-etcd pod/revision-pruner-7-ci-op-33pz4yh4-2a78c-c8x64-master-1 uid/0f5d9416-976b-4825-953b-084df247fffc",
            "message": "constructed/true reason/Created ",
            "from": "2022-03-14T22:06:14Z",
            "to": "2022-03-14T22:06:14Z"
        },
        {
            "level": "Info",
            "locator": "ns/openshift-etcd pod/revision-pruner-7-ci-op-33pz4yh4-2a78c-c8x64-master-1 uid/0f5d9416-976b-4825-953b-084df247fffc",
            "message": "constructed/true reason/Scheduled node/ci-op-33pz4yh4-2a78c-c8x64-master-1",
            "from": "2022-03-14T22:06:14Z",
            "to": "2022-03-14T21:57:54Z"
        },
        {
            "level": "Info",
            "locator": "ns/openshift-etcd pod/revision-pruner-7-ci-op-33pz4yh4-2a78c-c8x64-master-1 uid/0f5d9416-976b-4825-953b-084df247fffc container/pruner",
            "message": "constructed/true reason/NotReady ",
            "from": "2022-03-14T22:06:14Z",
            "to": "2022-03-14T21:57:54Z"
        },

The from and to fields look backwards so they don't show up on the chart.

@DennisPeriquet
Copy link
Contributor

DennisPeriquet commented Mar 21, 2022

I looked at this one (de49307):

curl -sk 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26908/pull-ci-openshift-origin-master-e2e-aws-serial/1505958001251454976/artifacts/e2e-aws-serial/openshift-e2e-test/artifacts/junit/e2e-intervals_operators_20220321-180600.json'|jq '.items[]|select (.from > .to)'

...
(there are 82 cases where from > to)

{
  "level": "Info",
  "locator": "ns/openshift-etcd pod/installer-4-ip-10-0-140-153.us-east-2.compute.internal uid/da3166b9-2b57-4191-82a2-292d8e3faa75",
  "message": "constructed/true reason/Scheduled node/ip-10-0-140-153.us-east-2.compute.internal",
  "from": "2022-03-21T18:06:01Z",
  "to": "2022-03-21T17:51:32Z"
}
{
  "level": "Info",
  "locator": "ns/openshift-etcd pod/installer-4-ip-10-0-140-153.us-east-2.compute.internal uid/da3166b9-2b57-4191-82a2-292d8e3faa75 container/installer",
  "message": "constructed/true reason/NotReady ",
  "from": "2022-03-21T18:06:01Z",
  "to": "2022-03-21T17:51:32Z"
}

@DennisPeriquet
Copy link
Contributor

DennisPeriquet commented Mar 21, 2022

here's another one (using CI for de49307):

$ curl -sk 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/26908/pull-ci-openshift-origin-master-e2e-aws-csi/1505935488840634368/artifacts/e2e-aws-csi/openshift-e2e-test/artifacts/junit/e2e-intervals_kube-apiserver_20220321-164339.json' |jq '.items[]|select (.from > .to)'

...
{
  "level": "Info",
  "locator": "ns/openshift-kube-apiserver pod/revision-pruner-7-ip-10-0-214-214.us-west-1.compute.internal uid/6a81964d-169c-47e0-a986-551429370ae9",
  "message": "constructed/true reason/Scheduled node/ip-10-0-214-214.us-west-1.compute.internal",
  "from": "2022-03-21T16:43:39Z",
  "to": "2022-03-21T16:43:14Z"
}
{
  "level": "Info",
  "locator": "ns/openshift-kube-apiserver pod/revision-pruner-7-ip-10-0-214-214.us-west-1.compute.internal uid/6a81964d-169c-47e0-a986-551429370ae9 container/pruner",
  "message": "constructed/true reason/NotReady ",
  "from": "2022-03-21T16:43:39Z",
  "to": "2022-03-21T16:43:14Z"
}

if (m[2] == "ContainerStart") {
return [item.locator, ` (container lifecycle)`, "ContainerStart"];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a case for ContainerExit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a case for ContainerExit?

container exit shouldn't have an interval because that's the absence of an interval, right?

continue
}
annotationTokens := strings.Split(curr, "/")
annotations[annotationTokens[0]] = annotationTokens[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small savings but can we just do this:

if annotationTokens[0] == "reason" {
  return annotationTokens[1]
}

though I'm not sure what to return if you didn't find "reason" in the tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small savings but can we just do this:

if annotationTokens[0] == "reason" {
return annotationTokens[1]
}
though I'm not sure what to return if you didn't find "reason" in the tokens.

I'd like to keep the logic that produces the annotations because I suspect we will need it again.

@deads2k
Copy link
Contributor Author

deads2k commented Mar 22, 2022

/retest

@deads2k
Copy link
Contributor Author

deads2k commented Mar 22, 2022

/test all

@deads2k
Copy link
Contributor Author

deads2k commented Mar 22, 2022

@DennisPeriquet ok, worked it out. Those are terminate before the test starts. I can fix the times, but they still won't show up on the chart.

podCoordinates := monitorapi.PodFrom(inLocator)

// no hit for deleted, but if it's a RunOnce pod with all terminated containers, the logical "this pod is over"
// happens when the last container is terminated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this comment is relevant on lines 172-173 but not sure how it's relevant here.

if !ok {
return t.delegate.getEndTime(locator)
}
for i := len(containerEvents) - 1; i >= 0; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you choose to walk the containerEvents from the end like this vs. using for ... range like in line 141?

@deads2k
Copy link
Contributor Author

deads2k commented Mar 22, 2022

/retest

@DennisPeriquet
Copy link
Contributor

@DennisPeriquet ok, worked it out. Those are terminate before the test starts. I can fix the times, but they still won't show up on the chart.

Cool. I did a check on the junit/jsons on a jobrun from 122151c and didn't find any cases where the .from was after .to.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@deads2k
Copy link
Contributor Author

deads2k commented Mar 31, 2022

/test all

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@DennisPeriquet
Copy link
Contributor

On the two required (and failing) jobs, I'm seeing a lot of:

ns/e2e-test-whereabouts-e2e-59pln pod/whereabouts-pod-fgf26 node/ci-op-vqn19vmz-2a78c-tgfb2-worker-c-drphp - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_whereabouts-pod-fgf26_e2e-test-whereabouts-e2e-59pln_3a5636c4-7d4d-4e2c-ad3c-e35ba7bc63df_0(83b31776a78f0592ff291661771cfed6a1af6d718fc348e73b8f73f1fa549c62): error adding pod e2e-test-whereabouts-e2e-59pln_whereabouts-pod-fgf26 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [e2e-test-whereabouts-e2e-59pln/whereabouts-pod-fgf26/3a5636c4-7d4d-4e2c-ad3c-e35ba7bc63df:whereaboutstestbridge]: error adding container to network "whereaboutstestbridge": Error at storage engine: Could not allocate IP in range: ip: 192.168.2.225 / - 192.168.2.230 / range: net.IPNet{IP:net.IP{0xc0, 0xa8, 0x2, 0xe0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xf8}}

@DennisPeriquet
Copy link
Contributor

/lgtm cancel

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 1, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Apr 4, 2022

fix identified in a different PR by @DennisPeriquet relabelling

@deads2k deads2k added the lgtm Indicates that a PR is ready to be merged. label Apr 4, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Apr 4, 2022

/test all

@deads2k
Copy link
Contributor Author

deads2k commented Apr 4, 2022

/skip

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@deads2k
Copy link
Contributor Author

deads2k commented Apr 4, 2022

failed on "pods should successfully create sandboxes by other"

/override ci/prow/e2e-aws-fips
/override ci/prow/e2e-gcp

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-aws-fips, ci/prow/e2e-gcp

In response to this:

failed on "pods should successfully create sandboxes by other"

/override ci/prow/e2e-aws-fips
/override ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 02cd062 into openshift:master Apr 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 4, 2022

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-cgroupsv2 612c4db link false /test e2e-aws-cgroupsv2
ci/prow/e2e-aws-single-node 612c4db link false /test e2e-aws-single-node
ci/prow/e2e-agnostic-cmd 612c4db link false /test e2e-agnostic-cmd

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants