Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce-master-scale-correctness - test cases not showing up on testgrid tab #111510

Closed
azylinski opened this issue Jul 28, 2022 · 50 comments · Fixed by GoogleCloudPlatform/testgrid#1055
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Milestone

Comments

@azylinski
Copy link
Contributor

azylinski commented Jul 28, 2022

Which jobs are failing?

Since 8th JUL, the test cases don't show up on testgrid:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Which tests are failing?

ci-kubernetes-e2e-gce-scale-correctness

It's not clear from testgrid, which runs pass and which failed atm

Since when has it been failing?

2022-07-08

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Reason for failure (if possible)

Since 8th JUL, the build-log and junit artifacts looks different.

Before, the junit file was split into parts: junit_{}.xml and it contains only a testcase data (name, time, skipped*, ...), E.g: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1545015201953222656/artifacts/

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs. E.g.:
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1550813416132710400/artifacts/

Looking at the difference in kubernetes/kubernetes source code - there were number of changes to: ginkgo, klog and traces - all could be related:
2a017f9...4569e64

cc @chendave - I see you've made number of commits; would you be able to help?

Anything else we need to know?

No response

Relevant SIG(s)

No response

@azylinski azylinski added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 28, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 28, 2022
@k8s-ci-robot
Copy link
Contributor

@azylinski: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 28, 2022
@k8s-ci-robot
Copy link
Contributor

@azylinski: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@azylinski
Copy link
Contributor Author

/assign @chendave

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

/milestone 1.25

@k8s-ci-robot
Copy link
Contributor

@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22, v1.23, v1.24, v1.25, v1.26]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

/milestone v1.25

@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 29, 2022
@aojea
Copy link
Member

aojea commented Jul 29, 2022

I think that this is related to the ginkgov2 migration, the reporters in ginkgo has changed.

Do you have a link to the code where the scale test run the e2e commands?

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

The test is defined here: https://github.com/kubernetes/test-infra/blob/afcd7bf377861006c31a98cffe673939243741fe/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L4

We use kubekins.

My guess is that the old "junit shards" we see were generated by different ginkgo processes/threads (number of junit files matches --ginkgo-parallel=40 set in our test) and #109111 mentions that we the old parameter "--parallel" no longer works.

Probably we need to migrate kubekins to use ginkgo -procs=N instead of "--parallel".

@aojea
Copy link
Member

aojea commented Jul 29, 2022

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

I just checked kubetest code and it looks like the kubetest's --ginkgo-parallel flag gets propagated to --nodes ginkgo flag and it seems to work fine:

The most recent test (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856) has Running in parallel across 40 processes in build log.

@aojea
Copy link
Member

aojea commented Jul 29, 2022

it seems is using --nodes for parallel

https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/e2e/runner.go#L158

I think that now it aggregates the reports, also running in parallel

These functions are called exactly once at the end of the test suite after any AfterSuite nodes have run. When running in parallel the functions are called on the primary Ginkgo process after all other processes have finished and is guaranteed to have an aggregated copy of types.Report that includes all SpecReports from all the Ginkgo parallel processes.

https://onsi.github.io/ginkgo/MIGRATING_TO_V2

@aojea
Copy link
Member

aojea commented Jul 29, 2022

besides that, all the other jobs are reporting a single junit xml file, is this the real problem? @BenTheElder do you have more ideas on why this file is not being interpreted by testgrid?

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

My guess is that 600MiB of junit exceeds some size limit in used in testgrid and for that reason it isn't interpreted by testgrid.

@azylinski
Copy link
Contributor Author

re: #111510 (comment)

https://github.com/kubernetes/test-infra/blob/master/kubetest/main.go#L1012
I think this envvar needs to be changed from: GINKGO_PARALLEL_NODES to GINKGO_PARALLEL_PROCESS

Ref: https://onsi.github.io/ginkgo/MIGRATING_TO_V2#renamed-ginkgoparallelnode

@aojea
Copy link
Member

aojea commented Jul 29, 2022

that doesn't look related, AFAIK the migration is about an API and that is a kubetest variable, we are passing the flag correctly.

I think that the junit generation is a better lead, in k/k there is tooling to strip the file to avoid issues with large sizes

#109112

it seems kubetest dumps it entirely

https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/process/process.go#L66-L125

someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

@chendave
Copy link
Member

chendave commented Jul 29, 2022

sorry, I just notice this, I am trying to answer some questions here if possible.

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs.

as @aojea has said, Ginkgo V2 merged all results into a single junit file, the file is named as "junitxxx01.yaml", pls see: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/test_context.go#L336-L339

@chendave
Copy link
Member

chendave commented Jul 29, 2022

#109111 mentions that we the old parameter "--parallel" no longer works.

it's only for golang's parallel flag, we shouldn't use that as it's not accepted by Ginkgov2, here is an example, https://github.com/kubernetes/test-infra/pull/26784/files.

--ginkgo-parallel is fine.

@chendave
Copy link
Member

I think this envvar needs to be changed from: GINKGO_PARALLEL_NODES to GINKGO_PARALLEL_PROCESS

I don't think so, pls see:

if [[ -n "${GINKGO_PARALLEL_NODES:-}" ]]; then
ginkgo_args+=("--nodes=${GINKGO_PARALLEL_NODES}")
elif [[ ${GINKGO_PARALLEL} =~ ^[yY]$ ]]; then
ginkgo_args+=("--nodes=25")
fi

@chendave
Copy link
Member

someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

+1 on the pruning and adapt with new junit file.

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

In deck's logs I see:

{"artifact":"https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856/artifacts/junit_01.xml", "component":"deck", "error":"file size over specified limit", "file":"k8s.io/test-infra/prow/spyglass/lenses/junit/lens.go:165", "func":"k8s.io/test-infra/prow/spyglass/lenses/junit.Lens.getJvd.func1", "level":"warning", "msg":"Error reading artifact"}

So at least for deck the reason for not showing junit is too large file.

I think testgrid implements similar logic probably.

@chendave
Copy link
Member

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs.

@onsi is there any way to keep a simplified junit report instead of dump everything in that file? as what Ginkgo V1 did?

@mborsz
Copy link
Member

mborsz commented Jul 29, 2022

@aojea
Copy link
Member

aojea commented Jul 29, 2022

the solution seems to adapt kubetest

it seems kubetest dumps it entirely
https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/process/process.go#L66-L125
someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

I can try to take a shot after code freeze

@BenTheElder
Copy link
Member

This is really something that we shouldn't have to do in kubetest :/ (consider this also impacts tools like sonobuoy), and we're generally not accepting more complexity in that tool https://github.com/kubernetes/test-infra/tree/master/kubetest#deprecation-notice

@chendave
Copy link
Member

junit_v2.txt
junit_v1.txt

Attached v1 junit report and v2 junit report as a reference, v2 has embedded system-err into each testcases even the testcase is skipped, this make the file for v2 is larger than v1.

v1,

      <testcase name="[sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support expansion of pvcs created for ephemeral pvcs" classname="Kubernetes e2e suite" time="0">
          <skipped></skipped>
      </testcase>

v2

          <testcase name="[It] [sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support multiple inline ephemeral volumes" classname="Kubernetes e2e suite" status="skipped" time="0">
              <skipped message="skipped"></skipped>
              <system-err>[ReportAfterEach] TOP-LEVEL&#xA;  /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:142&#xA;</system-err>
          </testcase>

Just run one testcases, and v1 has 21145L, and v2 has 28258L

# du -sh junit_v1.xml
2.0M    junit_v1.xml
# du -sh junit_v2.xml
3.5M    junit_v2.xml

@onsi
Copy link
Contributor

onsi commented Aug 1, 2022

hey all,

when running in parallel ginkgo v2 now merges the individual reports generated by each parallel process into one composite process. there is no way to turn this off (v1's behavior was a shortcut at the time and somewhat ugly and confusing).

v2 also merges junit reports from multiple suites into a single file. this can be turned off with --keep-separate-reports

in v2 i also took a closer look at what few official-seeming junit specs exist and updated the reporter to match. currently there is no mechanism to control that default behavior.

you can, however, build a custom reporter to do some filtering/processing in-suite and for a project of k8s scope that might make the most sense.

for example - a reporter that modifies/filters the report object before manually calling GenerateJUnitReport or that builds the JUnitTestCases manually. I'd be happy to share more detail/help if that approach is preferred.

@chendave
Copy link
Member

chendave commented Aug 1, 2022

@azylinski How can we trigger this CI: ci-kubernetes-e2e-gce-scale-correctness?

@azylinski
Copy link
Contributor Author

azylinski commented Aug 1, 2022

I'm not sure if you can easily, as the project is constantly used by correctness and performance tests.
Or at least I don't know how to run it - I've just wrote down in this issue to what I've observed.

@marseel
Copy link
Member

marseel commented Aug 1, 2022

FYI this is CI job that is being run every day: https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-correctness
It requires 5k nodes cluster to run and we have 1 gcp project shared with other scalability tests that can handle it so there is no easy way to manually trigger it.

@chendave
Copy link
Member

chendave commented Aug 2, 2022

might be easiest if we trim / split / ... in a custom reporter or on exit in e2e.test,

I have tried to split, but split the 600+ M xml file would be time consuming.

custom reporter or on exit in e2e.test

I am not sure whether that job rely on the e2e.test, if so, we might consider to add a custom reporter for that. For what I observed for each jobs triggered regularly for a PR, the size of report is less than 50M. The gap between 50 and 600 are huge, poor background for me to get more info on that job.

@BenTheElder Talking about the custom reporter, shall the testgrid or Spyglass smart enough to fetch the result from two different junit xml file? one is trimmed and the other one is original.

@chendave
Copy link
Member

chendave commented Aug 2, 2022

/cc @pohly for thoughts.

@pohly
Copy link
Contributor

pohly commented Aug 2, 2022

I think we should try to reduce the amount of information that gets stored in the single JUnit file. We also should ensure that we only store a single output stream (#109744 (comment)). That addresses the main problem ("too much information").

Splitting into different files feels like a poor workaround because eventually they need to be merged again in memory when dealing with the entire job run.

shall the testgrid or Spyglass smart enough to fetch the result from two different junit xml file? one is trimmed and the other one is original.

I don't think we need the original, more complete JUnit file.

@chendave
Copy link
Member

chendave commented Aug 2, 2022

I don't think we need the original, more complete JUnit file.

I am afraid someone need the original one, since the original one holds the single of truth if anyone wants to check the details on the test execution.

Splitting into different files feels like a poor workaround

+1 for couple of reasons,

@onsi said that the split by the the number of parallelism is just a workaround at that time, v2 sound like did the right things, but this doesn't work for testgrid, 😅 .

split is time consuming and not trivial to implement.

@chendave
Copy link
Member

chendave commented Aug 2, 2022

Here is the comparison after this pr:
#111627

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/111627/pull-kubernetes-e2e-kind-ipv6/1554370753451790336/artifacts/

original one:
junit_01.xml - 25272885
trimmed one:
junit_01_trimmed.xml - 2492432

more than 90% off.

similarly, we will eventually get around 60 M for the job gce-master-scale-correctness?

@chendave
Copy link
Member

chendave commented Aug 2, 2022

Spyglass seems work, here is the link, stdout and stderr is not impacted when the test failed.

while the testcase passed or skipped, Spyglass doesn't provide the link to the stdout or stderr, so the Spyglass should be fine.

Spyglass reads the trimmed xml file btw, not sure how it selects the file though.

@chendave
Copy link
Member

chendave commented Aug 5, 2022

checked https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests= all testcase status should be back.

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1555161967272923136/artifacts/, the size now is about 2.5M.

But one thing should be noted is that, name of the testcases has been changed a little bit which is caused by the new format junit xml.

Here is one example,
In V1, the testcase named as:
"Kubernetes e2e suite.[sig-storage] Volumes NFSv4 should be mountable for NFSv4"
now has been updated to:
"Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4"

pls note the string of "[It]" that is added.

So, you can only check the status of "Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4" now.

We can somehow modify the report again to trim the string "[It]", this should be easy, but I am not quite sure whether we need to make everything the same as before.

If you are okay with the new test name, I think we can close this issue, otherwise we can update the name as well.

@azylinski @pohly @aojea thoughts?

@chendave
Copy link
Member

chendave commented Aug 5, 2022

I personally think it's okay, k8s just deliver this file while how the downstream tools consume those data is not predictable.

If the "IT" string added to the testcase is not necessary then it should be a bug from Ginkgo, and should be fixed in Ginkgo, in this case, we can filter out the string as a temporary workaround. @onsi

@chendave
Copy link
Member

chendave commented Aug 5, 2022

Another diff I found,

All skipped testcase won't shown in testgrid,

e.g.
"[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node"

the new format is
<testcase name="[It] [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node" classname="Kubernetes e2e suite" status="skipped" time="0.000383338">
<skipped message="skipped - Driver nfs doesn't support Block -- skipping"></skipped>
</testcase>

pls note the attribute of "status="skipped" ,

it was,

<testcase name="[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node" classname="Kubernetes e2e suite" time="0">
<skipped></skipped>
</testcase>

Can testgrid or some tool grep the "status" to get the result instead?

@aojea do you know someone familiar with the tool can help to update the tool a little bit?

@aojea
Copy link
Member

aojea commented Aug 5, 2022

I see , it seems the new reporting format breaks testgrid, @michelle192837 can you help us to recover the functionality in testgrid?

@helayoty
Copy link
Member

@azylinski @chendave 👋 the release 1.25 bug triage Lead. I'm reaching out to see the status of this issue and if we are still targeting the current release. Also, is it a release blocker?

Can you please update the priority as well?

@BenTheElder
Copy link
Member

FWIW I think this is implemented here https://github.com/GoogleCloudPlatform/testgrid/blob/master/pkg/updater/eval.go (junit in GCS => testgrid)

@chendave
Copy link
Member

@helayoty I don't think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this GoogleCloudPlatform/testgrid#1055.

@helayoty
Copy link
Member

@helayoty I don't think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this GoogleCloudPlatform/testgrid#1055.

Thanks for replying back, I'll move it to 1.26 then

@k8s-ci-robot
Copy link
Contributor

@helayoty: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22, v1.23, v1.24, v1.25, v1.26]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@helayoty
Copy link
Member

/milestone v1.26

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.25, v1.26 Aug 16, 2022
@aojea
Copy link
Member

aojea commented Aug 17, 2022

/close

this issue has been resolved, I ca see that the test cases are showing up now, since the different fixes added to the e2e framework.

The remaining part is to show the skip test cases in testgrid, but that is another issue, better not to add more things here

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

this issue has been resolved, I ca see that the test cases are showing up now, since the different fixes added to the e2e framework.

The remaining part is to show the skip test cases in testgrid, but that is another issue, better not to add more things here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants