gce-master-scale-correctness - test cases not showing up on testgrid tab #111510

azylinski · 2022-07-28T14:21:37Z

Which jobs are failing?

Since 8th JUL, the test cases don't show up on testgrid:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Which tests are failing?

ci-kubernetes-e2e-gce-scale-correctness

It's not clear from testgrid, which runs pass and which failed atm

Since when has it been failing?

2022-07-08

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=

Reason for failure (if possible)

Since 8th JUL, the build-log and junit artifacts looks different.

Before, the junit file was split into parts: junit_{}.xml and it contains only a testcase data (name, time, skipped*, ...), E.g: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1545015201953222656/artifacts/

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs. E.g.:
https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1550813416132710400/artifacts/

Looking at the difference in kubernetes/kubernetes source code - there were number of changes to: ginkgo, klog and traces - all could be related:
2a017f9...4569e64

cc @chendave - I see you've made number of commits; would you be able to help?

Anything else we need to know?

No response

Relevant SIG(s)

No response

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-07-28T14:22:37Z

@azylinski: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-07-28T14:22:38Z

@azylinski: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

azylinski · 2022-07-29T08:23:06Z

/assign @chendave

mborsz · 2022-07-29T09:20:51Z

/milestone 1.25

k8s-ci-robot · 2022-07-29T09:20:54Z

@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22, v1.23, v1.24, v1.25, v1.26]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.25

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mborsz · 2022-07-29T09:21:12Z

/milestone v1.25

aojea · 2022-07-29T09:26:50Z

I think that this is related to the ginkgov2 migration, the reporters in ginkgo has changed.

Do you have a link to the code where the scale test run the e2e commands?

mborsz · 2022-07-29T09:30:17Z

The test is defined here: https://github.com/kubernetes/test-infra/blob/afcd7bf377861006c31a98cffe673939243741fe/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L4

We use kubekins.

My guess is that the old "junit shards" we see were generated by different ginkgo processes/threads (number of junit files matches --ginkgo-parallel=40 set in our test) and #109111 mentions that we the old parameter "--parallel" no longer works.

Probably we need to migrate kubekins to use ginkgo -procs=N instead of "--parallel".

aojea · 2022-07-29T09:36:26Z

hmm, that seems a kubetest flag https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/main.go#L150 ,

mborsz · 2022-07-29T09:40:14Z

I just checked kubetest code and it looks like the kubetest's --ginkgo-parallel flag gets propagated to --nodes ginkgo flag and it seems to work fine:

The most recent test (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856) has Running in parallel across 40 processes in build log.

aojea · 2022-07-29T09:41:54Z

it seems is using --nodes for parallel

https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/e2e/runner.go#L158

I think that now it aggregates the reports, also running in parallel

These functions are called exactly once at the end of the test suite after any AfterSuite nodes have run. When running in parallel the functions are called on the primary Ginkgo process after all other processes have finished and is guaranteed to have an aggregated copy of types.Report that includes all SpecReports from all the Ginkgo parallel processes.

https://onsi.github.io/ginkgo/MIGRATING_TO_V2

aojea · 2022-07-29T09:43:13Z

besides that, all the other jobs are reporting a single junit xml file, is this the real problem? @BenTheElder do you have more ideas on why this file is not being interpreted by testgrid?

mborsz · 2022-07-29T09:46:47Z

My guess is that 600MiB of junit exceeds some size limit in used in testgrid and for that reason it isn't interpreted by testgrid.

azylinski · 2022-07-29T09:47:36Z

re: #111510 (comment)

https://github.com/kubernetes/test-infra/blob/master/kubetest/main.go#L1012
I think this envvar needs to be changed from: GINKGO_PARALLEL_NODES to GINKGO_PARALLEL_PROCESS

Ref: https://onsi.github.io/ginkgo/MIGRATING_TO_V2#renamed-ginkgoparallelnode

aojea · 2022-07-29T10:06:30Z

that doesn't look related, AFAIK the migration is about an API and that is a kubetest variable, we are passing the flag correctly.

I think that the junit generation is a better lead, in k/k there is tooling to strip the file to avoid issues with large sizes

#109112

it seems kubetest dumps it entirely

https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/process/process.go#L66-L125

someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

chendave · 2022-07-29T10:43:19Z

sorry, I just notice this, I am trying to answer some questions here if possible.

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs.

as @aojea has said, Ginkgo V2 merged all results into a single junit file, the file is named as "junitxxx01.yaml", pls see: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/test_context.go#L336-L339

chendave · 2022-07-29T10:50:24Z

#109111 mentions that we the old parameter "--parallel" no longer works.

it's only for golang's parallel flag, we shouldn't use that as it's not accepted by Ginkgov2, here is an example, https://github.com/kubernetes/test-infra/pull/26784/files.

--ginkgo-parallel is fine.

chendave · 2022-07-29T10:59:33Z

I think this envvar needs to be changed from: GINKGO_PARALLEL_NODES to GINKGO_PARALLEL_PROCESS

I don't think so, pls see:

kubernetes/hack/ginkgo-e2e.sh

Lines 135 to 139 in 3c12379

    
           if [[ -n "${GINKGO_PARALLEL_NODES:-}" ]]; then 
        
             ginkgo_args+=("--nodes=${GINKGO_PARALLEL_NODES}") 
        
           elif [[ ${GINKGO_PARALLEL} =~ ^[yY]$ ]]; then 
        
             ginkgo_args+=("--nodes=25") 
        
           fi

chendave · 2022-07-29T11:02:40Z

someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

+1 on the pruning and adapt with new junit file.

mborsz · 2022-07-29T11:11:51Z

In deck's logs I see:

{"artifact":"https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856/artifacts/junit_01.xml", "component":"deck", "error":"file size over specified limit", "file":"k8s.io/test-infra/prow/spyglass/lenses/junit/lens.go:165", "func":"k8s.io/test-infra/prow/spyglass/lenses/junit.Lens.getJvd.func1", "level":"warning", "msg":"Error reading artifact"}

So at least for deck the reason for not showing junit is too large file.

I think testgrid implements similar logic probably.

chendave · 2022-07-29T11:12:46Z

From 07-08, there's a single junit_01.xml file, which is 650MB+ and contains all the std-out and std-err informations, which is a duplication of build logs.

@onsi is there any way to keep a simplified junit report instead of dump everything in that file? as what Ginkgo V1 did?

mborsz · 2022-07-29T11:13:57Z

Deck is using limit of 200M:
https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/config/prow/config.yaml#L50-L50

aojea · 2022-07-29T11:16:25Z

the solution seems to adapt kubetest

it seems kubetest dumps it entirely
https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/process/process.go#L66-L125
someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k?

I can try to take a shot after code freeze

BenTheElder · 2022-07-29T20:05:22Z

This is really something that we shouldn't have to do in kubetest :/ (consider this also impacts tools like sonobuoy), and we're generally not accepting more complexity in that tool https://github.com/kubernetes/test-infra/tree/master/kubetest#deprecation-notice

chendave · 2022-07-30T10:57:56Z

junit_v2.txt
junit_v1.txt

Attached v1 junit report and v2 junit report as a reference, v2 has embedded system-err into each testcases even the testcase is skipped, this make the file for v2 is larger than v1.

v1,

      <testcase name="[sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support expansion of pvcs created for ephemeral pvcs" classname="Kubernetes e2e suite" time="0">
          <skipped></skipped>
      </testcase>

v2

          <testcase name="[It] [sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support multiple inline ephemeral volumes" classname="Kubernetes e2e suite" status="skipped" time="0">
              <skipped message="skipped"></skipped>
              <system-err>[ReportAfterEach] TOP-LEVEL&#xA;  /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:142&#xA;</system-err>
          </testcase>

Just run one testcases, and v1 has 21145L, and v2 has 28258L

# du -sh junit_v1.xml
2.0M    junit_v1.xml
# du -sh junit_v2.xml
3.5M    junit_v2.xml

onsi · 2022-08-01T13:12:58Z

hey all,

when running in parallel ginkgo v2 now merges the individual reports generated by each parallel process into one composite process. there is no way to turn this off (v1's behavior was a shortcut at the time and somewhat ugly and confusing).

v2 also merges junit reports from multiple suites into a single file. this can be turned off with --keep-separate-reports

in v2 i also took a closer look at what few official-seeming junit specs exist and updated the reporter to match. currently there is no mechanism to control that default behavior.

you can, however, build a custom reporter to do some filtering/processing in-suite and for a project of k8s scope that might make the most sense.

for example - a reporter that modifies/filters the report object before manually calling GenerateJUnitReport or that builds the JUnitTestCases manually. I'd be happy to share more detail/help if that approach is preferred.

chendave · 2022-08-01T13:41:10Z

@azylinski How can we trigger this CI: ci-kubernetes-e2e-gce-scale-correctness?

azylinski · 2022-08-01T14:08:50Z

I'm not sure if you can easily, as the project is constantly used by correctness and performance tests.
Or at least I don't know how to run it - I've just wrote down in this issue to what I've observed.

marseel · 2022-08-01T16:41:40Z

FYI this is CI job that is being run every day: https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-correctness
It requires 5k nodes cluster to run and we have 1 gcp project shared with other scalability tests that can handle it so there is no easy way to manually trigger it.

chendave · 2022-08-02T02:15:09Z

might be easiest if we trim / split / ... in a custom reporter or on exit in e2e.test,

I have tried to split, but split the 600+ M xml file would be time consuming.

custom reporter or on exit in e2e.test

I am not sure whether that job rely on the e2e.test, if so, we might consider to add a custom reporter for that. For what I observed for each jobs triggered regularly for a PR, the size of report is less than 50M. The gap between 50 and 600 are huge, poor background for me to get more info on that job.

@BenTheElder Talking about the custom reporter, shall the testgrid or Spyglass smart enough to fetch the result from two different junit xml file? one is trimmed and the other one is original.

chendave · 2022-08-02T02:35:11Z

/cc @pohly for thoughts.

pohly · 2022-08-02T08:00:09Z

I think we should try to reduce the amount of information that gets stored in the single JUnit file. We also should ensure that we only store a single output stream (#109744 (comment)). That addresses the main problem ("too much information").

Splitting into different files feels like a poor workaround because eventually they need to be merged again in memory when dealing with the entire job run.

shall the testgrid or Spyglass smart enough to fetch the result from two different junit xml file? one is trimmed and the other one is original.

I don't think we need the original, more complete JUnit file.

chendave · 2022-08-02T08:20:08Z

I don't think we need the original, more complete JUnit file.

I am afraid someone need the original one, since the original one holds the single of truth if anyone wants to check the details on the test execution.

Splitting into different files feels like a poor workaround

+1 for couple of reasons,

@onsi said that the split by the the number of parallelism is just a workaround at that time, v2 sound like did the right things, but this doesn't work for testgrid, 😅 .

split is time consuming and not trivial to implement.

chendave · 2022-08-02T08:25:43Z

Here is the comparison after this pr:
#111627

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/111627/pull-kubernetes-e2e-kind-ipv6/1554370753451790336/artifacts/

original one:
junit_01.xml - 25272885
trimmed one:
junit_01_trimmed.xml - 2492432

more than 90% off.

similarly, we will eventually get around 60 M for the job gce-master-scale-correctness?

chendave · 2022-08-02T08:29:59Z

Spyglass seems work, here is the link, stdout and stderr is not impacted when the test failed.

while the testcase passed or skipped, Spyglass doesn't provide the link to the stdout or stderr, so the Spyglass should be fine.

Spyglass reads the trimmed xml file btw, not sure how it selects the file though.

chendave · 2022-08-05T02:29:47Z

checked https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests= all testcase status should be back.

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1555161967272923136/artifacts/, the size now is about 2.5M.

But one thing should be noted is that, name of the testcases has been changed a little bit which is caused by the new format junit xml.

Here is one example,
In V1, the testcase named as:
"Kubernetes e2e suite.[sig-storage] Volumes NFSv4 should be mountable for NFSv4"
now has been updated to:
"Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4"

pls note the string of "[It]" that is added.

So, you can only check the status of "Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4" now.

We can somehow modify the report again to trim the string "[It]", this should be easy, but I am not quite sure whether we need to make everything the same as before.

If you are okay with the new test name, I think we can close this issue, otherwise we can update the name as well.

@azylinski @pohly @aojea thoughts?

chendave · 2022-08-05T02:45:17Z

I personally think it's okay, k8s just deliver this file while how the downstream tools consume those data is not predictable.

If the "IT" string added to the testcase is not necessary then it should be a bug from Ginkgo, and should be fixed in Ginkgo, in this case, we can filter out the string as a temporary workaround. @onsi

chendave · 2022-08-05T03:16:29Z

Another diff I found,

All skipped testcase won't shown in testgrid,

e.g.
"[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node"

the new format is
<testcase name="[It] [sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node" classname="Kubernetes e2e suite" status="skipped" time="0.000383338">
<skipped message="skipped - Driver nfs doesn't support Block -- skipping"></skipped>
</testcase>

pls note the attribute of "status="skipped" ,

it was,

Can testgrid or some tool grep the "status" to get the result instead?

@aojea do you know someone familiar with the tool can help to update the tool a little bit?

aojea · 2022-08-05T06:57:16Z

I see , it seems the new reporting format breaks testgrid, @michelle192837 can you help us to recover the functionality in testgrid?

helayoty · 2022-08-15T15:53:50Z

@azylinski @chendave 👋 the release 1.25 bug triage Lead. I'm reaching out to see the status of this issue and if we are still targeting the current release. Also, is it a release blocker?

Can you please update the priority as well?

BenTheElder · 2022-08-15T17:53:48Z

FWIW I think this is implemented here https://github.com/GoogleCloudPlatform/testgrid/blob/master/pkg/updater/eval.go (junit in GCS => testgrid)

chendave · 2022-08-16T07:05:30Z

@helayoty I don't think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this GoogleCloudPlatform/testgrid#1055.

helayoty · 2022-08-16T14:37:00Z

@helayoty I don't think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this GoogleCloudPlatform/testgrid#1055.

Thanks for replying back, I'll move it to 1.26 then

k8s-ci-robot · 2022-08-16T14:41:56Z

@helayoty: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22, v1.23, v1.24, v1.25, v1.26]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

helayoty · 2022-08-16T14:42:43Z

/milestone v1.26

aojea · 2022-08-17T06:45:08Z

/close

this issue has been resolved, I ca see that the test cases are showing up now, since the different fixes added to the e2e framework.

The remaining part is to show the skip test cases in testgrid, but that is another issue, better not to add more things here

k8s-ci-robot · 2022-08-17T06:45:25Z

@aojea: Closing this issue.

In response to this:

/close

this issue has been resolved, I ca see that the test cases are showing up now, since the different fixes added to the e2e framework.

The remaining part is to show the skip test cases in testgrid, but that is another issue, better not to add more things here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

azylinski added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 28, 2022

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 28, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 28, 2022

k8s-ci-robot assigned chendave Jul 29, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 29, 2022

chendave mentioned this issue Aug 1, 2022

Don't merge the junit report #111599

Closed

chendave mentioned this issue Aug 2, 2022

e2e: Trim junit reporter to adapt with testgrid #111627

Merged

azylinski mentioned this issue Aug 2, 2022

REQUEST: New membership for azylinski kubernetes/org#3583

Closed

9 tasks

pohly mentioned this issue Aug 3, 2022

e2e: trim junit report for Spyglass #111669

Merged

chendave mentioned this issue Aug 16, 2022

fix: Add skipped status back for Ginkgo v2 GoogleCloudPlatform/testgrid#1055

Merged

k8s-ci-robot modified the milestones: v1.25, v1.26 Aug 16, 2022

k8s-ci-robot closed this as completed Aug 17, 2022

gce-master-scale-correctness - test cases not showing up on testgrid tab #111510

gce-master-scale-correctness - test cases not showing up on testgrid tab #111510

Comments

azylinski commented Jul 28, 2022 • edited Loading

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Jul 28, 2022

k8s-ci-robot commented Jul 28, 2022

azylinski commented Jul 29, 2022

mborsz commented Jul 29, 2022

k8s-ci-robot commented Jul 29, 2022

mborsz commented Jul 29, 2022

aojea commented Jul 29, 2022

mborsz commented Jul 29, 2022

aojea commented Jul 29, 2022

mborsz commented Jul 29, 2022

aojea commented Jul 29, 2022

aojea commented Jul 29, 2022

mborsz commented Jul 29, 2022

azylinski commented Jul 29, 2022

aojea commented Jul 29, 2022 • edited Loading

chendave commented Jul 29, 2022 • edited Loading

chendave commented Jul 29, 2022 • edited Loading

chendave commented Jul 29, 2022

chendave commented Jul 29, 2022

mborsz commented Jul 29, 2022 • edited Loading

chendave commented Jul 29, 2022

mborsz commented Jul 29, 2022

aojea commented Jul 29, 2022

BenTheElder commented Jul 29, 2022

chendave commented Jul 30, 2022

onsi commented Aug 1, 2022

chendave commented Aug 1, 2022

azylinski commented Aug 1, 2022 • edited Loading

marseel commented Aug 1, 2022

chendave commented Aug 2, 2022 • edited Loading

chendave commented Aug 2, 2022

pohly commented Aug 2, 2022

chendave commented Aug 2, 2022

chendave commented Aug 2, 2022

chendave commented Aug 2, 2022 • edited Loading

chendave commented Aug 5, 2022

chendave commented Aug 5, 2022

chendave commented Aug 5, 2022 • edited Loading

aojea commented Aug 5, 2022

helayoty commented Aug 15, 2022

BenTheElder commented Aug 15, 2022

chendave commented Aug 16, 2022

helayoty commented Aug 16, 2022

k8s-ci-robot commented Aug 16, 2022

helayoty commented Aug 16, 2022

aojea commented Aug 17, 2022

k8s-ci-robot commented Aug 17, 2022

azylinski commented Jul 28, 2022 •

edited

Loading

aojea commented Jul 29, 2022 •

edited

Loading

chendave commented Jul 29, 2022 •

edited

Loading

chendave commented Jul 29, 2022 •

edited

Loading

mborsz commented Jul 29, 2022 •

edited

Loading

azylinski commented Aug 1, 2022 •

edited

Loading

chendave commented Aug 2, 2022 •

edited

Loading

chendave commented Aug 2, 2022 •

edited

Loading

chendave commented Aug 5, 2022 •

edited

Loading