-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gce-master-scale-correctness - test cases not showing up on testgrid tab #111510
Comments
@azylinski: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:
Please see the group list for a listing of the SIGs, working groups, and committees available. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@azylinski: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @chendave |
/milestone 1.25 |
@mborsz: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.25 |
I think that this is related to the ginkgov2 migration, the reporters in ginkgo has changed. Do you have a link to the code where the scale test run the e2e commands? |
The test is defined here: https://github.com/kubernetes/test-infra/blob/afcd7bf377861006c31a98cffe673939243741fe/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L4 We use kubekins. My guess is that the old "junit shards" we see were generated by different ginkgo processes/threads (number of junit files matches Probably we need to migrate kubekins to use |
hmm, that seems a kubetest flag https://github.com/kubernetes/test-infra/blob/444b10105ab0b6356e3327a2100342bb782f2a57/kubetest/main.go#L150 , |
I just checked kubetest code and it looks like the kubetest's --ginkgo-parallel flag gets propagated to --nodes ginkgo flag and it seems to work fine: The most recent test (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1552625181665529856) has |
it seems is using --nodes for parallel I think that now it aggregates the reports, also running in parallel
|
besides that, all the other jobs are reporting a single junit xml file, is this the real problem? @BenTheElder do you have more ideas on why this file is not being interpreted by testgrid? |
My guess is that 600MiB of junit exceeds some size limit in used in testgrid and for that reason it isn't interpreted by testgrid. |
https://github.com/kubernetes/test-infra/blob/master/kubetest/main.go#L1012 Ref: https://onsi.github.io/ginkgo/MIGRATING_TO_V2#renamed-ginkgoparallelnode |
that doesn't look related, AFAIK the migration is about an API and that is a kubetest variable, we are passing the flag correctly. I think that the junit generation is a better lead, in k/k there is tooling to strip the file to avoid issues with large sizes it seems kubetest dumps it entirely someone is up to adapt kubetest to either split the xml generated or prune the large messages in k/k? |
sorry, I just notice this, I am trying to answer some questions here if possible.
as @aojea has said, Ginkgo V2 merged all results into a single junit file, the file is named as "junitxxx01.yaml", pls see: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/test_context.go#L336-L339 |
it's only for golang's parallel flag, we shouldn't use that as it's not accepted by Ginkgov2, here is an example, https://github.com/kubernetes/test-infra/pull/26784/files.
|
I don't think so, pls see: Lines 135 to 139 in 3c12379
|
+1 on the pruning and adapt with new junit file. |
In deck's logs I see:
So at least for deck the reason for not showing junit is too large file. I think testgrid implements similar logic probably. |
@onsi is there any way to keep a simplified junit report instead of dump everything in that file? as what Ginkgo V1 did? |
the solution seems to adapt kubetest
I can try to take a shot after code freeze |
This is really something that we shouldn't have to do in kubetest :/ (consider this also impacts tools like sonobuoy), and we're generally not accepting more complexity in that tool https://github.com/kubernetes/test-infra/tree/master/kubetest#deprecation-notice |
Attached v1 junit report and v2 junit report as a reference, v2 has embedded v1, <testcase name="[sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support expansion of pvcs created for ephemeral pvcs" classname="Kubernetes e2e suite" time="0">
<skipped></skipped>
</testcase> v2 <testcase name="[It] [sig-storage] In-tree Volumes [Driver: rbd][Feature:Volumes][Serial] [Testpattern: Generic Ephemeral-volume (default fs) (late-binding)] ephemeral should support multiple inline ephemeral volumes" classname="Kubernetes e2e suite" status="skipped" time="0">
<skipped message="skipped"></skipped>
<system-err>[ReportAfterEach] TOP-LEVEL
 /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:142
</system-err>
</testcase> Just run one testcases, and v1 has 21145L, and v2 has 28258L # du -sh junit_v1.xml
2.0M junit_v1.xml
# du -sh junit_v2.xml
3.5M junit_v2.xml |
hey all, when running in parallel ginkgo v2 now merges the individual reports generated by each parallel process into one composite process. there is no way to turn this off (v1's behavior was a shortcut at the time and somewhat ugly and confusing). v2 also merges junit reports from multiple suites into a single file. this can be turned off with in v2 i also took a closer look at what few official-seeming junit specs exist and updated the reporter to match. currently there is no mechanism to control that default behavior. you can, however, build a custom reporter to do some filtering/processing in-suite and for a project of k8s scope that might make the most sense. for example - a reporter that modifies/filters the report object before manually calling |
@azylinski How can we trigger this CI: ci-kubernetes-e2e-gce-scale-correctness? |
I'm not sure if you can easily, as the project is constantly used by correctness and performance tests. |
FYI this is CI job that is being run every day: https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-correctness |
I have tried to split, but split the 600+ M xml file would be time consuming.
I am not sure whether that job rely on the e2e.test, if so, we might consider to add a custom reporter for that. For what I observed for each jobs triggered regularly for a PR, the size of report is less than 50M. The gap between 50 and 600 are huge, poor background for me to get more info on that job. @BenTheElder Talking about the custom reporter, shall the testgrid or Spyglass smart enough to fetch the result from two different junit xml file? one is trimmed and the other one is original. |
/cc @pohly for thoughts. |
I think we should try to reduce the amount of information that gets stored in the single JUnit file. We also should ensure that we only store a single output stream (#109744 (comment)). That addresses the main problem ("too much information"). Splitting into different files feels like a poor workaround because eventually they need to be merged again in memory when dealing with the entire job run.
I don't think we need the original, more complete JUnit file. |
I am afraid someone need the original one, since the original one holds the single of truth if anyone wants to check the details on the test execution.
+1 for couple of reasons, @onsi said that the split by the the number of parallelism is just a workaround at that time, v2 sound like did the right things, but this doesn't work for testgrid, 😅 . split is time consuming and not trivial to implement. |
Here is the comparison after this pr: original one: more than 90% off. similarly, we will eventually get around 60 M for the job |
Spyglass seems work, here is the link, while the testcase passed or skipped, Spyglass doesn't provide the link to the Spyglass reads the trimmed xml file btw, not sure how it selects the file though. |
checked https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests= all testcase status should be back. https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1555161967272923136/artifacts/, the size now is about 2.5M. But one thing should be noted is that, name of the testcases has been changed a little bit which is caused by the new format junit xml. Here is one example, pls note the string of "[It]" that is added. So, you can only check the status of "Kubernetes e2e suite.[It] [sig-storage] Volumes NFSv4 should be mountable for NFSv4" now. We can somehow modify the report again to trim the string "[It]", this should be easy, but I am not quite sure whether we need to make everything the same as before. If you are okay with the new test name, I think we can close this issue, otherwise we can update the name as well. @azylinski @pohly @aojea thoughts? |
I personally think it's okay, k8s just deliver this file while how the downstream tools consume those data is not predictable. If the "IT" string added to the testcase is not necessary then it should be a bug from Ginkgo, and should be fixed in Ginkgo, in this case, we can filter out the string as a temporary workaround. @onsi |
Another diff I found, All skipped testcase won't shown in testgrid, e.g. the new format is pls note the attribute of "status="skipped" , it was, <testcase name="[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (block volmode)] multiVolume [Slow] should access to two volumes with different volume mode and retain data across pod recreation on different node" classname="Kubernetes e2e suite" time="0"> Can testgrid or some tool grep the "status" to get the result instead? @aojea do you know someone familiar with the tool can help to update the tool a little bit? |
I see , it seems the new reporting format breaks testgrid, @michelle192837 can you help us to recover the functionality in testgrid? |
@azylinski @chendave 👋 the release 1.25 bug triage Lead. I'm reaching out to see the status of this issue and if we are still targeting the current release. Also, is it a release blocker? Can you please update the priority as well? |
FWIW I think this is implemented here https://github.com/GoogleCloudPlatform/testgrid/blob/master/pkg/updater/eval.go (junit in GCS => testgrid) |
@helayoty I don't think this is a release blocker, the change needed in Kubernetes has been merged, meanwhile, I pushed a pr in testgrid to close this GoogleCloudPlatform/testgrid#1055. |
Thanks for replying back, I'll move it to 1.26 then |
@helayoty: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.26 |
/close this issue has been resolved, I ca see that the test cases are showing up now, since the different fixes added to the e2e framework. The remaining part is to show the skip test cases in testgrid, but that is another issue, better not to add more things here |
@aojea: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are failing?
Since 8th JUL, the test cases don't show up on testgrid:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=
Which tests are failing?
ci-kubernetes-e2e-gce-scale-correctness
It's not clear from testgrid, which runs pass and which failed atm
Since when has it been failing?
2022-07-08
Testgrid link
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-correctness&show-stale-tests=
Reason for failure (if possible)
Since 8th JUL, the build-log and junit artifacts looks different.
Before, the junit file was split into parts:
junit_{}.xml
and it contains only atestcase
data (name, time, skipped*, ...), E.g: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1545015201953222656/artifacts/From
07-08
, there's a singlejunit_01.xml
file, which is650MB+
and contains all thestd-out
andstd-err
informations, which is a duplication of build logs. E.g.:https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1550813416132710400/artifacts/
Looking at the difference in kubernetes/kubernetes source code - there were number of changes to: ginkgo, klog and traces - all could be related:
2a017f9...4569e64
cc @chendave - I see you've made number of commits; would you be able to help?
Anything else we need to know?
No response
Relevant SIG(s)
No response
The text was updated successfully, but these errors were encountered: