flake data excludes pod-utils jobs #14643

BenTheElder · 2019-10-07T19:58:18Z

See http://storage.googleapis.com/k8s-metrics/flakes-latest.json etc. (metrics/ produced files)
and http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

The flake data is very misleading, for example pull-kubernetes-verify has "no flakes" which is definitely wrong.

What seems to be happening if we only include data from bootstrap.py results, not pod-utils (I think) possibly due to handling of the repos data (per @cjwagner)

We should fix this, not having flake data is a pretty big regression for managing kubernetes presubmits. I didn't realize that jobs I'd migrated were losing this.

The text was updated successfully, but these errors were encountered:

stevekuznetsov · 2019-10-07T23:59:57Z

What data/files is the flake analysis using? Are the pod-utils not uploading something they should?
/assign

BenTheElder · 2019-10-08T00:07:55Z

I'm not sure I fully understand the pipeline just yet, I took a quick look at it seems something like:

https://github.com/kubernetes/test-infra/tree/master/kettle -> fetch finished / started.json from each job, move into some schema in big query
https://github.com/kubernetes/test-infra/tree/master/metrics -> query the table, select flake data (which appears to be reading repos in started?)

I think the issue is that repos is now a JSON blob, so @fejta suggested something like making this a field in the database and then reading that instead of this

test-infra/metrics/configs/flakes-config.yaml

Line 64 in 5deb5b9

where i.key = 'repos'

BenTheElder · 2019-10-08T00:13:43Z

I'm not actually sure which component is at fault here or exactly why this doesn't work, but looking at the data we produce jobs that are fully pod-utils are missing and jobs that migrated to pod-utils on newer branches have "0 flakes" when they definitely have non-zero flakes.

I would tend to suggest that the pipeline is a bit hairy and probably at fault, but the results are generally very useful for identifying sources of flakiness.

The data looks (?) present in pod-utils to me but I'm not fully familiar with that format or the big query pipeline...

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/started.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/started.json

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/finished.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/finished.json

... maybe it's reading repos from finished.json instead of started?

stevekuznetsov · 2019-10-08T00:51:13Z

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help if we can identify what the utils should be doing to be compliant

BenTheElder · 2019-10-08T01:46:52Z

Thanks. I intend to take another look at this pipeline tomorrow to try to understand what is different.

…

On Mon, Oct 7, 2019, 17:51 Steve Kuznetsov ***@***.***> wrote: Hmm -- not sure, I've never looked at that pipeline myself. Happy to help if we can identify what the utils should be doing to be compliant — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14643?email_source=notifications&email_token=AAHADK3ZVTJJ5COVEKJMMA3QNPKQRA5CNFSM4I6JCTWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEASIVGQ#issuecomment-539265690>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHADK2M6JBW4KA5ZI3DG3TQNPKQRANCNFSM4I6JCTWA> .

BenTheElder · 2019-12-02T22:50:58Z

This came up again today wrt pull-kubernetes-integration flakes but I don't really understand this pipeline and I'm pretty far over capacity.

It seems like we do bigquery quer{y,ies} and then pipe through jq? ... these are fairly gnarly.

stevekuznetsov · 2019-12-02T23:16:00Z

Who's an expert on that pipeline?

BenTheElder · 2019-12-02T23:19:22Z

Cole is the only person I can remember touching it in the past year or so.

spiffxp · 2020-01-08T17:52:43Z

/assign
I'll take a look

Work that may overlap #15469

spiffxp · 2020-01-10T00:47:20Z

The flakes query looks for version != 'unknown' for CI jobs and metadata.key == 'repos' for PR jobs https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml

looking at fields in the builds table for pr:pull-kubernetes-e2e-kind vs. pr:pull-kubernetes-e2e-gce,
kind has a null version and an empty metadata field, while gce's are populated

metadata comes from either finished.json or started.json

test-infra/kettle/make_json.py

Lines 158 to 163 in cdeb7eb

    
           def get_metadata(): 
        
               metadata = None 
        
               if finished and 'metadata' in finished: 
        
                   metadata = finished['metadata'] 
        
               elif started: 
        
                   metadata = started.get('metadata')

version comes from finished.json

test-infra/kettle/make_json.py

Lines 155 to 156 in cdeb7eb

    
           if 'version' in finished: 
        
               build['version'] = finished['version']

spiffxp · 2020-01-10T00:52:11Z

example kind job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/85282/pull-kubernetes-e2e-kind/1215213961905967104

finished.json has no version field and no metadata field

{"timestamp":1578565960,"passed":false,"result":"FAILURE","revision":"05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}

started.json has repos in it, but no metadata field

{"timestamp":1578564680,"pull":"85282","repo-version":"49162743c0055b4395dd40bdf910f2c0472973b5","repos":{"kubernetes/kubernetes":"master:ef69bc910f0e47bbe3cf396d4bebf4f678cf6f3a,85282:05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}}

example gce job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/86450/pull-kubernetes-e2e-gce/1215395413390004227

finished.json has a version field and metadata with repos populated:

{
  "timestamp": 1578610945, 
  "version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "result": "FAILURE", 
  "passed": false, 
  "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "metadata": {
    "repo-commit": "64e0fc900b5b3fcd5e5a16cb76ed40b1b900df15", 
    "node_os_image": "cos-77-12371-89-0", 
    "repos": {
      "k8s.io/kubernetes": "master:aef336d71253d9897f83425e80a231763d1385e8,86450:91a6050b58898d14f48ef893733cff070b17c0db", 
      "k8s.io/release": "master"
    }, 
    "infra-commit": "dd307d2a7", 
    "repo": "k8s.io/kubernetes", 
    "master_os_image": "cos-77-12371-89-0", 
    "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
    "pod": "c130ee54-332c-11ea-9e6e-4a9fb1cbefb2", 
    "revision": "v1.18.0-alpha.1.550+64e0fc900b5b3f"
  }
}

fejta-bot · 2020-04-13T00:28:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

spiffxp · 2020-04-14T00:01:52Z

/remove-lifecycle stale
There is lots of organic data munging in the existing pipeline, and assumptions on the use of bootstrap and/or scripts in k/k's hack directory.

I got as far as writing a google-doc proposal which proposed adding repo and repo_commit fields, which would more closely match what testgrid is going to support going forward. Unfortunately it looks like this would require plumbing through the job->pod-utils->gcs->kettle->bigquery->metrics-queries pipeline and touching most every part along the way.

I was left with the impression that if we need to touch every part of the pipeline, maybe we want to consider rewriting parts of it piecemeal.

BenTheElder · 2020-05-12T05:20:54Z

@spiffxp lacking this data seems problematic. can we at least add some snippet we run in the wrapper script to dump this to e.g. metadata.json, or update the pipeline to consume the prowjob.json or ..?

MushuEE · 2020-07-06T23:42:47Z

/assign
Going to need to build this into Flake efforts

…uild This is based on spiffxp's proposal off issue kubernetes#14643

spiffxp · 2020-10-22T22:04:41Z

#19666 covers updating queries

spiffxp · 2021-01-12T21:14:41Z

Current status:

http://storage.googleapis.com/k8s-metrics/flakes-experiment-latest.json is results from the new query that uses data from pod-utils (or bootstrap)
http://storage.googleapis.com/k8s-metrics/flakes-latest.json is results from the old query that uses data from bootstrap only

We haven't decided whether to swap out the old for the new:

There are more jobs in the new set of results: this is expected, and good!
Most jobs have flakiest: null - is this expected?
For jobs that appear in both sets of results, the new results show more flakes and lower consistency. Do we know why? Do we care?

When we decide to swap out old for new, we should also look at updating other queries before calling this done (ref: #20013)

spiffxp · 2021-01-12T21:15:02Z

/milestone v1.21

MushuEE · 2021-01-12T23:18:56Z

Thanks @spiffxp, I have only done one juxtaposition of job results and had not seen that discrepancy in job data. I will try and look into this soon

MushuEE · 2021-01-13T23:09:44Z

Here are the results I am seeing:
OLD QUERY TOP 10

job	build_consistency	commit_consistency	flakes	runs	commits
pr:pull-kubernetes-e2e-gce-ubuntu-containerd	0.943	0.934	26	491	395
ci-kubernetes-e2e-gce-multizone	0.83	0.701	20	194	67
ci-kubernetes-e2e-gci-gce-ipvs	0.628	0.746	16	137	63
ci-kubernetes-e2e-gci-gce-flaky	0.591	0.761	16	242	67
ci-kubernetes-e2e-gci-gce-ip-alias	0.925	0.761	16	254	67
ci-kubernetes-e2e-gci-gce	0.937	0.783	15	255	69
ci-kubernetes-e2e-gci-gce-proto	0.925	0.791	14	240	67
ci-kubernetes-e2e-gci-gce-kube-dns-nodecache	0.799	0.794	13	134	63
pr:pull-kubernetes-node-e2e	0.974	0.969	13	494	415
ci-kubernetes-e2e-gci-gce-coredns	0.87	0.813	12	138	64

New Query Top 10

job	build_consistency	commit_consistency	flakes	runs	commits
ci-kubernetes-cached-make-test	0.526	0.057	66	603	70
pr:pull-kubernetes-e2e-gce-ubuntu-containerd	0.945	0.938	29	564	465
ci-kubernetes-generate-make-test-cache	0.753	0.687	21	158	67
ci-kubernetes-coverage-unit	0.771	0.692	20	157	65
ci-kubernetes-e2e-gce-multizone	0.83	0.701	20	194	67
ci-kubernetes-e2e-gci-gce-ipvs	0.628	0.746	16	137	63
ci-kubernetes-e2e-gci-gce-flaky	0.591	0.761	16	242	67
ci-kubernetes-e2e-gci-gce-ip-alias	0.925	0.761	16	254	67
pr:pull-kubernetes-node-e2e	0.972	0.967	16	562	480
ci-kubernetes-e2e-gci-gce	0.937	0.783	15	255	69

MushuEE · 2021-01-15T22:58:40Z

/close

k8s-ci-robot · 2021-01-15T22:58:52Z

@MushuEE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MushuEE · 2021-01-15T23:01:04Z

#20500

BenTheElder added kind/bug Categorizes issue or PR as related to a bug. area/metrics labels Oct 7, 2019

k8s-ci-robot assigned stevekuznetsov Oct 7, 2019

k8s-ci-robot assigned spiffxp Jan 8, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2020

This was referenced May 12, 2020

Migrate kind jobs to podutils #17451

Merged

[Flakey Test] pull-kubernetes-e2e-kind high failure rate kubernetes/kubernetes#91128

Closed

k8s-ci-robot assigned MushuEE Jul 6, 2020

MushuEE added a commit to MushuEE/test-infra that referenced this issue Jul 7, 2020

Do some boyscouting and add new repos and repo_commit fields to the b…

dcc207d

…uild This is based on spiffxp's proposal off issue kubernetes#14643

MushuEE mentioned this issue Jul 7, 2020

Extend Kettle build fields to be used for determining flakes #18197

Merged

spiffxp mentioned this issue Aug 5, 2020

Questions about started.json and finished.json #3412

Open

BenTheElder added the area/prow Issues or PRs related to prow label Aug 28, 2020

MushuEE mentioned this issue Sep 14, 2020

Update query to use new repos field and repo_commit #19209

Closed

spiffxp mentioned this issue Oct 22, 2020

Pod-Util BQ job supported flake query #19666

Closed

2 tasks

MushuEE mentioned this issue Oct 23, 2020

Update Experimental query for Pod-Util Support #19669

Merged

k8s-ci-robot added this to the v1.21 milestone Jan 12, 2021

k8s-ci-robot closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake data excludes pod-utils jobs #14643

flake data excludes pod-utils jobs #14643

BenTheElder commented Oct 7, 2019

stevekuznetsov commented Oct 7, 2019

BenTheElder commented Oct 8, 2019

BenTheElder commented Oct 8, 2019

stevekuznetsov commented Oct 8, 2019

BenTheElder commented Oct 8, 2019 via email

BenTheElder commented Dec 2, 2019

stevekuznetsov commented Dec 2, 2019

BenTheElder commented Dec 2, 2019

spiffxp commented Jan 8, 2020

spiffxp commented Jan 10, 2020 •

edited

spiffxp commented Jan 10, 2020

fejta-bot commented Apr 13, 2020

spiffxp commented Apr 14, 2020

BenTheElder commented May 12, 2020

MushuEE commented Jul 6, 2020

spiffxp commented Oct 22, 2020

spiffxp commented Jan 12, 2021 •

edited

spiffxp commented Jan 12, 2021

MushuEE commented Jan 12, 2021

MushuEE commented Jan 13, 2021

MushuEE commented Jan 15, 2021

k8s-ci-robot commented Jan 15, 2021

MushuEE commented Jan 15, 2021

flake data excludes pod-utils jobs #14643

flake data excludes pod-utils jobs #14643

Comments

BenTheElder commented Oct 7, 2019

stevekuznetsov commented Oct 7, 2019

BenTheElder commented Oct 8, 2019

BenTheElder commented Oct 8, 2019

stevekuznetsov commented Oct 8, 2019

BenTheElder commented Oct 8, 2019 via email

BenTheElder commented Dec 2, 2019

stevekuznetsov commented Dec 2, 2019

BenTheElder commented Dec 2, 2019

spiffxp commented Jan 8, 2020

spiffxp commented Jan 10, 2020 • edited

spiffxp commented Jan 10, 2020

fejta-bot commented Apr 13, 2020

spiffxp commented Apr 14, 2020

BenTheElder commented May 12, 2020

MushuEE commented Jul 6, 2020

spiffxp commented Oct 22, 2020

spiffxp commented Jan 12, 2021 • edited

spiffxp commented Jan 12, 2021

MushuEE commented Jan 12, 2021

MushuEE commented Jan 13, 2021

MushuEE commented Jan 15, 2021

k8s-ci-robot commented Jan 15, 2021

MushuEE commented Jan 15, 2021

spiffxp commented Jan 10, 2020 •

edited

spiffxp commented Jan 12, 2021 •

edited