Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake data excludes pod-utils jobs #14643

Closed
BenTheElder opened this issue Oct 7, 2019 · 23 comments
Closed

flake data excludes pod-utils jobs #14643

BenTheElder opened this issue Oct 7, 2019 · 23 comments
Assignees
Labels
area/metrics area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@BenTheElder
Copy link
Member

See http://storage.googleapis.com/k8s-metrics/flakes-latest.json etc. (metrics/ produced files)
and http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1

The flake data is very misleading, for example pull-kubernetes-verify has "no flakes" which is definitely wrong.

What seems to be happening if we only include data from bootstrap.py results, not pod-utils (I think) possibly due to handling of the repos data (per @cjwagner)

We should fix this, not having flake data is a pretty big regression for managing kubernetes presubmits. I didn't realize that jobs I'd migrated were losing this.

@BenTheElder BenTheElder added kind/bug Categorizes issue or PR as related to a bug. area/metrics labels Oct 7, 2019
@stevekuznetsov
Copy link
Contributor

What data/files is the flake analysis using? Are the pod-utils not uploading something they should?
/assign

@BenTheElder
Copy link
Member Author

I'm not sure I fully understand the pipeline just yet, I took a quick look at it seems something like:

I think the issue is that repos is now a JSON blob, so @fejta suggested something like making this a field in the database and then reading that instead of this

where i.key = 'repos'

@BenTheElder
Copy link
Member Author

I'm not actually sure which component is at fault here or exactly why this doesn't work, but looking at the data we produce jobs that are fully pod-utils are missing and jobs that migrated to pod-utils on newer branches have "0 flakes" when they definitely have non-zero flakes.

I would tend to suggest that the pipeline is a bit hairy and probably at fault, but the results are generally very useful for identifying sources of flakiness.

The data looks (?) present in pod-utils to me but I'm not fully familiar with that format or the big query pipeline...

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/started.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/started.json

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-gce/1181359333854679040/finished.json
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/83519/pull-kubernetes-e2e-kind/1181359333888233473/finished.json

... maybe it's reading repos from finished.json instead of started?

@stevekuznetsov
Copy link
Contributor

Hmm -- not sure, I've never looked at that pipeline myself. Happy to help if we can identify what the utils should be doing to be compliant

@BenTheElder
Copy link
Member Author

BenTheElder commented Oct 8, 2019 via email

@BenTheElder
Copy link
Member Author

This came up again today wrt pull-kubernetes-integration flakes but I don't really understand this pipeline and I'm pretty far over capacity.

It seems like we do bigquery quer{y,ies} and then pipe through jq? ... these are fairly gnarly.

@stevekuznetsov
Copy link
Contributor

Who's an expert on that pipeline?

@BenTheElder
Copy link
Member Author

Cole is the only person I can remember touching it in the past year or so.

@spiffxp
Copy link
Member

spiffxp commented Jan 8, 2020

/assign
I'll take a look

Work that may overlap #15469

@spiffxp
Copy link
Member

spiffxp commented Jan 10, 2020

The flakes query looks for version != 'unknown' for CI jobs and metadata.key == 'repos' for PR jobs https://github.com/kubernetes/test-infra/blob/5deb5b970e73cdd55b3068b9c50962e8657bdb23/metrics/configs/flakes-config.yaml

looking at fields in the builds table for pr:pull-kubernetes-e2e-kind vs. pr:pull-kubernetes-e2e-gce,
kind has a null version and an empty metadata field, while gce's are populated

metadata comes from either finished.json or started.json

def get_metadata():
metadata = None
if finished and 'metadata' in finished:
metadata = finished['metadata']
elif started:
metadata = started.get('metadata')

version comes from finished.json

if 'version' in finished:
build['version'] = finished['version']

@spiffxp
Copy link
Member

spiffxp commented Jan 10, 2020

example kind job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/85282/pull-kubernetes-e2e-kind/1215213961905967104

finished.json has no version field and no metadata field

{"timestamp":1578565960,"passed":false,"result":"FAILURE","revision":"05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}

started.json has repos in it, but no metadata field

{"timestamp":1578564680,"pull":"85282","repo-version":"49162743c0055b4395dd40bdf910f2c0472973b5","repos":{"kubernetes/kubernetes":"master:ef69bc910f0e47bbe3cf396d4bebf4f678cf6f3a,85282:05c8dce8bcb1874ad57bcdeb391c11fcccff2a58"}}

example gce job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/86450/pull-kubernetes-e2e-gce/1215395413390004227

finished.json has a version field and metadata with repos populated:

{
  "timestamp": 1578610945, 
  "version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "result": "FAILURE", 
  "passed": false, 
  "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
  "metadata": {
    "repo-commit": "64e0fc900b5b3fcd5e5a16cb76ed40b1b900df15", 
    "node_os_image": "cos-77-12371-89-0", 
    "repos": {
      "k8s.io/kubernetes": "master:aef336d71253d9897f83425e80a231763d1385e8,86450:91a6050b58898d14f48ef893733cff070b17c0db", 
      "k8s.io/release": "master"
    }, 
    "infra-commit": "dd307d2a7", 
    "repo": "k8s.io/kubernetes", 
    "master_os_image": "cos-77-12371-89-0", 
    "job-version": "v1.18.0-alpha.1.550+64e0fc900b5b3f", 
    "pod": "c130ee54-332c-11ea-9e6e-4a9fb1cbefb2", 
    "revision": "v1.18.0-alpha.1.550+64e0fc900b5b3f"
  }
}

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2020
@spiffxp
Copy link
Member

spiffxp commented Apr 14, 2020

/remove-lifecycle stale
There is lots of organic data munging in the existing pipeline, and assumptions on the use of bootstrap and/or scripts in k/k's hack directory.

I got as far as writing a google-doc proposal which proposed adding repo and repo_commit fields, which would more closely match what testgrid is going to support going forward. Unfortunately it looks like this would require plumbing through the job->pod-utils->gcs->kettle->bigquery->metrics-queries pipeline and touching most every part along the way.

I was left with the impression that if we need to touch every part of the pipeline, maybe we want to consider rewriting parts of it piecemeal.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2020
@BenTheElder
Copy link
Member Author

@spiffxp lacking this data seems problematic. can we at least add some snippet we run in the wrapper script to dump this to e.g. metadata.json, or update the pipeline to consume the prowjob.json or ..?

@MushuEE
Copy link
Contributor

MushuEE commented Jul 6, 2020

/assign
Going to need to build this into Flake efforts

@spiffxp
Copy link
Member

spiffxp commented Oct 22, 2020

#19666 covers updating queries

@spiffxp
Copy link
Member

spiffxp commented Jan 12, 2021

Current status:

We haven't decided whether to swap out the old for the new:

  • There are more jobs in the new set of results: this is expected, and good!
  • Most jobs have flakiest: null - is this expected?
  • For jobs that appear in both sets of results, the new results show more flakes and lower consistency. Do we know why? Do we care?

When we decide to swap out old for new, we should also look at updating other queries before calling this done (ref: #20013)

@spiffxp
Copy link
Member

spiffxp commented Jan 12, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 12, 2021
@MushuEE
Copy link
Contributor

MushuEE commented Jan 12, 2021

Thanks @spiffxp, I have only done one juxtaposition of job results and had not seen that discrepancy in job data. I will try and look into this soon

@MushuEE
Copy link
Contributor

MushuEE commented Jan 13, 2021

Here are the results I am seeing:
OLD QUERY TOP 10

job build_consistency commit_consistency flakes runs commits
pr:pull-kubernetes-e2e-gce-ubuntu-containerd 0.943 0.934 26 491 395
ci-kubernetes-e2e-gce-multizone 0.83 0.701 20 194 67
ci-kubernetes-e2e-gci-gce-ipvs 0.628 0.746 16 137 63
ci-kubernetes-e2e-gci-gce-flaky 0.591 0.761 16 242 67
ci-kubernetes-e2e-gci-gce-ip-alias 0.925 0.761 16 254 67
ci-kubernetes-e2e-gci-gce 0.937 0.783 15 255 69
ci-kubernetes-e2e-gci-gce-proto 0.925 0.791 14 240 67
ci-kubernetes-e2e-gci-gce-kube-dns-nodecache 0.799 0.794 13 134 63
pr:pull-kubernetes-node-e2e 0.974 0.969 13 494 415
ci-kubernetes-e2e-gci-gce-coredns 0.87 0.813 12 138 64

New Query Top 10

job build_consistency commit_consistency flakes runs commits
ci-kubernetes-cached-make-test 0.526 0.057 66 603 70
pr:pull-kubernetes-e2e-gce-ubuntu-containerd 0.945 0.938 29 564 465
ci-kubernetes-generate-make-test-cache 0.753 0.687 21 158 67
ci-kubernetes-coverage-unit 0.771 0.692 20 157 65
ci-kubernetes-e2e-gce-multizone 0.83 0.701 20 194 67
ci-kubernetes-e2e-gci-gce-ipvs 0.628 0.746 16 137 63
ci-kubernetes-e2e-gci-gce-flaky 0.591 0.761 16 242 67
ci-kubernetes-e2e-gci-gce-ip-alias 0.925 0.761 16 254 67
pr:pull-kubernetes-node-e2e 0.972 0.967 16 562 480
ci-kubernetes-e2e-gci-gce 0.937 0.783 15 255 69

@MushuEE
Copy link
Contributor

MushuEE commented Jan 15, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@MushuEE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MushuEE
Copy link
Contributor

MushuEE commented Jan 15, 2021

#20500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants