Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go.k8s.io/triage is out of date #9271

Closed
spiffxp opened this issue Sep 5, 2018 · 20 comments

Comments

Projects
None yet
4 participants
@spiffxp
Copy link
Member

commented Sep 5, 2018

/area kettle
/kind bug
/assign

http://velodrome.k8s.io/dashboard/db/bigquery-metrics?panelId=12&fullscreen&orgId=1

Called out in https://github.com/kubernetes/test-infra/blob/master/docs/oss-oncall-log.md

Last log entry

==== 2018-08-29 14:55:36 PDT ========================================
PULLED 471
ACK irrelevant 469
EXTEND-ACK  2
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-channelalpha/5155 True True 2018-08-29 14:23:01 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-kops-aws/104011 True True 2018-08-29 14:05:59 PDT SUCCESS
ACK "finished.json" 2
Downloading JUnit artifacts.

Replace the pod

spiffxp@spiffxp-macbookpro:kettle (master %)$ k get pods
NAME                      READY     STATUS    RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Running   202        26d
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl delete pod -l app=kettle
pod "kettle-5df45c4dcb-7tnx9" deleted
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl rollout status deployment/kettle
deployment "kettle" successfully rolled out
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl get pod -l app=kettle
NAME                      READY     STATUS        RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Terminating   202        26d
kettle-5df45c4dcb-fzkjt   1/1       Running       0          16s

Watch the logs

spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl logs -f $(kubectl get pod -l app=kettle -oname)
Activated service account credentials for: [kettle@k8s-gubernator.iam.gserviceaccount.com]
Loading builds from gs://kubernetes-jenkins/pr-logs
already have 1296792 builds
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubebench/74/kubeflow-kubebench-presubmit/144
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/672
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/666
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/664
gs://kubernetes-jenkins/pr-logs/pull/cloud-provider-gcp/50/cloud-provider-gcp-tests/30
# ...
@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2018

I will close this when the spice is flowing once more

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Sep 5, 2018

hummm restart the pod?

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2018

At last glance the kettle pod is still trying to catch up. There is data flowing into bigquery. I'm not sure if anything needs to be done to refresh triage or the metrics driving our velodrome dashboard. Going to wait a bit longer.

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2018

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2018

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Sep 10, 2018

/assign
currently oncall and looking into this

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2018

/unassign @BenTheElder
Trying to get to the point where a run takes less than two hours. In the meantime I've updated with results from a manual run on my laptop.

https://storage.googleapis.com/k8s-gubernator/triage/index.html

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2018

/area triage
/remove-area kettle

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Sep 24, 2018

Thanks @spiffxp!

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 25, 2018

https://storage.googleapis.com/k8s-gubernator/triage/index.html now looks truncated midway through, trying to figure out why

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2018

It was truncated because I was downloading an old tarball of results. Whoops. One last thing, going to try and get https://k8s-testgrid.appspot.com/sig-testing-misc#triage populated

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2018

/area kettle
https://k8s-testgrid.appspot.com/sig-testing-misc#metrics-kettle

I think kettle may be failing again

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 26, 2018

Kettle's log output is confusing me, it's streaming:

gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335654
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335666
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335650
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335640
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335655
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335649
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335647
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335663
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335662
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335646
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335659
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335645
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335652
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335641
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335639
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335635
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335633
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335631
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335643
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335628
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335630

Trying a delete/rollout and seeing what happens

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2018

It fell into the same loop, but on a different bucket. I have kicked kettle again, and it seems to be going further this time?

I looked at logs for the past few days in stackdriver. Normal behavior is:

  • make_db.py gets called around 1am PDT
  • Starts loading buckets with Loading builds from gs://kubernetes-jenkins/pr-logs, which is hardcoded as the first bucket to load
  • Second to last bucket loaded is gs://kubernetes-jenkins/logs/
  • Last bucket loaded is gs://istio-circleci/

Last night:

  • I see the entry for kubernetes-jenkins/logs,
  • ... but not istio-circleci
  • Instead, it starts looping on ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new as described in the previous comment
@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2018

Current suspicion: while trying to decide how to enumerate builds for a given job, we hit an error when trying to read the latest-build.txt file that is silently passed and kicks us to the non-sequential path:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L137-L145

This path ends up going through a while True that could potentially keep looping:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L96-L109

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2018

Now suspect this is related to tot being down while the rest of prow was down due to outage on 2019-09-25 (https://docs.google.com/document/d/1kwqU4sCycwxfTsV774lnrtFakCg90rMXNShmjSqyEJI/view)

@BenTheElder

This comment has been minimized.

Copy link
Member

commented Sep 28, 2018

Is this stable now or still giving us problems?

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

I think it's stable. I'll close after verifying and open a followup issue for extending how far back we look. I'm tempted to punt on hardening kettle if we're going to revisit tot/snowflake id's.

@spiffxp

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

/close

  • ref: #9615 for triage lookback
  • ref: #9604 for tot
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2018

@spiffxp: Closing this issue.

In response to this:

/close

  • ref: #9615 for triage lookback
  • ref: #9604 for tot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.