Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading logs to GCS seem to stopped working #34446

Closed
wojtek-t opened this issue Oct 10, 2016 · 18 comments · Fixed by #34647
Closed

Uploading logs to GCS seem to stopped working #34446

wojtek-t opened this issue Oct 10, 2016 · 18 comments · Fixed by #34647
Assignees
Labels
area/test-infra kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@wojtek-t
Copy link
Member

It seems that we stopped uploading logs to GCS around Friday/Saturday.

As examples, this is the first kubemark-scale run that doesn't have logs:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1468/?project=kubernetes-jenkins

But we also don't have any logs e.g. for kubernetes-gce-e2e suite (which seems to be our main suite), e.g.:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce/24856/?project=kubernetes-jenkins
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce/24857/?project=kubernetes-jenkins
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce/24858/?project=kubernetes-jenkins

This seems to be the last run, for which we have logs:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce/24740/?project=kubernetes-jenkins

which suggests that it was broken around 2016-10-08 01:00:00

@kubernetes/test-infra-admins @kubernetes/test-infra-admins @kubernetes/sig-testing
@fejta @ixdy @spxtr

@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels Oct 10, 2016
@wojtek-t
Copy link
Member Author

wojtek-t commented Oct 10, 2016

P0 - because it makes any debugging roughly impossible.

@spxtr
Copy link
Contributor

spxtr commented Oct 10, 2016

@ixdy
Copy link
Member

ixdy commented Oct 10, 2016

@rmmh is working on this

@wojtek-t
Copy link
Member Author

It seems that after fixes from yesterday, we are still lacking logs from all kubemark runs. @gmarek

@wojtek-t
Copy link
Member Author

Lack of kubemark logs is significantly slowing down any scalability-related work, which is our P0 goal for this quarter, so please prioritize it.
@fgrzadkowski - FYI ^^

@wojtek-t
Copy link
Member Author

As an example:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-kubemark-gce-scale/1473/?project=kubernetes-jenkins
[is a run from an hour ago] and it contains only started.json file.

@fejta
Copy link
Contributor

fejta commented Oct 11, 2016

I'm going to migrate this job over to where our code (rather than a jenkins plugin) captures and uploads logs, which we want to do anyway and will hopefully resolve this.

Odd... if we look at job/kubernetes-e2e-gce/configure we will a bunch of conditional build steps in the post-build actions configuration, which matches what is specified in yaml:
https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins/kubernetes-e2e-gce.yaml#L46

We also have this specified in kubemark yaml:
https://github.com/kubernetes/test-infra/blob/master/jenkins/job-configs/kubernetes-jenkins/kubernetes-kubemark.yaml#L26

However I don't see where this is applied in the actual configuration: /job/kubernetes-kubemark-gce-scale/configure

@ixdy
Copy link
Member

ixdy commented Oct 11, 2016

I suspect that the job updater ran at some point this weekend while the conditional build step plugin wasn't working, so it wasn't able to add that step. I'm not sure why it hasn't rectified it since.

@rmmh
Copy link
Contributor

rmmh commented Oct 11, 2016

That seems to be the problem. I've cleared the job cache and done a full rebuild, so gce-scale and anything else that was updated why the postbuild plugin was missing should be fixed now.

@rmmh rmmh closed this as completed Oct 11, 2016
@wojtek-t
Copy link
Member Author

Hmm - it doesn't seem to be fully fixed.

Basically, we used to have logs from kubemark master machine, e.g.:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/6550/artifacts/?project=kubernetes-jenkins
(there is this kubemark-500-kubemark-master/ directory)

However, even now (this is the last run), we no longer have those logs:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/6686/artifacts/?project=kubernetes-jenkins

So something is still broken...

@wojtek-t wojtek-t reopened this Oct 11, 2016
@wojtek-t
Copy link
Member Author

@rmmh
Copy link
Contributor

rmmh commented Oct 12, 2016

All the files in _artifacts are being properly uploaded. Something else is failing to copy the logs onto the node.

In gce-scale/1477

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDumping master and node logs to /workspace/_artifacts
Copying 'kern kube-apiserver kube-scheduler kube-controller-manager etcd glbc cluster-autoscaler docker kubelet supervisor/supervisord supervisor/kubelet-stdout supervisor/kubelet-stderr supervisor/docker-stdout supervisor/docker-stderr' from kubemark-2000-kubemark-master
Node SSH not supported for kubemark

In gce-scale/1465, it seems to be copying more:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSDumping master and node logs to /workspace/_artifacts
scp: /var/log/glbc.log*: No such file or directory
scp: /var/log/cluster-autoscaler.log*: No such file or directory
scp: /var/log/supervisor/kubelet-stdout.log*: No such file or directory
scp: /var/log/supervisor/kubelet-stderr.log*: No such file or directory
scp: /var/log/supervisor/docker-stdout.log*: No such file or directory
scp: /var/log/supervisor/docker-stderr.log*: No such file or directory
ERROR: (gcloud.compute.copy-files) [/usr/bin/scp] exited with return code [1]. See https://cloud.google.com/compute/docs/troubleshooting#ssherrors for troubleshooting hints.
Node SSH not supported for kubemark

Did something else change in kubemark between these time periods? Is scp from the master broken now? This is a separate issue.

@rmmh rmmh closed this as completed Oct 12, 2016
@wojtek-t
Copy link
Member Author

I'm not aware of any other changes.
@gmarek - we should try to debug it tomorrow.

@wojtek-t wojtek-t reopened this Oct 12, 2016
@wojtek-t wojtek-t assigned gmarek and wojtek-t and unassigned rmmh and fejta Oct 12, 2016
@spxtr
Copy link
Contributor

spxtr commented Oct 12, 2016

We did touch log-dump.sh recently in #34153. Might be related.

@zmerlynn
Copy link
Member

Yeah, I was just trying to work out if it was. I tried to keep the fake kubemark path alive as best I could.

@wojtek-t
Copy link
Member Author

@zmerlynn - if you could figure out if it was because of your PR (timing matches) that would be great...

zmerlynn added a commit to zmerlynn/kubernetes that referenced this issue Oct 12, 2016
@zmerlynn
Copy link
Member

@wojtek-t: #34647

@wojtek-t
Copy link
Member Author

I confirm that this is fixed now.
@zmerlynn - thanks a lot for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants