Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zuul Job failures #65

Closed
chavafg opened this issue Sep 14, 2018 · 16 comments
Closed

Zuul Job failures #65

chavafg opened this issue Sep 14, 2018 · 16 comments

Comments

@chavafg
Copy link
Contributor

chavafg commented Sep 14, 2018

Recently, the zuul jobs have been failing. I see different failures:

  1. Job Time out.
    For some reason, some of the jobs have a timeout and unfortunately, it seems that there is no way to check why (or in which step) the job has timed out.
    For example: http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/ara-report/
    shows that the run.yaml could not be executed and no logs of that task are available.
    On the other hand I see that the post.yaml was executed, which collect the kata logs, meaning that the machine didn't hang, so I was wondering if there could be a way to know the reason of this timeout.

  2. Unable to apply a git patch
    Seems that sometimes, we are unable to apply a patch from git:

Applying patch: /home/zuul/src/github.com/kata-containers/packaging/obs-packaging/qemu-lite/patches/0001-memfd-fix-configure-test.patch

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
fatal: empty ident name (for <zuul@ubuntu-xenial-vexxhost-vexxhost-sjc1-0001993195.(none)>) not allowed

For this one, I think we have 2 options: 1. add a git config to the zuul jobs before running the setup or 2. change git am for patch

  1. For some reason, sometimes the vexxhost machine where we run the CI do not have nested virtualization enabled:
time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description="Intel Architecture CPU" name=GenuineIntel pid=29482 source=runtime type=attribute
time="2018-09-13T17:01:05Z" level=error msg="CPU property not found" arch=amd64 description="Virtualization support" name=vmx pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description="64Bit CPU" name=lm pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description=SSE4.1 name=sse4_1 pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Kernel-based Virtual Machine" name=kvm pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Host kernel accelerator for virtio" name=vhost pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Host kernel accelerator for virtio network" name=vhost_net pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Intel KVM" name=kvm_intel pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=error msg="open /sys/module/kvm_intel/parameters/nested: no such file or directory" arch=amd64 name=kata-runtime pid=29482 source=runtime
open /sys/module/kvm_intel/parameters/nested: no such file or directory
@jodh-intel
Copy link
Contributor

Related: #42.

@cboylan
Copy link

cboylan commented Sep 14, 2018

For 1 I have pushed https://review.openstack.org/602627 to double the timeout to two hours from one hour. We timeout the job run itself independent of the log collection so that we can debug things (which is why you see all of the kata logs) unfortunately due to the way ara works it has a hard time when we stop ansible under it. The good news is you can see the text version of the log at http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/job-output.txt.gz. My read of that is the job was just taking its time and the timeout is too short but let us know if that isn't how you read it.

For 2 I pushed https://review.openstack.org/602628 which will globally configure some generic throwaway git identity details.

For 3 we'll need more information to be sure of what is happening. Can you provide a link to the log files? One potential reason for that is Vexxhost turned on a new region which we are using that should have nested virt enabled everywhere, but may be a hypervisor is misconfigured. If you can give us the log link @mnaser can use those details to check this.

@cboylan
Copy link

cboylan commented Sep 14, 2018

Managed to catch on on 3 myself. @mnaser http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_53_28_559500 ran in sjc1 and if you scroll to http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_57_59_437331 it claims to not have vmx. Let me know if that instance info there isn't enough info for you and I will find the instance uuid on our end.

openstack-gerrit pushed a commit to openstack/openstack-zuul-jobs that referenced this issue Sep 16, 2018
As indicated at kata-containers/ci#65 our
existing hour long timeout is not sufficient. Bump up to two hours to
get plenty of room on this.

Change-Id: I39a706fe70f0f552a7bb986765acef065bbbace1
openstack-gerrit pushed a commit to openstack/openstack-zuul-jobs that referenced this issue Sep 16, 2018
The kata test jobs apply patches to git repos with git. This creates
commits which requires you have a user configured in git. Set up a
global git config with generic Zuul identify info in it to address this.

More details at kata-containers/ci#65

Change-Id: I08a6a13501fad92cd290f0a9e5559f61b11d7fab
@mnaser
Copy link
Member

mnaser commented Sep 17, 2018

@cboylan this should be resolved, sjc1 now has vmx again.

@chavafg
Copy link
Contributor Author

chavafg commented Sep 17, 2018

Got this error on last recheck:

1..32
not ok 1 ctr not found correct error message
# (from function `start_crio' in file helpers.bash, line 211,
#  in test file ctr.bats, line 10)
#   `start_crio' failed
# time="2018-09-17T13:26:37Z" level=error msg="error opening storage: /dev/vdb is not available for use with devicemapper" 

@mnaser, our cri-o tests require a /dev/vdb device. Can this also be added to sjc1?

@mnaser
Copy link
Member

mnaser commented Sep 17, 2018

Is there a way to work around this in your CI to remove that expectation? Maybe using a loopback device?

@chavafg
Copy link
Contributor Author

chavafg commented Sep 17, 2018

iirc we have disable support for loopback devices, @sboeuf?

@sboeuf
Copy link

sboeuf commented Sep 17, 2018

That was about us running CRI-O tests in a stable environment. CRI-O was not stable using loopback devices, that's why we moved away from it.

@chavafg
Copy link
Contributor Author

chavafg commented Sep 17, 2018

@mnaser @cboylan we have this /dev/vdb device on the VMs that jenkins launch, is there a way to also restrict that VM type for the Zuul CI?

@mnaser
Copy link
Member

mnaser commented Sep 17, 2018

Kata currently runs it's own custom flavour in the Jenkins CI while it uses our normal flavours in SJC1 (Zuul).

I really suggest that we come up with a solution together for this, as not having this means that no one can run those tests on their own VMs.

@cboylan any workaround ideas?

@cboylan
Copy link

cboylan commented Sep 17, 2018

Might help to have more info about what the block device is used for.

But generally, my suggestion would be to use a loopback device. At least when testing Swift and Cinder we've used them with success. With Swift to provide an XFS filesystem regardless of the host system. With Cinder to provide a dedicated vgs out of which cinder can provision lvs with lvm.

@chavafg
Copy link
Contributor Author

chavafg commented Sep 18, 2018

So I have tested locally the use of loopback devices and it still works.
I could add a condition in our cri-o setup to use loopback when running on Zuul (or when not running on Jenkins). We need to have in mind that using loopback devices will increase ~5 minutes of execution time since it is slower.

wdyt @sboeuf @egernst ?

@sboeuf
Copy link

sboeuf commented Sep 18, 2018

@chavafg increasing 5 min is a lot... Also, my main concern is about the stability of the CI. I want to make sure we don't end up with some inconsistent failures from the CRI-O tests because we're using a loopback device.

@cboylan
Copy link

cboylan commented Sep 18, 2018

To clarify this isn't really a Zuul vs Jenkins issue as much as it is a vexxhost cloud region A vs region B issue. Zuul (and Nodepool) are able to speak to the new Vexxhost region where mnaser would prefer to not set up kata specific flavors to run the jobs (at least that is my understanding).

The reason for not doing that is to use something a bit more generic which ensures others can run these tests too.

The upside to using multiple regions is more resources overall but also more availability as we can lose an entire cloud region and keep running test jobs.

@grahamwhaley
Copy link
Contributor

@chavafg - can we close this one now?

@chavafg
Copy link
Contributor Author

chavafg commented Jan 10, 2019

yes, lets close this one. the issues described here are already solved.

@chavafg chavafg closed this as completed Jan 10, 2019
GabyCT added a commit to GabyCT/ci that referenced this issue Feb 12, 2019
This will check that is possible to perform a yum update inside a container.

Fixes kata-containers#65

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants