Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional tests are failing on 'shell to vm' #399

Closed
nvgoldin opened this issue Dec 21, 2016 · 19 comments
Closed

functional tests are failing on 'shell to vm' #399

nvgoldin opened this issue Dec 21, 2016 · 19 comments
Labels

Comments

@nvgoldin
Copy link
Contributor

The CirrOS VM doesn't get an IP, logging in with lago console and executing /etc/init.d/S40network restart resolves the issue. The only change done to automation was afec1ac, could it be that dnsmasq is not yet configured when we run lago start(now with KVM enabled it is probably done faster than before)? @david-caro - any ideas?

@nvgoldin nvgoldin added the tests label Dec 21, 2016
@mykaul
Copy link

mykaul commented Dec 21, 2016

I would say that this is exactly why we need to make some order between SSH and root passwords, between password-based and public-key based authentication. We need to see that such a VM:

  1. Does have a root password (for ovirt-system-tests use case)
  2. Does have a SSH password (which happens of course to be the root password) - for Lago to connect to it.
  3. Does not have a SSH public key (we do not inject to it a public key - note that ovirt-system-tests might in the future via cloud-init

Currently, there's quite a bit of a mixup. It's very similar to the situation with Hosted-Engine.

@gbenhaim
Copy link
Member

We've already experienced this issue before: #318
The interface is shown in the vm, but for some reason ifup/ifdown can't recognize it. I don't think that this issue relates to ssh but also don't have a good explanation why it relates to the boot menu.

@mykaul
Copy link

mykaul commented Dec 22, 2016 via email

@gbenhaim
Copy link
Member

As I see it we have two options:

  1. Enable the boot menu.
  2. Search for another image to use in the functional tests.

@mykaul @nvgoldin @ifireball @david-caro
Thoughts ?

@mykaul
Copy link

mykaul commented Dec 23, 2016 via email

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Dec 25, 2016

I looked at dnsmasq logs:

Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: DHCP, IP range 192.168.201.100 -- 192.168.201.254, lease time 1h
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: DHCPv6, IP range fd8f:1391:3a82:5e0d::c0a8:c964 -- fd8f:1391:3a82:5e0d::c0a8:c9fe, lease time 1h
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: router advertisement on fd8f:1391:3a82:5e0d::
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: IPv6 router advertisement enabled
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: DHCP, sockets bound exclusively to interface 71b4-785d519
Dec 25 07:54:57 ng-lt dnsmasq[4624]: reading /etc/resolv.conf
Dec 25 07:54:57 ng-lt dnsmasq[4624]: using nameserver 10.35.255.14#53
Dec 25 07:54:57 ng-lt dnsmasq[4624]: using nameserver 10.38.5.26#53
Dec 25 07:54:57 ng-lt dnsmasq[4624]: using nameserver 10.200.0.246#53
Dec 25 07:54:57 ng-lt dnsmasq[4624]: using nameserver 10.192.206.245#53
Dec 25 07:54:57 ng-lt dnsmasq[4624]: using nameserver 10.200.0.245#53
Dec 25 07:54:57 ng-lt dnsmasq[4624]: read /etc/hosts - 3 addresses
Dec 25 07:54:57 ng-lt dnsmasq[4624]: read /var/lib/libvirt/dnsmasq/71b4-785d5195a1.addnhosts - 4 addresses
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: read /var/lib/libvirt/dnsmasq/71b4-785d5195a1.hostsfile
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: duplicate IP address 192.168.201.3 (lago_functional_tests_vm01) in dhcp-config directive
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: duplicate IP address 192.168.201.2 (lago_functional_tests_vm02) in dhcp-config directive
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: duplicate IP address fd8f:1391:3a82:5e0d::c0a8:c903 (lago_functional_tests_vm01) in dhcp-config directive
Dec 25 07:54:57 ng-lt dnsmasq-dhcp[4624]: duplicate IP address fd8f:1391:3a82:5e0d::c0a8:c902 (lago_functional_tests_vm02) in dhcp-config directive
Dec 25 07:55:13 ng-lt dnsmasq-dhcp[4624]: RTR-ADVERT(71b4-785d519) fd8f:1391:3a82:5e0d::
Dec 25 07:55:27 ng-lt dnsmasq-dhcp[4624]: RTR-ADVERT(71b4-785d519) fd8f:1391:3a82:5e0d::
Dec 25 07:55:33 ng-lt dnsmasq-dhcp[4624]: RTR-ADVERT(71b4-785d519) fd8f:1391:3a82:5e0d::

and after the reboot:

Dec 25 07:55:44 ng-lt dnsmasq-dhcp[4624]: DHCPDISCOVER(71b4-785d519) 54:52:c0:a8:c9:02
Dec 25 07:55:44 ng-lt dnsmasq-dhcp[4624]: DHCPOFFER(71b4-785d519) 192.168.201.2 54:52:c0:a8:c9:02
Dec 25 07:55:44 ng-lt dnsmasq-dhcp[4624]: DHCPREQUEST(71b4-785d519) 192.168.201.2 54:52:c0:a8:c9:02
Dec 25 07:55:44 ng-lt dnsmasq-dhcp[4624]: DHCPACK(71b4-785d519) 192.168.201.2 54:52:c0:a8:c9:02 lago_functional_tests_vm02

So I'm not sure yet why it is complaining about duplicate definitions(and we try to add 4 hosts), and also why rebooting the CirrOS VM resolves the issue. Also nothing this doesn't always happen.

@mykaul
Copy link

mykaul commented Dec 25, 2016

We need to ensure Lago is not creating those definitions - that it is not racy. I think the networks are not prepared in parallel (but worth checking).
I've opened a bug on libvirt that it seems to me that libvirt's dnsmasq config is sometimes not ready in time.

@nvgoldin
Copy link
Contributor Author

@mykaul - will check. also need to check if the networkCreateXML call to libvirt is blocking, in the sense that it returns when completed.

for reference another failure: http://jenkins.ovirt.org/job/lago_master_check-merged-fc25-x86_64/3/console

@mykaul
Copy link

mykaul commented Dec 25, 2016 via email

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Dec 25, 2016

perhaps we should ask isActive() on it?\

exactly what I'm checking now..

is #404 ok to merge? we can try merging and see how others go.

@mykaul
Copy link

mykaul commented Dec 25, 2016 via email

@nvgoldin
Copy link
Contributor Author

might have been resolved by #405, lets keep this ticket open for a while to see if it happens again in the following weeks.

@nvgoldin
Copy link
Contributor Author

Unfortunately another failure:
http://jenkins.ovirt.org/job/lago_master_check-merged-fc25-x86_64/13/console
again restarting the network on the CirrOS VM resolved the issue. The difference here is that it was running on a bare-metal slave, which probably made things faster. So could be that the fix in #405, just gives more waiting time, but does not fix the issue.

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Jan 4, 2017

Moved the jobs back to VMs only for now.

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Jan 5, 2017

Finally got some true logs from the cirros vm on bootup:

Sending discover...

Sending discover...
Usage: /sbin/cirros-dhcpc <up|down>
No lease, failing
WARN: /etc/rc3.d/S40-network failed

For now I'm going to try add a work-around inside the image itself.

gerrit-ovirt-org pushed a commit to oVirt/jenkins that referenced this issue Jan 5, 2017
lago functional tests fail when they run "too quickly",
specifically this happens on bare-metals. So bringing
them back to VMs for now.

bug-url: lago-project/lago#399

Change-Id: I8c08145a4209e56e9e06b339fb853720314eb7b7
Signed-off-by: Nadav Goldin <ngoldin@redhat.com>
@mykaul
Copy link

mykaul commented Jan 5, 2017 via email

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Jan 5, 2017

Did you open a bug on cirros?

Not sure if this is cirros bug, I need some more work to debug this on the libvirt/dnsmasq, I would suspect them for as I didn't see those dhcp request get to dnsmasq at all(only after restarting the network/dhcp).

@nvgoldin
Copy link
Contributor Author

nvgoldin commented Jan 8, 2017

@nvgoldin
Copy link
Contributor Author

Safe to say this was fixed with the cirros workaround., haven't seen it for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants