New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zone provisioning on admin interface seems to get stuck #158

Open
MerlinDMC opened this Issue Dec 30, 2012 · 37 comments

Comments

Projects
None yet
@MerlinDMC
Contributor

MerlinDMC commented Dec 30, 2012

tested in VMware (image 20121228T011955Z)

I did setup two nics (admin on a host-only network, external on a bridged adapter) external was added manually, admin by the configuration wizard

provisioning a base64 1.8.4 Image works if I add no nic or use the external nic-tag:

[root@00-0c-29-18-ec-10 ~]# vmadm create << EOF
> {
>   "brand": "joyent",
>   "image_uuid": "fdea06b0-3f24-11e2-ac50-0b645575ce9d",
>   "autoboot": true,
>   "max_physical_memory": 256,
>   "max_swap": 256
> }
> EOF
Successfully created 3f6be33b-c745-456c-9f68-a300338b62b0
[root@00-0c-29-18-ec-10 ~]# vmadm create << EOF
> {
>   "brand": "joyent",
>   "image_uuid": "fdea06b0-3f24-11e2-ac50-0b645575ce9d",
>   "autoboot": true,
>   "max_physical_memory": 256,
>   "max_swap": 256,
>   "nics": [
>     {
>       "nic_tag": "external",
>       "ip": "10.0.0.213",
>       "netmask": "255.255.0.0",
>       "gateway": "10.0.0.254",
>       "primary": true
>     }
>   ]
> }
> EOF
Successfully created 545c9840-ec1d-486c-bbd6-50de99c8b2e9

provisioning fails with a timeout after 5 minutes if I want to use the admin nic-tag:

[root@00-0c-29-18-ec-10 ~]# vmadm create << EOF
> {
>   "brand": "joyent",
>   "image_uuid": "fdea06b0-3f24-11e2-ac50-0b645575ce9d",
>   "autoboot": true,
>   "max_physical_memory": 256,
>   "max_swap": 256,
>   "nics": [
>     {
>       "nic_tag": "admin",
>       "ip": "10.0.0.123",
>       "netmask": "255.255.0.0",
>       "gateway": "10.0.0.254",
>       "primary": true
>     }
>   ]
> }
> EOF
timed out waiting for /var/svc/provisioning to move  for b0b4aa72-6082-42e2-97cf-51d5352e3e95

the config used:

[root@00-0c-29-18-ec-10 ~]# cat /usbkey/config 
#
# This file was auto-generated and must be source-able by bash.
#

coal=true
# admin_nic is the nic admin_ip will be connected to for headnode zones.
admin_nic=0:c:29:18:ec:10
admin_ip=dhcp
admin_netmask=
admin_network=...
admin_gateway=dhcp

external_nic=0:c:29:18:ec:1a
external0_ip=192.168.1.8
external0_netmask=255.255.255.0
external0_network=192.168.1.0
external0_gateway=192.168.1.1

headnode_default_gateway=192.168.1.1

dns_resolvers=8.8.8.8,8.8.4.4
dns_domain=

ntp_hosts=pool.ntp.org
compute_node_ntp_hosts=dhcp
@rmustacc

This comment has been minimized.

Show comment
Hide comment
@rmustacc

rmustacc Dec 30, 2012

Member

Does the same thing happen if instead of using the joyent brand you switch it to the joyent-minimal brand? I would assume that the traditional joyent brand's zoneinit is trying to reach the outside world to update pkgsrc and cannot.

Member

rmustacc commented Dec 30, 2012

Does the same thing happen if instead of using the joyent brand you switch it to the joyent-minimal brand? I would assume that the traditional joyent brand's zoneinit is trying to reach the outside world to update pkgsrc and cannot.

@MerlinDMC

This comment has been minimized.

Show comment
Hide comment
@MerlinDMC

MerlinDMC Dec 31, 2012

Contributor

All those zones don't have network connectivity to the outer world - the gateway 10.0.0.254 is not reachable.

But it works fine with the joyent-minimal brand.

Contributor

MerlinDMC commented Dec 31, 2012

All those zones don't have network connectivity to the outer world - the gateway 10.0.0.254 is not reachable.

But it works fine with the joyent-minimal brand.

@ryancnelson

This comment has been minimized.

Show comment
Hide comment
@ryancnelson

ryancnelson Dec 31, 2012

I can confirm seeing this, recently, too… only on joyent-brand zones where
the networking is either broken, or on an unreachable "island network"
segment… looks like we've got a dependency on external-reaching network
connectivity in those datasets. A quick look didn't appear to be the pkgin
update one from awhile back, though.

I've seen this on old and new datasets -- is it possible this isn't in the
dataset itself, but something common to any smartos-zone dataset, like a
provisioning message that's trying to connect someplace?

On Sun, Dec 30, 2012 at 11:42 PM, Daniel Malon notifications@github.comwrote:

All those zones don't have network connectivity to the outer world - the
gateway 10.0.0.254 is not reachable.

But it works fine with the joyent-minimal brand.


Reply to this email directly or view it on GitHubhttps://github.com/joyent/smartos-live/issues/158#issuecomment-11773653.

ryancnelson commented Dec 31, 2012

I can confirm seeing this, recently, too… only on joyent-brand zones where
the networking is either broken, or on an unreachable "island network"
segment… looks like we've got a dependency on external-reaching network
connectivity in those datasets. A quick look didn't appear to be the pkgin
update one from awhile back, though.

I've seen this on old and new datasets -- is it possible this isn't in the
dataset itself, but something common to any smartos-zone dataset, like a
provisioning message that's trying to connect someplace?

On Sun, Dec 30, 2012 at 11:42 PM, Daniel Malon notifications@github.comwrote:

All those zones don't have network connectivity to the outer world - the
gateway 10.0.0.254 is not reachable.

But it works fine with the joyent-minimal brand.


Reply to this email directly or view it on GitHubhttps://github.com/joyent/smartos-live/issues/158#issuecomment-11773653.

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Dec 31, 2012

Contributor

There should be no dependency on external networking.

Daniel - can you check the /zones//root/var/svc/log/system-zoneinit:default:log file? It contains the whole of a Bash xtrace of the zoneinit run.

Contributor

mamash commented Dec 31, 2012

There should be no dependency on external networking.

Daniel - can you check the /zones//root/var/svc/log/system-zoneinit:default:log file? It contains the whole of a Bash xtrace of the zoneinit run.

@MerlinDMC

This comment has been minimized.

Show comment
Hide comment
@MerlinDMC

MerlinDMC Dec 31, 2012

Contributor

I tracked that down a little bit more.

In checkDatasetProvisionable (VM.js#L5972) getZoneinitJSON is used to read the zoneinit.json file and determine features.var_svc_provisioning true/false.

I used the base64 1.8.4 data which has a zoneinit.json file and the logic should read that and see the var_svc_provisioning: true - It doesn't.

It defaults to the check for 04-mdata.sh file and logs
{"name":"vmadm","hostname":"00-0c-29-18-ec-10","pid":10350,"level":30,"msg":"/zones/fdea06b0-3f24-11e2-ac50-0b645575ce9d/root/var/zoneinit/includes/04-mdata.sh exists","time":"2012-12-31T10:44:41.445Z","v":0}

This happens because getZoneinitJSON() seems to assume a absolute rootpath as argument but VM.js#L5895 is relative so the ENOENT exception is thrown there and the process is using the fallback checks.

That doesn't fix my problem but i guess this also should not happen there.

/var/svc/provisioning file still exists after the timeout happens so I assume that the smartdc-mdata:execute script never runs (in there is the mv /var/svc/provision{ing,_success})

Contributor

MerlinDMC commented Dec 31, 2012

I tracked that down a little bit more.

In checkDatasetProvisionable (VM.js#L5972) getZoneinitJSON is used to read the zoneinit.json file and determine features.var_svc_provisioning true/false.

I used the base64 1.8.4 data which has a zoneinit.json file and the logic should read that and see the var_svc_provisioning: true - It doesn't.

It defaults to the check for 04-mdata.sh file and logs
{"name":"vmadm","hostname":"00-0c-29-18-ec-10","pid":10350,"level":30,"msg":"/zones/fdea06b0-3f24-11e2-ac50-0b645575ce9d/root/var/zoneinit/includes/04-mdata.sh exists","time":"2012-12-31T10:44:41.445Z","v":0}

This happens because getZoneinitJSON() seems to assume a absolute rootpath as argument but VM.js#L5895 is relative so the ENOENT exception is thrown there and the process is using the fallback checks.

That doesn't fix my problem but i guess this also should not happen there.

/var/svc/provisioning file still exists after the timeout happens so I assume that the smartdc-mdata:execute script never runs (in there is the mv /var/svc/provision{ing,_success})

@ghost ghost assigned joshwilsdon Dec 31, 2012

@joshwilsdon

This comment has been minimized.

Show comment
Hide comment
@joshwilsdon

joshwilsdon Dec 31, 2012

Member

This sounds like a bug in these datasets. I'll try to do some testing and add a workaround to vmadm so that we can provision 1.8.4 and any other datasets with this problem.

Member

joshwilsdon commented Dec 31, 2012

This sounds like a bug in these datasets. I'll try to do some testing and add a workaround to vmadm so that we can provision 1.8.4 and any other datasets with this problem.

@robinbowes

This comment has been minimized.

Show comment
Hide comment
@robinbowes

robinbowes Jan 2, 2013

I've just hit the same issue.

This is my config:

swap=1.25x
admin_nic=0:25:90:51:f1:80
admin_ip=192.168.1.20
admin_netmask=255.255.255.0
admin_network=192.168.1.0
admin_gateway=192.168.1.1

dns_resolvers=192.168.1.90
dns_domain=robinbowes.com

ntp_hosts=0.pool.ntp.org

compute_node_ntp_hosts=0.pool.ntp.org
etherstub=stub0

default_keymap=uk

And this is the json I'm using to create the new VM:

{
 "zonename": "fifo",
 "alias": "fifo",
 "autoboot": true,
 "brand": "joyent",
 "image_uuid": "fdea06b0-3f24-11e2-ac50-0b645575ce9d",
 "max_physical_memory": 512,
 "resolvers": [
 "192.168.1.90"
 ],
 "nics": [
 {
 "interface": "net0",
 "nic_tag": "admin",
 "gateway": "192.168.1.1",
 "ip": "192.168.1.21",
 "netmask": "255.255.255.0"
 }
 ]
}

Is there a workaround at this stage?

R.

robinbowes commented Jan 2, 2013

I've just hit the same issue.

This is my config:

swap=1.25x
admin_nic=0:25:90:51:f1:80
admin_ip=192.168.1.20
admin_netmask=255.255.255.0
admin_network=192.168.1.0
admin_gateway=192.168.1.1

dns_resolvers=192.168.1.90
dns_domain=robinbowes.com

ntp_hosts=0.pool.ntp.org

compute_node_ntp_hosts=0.pool.ntp.org
etherstub=stub0

default_keymap=uk

And this is the json I'm using to create the new VM:

{
 "zonename": "fifo",
 "alias": "fifo",
 "autoboot": true,
 "brand": "joyent",
 "image_uuid": "fdea06b0-3f24-11e2-ac50-0b645575ce9d",
 "max_physical_memory": 512,
 "resolvers": [
 "192.168.1.90"
 ],
 "nics": [
 {
 "interface": "net0",
 "nic_tag": "admin",
 "gateway": "192.168.1.1",
 "ip": "192.168.1.21",
 "netmask": "255.255.255.0"
 }
 ]
}

Is there a workaround at this stage?

R.

@MerlinDMC

This comment has been minimized.

Show comment
Hide comment
@MerlinDMC

MerlinDMC Jan 2, 2013

Contributor

As a workaround setting "nowait": true should skip the wait for provisioning part

Contributor

MerlinDMC commented Jan 2, 2013

As a workaround setting "nowait": true should skip the wait for provisioning part

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 2, 2013

Contributor

I just tried this, and couldn't reproduce. Can someone provide the /zones/<uuid>/root/var/svc/log/system-zoneinit:default:log file? That's the only source of information on why zoneinit failed really.

Contributor

mamash commented Jan 2, 2013

I just tried this, and couldn't reproduce. Can someone provide the /zones/<uuid>/root/var/svc/log/system-zoneinit:default:log file? That's the only source of information on why zoneinit failed really.

@MerlinDMC

This comment has been minimized.

Show comment
Hide comment
@MerlinDMC

MerlinDMC Jan 2, 2013

Contributor
[root@00-0c-29-18-ec-10 ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
11f7efe9-a5d1-4f20-aede-07b57b982221  OS    256      failed            -

system-zoneinit:default.log

Contributor

MerlinDMC commented Jan 2, 2013

[root@00-0c-29-18-ec-10 ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
11f7efe9-a5d1-4f20-aede-07b57b982221  OS    256      failed            -

system-zoneinit:default.log

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 2, 2013

Contributor

Hm OK, so zoneinit is happy actually. What does the SMF log for mdata:execute (svcs -L mdata:execute) say then?

Contributor

mamash commented Jan 2, 2013

Hm OK, so zoneinit is happy actually. What does the SMF log for mdata:execute (svcs -L mdata:execute) say then?

@MerlinDMC

This comment has been minimized.

Show comment
Hide comment
@MerlinDMC

MerlinDMC Jan 2, 2013

Contributor

smartdc-mdata:fetch.log

[ Dec 31 13:55:47 Disabled. ]
[ Dec 31 13:55:54 Enabled. ]

this one does not execute the script /lib/svc/method/mdata-fetch after it got enabled so mdata:execute does not get enabled and can't move the provisioning file

If I force another reboot with zoneadm -z 11f7efe9-a5d1-4f20-aede-07b57b982221 boot the zone boots, mdata:fetch is runing fine, and provisioning file is moved to /var/svc/provision_success

Contributor

MerlinDMC commented Jan 2, 2013

smartdc-mdata:fetch.log

[ Dec 31 13:55:47 Disabled. ]
[ Dec 31 13:55:54 Enabled. ]

this one does not execute the script /lib/svc/method/mdata-fetch after it got enabled so mdata:execute does not get enabled and can't move the provisioning file

If I force another reboot with zoneadm -z 11f7efe9-a5d1-4f20-aede-07b57b982221 boot the zone boots, mdata:fetch is runing fine, and provisioning file is moved to /var/svc/provision_success

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 2, 2013

Contributor

After some troubleshooting, it looks like the problem is that Cron is unhappy with the admin network setup and will not execute anything. This includes the 'at' job that reboots the zone at the end of the zoneinit run, which triggers the mdata:fetch -> mdata:execute chain and concludes the provisioning.

We'll debug this on our end.

Contributor

mamash commented Jan 2, 2013

After some troubleshooting, it looks like the problem is that Cron is unhappy with the admin network setup and will not execute anything. This includes the 'at' job that reboots the zone at the end of the zoneinit run, which triggers the mdata:fetch -> mdata:execute chain and concludes the provisioning.

We'll debug this on our end.

@alcir

This comment has been minimized.

Show comment
Hide comment
@alcir

alcir Jan 16, 2013

Is there any news?

alcir commented Jan 16, 2013

Is there any news?

@alcir

This comment has been minimized.

Show comment
Hide comment
@alcir

alcir Jan 16, 2013

I tried, in various boot script, to redirect "ls -la /var/svc" in to a file:
/etc/rc2_d/S99blah
/lib/svc/method/fs-usr
/var/zoneinit/includes/00-mdata.sh

and it seems that /var/svc/mdata is not present since the zone boot.

alcir commented Jan 16, 2013

I tried, in various boot script, to redirect "ls -la /var/svc" in to a file:
/etc/rc2_d/S99blah
/lib/svc/method/fs-usr
/var/zoneinit/includes/00-mdata.sh

and it seems that /var/svc/mdata is not present since the zone boot.

@alcir

This comment has been minimized.

Show comment
Hide comment
@alcir

alcir Jan 16, 2013

And, If in the source dataset I copy /var/svc/mdata to /var/svc/mdata.tmp and in a /var/zoneinit/includes script I add a line like
mv /var/svc/mdata.tmp /var/svc/mdata

The provisioning is successful
Successfully created 80546aec-0e03-4f34-9bae-df6260aaef44

alcir commented Jan 16, 2013

And, If in the source dataset I copy /var/svc/mdata to /var/svc/mdata.tmp and in a /var/zoneinit/includes script I add a line like
mv /var/svc/mdata.tmp /var/svc/mdata

The provisioning is successful
Successfully created 80546aec-0e03-4f34-9bae-df6260aaef44

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 16, 2013

Contributor

alessioc, that's definitely unrelated to the bug described in this ticket, but I see where it's coming from. Can you please confirm your platform version (uname -v) and the image UUID you're trying to use?

Contributor

mamash commented Jan 16, 2013

alessioc, that's definitely unrelated to the bug described in this ticket, but I see where it's coming from. Can you please confirm your platform version (uname -v) and the image UUID you're trying to use?

@alcir

This comment has been minimized.

Show comment
Hide comment
@alcir

alcir Jan 17, 2013

uname -v
joyent_20130111T010112Z

image UUID
fdea06b0-3f24-11e2-ac50-0b645575ce9d

alcir commented Jan 17, 2013

uname -v
joyent_20130111T010112Z

image UUID
fdea06b0-3f24-11e2-ac50-0b645575ce9d

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 17, 2013

Contributor

alessioc, /var/svc/mdata is there for older platforms only. Newer platforms (from cca mid December) will delete /var/svc/mdata from the image when provisioning as part of the cleanup, because a replacement SMF manifest for 'mdata' will be auto-imported from /lib/svc/manifest/system/mdata.xml, which uses /lib/svc/method/mdata-fetch and /lib/svc/method/mdata-execute respectively.

You cannot opt out of the SMF import from /lib/svc/manifest, but somehow it's failing in your case. I cannot reproduce that with joyent_20130111T010112Z and fdea06b0-3f24-11e2-ac50-0b645575ce9d though. It's not related to this issue either way.

Contributor

mamash commented Jan 17, 2013

alessioc, /var/svc/mdata is there for older platforms only. Newer platforms (from cca mid December) will delete /var/svc/mdata from the image when provisioning as part of the cleanup, because a replacement SMF manifest for 'mdata' will be auto-imported from /lib/svc/manifest/system/mdata.xml, which uses /lib/svc/method/mdata-fetch and /lib/svc/method/mdata-execute respectively.

You cannot opt out of the SMF import from /lib/svc/manifest, but somehow it's failing in your case. I cannot reproduce that with joyent_20130111T010112Z and fdea06b0-3f24-11e2-ac50-0b645575ce9d though. It's not related to this issue either way.

@alcir

This comment has been minimized.

Show comment
Hide comment
@alcir

alcir Jan 17, 2013

I use sm-prepare-img before creating my own dataset. Maybe the problem is that sm-prepare-img recreate unuseful mdata stuff?

alcir commented Jan 17, 2013

I use sm-prepare-img before creating my own dataset. Maybe the problem is that sm-prepare-img recreate unuseful mdata stuff?

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 18, 2013

Contributor

Here is my experience deploying the base-1.8.4 image.

I get timed out waiting for /var/svc/provisioning to move for:

{
  "brand": "joyent",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

Works without getting SSH access with "brand":"joyent-minimal":

{
  "brand": "joyent-minimal",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I created an "external" NIC by adding another NIC on the same subnet as the admin NIC. I changed /usbkey/config to include:

external_nic=a0:36:9f:7:a5:26
external_ip=10.100.0.14
external_netmask=255.255.0.0
external_network=...
external_gateway=10.100.0.12

Using that with "nic_tag":"external" this gave me a working OS zone with SSH:

{
  "brand": "joyent",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "external",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I get similar behavior on Live Image 20121115T191935Z with image UUID 84cb7edc-3f22-11e2-8a2a-3f2a7b148699 (aka base-1.8.4). The error I get in the first case is "timed out waiting for zone to transition to running". Using "brand":"joyent-minimal" or "nic_tag":"external" also works there as a workaround.

In 20130111T010112Z, using the base-1.8.1 (UUID 55330ab4-066f-11e2-bd0f-434f2462fada) image I get success and working SSH:

{
  "brand": "joyent",
  "dataset_uuid": "55330ab4-066f-11e2-bd0f-434f2462fada",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.17",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I could not get 20121115T191935Z to import the base-1.8.1 image.

In 20130111T010112Z, using the base-1.8.2 (UUID ef22b182-3d7a-11e2-a7a9-af27913943e2) image gets timed out waiting for /var/svc/provisioning to move:

{
  "brand": "joyent",
  "dataset_uuid": "ef22b182-3d7a-11e2-a7a9-af27913943e2",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.18",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

Using "brand":"joyent-minimal" or "nic-tag":"external" gets around the issue on base-1.8.2 as well.

Is the problem in the dataset/image itself?

Contributor

AlainODea commented Jan 18, 2013

Here is my experience deploying the base-1.8.4 image.

I get timed out waiting for /var/svc/provisioning to move for:

{
  "brand": "joyent",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

Works without getting SSH access with "brand":"joyent-minimal":

{
  "brand": "joyent-minimal",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I created an "external" NIC by adding another NIC on the same subnet as the admin NIC. I changed /usbkey/config to include:

external_nic=a0:36:9f:7:a5:26
external_ip=10.100.0.14
external_netmask=255.255.0.0
external_network=...
external_gateway=10.100.0.12

Using that with "nic_tag":"external" this gave me a working OS zone with SSH:

{
  "brand": "joyent",
  "dataset_uuid": "84cb7edc-3f22-11e2-8a2a-3f2a7b148699",
  "nics": [
    {
      "nic_tag": "external",
      "ip": "10.100.0.16",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I get similar behavior on Live Image 20121115T191935Z with image UUID 84cb7edc-3f22-11e2-8a2a-3f2a7b148699 (aka base-1.8.4). The error I get in the first case is "timed out waiting for zone to transition to running". Using "brand":"joyent-minimal" or "nic_tag":"external" also works there as a workaround.

In 20130111T010112Z, using the base-1.8.1 (UUID 55330ab4-066f-11e2-bd0f-434f2462fada) image I get success and working SSH:

{
  "brand": "joyent",
  "dataset_uuid": "55330ab4-066f-11e2-bd0f-434f2462fada",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.17",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

I could not get 20121115T191935Z to import the base-1.8.1 image.

In 20130111T010112Z, using the base-1.8.2 (UUID ef22b182-3d7a-11e2-a7a9-af27913943e2) image gets timed out waiting for /var/svc/provisioning to move:

{
  "brand": "joyent",
  "dataset_uuid": "ef22b182-3d7a-11e2-a7a9-af27913943e2",
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "10.100.0.18",
      "netmask": "255.255.0.0",
      "gateway": "10.100.0.1"
    }
  ]
}

Using "brand":"joyent-minimal" or "nic-tag":"external" gets around the issue on base-1.8.2 as well.

Is the problem in the dataset/image itself?

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 18, 2013

Contributor

No, the dataset is fine. The problem is that under (not yet fully diagnosed) network conditions, such as using an admin network (though the default 'admin' nic on VMware is fine), Cron inside the zone refuses to operate properly, which makes the zone fail its first post-provision reboot and doesn't trigger the mdata:execute service that would normally move /var/svc/provisioning to /var/svc/provision_success (and signal to vmadm that everything is done).

We're troubleshooting the Cron/audit problem on our end, though it's not of the highest priority, so may take 2-3 days more.

Contributor

mamash commented Jan 18, 2013

No, the dataset is fine. The problem is that under (not yet fully diagnosed) network conditions, such as using an admin network (though the default 'admin' nic on VMware is fine), Cron inside the zone refuses to operate properly, which makes the zone fail its first post-provision reboot and doesn't trigger the mdata:execute service that would normally move /var/svc/provisioning to /var/svc/provision_success (and signal to vmadm that everything is done).

We're troubleshooting the Cron/audit problem on our end, though it's not of the highest priority, so may take 2-3 days more.

@mdrobnak

This comment has been minimized.

Show comment
Hide comment
@mdrobnak

mdrobnak Jan 19, 2013

I no longer get a "failed to provision" using the 12-28-2012 image, but have seen the "island network" issue. I have found that network/routing-setup incorrectly removes the default gateway on zones that are set up using admin_nic along with DHCP for the addressing. Hopefully this helps with troubleshooting the root cause.

So for now I have a script that disables that service, and after vmadm returns that it started, I stop and start it, which isn't ideal, but that has worked for me.

mdrobnak commented Jan 19, 2013

I no longer get a "failed to provision" using the 12-28-2012 image, but have seen the "island network" issue. I have found that network/routing-setup incorrectly removes the default gateway on zones that are set up using admin_nic along with DHCP for the addressing. Hopefully this helps with troubleshooting the root cause.

So for now I have a script that disables that service, and after vmadm returns that it started, I stop and start it, which isn't ideal, but that has worked for me.

@joshwilsdon

This comment has been minimized.

Show comment
Hide comment
@joshwilsdon

joshwilsdon Jan 19, 2013

Member

As far as I can tell, the problem here occurs when either you have a bad gateway or unreachable resolvers. If you have an invalid gateway but set the resolvers to 127.0.0.1 (note: currently this is also broken, but I've created an internal ticket with a patch for this problem) the VM will succeed provisioning. If you have valid resolvers but a bad gateway, it fails in the same fashion because it can't reach the resolvers.

Investigation is ongoing, but it seems that this is getting closer to the real problem here.

Member

joshwilsdon commented Jan 19, 2013

As far as I can tell, the problem here occurs when either you have a bad gateway or unreachable resolvers. If you have an invalid gateway but set the resolvers to 127.0.0.1 (note: currently this is also broken, but I've created an internal ticket with a patch for this problem) the VM will succeed provisioning. If you have valid resolvers but a bad gateway, it fails in the same fashion because it can't reach the resolvers.

Investigation is ongoing, but it seems that this is getting closer to the real problem here.

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 19, 2013

Contributor

Thank you @joshwilsdon, I think I get the problem now. I don't supply resolvers so the zone falls back to using 8.8.8.8 and 8.8.4.4 as resolvers. I wasn't able to see the attempts on the firewall because I block as much as possible in my switches first. The wisdom of my approach there is rapidly waning as it hampers observability.

It would be nice if vmadm or zoneinit checked resolvers early and failed with a specific message. I'm not sure if that is feasible.

Contributor

AlainODea commented Jan 19, 2013

Thank you @joshwilsdon, I think I get the problem now. I don't supply resolvers so the zone falls back to using 8.8.8.8 and 8.8.4.4 as resolvers. I wasn't able to see the attempts on the firewall because I block as much as possible in my switches first. The wisdom of my approach there is rapidly waning as it hampers observability.

It would be nice if vmadm or zoneinit checked resolvers early and failed with a specific message. I'm not sure if that is feasible.

@mdrobnak

This comment has been minimized.

Show comment
Hide comment
@mdrobnak

mdrobnak Jan 19, 2013

Hmm. Is there a way to tell the zone to procure resolver information from DHCP, like with KVM instances? Or must this be set in the JSON?

mdrobnak commented Jan 19, 2013

Hmm. Is there a way to tell the zone to procure resolver information from DHCP, like with KVM instances? Or must this be set in the JSON?

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 20, 2013

Contributor

Definitely DNS. I ran snoop 10.100.0.16 (the IP of the OS VM I am creating). It repeatedly attempts to reach Google DNS to resolve first www.joyent.com and then repeatedly attempts to resolve the zone UUID. This happens whether or not I specify resolvers in the JSON payload. I have verified that the /etc/resolv.conf within the zone for the OS VM vmadm is creating contains only the 8.8.8.8 and 8.8.4.4 Google public DNS resolvers. I have been trying to track where vmadm loses the resolvers I provide, but I have not been successful.

Contributor

AlainODea commented Jan 20, 2013

Definitely DNS. I ran snoop 10.100.0.16 (the IP of the OS VM I am creating). It repeatedly attempts to reach Google DNS to resolve first www.joyent.com and then repeatedly attempts to resolve the zone UUID. This happens whether or not I specify resolvers in the JSON payload. I have verified that the /etc/resolv.conf within the zone for the OS VM vmadm is creating contains only the 8.8.8.8 and 8.8.4.4 Google public DNS resolvers. I have been trying to track where vmadm loses the resolvers I provide, but I have not been successful.

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 20, 2013

Contributor

I thought it would be extremely clever to alter /etc/resolv.conf within the image's own zone. I still get Google public DNS, when I deploy despite supplying resolvers. I am guessing vmadm is using a snapshot, so my manipulations are not able to affect the deployed OS VM. I suspect I could probably trick vmadm by creating my own snapshot, but I'd rather have a clean, non-underhanded fix. I may have to derive my own images from base, etc. to get around this.

Contributor

AlainODea commented Jan 20, 2013

I thought it would be extremely clever to alter /etc/resolv.conf within the image's own zone. I still get Google public DNS, when I deploy despite supplying resolvers. I am guessing vmadm is using a snapshot, so my manipulations are not able to affect the deployed OS VM. I suspect I could probably trick vmadm by creating my own snapshot, but I'd rather have a clean, non-underhanded fix. I may have to derive my own images from base, etc. to get around this.

@joshwilsdon

This comment has been minimized.

Show comment
Hide comment
@joshwilsdon

joshwilsdon Jan 20, 2013

Member

The dataset will need to be fixed before you are able to set resolvers at all other than Google's when using joyent brand since it will write over the resolv.conf with the Google DNS servers when zoneinit runs. If you're able to reach the Internet so that cron can run the reboot with Google's DNS, you could try setting them after the zoneinit reboot using a user-script, but otherwise you'd just have to set them manually after the provision completes if you're using one of the newer datasets that has this problem.

Re: "I have been trying to track where vmadm loses the resolvers I provide, but I have not been successful", it's not vmadm losing the resolvers. It's the zoneinit scripts. They're failing to pull the resolvers out of the data provided by the metadata agent (same data you see with 'vmadm get ').

I think the problem with cron failing to run at jobs when DNS is broken should still be fixed separately though. I believe someone will be working on that early next week.

Member

joshwilsdon commented Jan 20, 2013

The dataset will need to be fixed before you are able to set resolvers at all other than Google's when using joyent brand since it will write over the resolv.conf with the Google DNS servers when zoneinit runs. If you're able to reach the Internet so that cron can run the reboot with Google's DNS, you could try setting them after the zoneinit reboot using a user-script, but otherwise you'd just have to set them manually after the provision completes if you're using one of the newer datasets that has this problem.

Re: "I have been trying to track where vmadm loses the resolvers I provide, but I have not been successful", it's not vmadm losing the resolvers. It's the zoneinit scripts. They're failing to pull the resolvers out of the data provided by the metadata agent (same data you see with 'vmadm get ').

I think the problem with cron failing to run at jobs when DNS is broken should still be fixed separately though. I believe someone will be working on that early next week.

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 20, 2013

Contributor

Thank you Josh. For now I'll see about allowing DNS out from the VMs to Google public DNS. I will take a look at the zoneinit scripts to see if I can tweak them locally to get around this.

Contributor

AlainODea commented Jan 20, 2013

Thank you Josh. For now I'll see about allowing DNS out from the VMs to Google public DNS. I will take a look at the zoneinit scripts to see if I can tweak them locally to get around this.

@perOeberg

This comment has been minimized.

Show comment
Hide comment
@perOeberg

perOeberg Jan 21, 2013

I'm following this with great interest as I'm having the same issues.

perOeberg commented Jan 21, 2013

I'm following this with great interest as I'm having the same issues.

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 21, 2013

Contributor

Is the problem that it attempts mdata-get before setting up the resolvers from the JSON payload?

Looking in /var/zoneinit/includes/ I see 00-mdata.sh and 12-network.sh. It is fatal to do name-based networking before setting the resolvers. However, the Metadata API is not supposed to do networking at all, but operate through zsockets or tty to the host. Has something changed there?

How can I see where zoneinit fails? Is there an execution trace for it?

Contributor

AlainODea commented Jan 21, 2013

Is the problem that it attempts mdata-get before setting up the resolvers from the JSON payload?

Looking in /var/zoneinit/includes/ I see 00-mdata.sh and 12-network.sh. It is fatal to do name-based networking before setting the resolvers. However, the Metadata API is not supposed to do networking at all, but operate through zsockets or tty to the host. Has something changed there?

How can I see where zoneinit fails? Is there an execution trace for it?

@joshwilsdon

This comment has been minimized.

Show comment
Hide comment
@joshwilsdon

joshwilsdon Jan 21, 2013

Member

Hi,

The metadata failure is just that 02-config.sh is looking at the wrong key
in the 1.8.4 dataset. It's looking at sdc:resolvers but that key has never
worked. To get the metadata it should be using sdc:resolvers.0 and
sdc:resolvers.1. I have created a patch for this and have been told that
this will be fixed in 1.8.5.

The metadata is still read through a zsock so there's no DNS lookup being
done at that part. The problem here is that we're always setting 8.8.8.8
and 8.8.4.4 as the resolvers and sometimes those are unreachable. When
they're unreachable they're exposing an additional bug in cron which we're
also tracking down where cron is doing DNS lookups for at jobs.
Because /var/zoneinit/includes/999-cleanup.sh uses at(1) to do the reboot,
the DNS lookup problems are preventing the at job from running long enough
that the provision is being considered timed-out.

As for logging, the only log for zoneinit is the service log
in /var/svc/log/system-zoneinit:default.log in the zone.

Thanks,
Josh

Member

joshwilsdon commented Jan 21, 2013

Hi,

The metadata failure is just that 02-config.sh is looking at the wrong key
in the 1.8.4 dataset. It's looking at sdc:resolvers but that key has never
worked. To get the metadata it should be using sdc:resolvers.0 and
sdc:resolvers.1. I have created a patch for this and have been told that
this will be fixed in 1.8.5.

The metadata is still read through a zsock so there's no DNS lookup being
done at that part. The problem here is that we're always setting 8.8.8.8
and 8.8.4.4 as the resolvers and sometimes those are unreachable. When
they're unreachable they're exposing an additional bug in cron which we're
also tracking down where cron is doing DNS lookups for at jobs.
Because /var/zoneinit/includes/999-cleanup.sh uses at(1) to do the reboot,
the DNS lookup problems are preventing the at job from running long enough
that the provision is being considered timed-out.

As for logging, the only log for zoneinit is the service log
in /var/svc/log/system-zoneinit:default.log in the zone.

Thanks,
Josh

@mamash

This comment has been minimized.

Show comment
Hide comment
@mamash

mamash Jan 21, 2013

Contributor
    1. 2013 v 20:49, joshwilsdon notifications@github.com:

The metadata failure is just that 02-config.sh is looking at the wrong key
in the 1.8.4 dataset. It's looking at sdc:resolvers but that key has never
worked. To get the metadata it should be using sdc:resolvers.0 and
sdc:resolvers.1. I have created a patch for this and have been told that
this will be fixed in 1.8.5.

Yeah, the code in 02-config.sh still reflects the situation where resolvers in metadata were broken in the platform (since a long time ago). I'll be fixing this.

The metadata is still read through a zsock so there's no DNS lookup being
done at that part. The problem here is that we're always setting 8.8.8.8
and 8.8.4.4 as the resolvers and sometimes those are unreachable. When
they're unreachable they're exposing an additional bug in cron which we're
also tracking down where cron is doing DNS lookups for at jobs.
Because /var/zoneinit/includes/999-cleanup.sh uses at(1) to do the reboot,
the DNS lookup problems are preventing the at job from running long enough
that the provision is being considered timed-out.

Josh btw the problem is not just at(1), but cron in general; no cron job will run under in this context.

-F

Contributor

mamash commented Jan 21, 2013

    1. 2013 v 20:49, joshwilsdon notifications@github.com:

The metadata failure is just that 02-config.sh is looking at the wrong key
in the 1.8.4 dataset. It's looking at sdc:resolvers but that key has never
worked. To get the metadata it should be using sdc:resolvers.0 and
sdc:resolvers.1. I have created a patch for this and have been told that
this will be fixed in 1.8.5.

Yeah, the code in 02-config.sh still reflects the situation where resolvers in metadata were broken in the platform (since a long time ago). I'll be fixing this.

The metadata is still read through a zsock so there's no DNS lookup being
done at that part. The problem here is that we're always setting 8.8.8.8
and 8.8.4.4 as the resolvers and sometimes those are unreachable. When
they're unreachable they're exposing an additional bug in cron which we're
also tracking down where cron is doing DNS lookups for at jobs.
Because /var/zoneinit/includes/999-cleanup.sh uses at(1) to do the reboot,
the DNS lookup problems are preventing the at job from running long enough
that the provision is being considered timed-out.

Josh btw the problem is not just at(1), but cron in general; no cron job will run under in this context.

-F

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Jan 21, 2013

Contributor

@mamash thank you for the clarification :)

I wound up allowing Google Public DNS from the research subnet. That resolved the timeout issue.

The resolvers are still 8.8.8.8 and 8.8.4.4 after zoneinit however. That's a little odd. Is that expected?

Contributor

AlainODea commented Jan 21, 2013

@mamash thank you for the clarification :)

I wound up allowing Google Public DNS from the research subnet. That resolved the timeout issue.

The resolvers are still 8.8.8.8 and 8.8.4.4 after zoneinit however. That's a little odd. Is that expected?

@hj1980

This comment has been minimized.

Show comment
Hide comment
@hj1980

hj1980 Feb 4, 2013

Just a thought... If you are running SmartOS itself within a virtualized environment, you will need to configure it's own adapter to permit promiscuous mode. It needs to be able to receive frames destined for the mac address of the inner vm.

(I was running SmartOS within VirtualBox, turning on promiscuous mode fixed this same issue for me)

hj1980 commented Feb 4, 2013

Just a thought... If you are running SmartOS itself within a virtualized environment, you will need to configure it's own adapter to permit promiscuous mode. It needs to be able to receive frames destined for the mac address of the inner vm.

(I was running SmartOS within VirtualBox, turning on promiscuous mode fixed this same issue for me)

@AlainODea

This comment has been minimized.

Show comment
Hide comment
@AlainODea

AlainODea Feb 23, 2013

Contributor

@hj1980 good point. I am running SmartOS on bare metal myself.

Contributor

AlainODea commented Feb 23, 2013

@hj1980 good point. I am running SmartOS on bare metal myself.

@ghost ghost assigned mamash Feb 27, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment