[STABILITY] workers losing ens4 network, not able to recover #4264

chefsalim · 2017-12-08T01:37:18Z

We're seeing workers in prod (and acceptance) lose the ens4 network

Dec 08 01:23:45 ip-10-0-0-242 hab[18939]: builder-worker.live(O): FATAL: Could not find network interface ens4

Checking the IP address:

root@ip-10-0-0-242:~# ip address show dev ens4 | grep 'inet '
Device "ens4" does not exist.

We need to figure out a more robust way to apply the ens4 interface.

The text was updated successfully, but these errors were encountered:

fnichol · 2017-12-19T19:45:55Z

I'm going to look at only setting up the network namespace at worker start rather than around each build activity.

This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>

chefsalim added A-builder labels Dec 8, 2017

chefsalim added the C-STABILITY label Dec 8, 2017

chefsalim assigned fnichol Dec 11, 2017

fnichol mentioned this issue Dec 20, 2017

[builder-worker] Create network namespace only at server boot. #4353

Merged

thesentinels closed this as completed in #4353 Jan 17, 2018

eeyun added Type:Stability and removed C-STABILITY labels Mar 13, 2018

christophermaier added Type: Bug Issues that describe broken functionality and removed C-bug labels Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STABILITY] workers losing ens4 network, not able to recover #4264

[STABILITY] workers losing ens4 network, not able to recover #4264

chefsalim commented Dec 8, 2017

fnichol commented Dec 19, 2017

[STABILITY] workers losing ens4 network, not able to recover #4264

[STABILITY] workers losing ens4 network, not able to recover #4264

Comments

chefsalim commented Dec 8, 2017

fnichol commented Dec 19, 2017