Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STABILITY] workers losing ens4 network, not able to recover #4264

Closed
chefsalim opened this issue Dec 8, 2017 · 1 comment · Fixed by #4353
Closed

[STABILITY] workers losing ens4 network, not able to recover #4264

chefsalim opened this issue Dec 8, 2017 · 1 comment · Fixed by #4353
Assignees
Labels
Type: Bug Issues that describe broken functionality Type:Stability

Comments

@chefsalim
Copy link
Contributor

We're seeing workers in prod (and acceptance) lose the ens4 network

Dec 08 01:23:45 ip-10-0-0-242 hab[18939]: builder-worker.live(O): FATAL: Could not find network interface ens4

Checking the IP address:

root@ip-10-0-0-242:~# ip address show dev ens4 | grep 'inet '
Device "ens4" does not exist.

We need to figure out a more robust way to apply the ens4 interface.

@fnichol
Copy link
Collaborator

fnichol commented Dec 19, 2017

I'm going to look at only setting up the network namespace at worker start rather than around each build activity.

fnichol added a commit that referenced this issue Dec 20, 2017
This change modifies where and when an optional network namespace is set
up for use when running Studio builds.

Prior to this change a network namespace would be created just before
each `studio build` invocation and destroyed immediately afterwards.
Now, a namespace will be created when the worker service starts and
will, by default, reuse an existing network namespace (the key signal
being the presence of an `airlock-ns/` directory). The network namespace
management continues to be provided by the Airlock program.

If for some reason the existing network namespace needs to destroyed on
each start/restart of the worker service, a new configuration option of:

```toml
recreate_ns_dir = true
```

can be set. The default behavior set to `false` which will reuse any
network namespaces found.

Note that the failure scenarios are different, but hopefully more
desirable:

* On a fresh installation, a network namespace will be created once, at
boot.
* If the network namespace creation fails, this happens at worker boot
and will sent the service in a failed/flapping state, loudly explaining
what is happening. Previously failures of this nature would have been
delayed until build time.
* If a worker is upgraded or restarted, it will reuse an existing
network namespace setup.
* If the network namespace is having issues, an operator can enable
network namespace recreation mode by setting the `recreate_ns_dir =
true` as above. The Supervisor would restart a worker service and the
destroy/create logic would be fired.

Finally, note that one major failure scenario is still unaddressed: if
the network interface's settings are not restored after a network
namespace teardown, then re-creating the namespace will fail. The good
news is that such a failure would effectively bring the worker offline
since it can't finish booting itself and would put itself out of the
ready pool of workers (rather than leaving a pool of dead workers as is
the case now). This final failure scenario can most likely be solved by
restarting the host's networking service. Sadly, this failure is highly
dependant on the cloud platform and operating system combination.

Closes #4264

Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol added a commit that referenced this issue Dec 21, 2017
This change modifies where and when an optional network namespace is set
up for use when running Studio builds.

Prior to this change a network namespace would be created just before
each `studio build` invocation and destroyed immediately afterwards.
Now, a namespace will be created when the worker service starts and
will, by default, reuse an existing network namespace (the key signal
being the presence of an `airlock-ns/` directory). The network namespace
management continues to be provided by the Airlock program.

If for some reason the existing network namespace needs to destroyed on
each start/restart of the worker service, a new configuration option of:

```toml
recreate_ns_dir = true
```

can be set. The default behavior set to `false` which will reuse any
network namespaces found.

Note that the failure scenarios are different, but hopefully more
desirable:

* On a fresh installation, a network namespace will be created once, at
boot.
* If the network namespace creation fails, this happens at worker boot
and will sent the service in a failed/flapping state, loudly explaining
what is happening. Previously failures of this nature would have been
delayed until build time.
* If a worker is upgraded or restarted, it will reuse an existing
network namespace setup.
* If the network namespace is having issues, an operator can enable
network namespace recreation mode by setting the `recreate_ns_dir =
true` as above. The Supervisor would restart a worker service and the
destroy/create logic would be fired.

Finally, note that one major failure scenario is still unaddressed: if
the network interface's settings are not restored after a network
namespace teardown, then re-creating the namespace will fail. The good
news is that such a failure would effectively bring the worker offline
since it can't finish booting itself and would put itself out of the
ready pool of workers (rather than leaving a pool of dead workers as is
the case now). This final failure scenario can most likely be solved by
restarting the host's networking service. Sadly, this failure is highly
dependant on the cloud platform and operating system combination.

Closes #4264

Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol added a commit that referenced this issue Dec 22, 2017
This change modifies where and when an optional network namespace is set
up for use when running Studio builds.

Prior to this change a network namespace would be created just before
each `studio build` invocation and destroyed immediately afterwards.
Now, a namespace will be created when the worker service starts and
will, by default, reuse an existing network namespace (the key signal
being the presence of an `airlock-ns/` directory). The network namespace
management continues to be provided by the Airlock program.

If for some reason the existing network namespace needs to destroyed on
each start/restart of the worker service, a new configuration option of:

```toml
recreate_ns_dir = true
```

can be set. The default behavior set to `false` which will reuse any
network namespaces found.

Note that the failure scenarios are different, but hopefully more
desirable:

* On a fresh installation, a network namespace will be created once, at
boot.
* If the network namespace creation fails, this happens at worker boot
and will sent the service in a failed/flapping state, loudly explaining
what is happening. Previously failures of this nature would have been
delayed until build time.
* If a worker is upgraded or restarted, it will reuse an existing
network namespace setup.
* If the network namespace is having issues, an operator can enable
network namespace recreation mode by setting the `recreate_ns_dir =
true` as above. The Supervisor would restart a worker service and the
destroy/create logic would be fired.

Finally, note that one major failure scenario is still unaddressed: if
the network interface's settings are not restored after a network
namespace teardown, then re-creating the namespace will fail. The good
news is that such a failure would effectively bring the worker offline
since it can't finish booting itself and would put itself out of the
ready pool of workers (rather than leaving a pool of dead workers as is
the case now). This final failure scenario can most likely be solved by
restarting the host's networking service. Sadly, this failure is highly
dependant on the cloud platform and operating system combination.

Closes #4264

Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol added a commit that referenced this issue Jan 17, 2018
This change modifies where and when an optional network namespace is set
up for use when running Studio builds.

Prior to this change a network namespace would be created just before
each `studio build` invocation and destroyed immediately afterwards.
Now, a namespace will be created when the worker service starts and
will, by default, reuse an existing network namespace (the key signal
being the presence of an `airlock-ns/` directory). The network namespace
management continues to be provided by the Airlock program.

If for some reason the existing network namespace needs to destroyed on
each start/restart of the worker service, a new configuration option of:

```toml
recreate_ns_dir = true
```

can be set. The default behavior set to `false` which will reuse any
network namespaces found.

Note that the failure scenarios are different, but hopefully more
desirable:

* On a fresh installation, a network namespace will be created once, at
boot.
* If the network namespace creation fails, this happens at worker boot
and will sent the service in a failed/flapping state, loudly explaining
what is happening. Previously failures of this nature would have been
delayed until build time.
* If a worker is upgraded or restarted, it will reuse an existing
network namespace setup.
* If the network namespace is having issues, an operator can enable
network namespace recreation mode by setting the `recreate_ns_dir =
true` as above. The Supervisor would restart a worker service and the
destroy/create logic would be fired.

Finally, note that one major failure scenario is still unaddressed: if
the network interface's settings are not restored after a network
namespace teardown, then re-creating the namespace will fail. The good
news is that such a failure would effectively bring the worker offline
since it can't finish booting itself and would put itself out of the
ready pool of workers (rather than leaving a pool of dead workers as is
the case now). This final failure scenario can most likely be solved by
restarting the host's networking service. Sadly, this failure is highly
dependant on the cloud platform and operating system combination.

Closes #4264

Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol added a commit that referenced this issue Jan 17, 2018
This change modifies where and when an optional network namespace is set
up for use when running Studio builds.

Prior to this change a network namespace would be created just before
each `studio build` invocation and destroyed immediately afterwards.
Now, a namespace will be created when the worker service starts and
will, by default, reuse an existing network namespace (the key signal
being the presence of an `airlock-ns/` directory). The network namespace
management continues to be provided by the Airlock program.

If for some reason the existing network namespace needs to destroyed on
each start/restart of the worker service, a new configuration option of:

```toml
recreate_ns_dir = true
```

can be set. The default behavior set to `false` which will reuse any
network namespaces found.

Note that the failure scenarios are different, but hopefully more
desirable:

* On a fresh installation, a network namespace will be created once, at
boot.
* If the network namespace creation fails, this happens at worker boot
and will sent the service in a failed/flapping state, loudly explaining
what is happening. Previously failures of this nature would have been
delayed until build time.
* If a worker is upgraded or restarted, it will reuse an existing
network namespace setup.
* If the network namespace is having issues, an operator can enable
network namespace recreation mode by setting the `recreate_ns_dir =
true` as above. The Supervisor would restart a worker service and the
destroy/create logic would be fired.

Finally, note that one major failure scenario is still unaddressed: if
the network interface's settings are not restored after a network
namespace teardown, then re-creating the namespace will fail. The good
news is that such a failure would effectively bring the worker offline
since it can't finish booting itself and would put itself out of the
ready pool of workers (rather than leaving a pool of dead workers as is
the case now). This final failure scenario can most likely be solved by
restarting the host's networking service. Sadly, this failure is highly
dependant on the cloud platform and operating system combination.

Closes #4264

Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
@christophermaier christophermaier added Type: Bug Issues that describe broken functionality and removed C-bug labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Issues that describe broken functionality Type:Stability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants