-
Notifications
You must be signed in to change notification settings - Fork 314
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STABILITY] workers losing ens4 network, not able to recover #4264
Labels
Comments
I'm going to look at only setting up the network namespace at worker start rather than around each build activity. |
fnichol
added a commit
that referenced
this issue
Dec 20, 2017
This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol
added a commit
that referenced
this issue
Dec 21, 2017
This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol
added a commit
that referenced
this issue
Dec 22, 2017
This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol
added a commit
that referenced
this issue
Jan 17, 2018
This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
fnichol
added a commit
that referenced
this issue
Jan 17, 2018
This change modifies where and when an optional network namespace is set up for use when running Studio builds. Prior to this change a network namespace would be created just before each `studio build` invocation and destroyed immediately afterwards. Now, a namespace will be created when the worker service starts and will, by default, reuse an existing network namespace (the key signal being the presence of an `airlock-ns/` directory). The network namespace management continues to be provided by the Airlock program. If for some reason the existing network namespace needs to destroyed on each start/restart of the worker service, a new configuration option of: ```toml recreate_ns_dir = true ``` can be set. The default behavior set to `false` which will reuse any network namespaces found. Note that the failure scenarios are different, but hopefully more desirable: * On a fresh installation, a network namespace will be created once, at boot. * If the network namespace creation fails, this happens at worker boot and will sent the service in a failed/flapping state, loudly explaining what is happening. Previously failures of this nature would have been delayed until build time. * If a worker is upgraded or restarted, it will reuse an existing network namespace setup. * If the network namespace is having issues, an operator can enable network namespace recreation mode by setting the `recreate_ns_dir = true` as above. The Supervisor would restart a worker service and the destroy/create logic would be fired. Finally, note that one major failure scenario is still unaddressed: if the network interface's settings are not restored after a network namespace teardown, then re-creating the namespace will fail. The good news is that such a failure would effectively bring the worker offline since it can't finish booting itself and would put itself out of the ready pool of workers (rather than leaving a pool of dead workers as is the case now). This final failure scenario can most likely be solved by restarting the host's networking service. Sadly, this failure is highly dependant on the cloud platform and operating system combination. Closes #4264 Signed-off-by: Fletcher Nichol <fnichol@nichol.ca>
christophermaier
added
Type: Bug
Issues that describe broken functionality
and removed
C-bug
labels
Jul 24, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We're seeing workers in prod (and acceptance) lose the ens4 network
Checking the IP address:
We need to figure out a more robust way to apply the ens4 interface.
The text was updated successfully, but these errors were encountered: