New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move load balancer sandbox creation/deletion into libnetwork #35422

Merged
merged 3 commits into from Dec 1, 2017

Conversation

@pradipd
Contributor

pradipd commented Nov 7, 2017

- What I did
Fix for #35310

- How I did it
2 changes:

  1. Move sandbox creation/deletion into libnetwork. (docker/libnetwork#2011)
  2. Limit the number of scenarios in which we create the per-network lb sandbox. We only create the per network lb sandbox for:
    a. ingress network (both linux/windows)
    b. any overlay network on windows.

- How to verify it
Reran repro (described at the end of #35310)
UT/IT/PY

- Description for the changelog
Move load balancer sandbox creation/deletion into libnetwork

- A picture of a cute animal (not mandatory but encouraged)
kittens.

@pradipd pradipd requested review from dnephin and vdemeester as code owners Nov 7, 2017

@pradipd pradipd changed the title from Lbfix to Move load balancer sandbox creation/deletion into libnetwork Nov 7, 2017

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Nov 7, 2017

Member

ping @mavenugo PTAL

Member

thaJeztah commented Nov 7, 2017

ping @mavenugo PTAL

func (daemon *Daemon) deleteNetwork(networkID string, dynamic bool) error {
nw, err := daemon.FindNetwork(networkID)
if err != nil {
return err
}
if nw.Info().Ingress() {

This comment has been minimized.

@mavenugo

mavenugo Nov 7, 2017

Contributor

Can you pls explain why we need this ?

@mavenugo

mavenugo Nov 7, 2017

Contributor

Can you pls explain why we need this ?

This comment has been minimized.

@pradipd

pradipd Nov 7, 2017

Contributor

Design wise, I treated ingress like other overlay networks with a load balancer sandbox.
It requires the same steps to setup and remove the network (i.e. we must create the ingress-sbox, the ingress-endpoint, etc).
However, the ingress sandbox's lifetime is slightly different than the other networks.
The ingress sandbox exists even when there are no tasks connected to it.
But, other networks create the load balancer sandbox on the fly when the first task is scheduled, and remove it when the last task goes away.

Currently, when a task is removed, we delete all the networks it is connect to.
For ingress, this always returns "ActiveEndpoints" error.
The ingress sandbox and network are actually deleted in releaseIngress().

Now, by moving the deletion of the lb sandbox into libnetwork, libnetwork needs to figure out when to delete the lb sandbox (https://github.com/docker/libnetwork/pull/2011/files#diff-cba6b1b3488fe48ebbef7e1821a3745dR962). Libnetwork, deletes the network when the only endpoint left is the lb endpoint.
But, the ingress sandbox's exists even when there are no tasks left. So, we don't delete the ingress network here and only in releaseIngress().

One alternative I considered was to make the ingress's sandbox lifetime behave exactly like other overlay sandbox's. I.e. only create the ingress sandbox when a task is scheduled and remove it when the last task is gone. That would remove these special cases for ingress. However, I felt like that would be a bigger design change with more ramifications.

@pradipd

pradipd Nov 7, 2017

Contributor

Design wise, I treated ingress like other overlay networks with a load balancer sandbox.
It requires the same steps to setup and remove the network (i.e. we must create the ingress-sbox, the ingress-endpoint, etc).
However, the ingress sandbox's lifetime is slightly different than the other networks.
The ingress sandbox exists even when there are no tasks connected to it.
But, other networks create the load balancer sandbox on the fly when the first task is scheduled, and remove it when the last task goes away.

Currently, when a task is removed, we delete all the networks it is connect to.
For ingress, this always returns "ActiveEndpoints" error.
The ingress sandbox and network are actually deleted in releaseIngress().

Now, by moving the deletion of the lb sandbox into libnetwork, libnetwork needs to figure out when to delete the lb sandbox (https://github.com/docker/libnetwork/pull/2011/files#diff-cba6b1b3488fe48ebbef7e1821a3745dR962). Libnetwork, deletes the network when the only endpoint left is the lb endpoint.
But, the ingress sandbox's exists even when there are no tasks left. So, we don't delete the ingress network here and only in releaseIngress().

One alternative I considered was to make the ingress's sandbox lifetime behave exactly like other overlay sandbox's. I.e. only create the ingress sandbox when a task is scheduled and remove it when the last task is gone. That would remove these special cases for ingress. However, I felt like that would be a bigger design change with more ramifications.

This comment has been minimized.

@mavenugo

mavenugo Nov 8, 2017

Contributor

The suggestion to change the ingress sandbox lifetime wont work, because the purpose of ingress network was to provide the overlay connectivity from any node for any service with a task attached to any other node.

Also, technically the ingress-sbox is similar to the lb-sandbox that you have introduced. But the management of this sandbox and the ingress-endpoint is handled outside of your existing changes (I may have to spend some more time to understand if that is not the case). PS: I didnt review your previous changes and that might answer the lack of my knowledge in this area of the code.
I will comment more on this once I review the ingress-sbox management more thoroughly.

@mavenugo

mavenugo Nov 8, 2017

Contributor

The suggestion to change the ingress sandbox lifetime wont work, because the purpose of ingress network was to provide the overlay connectivity from any node for any service with a task attached to any other node.

Also, technically the ingress-sbox is similar to the lb-sandbox that you have introduced. But the management of this sandbox and the ingress-endpoint is handled outside of your existing changes (I may have to spend some more time to understand if that is not the case). PS: I didnt review your previous changes and that might answer the lack of my knowledge in this area of the code.
I will comment more on this once I review the ingress-sbox management more thoroughly.

This comment has been minimized.

@mavenugo

mavenugo Dec 1, 2017

Contributor

@pradipd okay. the changes makes sense.

@mavenugo

mavenugo Dec 1, 2017

Contributor

@pradipd okay. the changes makes sense.

@@ -358,6 +317,15 @@ func (daemon *Daemon) createNetwork(create types.NetworkCreateRequest, id string
nwOptions = append(nwOptions, libnetwork.NetworkOptionConfigFrom(create.ConfigFrom.Network))
}
if agent && driver == "overlay" && (create.Ingress || runtime.GOOS == "windows") {

This comment has been minimized.

@mavenugo

mavenugo Nov 7, 2017

Contributor

Why is Ingress network special here ?

@mavenugo

mavenugo Nov 7, 2017

Contributor

Why is Ingress network special here ?

This comment has been minimized.

@pradipd

pradipd Nov 7, 2017

Contributor

As mentioned above, I treated ingress like other overlay networks with a load balancer sandbox since setting up and deleting the load balancer sandbox is the same as other networks.

So, on linux, if we are creating the ingress network, we go ahead and create the load balancer sandbox. For all other overlay networks on linux, we don't do this...yet. Until linux takes advantage of the load balancer sandboxes.

@pradipd

pradipd Nov 7, 2017

Contributor

As mentioned above, I treated ingress like other overlay networks with a load balancer sandbox since setting up and deleting the load balancer sandbox is the same as other networks.

So, on linux, if we are creating the ingress network, we go ahead and create the load balancer sandbox. For all other overlay networks on linux, we don't do this...yet. Until linux takes advantage of the load balancer sandboxes.

This comment has been minimized.

@mavenugo

mavenugo Nov 8, 2017

Contributor

Same comment as above. I have to make sure this changes doesnt adversely impact Linux ingress network since the ingress-sbox is already managed in different code path. So, I would like to see if we can narrow down the scope of this change only to Windows without any impacts to Linux. We can remove the windows check once Linux makes full use of the LB-Sandbox concept.

But again, I have to spend a bit more time to understand the current changes made to ingress-sbox management in your previous ILB/ELB changes in to give a strong recommendation.

@mavenugo

mavenugo Nov 8, 2017

Contributor

Same comment as above. I have to make sure this changes doesnt adversely impact Linux ingress network since the ingress-sbox is already managed in different code path. So, I would like to see if we can narrow down the scope of this change only to Windows without any impacts to Linux. We can remove the windows check once Linux makes full use of the LB-Sandbox concept.

But again, I have to spend a bit more time to understand the current changes made to ingress-sbox management in your previous ILB/ELB changes in to give a strong recommendation.

This comment has been minimized.

@pradipd

pradipd Nov 8, 2017

Contributor

I can limit the change to windows even more if you prefer. It would just mean duplicate code for setting up the sandbox for ingress and other overlay networks on windows. But, it would be less risky for linux.
Here is the previous code to setup the ingress-sbox (https://github.com/moby/moby/blob/9be245f438f9fb2eaeb7891673b16aed9262a192/daemon/network.go)

And here is the current code in libnetwork PR (https://github.com/docker/libnetwork/pull/2011/files#diff-cba6b1b3488fe48ebbef7e1821a3745dR2057)

@pradipd

pradipd Nov 8, 2017

Contributor

I can limit the change to windows even more if you prefer. It would just mean duplicate code for setting up the sandbox for ingress and other overlay networks on windows. But, it would be less risky for linux.
Here is the previous code to setup the ingress-sbox (https://github.com/moby/moby/blob/9be245f438f9fb2eaeb7891673b16aed9262a192/daemon/network.go)

And here is the current code in libnetwork PR (https://github.com/docker/libnetwork/pull/2011/files#diff-cba6b1b3488fe48ebbef7e1821a3745dR2057)

This comment has been minimized.

@mavenugo

mavenugo Dec 1, 2017

Contributor

Thanks for the clarification. I think this change is fine as is.

@mavenugo

mavenugo Dec 1, 2017

Contributor

Thanks for the clarification. I think this change is fine as is.

@coolljt0725

This comment has been minimized.

Show comment
Hide comment
@coolljt0725
Contributor

coolljt0725 commented Nov 22, 2017

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 22, 2017

Contributor

@pradipd apologies on the delay. I didnt find the time to evaluate the impact on the ingress changes (both the current one and the one introduced prior to this change). We can get this and the libnetwork PR moving once I get to complete that analysis... In the meantime if someone else can do the same, that might accelerate this review.

Contributor

mavenugo commented Nov 22, 2017

@pradipd apologies on the delay. I didnt find the time to evaluate the impact on the ingress changes (both the current one and the one introduced prior to this change). We can get this and the libnetwork PR moving once I get to complete that analysis... In the meantime if someone else can do the same, that might accelerate this review.

@fcrisciani

This comment has been minimized.

Show comment
Hide comment
@fcrisciani

fcrisciani Nov 29, 2017

Contributor

@pradipd can you revendor the latest libnetwork?

Contributor

fcrisciani commented Nov 29, 2017

@pradipd can you revendor the latest libnetwork?

if agent && driver == "overlay" && (create.Ingress || runtime.GOOS == "windows") {
nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
if !exists {
return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)

This comment has been minimized.

@fcrisciani

fcrisciani Nov 29, 2017

Contributor

we should consider also the case where the error is associated with the ingress network else this message can be misleading

@fcrisciani

fcrisciani Nov 29, 2017

Contributor

we should consider also the case where the error is associated with the ingress network else this message can be misleading

This comment has been minimized.

@fcrisciani

fcrisciani Nov 29, 2017

Contributor

actually considering that the ingress is actually an instance of a load balancer sandbox, the id should give enough information on what to look for

@fcrisciani

fcrisciani Nov 29, 2017

Contributor

actually considering that the ingress is actually an instance of a load balancer sandbox, the id should give enough information on what to look for

@pradipd

This comment has been minimized.

Show comment
Hide comment
@pradipd

pradipd Nov 29, 2017

Contributor

Will try and get it today.

Contributor

pradipd commented Nov 29, 2017

Will try and get it today.

pradipd added some commits Nov 30, 2017

vndr libnetwork 64ae58878fc8f95e4a167499d654e13fa36abdc7
Signed-off-by: Pradip Dhara <pradipd@microsoft.com>
docker changes corresponding to libnetwork changes.
Signed-off-by: Pradip Dhara <pradipd@microsoft.com>
Adding test for creating service multiple times.
Signed-off-by: Pradip Dhara <pradipd@microsoft.com>
@pradipd

This comment has been minimized.

Show comment
Hide comment
@pradipd

pradipd Nov 30, 2017

Contributor

@thaJeztah @mavenugo @fcrisciani Can someone restart/retry the "z" and "experimental" builds?

Contributor

pradipd commented Nov 30, 2017

@thaJeztah @mavenugo @fcrisciani Can someone restart/retry the "z" and "experimental" builds?

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 30, 2017

Contributor

@pradipd i tried. it is still failing. we have to look into this.

Contributor

mavenugo commented Nov 30, 2017

@pradipd i tried. it is still failing. we have to look into this.

@mavenugo

LGTM

@pradipd

This comment has been minimized.

Show comment
Hide comment
@pradipd

pradipd Dec 1, 2017

Contributor

@thaJeztah @mavenugo @fcrisciani Can someone restart/retry the "experimental" build?

Contributor

pradipd commented Dec 1, 2017

@thaJeztah @mavenugo @fcrisciani Can someone restart/retry the "experimental" build?

@fcrisciani

LGTM

@mavenugo mavenugo merged commit 4bb2c24 into moby:master Dec 1, 2017

7 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 38134 has succeeded
Details
janky Jenkins build Docker-PRs 46827 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 7237 has succeeded
Details
vendor Jenkins build Docker-PRs-vendor 3975 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 18382 has succeeded
Details
z Jenkins build Docker-PRs-s390x 7056 has succeeded
Details

@pradipd pradipd deleted the pradipd:lbfix branch Dec 1, 2017

@pradipd

This comment has been minimized.

Show comment
Hide comment
@pradipd

pradipd Dec 1, 2017

Contributor

Thanks!

Contributor

pradipd commented Dec 1, 2017

Thanks!

@swaroopkundeti

This comment has been minimized.

Show comment
Hide comment
@swaroopkundeti

swaroopkundeti Jan 8, 2018

This is still persisting on 17.12 version

I get

Unable to complete atomic operation, key modified
--

swaroopkundeti commented Jan 8, 2018

This is still persisting on 17.12 version

I get

Unable to complete atomic operation, key modified
--

ctelfer added a commit to ctelfer/moby that referenced this pull request Mar 8, 2018

Add test for ingress removal on service removal
The commit moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <ctelfer@docker.com>

ctelfer added a commit to ctelfer/moby that referenced this pull request Mar 8, 2018

Add test for ingress removal on service removal
The commit moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <ctelfer@docker.com>

ctelfer added a commit to ctelfer/moby that referenced this pull request Mar 12, 2018

Add test for ingress removal on service removal
The commit moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <ctelfer@docker.com>

thaJeztah added a commit to thaJeztah/docker-ce that referenced this pull request Mar 14, 2018

Add test for ingress removal on service removal
The commit moby/moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <ctelfer@docker.com>
(cherry picked from commit 805b6a7)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

docker-jenkins pushed a commit to docker/docker-ce that referenced this pull request Mar 14, 2018

Add test for ingress removal on service removal
The commit moby/moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <ctelfer@docker.com>
Upstream-commit: 805b6a7
Component: engine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment