Rewrite Allocator #2516

dperny · 2018-02-16T21:36:11Z

The Allocator code is bad. I'd hazard to say it's irredeemably bad. It's a source of constant bugs and breakages. Things are allocated twice or not allocated at all. The issues don't seem to be present (usually) in the levels deeper than swarmkit (libnetwork). Instead, libnetwork tends to give garbage responses when given garbage input.

Some examples of problems:

A bunch of code for handling multiple allocators, even though only 1 allocator exists and no other allocators are anywhere in the roadmap
Responsibilities muddled. All of the network allocation methods are just implemented straight on the Allocator object, even thought presence of a allocActor type in the Run function suggests they should be separate.
Code is under-commented and tangled, making parsing it often inscrutable. The tangled nature of the code makes piece-wise refactoring almost impossible, because of the fragile nature of the code.
The network allocator keeps a local state from which it makes decisions, which is supposed to be a mirror of the state as committed into raft. However, logic errors can cause this internal state to become irrecoverably inconsistent. If we're lucky, a leadership change fixes this before it becomes a problem. If we're not, bad local state is used to make bad decisions and create inconsistent distributed state. We need to reduce the footprint of the local state as much as feasible.
The allocator requires the local state to be initialized from the distributed state before performing any new allocations. However, the logic for allocation and initialization follow the same code paths, and use a boolean flag to separate them. Errors in the initialization logic cause allocations and deallocations to occur before the local state is fully initialized, resulting in duplicate IP addresses.

We should rewrite the whole thing from scratch. It's not a small project, and there's a lot of risk in a rewrite versus a refactoring. However, a clean slate would let us escape the most ingrained design flaws.

dperny · 2018-02-16T21:47:49Z

Another example. I happened to notice this while going through the code:

https://github.com/docker/swarmkit/blob/2543c41ec5bb2646d2cebba6869016efe484d687/manager/allocator/allocator.go#L97-L109

You can't take a pointer to a loop variable. the same memory location is reused for each iteration. The value pointed to changes every time. This was a landmine waiting to happen.

anshulpundir · 2018-02-21T01:29:30Z

Can you please include code snippets for the last two points in the problem statement. I tried reading the code but it wasn't obvious to me. @dperny

dperny · 2018-02-21T21:52:49Z

The network allocator, like, the IPAM stuff, doesn't live in the raft store. It has its own internal state. Before you can allocate new IP addresses, you have to "allocate" all of the old IP addresses that are assigned in the raft store. This involves iterating through every object (node, network, service, task) and "allocating" their assigned IP addresses. However, it's possible for an object to be committed only partially allocated, where it has some IP addresses assigned and some not. This can happen, for example, when a spec has been updated, but the allocator hasn't run on the object yet.

In the cnmallocator code, eventually you get to something like this, ipam.RequestAddress (in all code paths, slightly different for each object type).

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/cnmallocator/networkallocator.go#L620

The difference between initialization and new allocation is whether or not addr is empty or an ip address. We have tons of bugs where we're passing an empty IP address to this method before we've fully iterated through every object and populated the ipam with the existing IPs. There's nothing stopping this from happening, because this is the same code path for new and old allocations. If you have an object in the above-mentioned half-allocated state, it's really tricky to get this right.

This is in doNetworkInit. The magic boolean flag on each method call control whether or not we're initializing the state, or we're allocating new things. This boolean flag gets passed down through methods and sub-methods, and switches off whether or not we perform allocation.

https://github.com/docker/swarmkit/blob/49a9d7f6ba3c1925262641e694c18eb43575f74b/manager/allocator/network.go#L155-L175

dperny · 2018-08-21T14:31:52Z

This is done as part of #2615, and work on it is tracked on the new-allocator branch in this repo.

To complete the new allocator rewrite, we need to merge #2686, which removes the old allocator.

Additionally, there have been changes to the old allocator since the rewrite began, which must be "forward-ported" to the new one. These are:

Only allocate attachments needed by nodes #2725 - Only allocate attachments needed by nodes
Global Default Address Pool feature support #2714 - Global Default Address Pool feature support

olljanat · 2018-11-28T18:59:18Z

@dperny btw #2714 and #2725 looks to be merged (so they should me marked as done).

Is there any time plan for this one?

dperny added area/networking exp/expert kind/enhancement kind/proposal labels Feb 16, 2018

dperny mentioned this issue Feb 16, 2018

Remove Allocator task voting #2517

Closed

dperny mentioned this issue Mar 28, 2018

Add new PortAllocator #2579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Allocator #2516

Rewrite Allocator #2516

dperny commented Feb 16, 2018

dperny commented Feb 16, 2018

anshulpundir commented Feb 21, 2018

dperny commented Feb 21, 2018

dperny commented Aug 21, 2018

olljanat commented Nov 28, 2018

Rewrite Allocator #2516

Rewrite Allocator #2516

Comments

dperny commented Feb 16, 2018

dperny commented Feb 16, 2018

anshulpundir commented Feb 21, 2018

dperny commented Feb 21, 2018

dperny commented Aug 21, 2018

olljanat commented Nov 28, 2018