Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLA+ model for SwarmKit tasks #2613

Merged
merged 1 commit into from Jun 7, 2018
Merged

TLA+ model for SwarmKit tasks #2613

merged 1 commit into from Jun 7, 2018

Conversation

talex5
Copy link
Contributor

@talex5 talex5 commented Apr 20, 2018

- What I did

I want to learn about how SwarmKit works, and extend its documentation. As a first step, I've been reading through the docs in the design directory and writing a more formal version of them. If someone familiar with SwarmKit has time to read over and correct my intepretation, that would be great. I marked with XXX things I was particularly unsure about. For example, the docs say very little about how a service is removed.

To make it more precise, I've written the document using TLA+ notation, but hopefully it's still readable to people not familiar with that.

So far, this is based only on reading the design docs. I plan to start reading through the Go code soon, but I wanted to get some early feedback first. Once it's correct, it could be added to the existing documentation.

- How to test it

It's easiest to read the PDF rendering of the spec.

You can also load the spec into the TLA toolbox and run the model checker on it to test the claimed properties. The document explains how to do that.

- Description for the changelog

Add a TLA+ spec for SwarmKit's task model

@codecov
Copy link

codecov bot commented Apr 20, 2018

Codecov Report

Merging #2613 into master will decrease coverage by 0.48%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2613      +/-   ##
==========================================
- Coverage   62.31%   61.82%   -0.49%     
==========================================
  Files         134      134              
  Lines       21739    21821      +82     
==========================================
- Hits        13546    13491      -55     
- Misses       6740     6892     +152     
+ Partials     1453     1438      -15

Copy link
Collaborator

@dperny dperny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully these explanations will make some things clearer.

(* This has every field mentioned in `task_model.md' except for `spec', which
it doesn't seem to use for anything.

XXX: can desired_state be any state, or just some (e.g. {ready, running, shutdown, remove})? *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there are no guarantees about this, but the current behavior should be only running, shutdown, and remove. I have also envisioned that one day for single-run tasks, the desired state might be completed. So to model accurately, it may be best to model the desired state as any possible state. And if that doesn't work, then we can constrain states as needed.

\/ CreateService
\/ UpdateService

(* XXX: what about removing a service? (not described in any detail in the design docs) *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing a service is possible. The original way it was done is to delete the service object. This cascades to the Orchestrator, which saw that the service was removed and deleted all of its tasks. The deletion of the tasks caused the Dispatcher to send an updated Assignments to the workers which did not include the deleted tasks, and in turn the worker shut down the deleted tasks.

This caused an issue because shutting down a task is far from instant, while deleting a task is. When a task is deleted, its network resources are freed by the Allocator, and can be reused. However, because the task may take some time to shut down, it was possible for network resources (IP addresses) to be reused while the old tasks' containers were still shutting down.

The solution for this was to add the Remove state, which (being a terminal state) indicates to a worker that the task should be shut down. Then, when the task is shut down, that state is reported back to the manager, which affirmatively knows that the task is terminated, and at that point can safely delete the resources it uses.

However, a further extension of this problem is the fact that a service itself can have resources associated with it, specifically Virtual IPs. Deleting the service would free those virtual IPs, creating the same problem. So, to fix that, I believe @nishanttotla was or is working on a change to the service object to flag it for removal, instead of removing it outright, which would be treated like a deletion used to be, kicking off the movement of tasks to desired state removed, etc., and then finally deleting the service when all of the tasks were deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll wait for @nishanttotla to do that (or submit a design for it) and add it to the model then.

To check I understand the current system:

  1. The service is removed from the system.
  2. The orchestrator sets the desired state for every task without a service to remove.
  3. Once a task with no service is shut down, the reaper will remove it.

are configured to start the new task before stopping the old one.'' - need to implement this

XXX: ``The updater takes care of making sure that each slot converges to having a single running task.''
How does it do that?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to look at the code itself, but the pattern is probably something along the lines of watching the event stream for task updates, and moving desired state of older tasks in a slot during a rolling update to shutdown.


XXX: `orchestrators.md' says: ``[The Restart supervisor] atomically changes the state of the old task to Shutdown''.
However, this conflicts with the claim that only the agent can change the state of a running task.
Presumably it means the desired state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably means desired state. Every state after assigned is set by the agent. Feel free to either open a new PR or add to this PR an update to the documentation.

(* XXX: spec says ``Tasks that were removed because of service removal
or scale down are not kept around in task history.
Does that include tasks that had already finished for some
other reason? *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going further off my explanation of slots, and the fact that the primary purpose of a slot is to maintain task history, the tasks in a slot have very little relationship between each other beyond the fact that they share a slot number.

When a task is removed in a scale down or service removal, the history for that slot is no longer useful in any meaningful way, so all tasks with the removed slot number in any state are removed. The slot no longer "exists" (as far as the concept of existence is meaningful for a slot) and therefore no longer requires any history.

(* Every task has a service:

XXX: The spec says: ``In some cases, there are tasks that exist independent of any service.
These do not have a value set in service_id.''. Need an example of one. *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network attachment tasks. They're an abomination, an absolutely disgusting hack that should never have been shipped. They are tasks which use the NetworkAttachmentSpec runtime spec, a spec which contains a container ID. These tasks are created through the grpc server implemented in resourceapi, and they are created by the agent. The purpose of Network Attachment tasks is to attach regular docker run containers to a swarmkit overlay network that has been created with the --attachable option.

There have been bugs in the past with attachment tasks because we have no visibility into their lifecycle and their creation and destruction is entirely at the whim of the agent. The end result is that network attachment tasks would remain in the object store after they had been terminated, leaking resources and causing the associated attached networks to be unremovable.

I added a workaround to remove network attachment tasks when a node is removed in #2409, but the underlying issue IIRC remains.

I'm sorry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @jlhawn is working on a hack week project to run one-off tasks, like docker run-esque tasks, that would exist independently of services.

Transition == {
<< null, new >>,
<< new, pending >>,
<< pending, assigned >>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite remember, I'd have to go check the code, but it may be possible to move from new to a terminal state, if the task cannot be allocated because there aren't enough free resources. I think. I'm not sure.

<< shutdown, remove >>,
<< rejected, remove >>,
<< complete, remove >>,
<< failed, remove >>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks will never actually BE in the removed state. They will only be in the desired state of removed. The state functions mainly as a flag to the task reaper, to indicate that a task once in a terminal state it and its history can be deleted.

<< shutdown, orphaned >>,

<< remove, null >>,
<< orphaned, null >>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks can be deleted (moved to the null state, I think, in terms of the implementation of this model) from any terminal state. For example, when a task history is deleted, the old tasks will all be in terminal states (like SHUTDOWN or FAILED) and will be deleted straight from those states. Once a task is in a terminal state, the resources it had in use are definitely free and it is safe to remove it.

=============================================================================
\* Modification History
\* Last modified Thu Apr 19 14:55:46 BST 2018 by tal
\* Created Mon Apr 16 11:31:19 BST 2018 by tal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a dumb nit, but wouldn't the modification history just be tracked in git? Or does this history get baked into the PDF that a TLA+ file creates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just some noise the TLA+ toolbox updates automatically when you save. Although it looks like if you delete it then it doesn't create it again, so easily solved!

@talex5
Copy link
Contributor Author

talex5 commented Apr 23, 2018

Thanks for the detailed review @dperny! I've made some updates:

Task.pdf

  • Explicitly model SwarmKit's view of a node's state (up or down). This is needed because we shouldn't automatically orphan nodes as soon as the node is down. It also makes it easier to write
    properties that depend on node state.

  • Note that removing services will be modelled once SwarmKit's new design for this is ready.

  • Confirm that slots do not exist as separate objects.

  • Confirm that already-terminated terms are removed when their slot is no longer needed.

  • remove is not actually a possible state of a task, only a desired state. The reaper removes any task that has finished and has this as its desired state.

  • Model the allocator rejecting tasks.

  • Some minor corrections to the existing markdown docs.

Note: I also made a change to the tasks variable, which should make things a bit easier to read as the model now mainly talks about things in terms of tasks rather than task IDs.

I also updated the set of valid transitions to group them by the component that performs them, and to check this.

progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the
way to `RUNNING`. If a task exits without an error code, it moves to the
progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on
the way to `RUNNING`. If a task exits without an error code, it moves to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a task exits without an error code, it moves to the COMPLETE state.

Lets also clarify this by saying that the actual state is changed to COMPLETE by the agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it says the agent handles these states on the line above the bit shown by GitHub's diff, so maybe it's OK as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, its fine as is.

@dperny
Copy link
Collaborator

dperny commented Apr 24, 2018

OK, I'll wait for @nishanttotla to do that (or submit a design for it) and add it to the model then.

To check I understand the current system:

  1. The service is removed from the system.
  2. The orchestrator sets the desired state for every task without a service to remove.
  3. Once a task with no service is shut down, the reaper will remove it.

More accurately, the service is removed from the system, the process of which involves setting the desired state of all of its tasks REMOVE, and when any task in desired state REMOVE enters a terminal state, the taskreaper will delete them.

(* A description of the task model in SwarmKit.
Currently, this is mostly based on `design/task_model.md'.

Note: I am not yet familiar with the SwarmKit code, and this document is likely to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this is no longer true :)

@anshulpundir
Copy link
Contributor

anshulpundir commented Apr 24, 2018

More accurately, the service is removed from the system, the process of which involves setting the desired state of all of its tasks REMOVE, and when any task in desired state REMOVE enters a terminal state, the taskreaper will delete them.

Even more accurately, any task which is in a terminal state or in an unassigned state (i.e. < ASSIGNED), the taskreaper will delete them.

Also:

The orchestrator sets the desired state for every task without a service to remove.

Not every task without a service, but just the tasks for the service just deleted.

@dperny @talex5

@talex5
Copy link
Contributor Author

talex5 commented Apr 24, 2018

Thanks for the explanation. I've added service removal. I modelled it as flagging the service and all its tasks for removal and then removing it when it had no tasks left (i.e. the proposed design rather than the current one). I added a comment about it in the model.

I also updated the reaper to allow removing tasks that haven't yet been assigned (as explained by @anshulpundir), and added a couple of exceptions to the rule that only one component is in charge of each state to cover that.

Task.pdf

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "tla" git@github.com:talex5/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842353920240
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@talex5 talex5 changed the title [RFC] TLA+ model for SwarmKit tasks TLA+ model for SwarmKit tasks May 2, 2018
@talex5
Copy link
Contributor Author

talex5 commented May 2, 2018

I've now split the spec into multiple files, and created a separate low-level spec for worker nodes. I also added a Makefile so you can build the PDFs and run the model checker without having to install the TLA toolbox (it uses a Docker image). This could also be used in CI.

@talex5
Copy link
Contributor Author

talex5 commented May 11, 2018

I think this PR is ready for merging now.

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments but otherwise looks good!

I also think we should squash the commits together.

progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the
way to `RUNNING`. If a task exits without an error code, it moves to the
progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on
the way to `RUNNING`. If a task exits without an error code, it moves to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, its fine as is.

(* The user removes a service.

Note: Currently, SwarmKit deletes the service from its records immediately.
However, this isn't right because we need to wait for the resources to be freed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't right because

Instead of saying this isn't right, I'd rather say that this is not very elegant. Because we still wait for resources to be freed before deleting the tasks, and resources are tied to tasks and not services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dperny mentioned that there are also service-level resources that need to be freed. How about:

However, this isn't right because we need to wait for service-level resources such as Virtual IPs to be freed.

\E id \in UnusedId(sid, slot) :
tasks' = tasks \union { NewTask(sid, slot, id, running) }

(* Add a task if a slot exists, contains no runnable tasks, and we weren't trying to remove it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: remove comma before 'and' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. According to https://en.wikipedia.org/wiki/Serial_comma:

A majority of American style guides mandate use of the serial comma

(* We don't model the updater explicitly, but we allow any task to be restarted (perhaps with
a different image) at any time, which should cover the behaviours of the restart supervisor.

XXX: wait for new task to be ready before shutting down old one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO ? I think some clarification is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I've updated it to:

TODO: SwarmKit also allows ``start-first'' mode updates where we first get the new task to
`running' and then mark the old task for shutdown. Add this to the model.


TODO: ``However, there are also cases where a slot may have multiple tasks with the
desired state of RUNNING. This can happen during rolling updates when the updates
are configured to start the new task before stopping the old one.'' - need to implement this *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to implement this

By this you mean implement the TLA model for this ?


(* The allocator rejects the task because there aren't enough resources.

XXX: Not clear whether SwarmKit actually does this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don' think it does. The task just stays in NEW (or READY ?) until resources are available. cc @dperny

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He said he wasn't sure: #2613 (comment)
Shall I remove this then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browsing through the code, I don't see a case where the current allocator fails tasks if there aren't enough resources. @talex5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for checking! I've removed this bit now.


---------------------------- MODULE Scheduler -------------------------------

\* Actions performed the scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performed the scheduler => performed by the scheduler

(* Forget about tasks in remove or orphan states.

Orphaned tasks belong to nodes that we are assuming are lost forever.
Or, at the very least, it's likely that a node outage lasting longer than the timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has crashed => has happened ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Orphaned tasks belong to nodes that we are assuming are lost forever (or have crashed and will come up with nothing running, which is an equally fine outcome).

I think this is what @dperny meant in #2613 (comment).


``Moving a task to the Orphaned state is not desirable,
because it's the one case where we break the otherwise invariant
that the agent sets all states past ASSIGNED.''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the agent sets all states past ASSIGNED.

Clarify that this is the actual state and not the desired state ?

These documents use the TLA+ notation to describe the behaviour of
SwarmKit more precisely than the prose documentation in the design
directory.

Run `make pdfs` to render these documents as PDF files.
The best one to start with is `SwarmKit.pdf`, which introduces the TLA+
notation and describes the overall components of SwarmKit.

The specifications can also be executed by the TLC model checker to help
find mistakes. Use `make check` to run the checks.

Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants