New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bumping Swarmkit's version #38525

Open
wants to merge 2 commits into
base: master
from

Conversation

5 participants
@wk8
Copy link
Contributor

wk8 commented Jan 10, 2019

Mainly for the sake of testing the new "soft delete" capabilities of
Swarmkit, see more at
docker/swarmkit#2778

Also reflecting new Swarmkit's API changes in moby's own types, namely
the PendingDelete fields added to both Services and NetworkResources.

Also amended relevant tests to now wait for services to be fully shut down
after removing them.

Signed-off-by: Jean Rouge jean.rouge@docker.com

Bumping Swarmkit's version
Mainly for the sake of testing the new "soft delete" capabilities of
Swarmkit, see more at
docker/swarmkit#2778

Signed-off-by: Jean Rouge <jer329@cornell.edu>

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch from f875aa2 to 7f0f93c Jan 10, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 10, 2019

Codecov Report

Merging #38525 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #38525      +/-   ##
==========================================
+ Coverage   36.62%   36.64%   +0.01%     
==========================================
  Files         608      608              
  Lines       45173    45175       +2     
==========================================
+ Hits        16545    16553       +8     
+ Misses      26341    26336       -5     
+ Partials     2287     2286       -1

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch 2 times, most recently from bf54789 to d41b944 Jan 10, 2019

@olljanat

This comment has been minimized.

Copy link
Contributor

olljanat commented Jan 10, 2019

@wk8 Janky failure:

02:50:56 ============================= test session starts ==============================
02:50:56 platform linux2 -- Python 2.7.13, pytest-2.9.1, py-1.7.0, pluggy-0.3.1
02:50:56 rootdir: /docker-py, inifile: pytest.ini
02:50:56 plugins: cov-2.1.0
02:50:57 collected 338 items
02:50:57 
02:51:23 ../../../../../docker-py/tests/integration/api_build_test.py ....s..s........
02:51:24 ../../../../../docker-py/tests/integration/api_client_test.py .......
02:51:25 ../../../../../docker-py/tests/integration/api_config_test.py .....
02:52:49 ../../../../../docker-py/tests/integration/api_container_test.py ....x....x.......x............................................................
02:52:59 ../../../../../docker-py/tests/integration/api_exec_test.py .............
02:53:06 ../../../../../docker-py/tests/integration/api_healthcheck_test.py ....
02:53:17 ../../../../../docker-py/tests/integration/api_image_test.py ...s........s....
02:53:52 ../../../../../docker-py/tests/integration/api_network_test.py ...........................
02:54:15 ../../../../../docker-py/tests/integration/api_plugin_test.py ...........
02:54:16 ../../../../../docker-py/tests/integration/api_secret_test.py .....
02:54:35 ../../../../../docker-py/tests/integration/api_service_test.py ............................FF.......F.FF.............
02:55:07 ../../../../../docker-py/tests/integration/api_swarm_test.py ..s...x......
02:55:22 ../../../../../docker-py/tests/integration/api_volume_test.py .........
02:55:22 ../../../../../docker-py/tests/integration/client_test.py ....
02:55:23 ../../../../../docker-py/tests/integration/errors_test.py .
02:56:11 ../../../../../docker-py/tests/integration/models_containers_test.py ..............................
02:56:27 ../../../../../docker-py/tests/integration/models_images_test.py .............
02:56:28 ../../../../../docker-py/tests/integration/models_networks_test.py ....
02:56:34 ../../../../../docker-py/tests/integration/models_nodes_test.py .
02:56:35 ../../../../../docker-py/tests/integration/models_resources_test.py .
02:56:37 ../../../../../docker-py/tests/integration/models_services_test.py ......F.......X.
02:56:38 ../../../../../docker-py/tests/integration/models_swarm_test.py .
02:56:38 ../../../../../docker-py/tests/integration/models_volumes_test.py ..
02:56:42 ../../../../../docker-py/tests/integration/regression_test.py ......
02:56:42 
02:56:42  generated xml file: /go/src/github.com/docker/docker/bundles/test-docker-py/results.xml 
02:56:42 =========================== short test summary info ============================
02:56:42 XFAIL ../../../../../docker-py/tests/integration/api_container_test.py::CreateContainerTest::test_create_with_cpu_rt_options
02:56:42   CONFIG_RT_GROUP_SCHED isn't enabled
02:56:42 XFAIL ../../../../../docker-py/tests/integration/api_container_test.py::CreateContainerTest::test_create_with_init_path
02:56:42   init-path removed in 17.05.0
02:56:42 XFAIL ../../../../../docker-py/tests/integration/api_container_test.py::CreateContainerTest::test_create_with_storage_opt
02:56:42   Not supported on most drivers
02:56:42 XFAIL ../../../../../docker-py/tests/integration/api_swarm_test.py::SwarmTest::test_init_swarm_with_log_driver
02:56:42   This doesn't seem to be taken into account by the engine
02:56:42 SKIP [1] tests/integration/api_image_test.py:289: Doesn't work inside a container - FIXME
02:56:42 SKIP [1] /docker-py/tests/integration/api_swarm_test.py:31: Test stalls the engine on 1.12.0
02:56:42 SKIP [3] /docker-py/tests/helpers.py:69: Feature requires Docker Engine experimental mode
02:56:42 =================================== FAILURES ===================================
02:56:42 _____________ ServiceTest.test_create_service_with_unicode_config ______________
02:56:42 /docker-py/tests/integration/api_service_test.py:642: in test_create_service_with_unicode_config
02:56:42     config_id = self.client.create_config(config_name, config_data)
02:56:42 /docker-py/docker/utils/decorators.py:34: in wrapper
02:56:42     return f(self, *args, **kwargs)
02:56:42 /docker-py/docker/api/config.py:35: in create_config
02:56:42     self._post_json(url, data=body), True
02:56:42 /docker-py/docker/api/client.py:229: in _result
02:56:42     self._raise_for_status(response)
02:56:42 /docker-py/docker/api/client.py:225: in _raise_for_status
02:56:42     raise create_api_error_from_http_exception(e)
02:56:42 /docker-py/docker/errors.py:31: in create_api_error_from_http_exception
02:56:42     raise cls(e, response=response, explanation=explanation)
02:56:42 E   APIError: 409 Client Error: Conflict ("rpc error: code = AlreadyExists desc = config favorite_touhou already exists")
02:56:42 _____________ ServiceTest.test_create_service_with_unicode_secret ______________
02:56:42 /docker-py/tests/integration/api_service_test.py:590: in test_create_service_with_unicode_secret
02:56:42     secret_id = self.client.create_secret(secret_name, secret_data)
02:56:42 /docker-py/docker/utils/decorators.py:34: in wrapper
02:56:42     return f(self, *args, **kwargs)
02:56:42 /docker-py/docker/api/secret.py:46: in create_secret
02:56:42     self._post_json(url, data=body), True
02:56:42 /docker-py/docker/api/client.py:229: in _result
02:56:42     self._raise_for_status(response)
02:56:42 /docker-py/docker/api/client.py:225: in _raise_for_status
02:56:42     raise create_api_error_from_http_exception(e)
02:56:42 /docker-py/docker/errors.py:31: in create_api_error_from_http_exception
02:56:42     raise cls(e, response=response, explanation=explanation)
02:56:42 E   APIError: 409 Client Error: Conflict ("rpc error: code = AlreadyExists desc = secret favorite_touhou already exists")
02:56:42 ________________________ ServiceTest.test_list_services ________________________
02:56:42 /docker-py/tests/integration/api_service_test.py:76: in test_list_services
02:56:42     assert len(test_services) == 0
02:56:42 E   AssertionError: assert 9 == 0
02:56:42 E    +  where 9 = len([{'CreatedAt': '2019-01-10T02:54:17.610899445Z', 'Endpoint': {'Spec': {}}, 'ID': 'obpnh07dag0cwi8z2yu3phd68', 'Pending...10T02:54:23.874616888Z', 'Endpoint': {'Spec': {}}, 'ID': 'x1i3jvdrgrei9h1yj9i9l2bk7', 'PendingDelete': True, ...}, ...])
02:56:42 ____________________ ServiceTest.test_remove_service_by_id _____________________
02:56:42 /docker-py/tests/integration/api_service_test.py:118: in test_remove_service_by_id
02:56:42     assert len(test_services) == 0
02:56:42 E   AssertionError: assert 4 == 0
02:56:42 E    +  where 4 = len([{'CreatedAt': '2019-01-10T02:54:17.610899445Z', 'Endpoint': {'Spec': {}}, 'ID': 'obpnh07dag0cwi8z2yu3phd68', 'Pending...9-01-10T02:54:24.061397702Z', 'Endpoint': {'Spec': {}}, 'ID': 'fet1grsijuqh9q06mo4t7t8w7', 'PendingDelete': True, ...}])
02:56:42 ___________________ ServiceTest.test_remove_service_by_name ____________________
02:56:42 /docker-py/tests/integration/api_service_test.py:124: in test_remove_service_by_name
02:56:42     assert len(test_services) == 0
02:56:42 E   AssertionError: assert 5 == 0
02:56:42 E    +  where 5 = len([{'CreatedAt': '2019-01-10T02:54:24.142811361Z', 'Endpoint': {'Spec': {}}, 'ID': '0p7bqsqb23lezynqqaquroex0', 'Pending...9-01-10T02:54:24.061397702Z', 'Endpoint': {'Spec': {}}, 'ID': 'fet1grsijuqh9q06mo4t7t8w7', 'PendingDelete': True, ...}])
02:56:42 _________________________ ServiceTest.test_list_remove _________________________
02:56:42 /docker-py/tests/integration/models_services_test.py:80: in test_list_remove
02:56:42     assert service not in client.services.list()
02:56:42 E   AssertionError: assert <Service: bke7s3fwnr> not in [<Service: bke7s3fwnr>, <Service: de3pe7myoh>, <Service: df7yub3t6n>, <Service: ozisoi19iv>, <Service: u4fwx2vhug>, <Service: uwsemomnq7>, ...]
02:56:42 E    +  where [<Service: bke7s3fwnr>, <Service: de3pe7myoh>, <Service: df7yub3t6n>, <Service: ozisoi19iv>, <Service: u4fwx2vhug>, <Service: uwsemomnq7>, ...] = <bound method ServiceCollection.list of <docker.models.services.ServiceCollection object at 0x7fd67c4c6510>>()
02:56:42 E    +    where <bound method ServiceCollection.list of <docker.models.services.ServiceCollection object at 0x7fd67c4c6510>> = <docker.models.services.ServiceCollection object at 0x7fd67c4c6510>.list
02:56:42 E    +      where <docker.models.services.ServiceCollection object at 0x7fd67c4c6510> = <docker.client.DockerClient object at 0x7fd67c4d80d0>.services
02:56:42 === 6 failed, 322 passed, 5 skipped, 4 xfailed, 1 xpassed in 346.28 seconds ====
06:00:50 Build timed out (after 200 minutes). Marking the build as failed.
06:00:50 Build timed out (after 200 minutes). Marking the build as aborted.

And Experimental failure:

03:47:46 ----------------------------------------------------------------------
03:47:46 FAIL: docker_api_swarm_service_test.go:402: DockerSwarmSuite.TestAPISwarmServiceConstraintLabel
03:47:46 
03:47:46 [d19a4a7199ab1] waiting for daemon to start
03:47:46 [d19a4a7199ab1] daemon started
03:47:46 
03:47:46 [da3902fd25507] waiting for daemon to start
03:47:46 [da3902fd25507] daemon started
03:47:46 
03:47:46 [dc49c2f95d361] waiting for daemon to start
03:47:46 [dc49c2f95d361] daemon started
03:47:46 
03:47:46 waited for 1.376717ms (out of 5s)
03:47:46 waited for 924.416453ms (out of 30s)
03:47:46 assertion failed: error is not nil: Error response from daemon: rpc error: code = AlreadyExists desc = name conflicts with an existing object: service top already exists
03:47:46 [d19a4a7199ab1] exiting daemon
03:47:46 [da3902fd25507] exiting daemon
03:47:46 [dc49c2f95d361] exiting daemon
03:48:01 
03:48:01 ----------------------------------------------------------------------
03:48:01 FAIL: docker_api_swarm_service_test.go:350: DockerSwarmSuite.TestAPISwarmServiceConstraintRole
03:48:01 
03:48:01 [d0b43fa1e3376] waiting for daemon to start
03:48:01 [d0b43fa1e3376] daemon started
03:48:01 
03:48:01 [dfecb3662b28b] waiting for daemon to start
03:48:01 [dfecb3662b28b] daemon started
03:48:01 
03:48:01 [da65e7126d230] waiting for daemon to start
03:48:01 [da65e7126d230] daemon started
03:48:01 
03:48:01 waited for 1.398944ms (out of 5s)
03:48:01 waited for 927.794596ms (out of 30s)
03:48:01 assertion failed: error is not nil: Error response from daemon: rpc error: code = AlreadyExists desc = name conflicts with an existing object: service top already exists
03:48:01 [d0b43fa1e3376] exiting daemon
03:48:01 [dfecb3662b28b] exiting daemon
03:48:01 [da65e7126d230] exiting daemon
03:48:16 

Looks to be new issues which at least I have not seen earlier.

Z looks to be infra issue java.nio.file.DirectoryNotEmptyException
and PowerPC failed to #38460 so I will trigger rebuild for those.

@olljanat

This comment has been minimized.

Copy link
Contributor

olljanat commented Jan 10, 2019

(reserved for my derek commands)

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 10, 2019

Yeah, this seems like a breaking change from swarmkit.

@wk8 wk8 requested review from tianon and vdemeester as code owners Jan 11, 2019

@GordonTheTurtle

This comment has been minimized.

Copy link

GordonTheTurtle commented Jan 11, 2019

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "wk8/bump_swarmkit" git@github.com:wk8/moby.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354192848
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch from 8232050 to 7cbaf75 Jan 11, 2019

@GordonTheTurtle GordonTheTurtle removed the dco/no label Jan 11, 2019

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch 4 times, most recently from cd2e72e to 736529e Jan 11, 2019

wk8 added a commit to wk8/docker-py that referenced this pull request Jan 12, 2019

Accounting for asynchronous service removals
Swarmkit services' removal will soon become asynchronous (see
moby/moby#38525); that is, a service will only actually
get deleted once all of its containers are shut down.

This patch changes service tests' `tearDown` to wait for services to be
actually removed, so that there's no name collision with subsequent tests.

This change is fully backward compatible with the engine's current behaviour.

Couple of other misc changes: consolidated polling in a single
`wait_until_truthy` function, and allowed passing more flags to `py.test` in
the makefile test-integration targets through a new optional `pytest_options`
env variable.

Signed-off-by: Jean Rouge <jer329@cornell.edu>

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch from 736529e to a9efff0 Jan 12, 2019

Reflecting new Swarmkit's API changes in moby's own types
Namely the `PendingDelete` fields added to both `Services`
and `NetworkResources` in
docker/swarmkit#2759

Also amended relevant tests to now wait for services to be
fully shut down after removing them.

Signed-off-by: Jean Rouge <jer329@cornell.edu>

@wk8 wk8 force-pushed the wk8:wk8/bump_swarmkit branch from a9efff0 to 516067f Jan 12, 2019

@wk8

This comment has been minimized.

Copy link
Contributor

wk8 commented Jan 13, 2019

Note to reviewers: the build failure should be fixed once docker/docker-py#2228 is merged. Thanks!

@@ -14,9 +13,6 @@ type ConfigConstructor func(*swarm.Config)

// CreateConfig creates a config given the specified spec
func (d *Daemon) CreateConfig(t assert.TestingT, configSpec swarm.ConfigSpec) string {
if ht, ok := t.(test.HelperT); ok {
ht.Helper()
}

This comment has been minimized.

@wk8

wk8 Jan 13, 2019

Contributor

Note for all of these: d.NewClientT(t) already does the exact same thing.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 13, 2019

Sorry no, changing docker-py is not a valid fix for a breaking API change.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 13, 2019

These tests exist to pinpoint issues like this, just changing the test is not how to deal with this.

This is a breaking change in swarmkit. It should be addressed there, and if not then we need to change the implementation here such that these changes are not breaking existing users.

@wk8

This comment has been minimized.

Copy link
Contributor

wk8 commented Jan 13, 2019

@cpuguy83 : yes, this is absolutely a breaking change in Swarmkit, that has been deemed the lesser of two evils after ample discussion within the Swarmkit team and outside of it.

Since it is indeed a breaking change, and changes a behavior that tests both in this repo and in docker-py rely upon, it seems logical to me to adapt the tests to test the new desired behavior?

@olljanat

This comment has been minimized.

Copy link
Contributor

olljanat commented Jan 13, 2019

@wk8 remember that Moby API must be backward compatible so you cannot change behavior for API versions < 1.40

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 15, 2019

Glossing over the actual bug being solved... I'll assume this had due diligence applied in how to solve the problem... why can't async delete be opt-in?

If swarmkit is not going to handle this, then we need to handle it in moby. We simply cannot have such a breaking change.

Basically:

Create("foo")
Delete("foo")
Create("foo")

Should just work. If Delete returns then "foo" must be available immediately again.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 15, 2019

btw, when I say "opt-in", I mean the delete internally would still be async (marked for removal) but the API would either only return when it is actually freed or if async=true it can return as soon as it has been marked.

@wk8

This comment has been minimized.

Copy link
Contributor

wk8 commented Jan 15, 2019

@cpuguy83 : sorry, I certainly didn't mean to gloss over the actual problem, apologies if it came across that way.

The actual problem being solved is that currently nothing prevents you from creating a service, with a network attached to it, and then deleting the service and the network even if the service's containers have not shut down yet. This means the containers can't perform their cleanup properly if they need to use the network to do so; and can also lead to interesting failures if a network and/or a service with the same name are re-created before the old ones are gone.

The same applies to any service-level resource down the road.

Note that it's only a breaking change precisely if you do anything that assumes (and needs to assume) that a service gets instantaneously deleted, which is conceptually wrong, and will lead to errors if that's really an assumption you need.

I understand your concern, though; how would you feel about, depending on the client's version:

  • if < 1.40, then we only return from the API call after the delete has occurred
  • otherwise, we return instantly, as we used to, but the actual removal happens asynchronously

That would ensure we don't break older clients, while still also returning fast in the 99% use case, when people don't try re-creating a service or a service-level resource with the same name right away.

@thaJeztah
Copy link
Member

thaJeztah left a comment

See my comment inline; about adding a separate field for this.

I also agree with @cpuguy83 - we can't change this behavior and make it the default; doing so will be breaking any consumer of the API out there; this behavior should be opt-in, and enabled through a new option in the API request (to be added in API v1.40 and up).

PreviousSpec *ServiceSpec `json:",omitempty"`
Endpoint Endpoint `json:",omitempty"`
UpdateStatus *UpdateStatus `json:",omitempty"`
PendingDelete bool `json:",omitempty"`

This comment has been minimized.

@thaJeztah

thaJeztah Jan 15, 2019

Member

I'm curious why are adding a separate boolean for this; it does not really match the current design of how objects in swarmkit work; for services we already have an UpdateStatus, which can have a single state; https://github.com/moby/moby/blob/master/vendor/github.com/docker/swarmkit/api/types.pb.go#L386-L394

Now we're adding another field that's about the status of the service, which means we

a) have to check two properties
b) those properties can actually "conflict"

For example, what would UpdateStatus == ROLLBACK_STARTED + PendingDelete == true mean?

I noticed this in the example output on the docker/cli pull request, where output could look something like this;

ID                  NAME                MODE                REPLICAS            IMAGE               PORTS               PENDING DELETION?
id-foo              name-foo            replicated          0/2                 busybox:latest      *:30000->3232/tcp   false

That output doesn't tell me much in 99% of the situations (where deleting is not a thing). Having a STATUS column on the other hand would provide me much more information (e.g. I'd be able to tell why 0/2 replicas are up; perhaps someone just triggered an update and it's still converging after the update (STATUS: update started), or perhaps there's an issue with these tasks, and they start to fail (STATUS: running - it "should be running" - time to investigate).

Looking at the current design in swarmkit; for tasks, SwarmKit uses a model where a task goes through a list of states

https://github.com/moby/moby/blob/master/vendor/github.com/docker/swarmkit/api/types.pb.go#L64-L89

Which can only go in a single direction (new -> pending -> ..... remove), which makes it non-ambiguous and requires only a single field to know exactly what's happening (well; two; DesiredState and Status.State)

Could we use the same approach for these objects (networks, services)? My fear is that at some point we'll end up in a situation where someone runs into a similar situation for creating a service ("service is created, but tasks are not up yet"), and we'll be adding a CreatePending bool field, and so on. If I'm not mistaken, we already have this situation for networks (network is created, but not functional yet).

Services and Networks could go through similar stages as tasks, and on removal, the service would have a status;

  • DesiredState: ServiceStateRunning + Status.State: ServiceStateStarting (service has been deployed, and waiting to be converged)
  • DesiredState: ServiceStateRemoved + Status.State: ServiceStateShutdown (service removal is pending, but tasks are still running).

That information could be useful for other situations as well (service created, but tasks are not up yet, so no networking), and could be used for the cli (which currently has to poll for the service's tasks status to see if the service has converged)

There would be a difference in that services are long-lived objects, and their state doesn't go in a single direction (e.g. after updating a service, it will go back), but it would still be a closer match to the current design, and more flexible if more statuses are needed in future.

This comment has been minimized.

@cpuguy83

cpuguy83 Jan 15, 2019

Contributor

K8s uses DeletionTimestap, though it is a little different, basically the timestamp is the time when the object should no longer be visible:

The resource is expected to be deleted (no longer visible
// from resource lists, and not reachable by name) after the time in this field, once the
// finalizers list is empty.

@thaJeztah thaJeztah added this to backlog in maintainers-session Jan 15, 2019

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Jan 15, 2019

It would be nice to only do this >= 1.40, however this is a very course opt-in for such a major change.
I think a dedicated option for async delete on the API request is warranted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment