Fix agent ServicePodWorker to not block on wait_for_port #2776

SpComb · 2017-09-06T16:36:42Z

Fixes #2775 by having the deploy fail if the container restarts during wait_for_port
Fixes #2415 by allowing the ServicePodWorker actor to terminate during wait_for_port
Maybe fixes #2625?
Mitigates #2710 by allowing deploys pending on wait_for_port to be aborted by stopping the service

Splits apart the apply exclusive { ensure_desired_state; sync_state_to_master } block, with an optional wait_for_port in between. The wait_for_port -> wait_until! -> sleep in the actor class allows the actor to see later update calls, and the wait_for_port wait_until! -> check_starting! can pick these up to cancel the wait.

Needs extra logic to protect sync_state_to_master against sending stale state from a suspended apply task, after a new update state has been applied and sent.

Workaround for cancelling a deploy stuck on `wait_for_port`

This allows interrupting a service deploy with a broken wait_for_port by using kontena service stop, which will immediately fail the ongoing deploy, as the agent will sync the service_pod state with the current deploy_rev and state: stopped - although there's also a race here with the wait_for_port task syncing its error state:

I, [2017-09-06T16:03:37.567719 #18]  INFO -- GridServiceInstanceDeployer: waited 13.7s of 300.0s until: service development/null/redis-waitforport-1 is running on node development/core-01 at 2017-09-06 16:03:23 UTC yielded GridServiceInstance
W, [2017-09-06T16:03:37.568076 #18]  WARN -- GridServiceInstanceDeployer: Failed to deploy service instance development/null/redis-waitforport-1 to node core-01: GridServiceInstanceDeployer::StateError: Service instance is not running, but stopped

W, [2017-09-06T16:03:37.600901 #1]  WARN -- Kontena::Workers::ServicePodWorker: wait_for_port failed: service stopped

TODO

fix dual-async.wait_for_port?
specs
more specs?

SpComb · 2017-09-06T16:38:01Z

agent/lib/kontena/workers/service_pod_worker.rb

+        elsif service_pod.running? && service_pod.wait_for_port
+          # delay sync_state_to_master until started
+          # XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks...
+          async.wait_for_port(service_pod, @container)


This is a side effect of the #2766 container create/start events setting @container_state_changed = true and thus triggering two calls to apply for each service_pod.deploy_rev:

Celluloid::Actor 0x1de478ea90 Celluloid::Cell 0x1de478ecac: Kontena::Workers::ServicePodWorker State: Running (executing tasks) /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `sleep' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `wait' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `block in check' /usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:33:in `while_time_remaining' /usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:11:in `for' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:58:in `check' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:155:in `block in run' /usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/group.rb:66:in `wait' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:152:in `run' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:131:in `block in start' /usr/lib/ruby/gems/2.4.0/gems/celluloid-essentials-0.20.5/lib/celluloid/internals/thread_handle.rb:14:in `block in initialize' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor/system.rb:78:in `block in get_thread' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/group/spawner.rb:50:in `block in instantiate' Tasks: 1) Celluloid::Task::Fibered[call]: sleeping {:dangerous_suspend=>false, :method_name=>:wait_for_port} Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here. 2) Celluloid::Task::Fibered[call]: sleeping {:dangerous_suspend=>false, :method_name=>:wait_for_port} Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here.

Very annoying to try and fix...

SpComb · 2017-09-06T16:40:07Z

agent/lib/kontena/workers/service_pod_worker.rb

          # Only terminate this actor after we have succesfully ensure_terminated the Docker container
          # Otherwise, stick around... the manager will notice we're still there and re-signal to destroy
-          self.terminate if service_pod.terminated?
+          self.terminate


Subtle change here: no longer calls sync_state_to_master after the call to destroy. That's probably a good thing, the server won't care anymore, and it would always log an error when terminating service pods:

E, [2017-09-06T13:25:41.627773 #8] ERROR -- RpcServer: RuntimeError: Instance not found D, [2017-09-06T13:25:41.627884 #8] DEBUG -- RpcServer: /app/app/services/rpc/node_service_pod_handler.rb:48:in `set_state' /app/app/services/rpc_server.rb:63:in `handle_request' /app/app/services/rpc_server.rb:47:in `process!' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize' /usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

SpComb · 2017-09-06T16:43:40Z

This is the minimal fix. I think a more complete fix would be to delegate the wait_for_port checking to a ContainerHealthCheckWorker actor supervised by the ServicePodWorker, and thus also allow waiting on any service health checks for the rolling deploys. It would also be easier to properly deal with the various state changes as an actor, rather than in the async task wait_until! loop.

SpComb · 2017-09-06T17:49:54Z

agent/lib/kontena/workers/service_pod_worker.rb

+         raise "service stopped" if !@service_pod.running?
+         raise "service redeployed" if @service_pod.deploy_rev != service_pod.deploy_rev
+         raise "container recreated" if @container.id != container.id
+         raise "container restarted" if @container.started_at != container.started_at


This isn't optimal: it doesn't trigger on the container die event: it triggers on the restart after that, which may be delayed.

jakolehm · 2017-09-07T14:05:39Z

agent/lib/kontena/workers/service_pod_worker.rb

+        elsif service_pod.running? && service_pod.wait_for_port
+          # delay sync_state_to_master until started
+          # XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks...
+          async.wait_for_port(service_pod, @container)


I don't like that wait_for_port is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?

Or maybe this could be a separate class/actor that checks pod readiness.

WIP on a health check actor managed by the service pod worker that tracks the service container ready/healthy state, where the server deploy waits for the service instance to go initialized -> starting -> running...

I don't like that wait_for_port is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?

It has to be an async call if it's in the exclusive block, and I'm hesitant to break apart the exclusive { ... sync_state_to_master } block because the way the @service_pod works is rather unclear. It has to avoid calling sync_state_to_master if the @service_pod has changed, because that would report the status of the old service rev with the newer deploy rev.

Maybe it would be better to use the same non-exclusive if @service_pod.deploy_rev == service_pod.deploy_rev check in all cases, though, both with and without wait_for_port.

Changed to move the sync_state_to_master out of the exclusive block.

…_master

SpComb · 2017-09-12T14:34:05Z

agent/spec/lib/kontena/workers/service_pod_worker_spec.rb

+          expect(subject.wrapped_object).to receive(:ensure_desired_state).once.and_return(restarted_container)
+          expect(subject.wrapped_object).to receive(:wait_for_port)
+          # XXX: in this case both the failing initial wait_for_port, and the restart will report state...
+          expect(subject.wrapped_object).to receive(:sync_state_to_master).with(service_pod, restarted_container)


If the restarted container completes wait_for_port immediately without sleeping, then it might report state: 'running' right away. The initial apply -> wait_for_port task will then race with that to report error: "container restarted", because the @service_pod.deploy_rev is still the same.

The deploy may either succeed because the wait_for_port passed for the restarted container, or fail because the initial container start didn't. I suppose both of those outcomes are correct?

jakolehm

LGTM , we probably should include this to 1.4.

jakolehm · 2017-09-12T18:46:56Z

agent/lib/docker/container.rb

@@ -76,6 +76,11 @@ def suspiciously_dead?
      false
    end

+    # @return DatetTime


SpComb · 2017-09-13T07:44:14Z

Let's merge and see if the double-apply ends up causing any issues. I suspect the fix for that is just getting rid of the @container_state_changed entirely.

SpComb added agent bug labels Sep 6, 2017

SpComb commented Sep 6, 2017

View reviewed changes

Tero Marttila added 4 commits September 7, 2017 11:12

remove wait_for_port from Kontena::ServicePods::Creator

6f74d8c

rename port_open?

ff42d5e

Docker::Container#started_at

963e869

ServicePodWorker async wait_for_port (buggy)

a9c37aa

SpComb force-pushed the fix/agent-servicepodworker-async-waitforport branch from a1672cd to a9c37aa Compare September 7, 2017 08:13

SpComb changed the base branch from fix/agent-servicepodworker-container-events to master September 7, 2017 08:13

jakolehm reviewed Sep 7, 2017

View reviewed changes

Tero Marttila added 3 commits September 12, 2017 16:55

refactor for synchronous non-exclusive wait_for_port -> sync_state_to…

e6a6ad9

…_master

spec sync_state_to_master

710aee7

spec some of the apply cases

ec394e6

SpComb commented Sep 12, 2017

View reviewed changes

jakolehm approved these changes Sep 12, 2017

View reviewed changes

agent/lib/docker/container.rb Outdated

@@ -76,6 +76,11 @@ def suspiciously_dead?

false

end

# @return DatetTime

Copy link

Contributor

jakolehm Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTime

typofix yardoc

564eeed

SpComb added this to the 1.4.0 milestone Sep 13, 2017

SpComb changed the title ~~Fix agent ServicePodWorker async wait_for_port~~ Fix agent ServicePodWorker to not block on wait_for_port Sep 13, 2017

SpComb merged commit 56ff0c4 into master Sep 13, 2017

SpComb deleted the fix/agent-servicepodworker-async-waitforport branch September 13, 2017 07:50

SpComb mentioned this pull request Nov 23, 2017

Agent ServicePodWorker re-applies after any apply that triggers container events #2766

Closed

SpComb mentioned this pull request Jan 9, 2018

stack rm fails to remove containers that are in restart loop #2608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix agent ServicePodWorker to not block on wait_for_port #2776

Fix agent ServicePodWorker to not block on wait_for_port #2776

SpComb commented Sep 6, 2017 •

edited

SpComb Sep 6, 2017

SpComb Sep 6, 2017

SpComb commented Sep 6, 2017 •

edited

SpComb Sep 6, 2017

jakolehm Sep 7, 2017

jakolehm Sep 7, 2017

SpComb Sep 8, 2017 •

edited

SpComb Sep 12, 2017

SpComb Sep 12, 2017

SpComb Sep 12, 2017 •

edited

jakolehm left a comment

jakolehm Sep 12, 2017

SpComb commented Sep 13, 2017

Fix agent ServicePodWorker to not block on wait_for_port #2776

Fix agent ServicePodWorker to not block on wait_for_port #2776

Conversation

SpComb commented Sep 6, 2017 • edited

Workaround for cancelling a deploy stuck on wait_for_port

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SpComb commented Sep 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SpComb Sep 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SpComb Sep 12, 2017 • edited

Choose a reason for hiding this comment

jakolehm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SpComb commented Sep 13, 2017

SpComb commented Sep 6, 2017 •

edited

Workaround for cancelling a deploy stuck on `wait_for_port`

SpComb commented Sep 6, 2017 •

edited

SpComb Sep 8, 2017 •

edited

SpComb Sep 12, 2017 •

edited