Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix agent ServicePodWorker to not block on wait_for_port #2776

Merged
merged 8 commits into from Sep 13, 2017

Conversation

SpComb
Copy link
Contributor

@SpComb SpComb commented Sep 6, 2017

Fixes #2775 by having the deploy fail if the container restarts during wait_for_port
Fixes #2415 by allowing the ServicePodWorker actor to terminate during wait_for_port
Maybe fixes #2625?
Mitigates #2710 by allowing deploys pending on wait_for_port to be aborted by stopping the service

Splits apart the apply exclusive { ensure_desired_state; sync_state_to_master } block, with an optional wait_for_port in between. The wait_for_port -> wait_until! -> sleep in the actor class allows the actor to see later update calls, and the wait_for_port wait_until! -> check_starting! can pick these up to cancel the wait.

Needs extra logic to protect sync_state_to_master against sending stale state from a suspended apply task, after a new update state has been applied and sent.

Workaround for cancelling a deploy stuck on wait_for_port

This allows interrupting a service deploy with a broken wait_for_port by using kontena service stop, which will immediately fail the ongoing deploy, as the agent will sync the service_pod state with the current deploy_rev and state: stopped - although there's also a race here with the wait_for_port task syncing its error state:

I, [2017-09-06T16:03:37.567719 #18]  INFO -- GridServiceInstanceDeployer: waited 13.7s of 300.0s until: service development/null/redis-waitforport-1 is running on node development/core-01 at 2017-09-06 16:03:23 UTC yielded GridServiceInstance
W, [2017-09-06T16:03:37.568076 #18]  WARN -- GridServiceInstanceDeployer: Failed to deploy service instance development/null/redis-waitforport-1 to node core-01: GridServiceInstanceDeployer::StateError: Service instance is not running, but stopped
W, [2017-09-06T16:03:37.600901 #1]  WARN -- Kontena::Workers::ServicePodWorker: wait_for_port failed: service stopped

TODO

  • fix dual-async.wait_for_port?
  • specs
  • more specs?

elsif service_pod.running? && service_pod.wait_for_port
# delay sync_state_to_master until started
# XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks...
async.wait_for_port(service_pod, @container)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of the #2766 container create/start events setting @container_state_changed = true and thus triggering two calls to apply for each service_pod.deploy_rev:

Celluloid::Actor 0x1de478ea90
Celluloid::Cell 0x1de478ecac: Kontena::Workers::ServicePodWorker
State: Running (executing tasks)
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `sleep'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `wait'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `block in check'
	/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:33:in `while_time_remaining'
	/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:11:in `for'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:58:in `check'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:155:in `block in run'
	/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/group.rb:66:in `wait'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:152:in `run'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:131:in `block in start'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-essentials-0.20.5/lib/celluloid/internals/thread_handle.rb:14:in `block in initialize'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor/system.rb:78:in `block in get_thread'
	/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/group/spawner.rb:50:in `block in instantiate'


	Tasks:
	  1) Celluloid::Task::Fibered[call]: sleeping
	      {:dangerous_suspend=>false, :method_name=>:wait_for_port}
		Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here.


	  2) Celluloid::Task::Fibered[call]: sleeping
	      {:dangerous_suspend=>false, :method_name=>:wait_for_port}
		Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here.

Very annoying to try and fix...

# Only terminate this actor after we have succesfully ensure_terminated the Docker container
# Otherwise, stick around... the manager will notice we're still there and re-signal to destroy
self.terminate if service_pod.terminated?
self.terminate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle change here: no longer calls sync_state_to_master after the call to destroy. That's probably a good thing, the server won't care anymore, and it would always log an error when terminating service pods:

E, [2017-09-06T13:25:41.627773 #8] ERROR -- RpcServer: RuntimeError: Instance not found
D, [2017-09-06T13:25:41.627884 #8] DEBUG -- RpcServer: /app/app/services/rpc/node_service_pod_handler.rb:48:in `set_state'
/app/app/services/rpc_server.rb:63:in `handle_request'
/app/app/services/rpc_server.rb:47:in `process!'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

@SpComb
Copy link
Contributor Author

SpComb commented Sep 6, 2017

This is the minimal fix. I think a more complete fix would be to delegate the wait_for_port checking to a ContainerHealthCheckWorker actor supervised by the ServicePodWorker, and thus also allow waiting on any service health checks for the rolling deploys. It would also be easier to properly deal with the various state changes as an actor, rather than in the async task wait_until! loop.

raise "service stopped" if !@service_pod.running?
raise "service redeployed" if @service_pod.deploy_rev != service_pod.deploy_rev
raise "container recreated" if @container.id != container.id
raise "container restarted" if @container.started_at != container.started_at
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't optimal: it doesn't trigger on the container die event: it triggers on the restart after that, which may be delayed.

@SpComb SpComb force-pushed the fix/agent-servicepodworker-async-waitforport branch from a1672cd to a9c37aa Compare September 7, 2017 08:13
@SpComb SpComb changed the base branch from fix/agent-servicepodworker-container-events to master September 7, 2017 08:13
elsif service_pod.running? && service_pod.wait_for_port
# delay sync_state_to_master until started
# XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks...
async.wait_for_port(service_pod, @container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that wait_for_port is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this could be a separate class/actor that checks pod readiness.

Copy link
Contributor Author

@SpComb SpComb Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP on a health check actor managed by the service pod worker that tracks the service container ready/healthy state, where the server deploy waits for the service instance to go initialized -> starting -> running...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that wait_for_port is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?

It has to be an async call if it's in the exclusive block, and I'm hesitant to break apart the exclusive { ... sync_state_to_master } block because the way the @service_pod works is rather unclear. It has to avoid calling sync_state_to_master if the @service_pod has changed, because that would report the status of the old service rev with the newer deploy rev.

Maybe it would be better to use the same non-exclusive if @service_pod.deploy_rev == service_pod.deploy_rev check in all cases, though, both with and without wait_for_port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to move the sync_state_to_master out of the exclusive block.

expect(subject.wrapped_object).to receive(:ensure_desired_state).once.and_return(restarted_container)
expect(subject.wrapped_object).to receive(:wait_for_port)
# XXX: in this case both the failing initial wait_for_port, and the restart will report state...
expect(subject.wrapped_object).to receive(:sync_state_to_master).with(service_pod, restarted_container)
Copy link
Contributor Author

@SpComb SpComb Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the restarted container completes wait_for_port immediately without sleeping, then it might report state: 'running' right away. The initial apply -> wait_for_port task will then race with that to report error: "container restarted", because the @service_pod.deploy_rev is still the same.

The deploy may either succeed because the wait_for_port passed for the restarted container, or fail because the initial container start didn't. I suppose both of those outcomes are correct?

Copy link
Contributor

@jakolehm jakolehm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , we probably should include this to 1.4.

@@ -76,6 +76,11 @@ def suspiciously_dead?
false
end

# @return DatetTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTime

@SpComb SpComb added this to the 1.4.0 milestone Sep 13, 2017
@SpComb
Copy link
Contributor Author

SpComb commented Sep 13, 2017

Let's merge and see if the double-apply ends up causing any issues. I suspect the fix for that is just getting rid of the @container_state_changed entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment