New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix agent ServicePodWorker to not block on wait_for_port #2776
Conversation
elsif service_pod.running? && service_pod.wait_for_port | ||
# delay sync_state_to_master until started | ||
# XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks... | ||
async.wait_for_port(service_pod, @container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side effect of the #2766 container create
/start
events setting @container_state_changed = true
and thus triggering two calls to apply
for each service_pod.deploy_rev
:
Celluloid::Actor 0x1de478ea90
Celluloid::Cell 0x1de478ecac: Kontena::Workers::ServicePodWorker
State: Running (executing tasks)
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `sleep'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `wait'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:63:in `block in check'
/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:33:in `while_time_remaining'
/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/wait.rb:11:in `for'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/mailbox.rb:58:in `check'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:155:in `block in run'
/usr/lib/ruby/gems/2.4.0/gems/timers-4.1.1/lib/timers/group.rb:66:in `wait'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:152:in `run'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:131:in `block in start'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-essentials-0.20.5/lib/celluloid/internals/thread_handle.rb:14:in `block in initialize'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor/system.rb:78:in `block in get_thread'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/group/spawner.rb:50:in `block in instantiate'
Tasks:
1) Celluloid::Task::Fibered[call]: sleeping
{:dangerous_suspend=>false, :method_name=>:wait_for_port}
Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here.
2) Celluloid::Task::Fibered[call]: sleeping
{:dangerous_suspend=>false, :method_name=>:wait_for_port}
Celluloid::Task::Fibered backtrace unavailable. Please try `Celluloid.task_class = Celluloid::Task::Threaded` if you need backtraces here.
Very annoying to try and fix...
# Only terminate this actor after we have succesfully ensure_terminated the Docker container | ||
# Otherwise, stick around... the manager will notice we're still there and re-signal to destroy | ||
self.terminate if service_pod.terminated? | ||
self.terminate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtle change here: no longer calls sync_state_to_master
after the call to destroy
. That's probably a good thing, the server won't care anymore, and it would always log an error when terminating service pods:
E, [2017-09-06T13:25:41.627773 #8] ERROR -- RpcServer: RuntimeError: Instance not found
D, [2017-09-06T13:25:41.627884 #8] DEBUG -- RpcServer: /app/app/services/rpc/node_service_pod_handler.rb:48:in `set_state'
/app/app/services/rpc_server.rb:63:in `handle_request'
/app/app/services/rpc_server.rb:47:in `process!'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/call/async.rb:7:in `dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
/usr/lib/ruby/gems/2.4.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'
This is the minimal fix. I think a more complete fix would be to delegate the |
raise "service stopped" if !@service_pod.running? | ||
raise "service redeployed" if @service_pod.deploy_rev != service_pod.deploy_rev | ||
raise "container recreated" if @container.id != container.id | ||
raise "container restarted" if @container.started_at != container.started_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't optimal: it doesn't trigger on the container die
event: it triggers on the restart after that, which may be delayed.
a1672cd
to
a9c37aa
Compare
elsif service_pod.running? && service_pod.wait_for_port | ||
# delay sync_state_to_master until started | ||
# XXX: apply() gets called twice for each deploy_rev, and this launches two wait_for_port tasks... | ||
async.wait_for_port(service_pod, @container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that wait_for_port
is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe this could be a separate class/actor that checks pod readiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP on a health check actor managed by the service pod worker that tracks the service container ready/healthy state, where the server deploy waits for the service instance to go initialized
-> starting
-> running
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that wait_for_port is such a special case here.. could it return state (container) that is reported normally back to master (not inside wait_for_port)?
It has to be an async
call if it's in the exclusive
block, and I'm hesitant to break apart the exclusive { ... sync_state_to_master }
block because the way the @service_pod
works is rather unclear. It has to avoid calling sync_state_to_master
if the @service_pod
has changed, because that would report the status of the old service rev with the newer deploy rev.
Maybe it would be better to use the same non-exclusive if @service_pod.deploy_rev == service_pod.deploy_rev
check in all cases, though, both with and without wait_for_port
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to move the sync_state_to_master
out of the exclusive
block.
expect(subject.wrapped_object).to receive(:ensure_desired_state).once.and_return(restarted_container) | ||
expect(subject.wrapped_object).to receive(:wait_for_port) | ||
# XXX: in this case both the failing initial wait_for_port, and the restart will report state... | ||
expect(subject.wrapped_object).to receive(:sync_state_to_master).with(service_pod, restarted_container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the restarted container completes wait_for_port
immediately without sleeping, then it might report state: 'running'
right away. The initial apply
-> wait_for_port
task will then race with that to report error: "container restarted"
, because the @service_pod.deploy_rev
is still the same.
The deploy may either succeed because the wait_for_port
passed for the restarted container, or fail because the initial container start didn't. I suppose both of those outcomes are correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM , we probably should include this to 1.4.
agent/lib/docker/container.rb
Outdated
@@ -76,6 +76,11 @@ def suspiciously_dead? | |||
false | |||
end | |||
|
|||
# @return DatetTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateTime
Let's merge and see if the double-apply ends up causing any issues. I suspect the fix for that is just getting rid of the |
Fixes #2775 by having the deploy fail if the container restarts during
wait_for_port
Fixes #2415 by allowing the
ServicePodWorker
actor to terminate duringwait_for_port
Maybe fixes #2625?
Mitigates #2710 by allowing deploys pending on
wait_for_port
to be aborted by stopping the serviceSplits apart the
apply
exclusive { ensure_desired_state; sync_state_to_master }
block, with an optionalwait_for_port
in between. Thewait_for_port
->wait_until!
->sleep
in the actor class allows the actor to see laterupdate
calls, and thewait_for_port
wait_until!
->check_starting!
can pick these up to cancel the wait.Needs extra logic to protect
sync_state_to_master
against sending stale state from a suspendedapply
task, after a newupdate
state has been applied and sent.Workaround for cancelling a deploy stuck on
wait_for_port
This allows interrupting a service deploy with a broken
wait_for_port
by usingkontena service stop
, which will immediately fail the ongoing deploy, as the agent will sync theservice_pod
state with the currentdeploy_rev
andstate: stopped
- although there's also a race here with thewait_for_port
task syncing itserror
state:TODO
async.wait_for_port
?