Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service pool lock must be held and reinitialized across fork #414

Closed
jrosser opened this issue Nov 2, 2018 · 14 comments
Closed

Service pool lock must be held and reinitialized across fork #414

jrosser opened this issue Nov 2, 2018 · 14 comments

Comments

@jrosser
Copy link

@jrosser jrosser commented Nov 2, 2018

From an OSA AIO deploy, an async task....

TASK [os_horizon : Compile messages for translation] *********************************************************************************************************************************
ok: [aio1_horizon_container-bfb96d25] => (item=/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon)
ok: [aio1_horizon_container-bfb96d25] => (item=/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/openstack_dashboard)
ERROR! [pid 21444] 17:12:49.261466 E mitogen: Received DEL_ROUTE for 107138 from mitogen.setns.Stream(u'setns.aio1_horizon_container-bfb96d25'), expected None
ERROR! [pid 21444] 17:12:49.262085 E mitogen.ctx.setns.aio1_horizon_container-bfb96d25: mitogen: Received DEL_ROUTE for 107138 from mitogen.sudo.Stream(u'sudo.horizon'), expected mitogen.core.Stream('parent')
ERROR! [pid 21444] 17:12:49.262392 E mitogen.ctx.setns.aio1_horizon_container-bfb96d25.sudo.horizon: mitogen: Received DEL_ROUTE for 107138 from mitogen.fork.Stream(u'fork.1309'), expected mitogen.core.Stream('parent')

TASK [os_horizon : Ensure messages are compiled for translation] *********************************************************************************************************************
changed: [aio1_horizon_container-bfb96d25] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'171d24cb4603029a', 'failed': False, u'started': 1, u'changed': False, 'item': u'/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon', u'finished': 0, '_ansible_ignore_errors': None, '_ansible_item_label': u'/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon'})
FAILED - RETRYING: Ensure messages are compiled for translation (300 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (299 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (298 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (297 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (296 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (295 retries left).

Retries 300 times and fails.

These are the tasks in question:

https://github.com/openstack/openstack-ansible-os_horizon/blob/3d60097f94b8a27eabcfd931d4f80a9d8f2b7280/tasks/horizon_post_install.yml#L110

https://github.com/openstack/openstack-ansible-os_horizon/blob/4283200534eafa444efd9bb408ddcb5c98a1d442/tasks/main.yml#L105

@dw
Copy link
Member

@dw dw commented Nov 5, 2018

Hi Johnathan!

Could you see from the target whether the process was still running? Async is quite icky because it touches the filesystem -- maybe it failed to write the job status file, still investigating.

Thanks for reporting this

@dw
Copy link
Member

@dw dw commented Nov 5, 2018

Yowzer, that's a fun one, it's creating those tasks in a loop.. https://github.com/openstack/openstack-ansible-os_horizon/blob/4283200534eafa444efd9bb408ddcb5c98a1d442/tasks/horizon_post_install.yml#L124

At least that is something to write a test case for. I'm curious if the async tasks are sensible to begin with.. does compress not need to wait for collectstatic to finish?

@dw
Copy link
Member

@dw dw commented Nov 5, 2018

The error messages are a red herring (#413). I've reproduced it locally. Enabling MITOGEN_ROUTER_DEBUG=1 makes the jobs succeed. Very interesting

@dw
Copy link
Member

@dw dw commented Nov 5, 2018

Ah! Has this only started occurring on recent masters?

@jrosser
Copy link
Author

@jrosser jrosser commented Nov 5, 2018

Sadly this failure was in CI so I didn't get to investigate it directly. I'm not sure where this started occurring as your recent changes to mitogen have made the OSA build get quite some way further than it did before.

Also, as far as I can see from a successful run, the "Ensure messages are compiled for translation" didn't have to wait very long for the completion of the async task, it might be worth adding another item to your test case which finishes instantaneously so that it is ready before the async_status task is run.

dw added a commit that referenced this issue Nov 5, 2018
dw added a commit that referenced this issue Nov 5, 2018
@dw
Copy link
Member

@dw dw commented Jan 20, 2019

Fairly sure the same issue is showing up on Travis, issue #425

dw added a commit that referenced this issue Jan 21, 2019
@dw
Copy link
Member

@dw dw commented Jan 21, 2019

MO:

  • must be a loop of async tasks like your example
  • must be over SSH + sudo, local subprocess isn't enough to tickle it (why?)
  • all tasks progress except once which sits idle after writing its status file
  • strace shows a potential deadlock:
proc]# strace -ff -p 29858
strace: Process 29858 attached with 3 threads
[pid 29864] futex(0x55ea9be52f60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 29860] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 29858] futex(0x55ea9be52f60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff
  • 29860 looks like the broker thread, it's asleep in poll and seemingly happy, so the lock is probably not one of the safe-to-take-in-async locks
  • remaining two threads are blocked waiting for the same lock
  • thread A is the main thread, it's the function call dispatcher
  • 29864 I think must either be a service pool worker (but I think there should be 8 of those), or the transient thread created to start the service pool on first request
@dw
Copy link
Member

@dw dw commented Jan 21, 2019

It's the service pool initialization lock.

Sequence is:

  • controller writes a file to the target account (this is the only master->controller service)
  • stub CALL_SERVICE handler registered by the bootstrap is called back on the broker thread
  • can't do anything with the message on the broker thread, because we can't deserialize there and can't use the importer there (since it needs the broker to run)
  • so fire a thread that keeps a copy of the message in its args for a safe place to import from
  • that thread calls get_or_init_pool(), which takes the lock to prevent any other such threads running
  • controller starts an async task by asking dispatcher thread to fork
  • newly forked async task gets a snapshot of memory with the pool lock held
  • everything reinits, except the pool lock

no clue what is causing the main thread to block on it too, doesn't matter, problem is that service pool lock isn't part of the cleanups done by on_fork().

There are actually 2 locks in play here: the importer lock can cause trouble too. It's already documented "no imports off the main thread", but I broke my own rule. Solution is taking the service pool lock during fork and discarding it after fork, and recording in the docs this is done to prevent off-main-thread import lock acquisition

@dw dw changed the title misbehaving async tasks Service pool lock must be held and reinitialized across fork Jan 21, 2019
@dw
Copy link
Member

@dw dw commented Jan 21, 2019

There are definitely more locks buried in the thread bootstrap machinery, so we should entirely pause the broker thread during fork too, by adding a lock it needs to run a loop iteration and taking it during fork

dw added a commit that referenced this issue Jan 21, 2019
@dw
Copy link
Member

@dw dw commented Jan 21, 2019

I have opened #472 to track further fork "improvement" work, and those annoying log messages are still there, they will be addressed as part of #413.

Thanks for reporting this!

@dw dw closed this Jan 21, 2019
dw added a commit that referenced this issue Jan 21, 2019
* origin/dmw:
  issue #414: reinitialize service pool lock on fork
  issue #414: reenable test.
  core: many docstring updates and an example substitute for Channel
  core: make Receiver a self-closing context manager.
  core: make Receiver.to_sender() use Router.myself().
  docs: update Changelog.
  docs: update Changelog.
@jrosser
Copy link
Author

@jrosser jrosser commented Jan 22, 2019

Looks like I should try openstack-ansible again and see where it gets - I still have a little hack required like this jrosser@edba3fb.

Feels like there is still something to do to integrate with the OSA inventory where the final 'via' is could be driven by the physical_host hostvar for each container.

@dw
Copy link
Member

@dw dw commented Jan 22, 2019

We can definitely look at pulling in your usability changes.. there's a ticket already open for it, I've just been swamped elsewhere. Thanks for your patience!

Note if you're playing with master, there are some new DEL_ROUTE errors getting printed just now. It may be an apt time for me to set up a local AIO and try this out locally again

@dw
Copy link
Member

@dw dw commented Jan 22, 2019

Related ticket is #403

@jrosser
Copy link
Author

@jrosser jrosser commented Jan 22, 2019

If you are able to have a quick look with an AIO it would be great - feels like there is probably a very small fix required to get the connection to the containers going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants