Service pool lock must be held and reinitialized across fork #414
Comments
Hi Johnathan! Could you see from the target whether the process was still running? Async is quite icky because it touches the filesystem -- maybe it failed to write the job status file, still investigating. Thanks for reporting this |
Yowzer, that's a fun one, it's creating those tasks in a loop.. https://github.com/openstack/openstack-ansible-os_horizon/blob/4283200534eafa444efd9bb408ddcb5c98a1d442/tasks/horizon_post_install.yml#L124 At least that is something to write a test case for. I'm curious if the async tasks are sensible to begin with.. does compress not need to wait for collectstatic to finish? |
The error messages are a red herring (#413). I've reproduced it locally. Enabling MITOGEN_ROUTER_DEBUG=1 makes the jobs succeed. Very interesting |
Ah! Has this only started occurring on recent masters? |
Sadly this failure was in CI so I didn't get to investigate it directly. I'm not sure where this started occurring as your recent changes to mitogen have made the OSA build get quite some way further than it did before. Also, as far as I can see from a successful run, the "Ensure messages are compiled for translation" didn't have to wait very long for the completion of the async task, it might be worth adding another item to your test case which finishes instantaneously so that it is ready before the async_status task is run. |
Fairly sure the same issue is showing up on Travis, issue #425 |
MO:
|
It's the service pool initialization lock. Sequence is:
no clue what is causing the main thread to block on it too, doesn't matter, problem is that service pool lock isn't part of the cleanups done by on_fork(). There are actually 2 locks in play here: the importer lock can cause trouble too. It's already documented "no imports off the main thread", but I broke my own rule. Solution is taking the service pool lock during fork and discarding it after fork, and recording in the docs this is done to prevent off-main-thread import lock acquisition |
There are definitely more locks buried in the thread bootstrap machinery, so we should entirely pause the broker thread during fork too, by adding a lock it needs to run a loop iteration and taking it during fork |
I have opened #472 to track further fork "improvement" work, and those annoying log messages are still there, they will be addressed as part of #413. Thanks for reporting this! |
* origin/dmw: issue #414: reinitialize service pool lock on fork issue #414: reenable test. core: many docstring updates and an example substitute for Channel core: make Receiver a self-closing context manager. core: make Receiver.to_sender() use Router.myself(). docs: update Changelog. docs: update Changelog.
Looks like I should try openstack-ansible again and see where it gets - I still have a little hack required like this jrosser@edba3fb. Feels like there is still something to do to integrate with the OSA inventory where the final 'via' is could be driven by the physical_host hostvar for each container. |
We can definitely look at pulling in your usability changes.. there's a ticket already open for it, I've just been swamped elsewhere. Thanks for your patience! Note if you're playing with master, there are some new DEL_ROUTE errors getting printed just now. It may be an apt time for me to set up a local AIO and try this out locally again |
Related ticket is #403 |
If you are able to have a quick look with an AIO it would be great - feels like there is probably a very small fix required to get the connection to the containers going. |
From an OSA AIO deploy, an async task....
TASK [os_horizon : Compile messages for translation] *********************************************************************************************************************************
ok: [aio1_horizon_container-bfb96d25] => (item=/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon)
ok: [aio1_horizon_container-bfb96d25] => (item=/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/openstack_dashboard)
ERROR! [pid 21444] 17:12:49.261466 E mitogen: Received DEL_ROUTE for 107138 from mitogen.setns.Stream(u'setns.aio1_horizon_container-bfb96d25'), expected None
ERROR! [pid 21444] 17:12:49.262085 E mitogen.ctx.setns.aio1_horizon_container-bfb96d25: mitogen: Received DEL_ROUTE for 107138 from mitogen.sudo.Stream(u'sudo.horizon'), expected mitogen.core.Stream('parent')
ERROR! [pid 21444] 17:12:49.262392 E mitogen.ctx.setns.aio1_horizon_container-bfb96d25.sudo.horizon: mitogen: Received DEL_ROUTE for 107138 from mitogen.fork.Stream(u'fork.1309'), expected mitogen.core.Stream('parent')
TASK [os_horizon : Ensure messages are compiled for translation] *********************************************************************************************************************
changed: [aio1_horizon_container-bfb96d25] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'171d24cb4603029a', 'failed': False, u'started': 1, u'changed': False, 'item': u'/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon', u'finished': 0, '_ansible_ignore_errors': None, '_ansible_item_label': u'/openstack/venvs/horizon-19.0.0.0b1/lib/python2.7/dist-packages/horizon'})
FAILED - RETRYING: Ensure messages are compiled for translation (300 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (299 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (298 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (297 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (296 retries left).
FAILED - RETRYING: Ensure messages are compiled for translation (295 retries left).
Retries 300 times and fails.
These are the tasks in question:
https://github.com/openstack/openstack-ansible-os_horizon/blob/3d60097f94b8a27eabcfd931d4f80a9d8f2b7280/tasks/horizon_post_install.yml#L110
https://github.com/openstack/openstack-ansible-os_horizon/blob/4283200534eafa444efd9bb408ddcb5c98a1d442/tasks/main.yml#L105
The text was updated successfully, but these errors were encountered: