fix top block locking race condition #485

pinkavaj · 2015-04-12T18:27:21Z

First patch fixes state work flow (does not fix any testable issue, but was used inconsistently/wrong)
second patch is QA for top_block race condition, unfortunately it fails with different kind of error
(ranging from invalid value to segfault), what is more unfortunate it sometimes pass (I'm not sure why)
third patch fixes bug tested by previous QA

d_state is intended to track running state of top block. * initial state is IDLE * set state to RUNNING when start() is called * set state to IDLE when stop() is called * set state to IDLE when all work is done (reqires call of wait()) * after unlock() start execution (continue execution) only if state is RUNNING (do nothingh otherwise)

jmcorgan · 2015-04-12T18:32:36Z

This is on top of #426? Given your comments above, it seems this isn't really a fix so much as a guess at what might be a problem. This isn't getting merged before 3.7.6.2 or 3.7.7 releases, and after the release we can revisit the above and #426 in a more systematic and testable way.

pinkavaj · 2015-04-12T18:43:55Z

#426 really does not fixed all the trouble, but this fix fixes something different form #426. After the deadlock was removed there appeared new rece condition I missed. Of course check it twice before any decission is made.

jmcorgan · 2015-04-16T00:31:47Z

I'd like you to write out the issues that this proposed patch fixes, how they happen, and how these changes address them.

pinkavaj · 2015-04-16T01:21:03Z

First patch have extensive comment which describes how d_state is expected to behave (I would appreciate any comment if my expectations are wrong) and fixes its behaviour accordingly. This patch can be cherry-picked alone, there is no QA attached for this patch. What is not mentioned in comment is it fixes case where d_state is changed when d_mutex is unlocked. This is wrong because d_state can be modified by multiple threads concurently and results in udefined behaviour. It should be noted the probability of race condition is pretty low almost negligible, but in my opinion it is still wrong.

Second patch is QA. It fails in more or less undefined manner because tests case where race condition occurs and its propagation depends on some hidden states, sometimes does not fail just because of some lucky coincidence (is the QA working for you?).

Third patch fixes race condition tested by QA in previous patch. The care should be taken here because i did not draw flow graph for all possible states. This patch fixes single case of race condition which can be described in this way.

some thread calls wait()
some other thread calls lock() do something and then call unlock (d_lock_count == 0)
unlock() call restart() which calls wait_for_jobs()
Now expect the thread which called wait() now wakes-up, it returns from wait_for_jobs() (because we are locked and all jobs are going to stop) and is halted on d_mutex lock. Then context switch again and function restart() continues in execution, starts new jobs ... and then unlock() is finished. This unlocks d_mutex, but sets d_lock_count = 0. Now jobs are running but the wait() exits because there is no lock held and there is no check if some new jobs were started after finishing wait_for_jobs() but before acquiring lock on d_mutex.

The fix add new state which is modified whe d_mutex is locked and indicate whatever job state should be checked again.

I hope the description above can be understand. I admit there should be test for each possible workflow or at least each possible workflow should be checked for race condition. I did neither of those, but if You persist I can try write mathematical proof whatever some race conditions can or cannot happen.

pinkavaj · 2015-05-21T13:31:25Z

Anny possible progress here? Need more info or something?

jmcorgan · 2015-05-27T17:18:38Z

We're going to be doing a systematic review of this area of code before making further changes to it.

jmcorgan · 2015-12-08T15:35:05Z

We're going to be making stable release 3.7.9 soon. After that, I'd like to walk through with you where you see issues here, what fixes you are proposing, and how to implement tests that pass/fail on this fixes. In the mean time, I am closing this PR.

pinkavaj added 3 commits April 12, 2015 18:52

gr: top_block2 QA for wait locking

f419e9f

gr: fix top_block2 locking

44ecf82

jmcorgan added maint labels Apr 12, 2015

jmcorgan self-assigned this Apr 13, 2015

jmcorgan removed the hold label Apr 13, 2015

jmcorgan added the hold label Apr 16, 2015

jmcorgan closed this Dec 8, 2015

jmcorgan removed Bug Fix labels Dec 8, 2015

pinkavaj deleted the top_blk-fix-03 branch June 29, 2016 07:18

jmcorgan mentioned this pull request Oct 2, 2016

top_block.unlock() deadlock when flow graph contains Python sync_block #943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix top block locking race condition #485

fix top block locking race condition #485

pinkavaj commented Apr 12, 2015

jmcorgan commented Apr 12, 2015

pinkavaj commented Apr 12, 2015

jmcorgan commented Apr 16, 2015

pinkavaj commented Apr 16, 2015

pinkavaj commented May 21, 2015

jmcorgan commented May 27, 2015

jmcorgan commented Dec 8, 2015

fix top block locking race condition #485

fix top block locking race condition #485

Conversation

pinkavaj commented Apr 12, 2015

jmcorgan commented Apr 12, 2015

pinkavaj commented Apr 12, 2015

jmcorgan commented Apr 16, 2015

pinkavaj commented Apr 16, 2015

pinkavaj commented May 21, 2015

jmcorgan commented May 27, 2015

jmcorgan commented Dec 8, 2015