Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix top block locking race condition #485

Closed
wants to merge 3 commits into from

Conversation

pinkavaj
Copy link
Contributor

  • First patch fixes state work flow (does not fix any testable issue, but was used inconsistently/wrong)
  • second patch is QA for top_block race condition, unfortunately it fails with different kind of error
    (ranging from invalid value to segfault), what is more unfortunate it sometimes pass (I'm not sure why)
  • third patch fixes bug tested by previous QA

d_state is intended to track running state of top block.

  * initial state is IDLE
  * set state to RUNNING when start() is called
  * set state to IDLE when stop() is called
  * set state to IDLE when all work is done (reqires call of wait())
  * after unlock() start execution (continue execution) only if state
    is RUNNING (do nothingh otherwise)
@jmcorgan
Copy link
Contributor

This is on top of #426? Given your comments above, it seems this isn't really a fix so much as a guess at what might be a problem. This isn't getting merged before 3.7.6.2 or 3.7.7 releases, and after the release we can revisit the above and #426 in a more systematic and testable way.

@pinkavaj
Copy link
Contributor Author

#426 really does not fixed all the trouble, but this fix fixes something different form #426. After the deadlock was removed there appeared new rece condition I missed. Of course check it twice before any decission is made.

@jmcorgan jmcorgan self-assigned this Apr 13, 2015
@jmcorgan jmcorgan removed the hold label Apr 13, 2015
@jmcorgan
Copy link
Contributor

I'd like you to write out the issues that this proposed patch fixes, how they happen, and how these changes address them.

@jmcorgan jmcorgan added the hold label Apr 16, 2015
@pinkavaj
Copy link
Contributor Author

First patch have extensive comment which describes how d_state is expected to behave (I would appreciate any comment if my expectations are wrong) and fixes its behaviour accordingly. This patch can be cherry-picked alone, there is no QA attached for this patch. What is not mentioned in comment is it fixes case where d_state is changed when d_mutex is unlocked. This is wrong because d_state can be modified by multiple threads concurently and results in udefined behaviour. It should be noted the probability of race condition is pretty low almost negligible, but in my opinion it is still wrong.

Second patch is QA. It fails in more or less undefined manner because tests case where race condition occurs and its propagation depends on some hidden states, sometimes does not fail just because of some lucky coincidence (is the QA working for you?).

Third patch fixes race condition tested by QA in previous patch. The care should be taken here because i did not draw flow graph for all possible states. This patch fixes single case of race condition which can be described in this way.

  • some thread calls wait()
  • some other thread calls lock() do something and then call unlock (d_lock_count == 0)
  • unlock() call restart() which calls wait_for_jobs()
    Now expect the thread which called wait() now wakes-up, it returns from wait_for_jobs() (because we are locked and all jobs are going to stop) and is halted on d_mutex lock. Then context switch again and function restart() continues in execution, starts new jobs ... and then unlock() is finished. This unlocks d_mutex, but sets d_lock_count = 0. Now jobs are running but the wait() exits because there is no lock held and there is no check if some new jobs were started after finishing wait_for_jobs() but before acquiring lock on d_mutex.

The fix add new state which is modified whe d_mutex is locked and indicate whatever job state should be checked again.

I hope the description above can be understand. I admit there should be test for each possible workflow or at least each possible workflow should be checked for race condition. I did neither of those, but if You persist I can try write mathematical proof whatever some race conditions can or cannot happen.

@pinkavaj
Copy link
Contributor Author

Anny possible progress here? Need more info or something?

@jmcorgan
Copy link
Contributor

We're going to be doing a systematic review of this area of code before making further changes to it.

@jmcorgan
Copy link
Contributor

jmcorgan commented Dec 8, 2015

We're going to be making stable release 3.7.9 soon. After that, I'd like to walk through with you where you see issues here, what fixes you are proposing, and how to implement tests that pass/fail on this fixes. In the mean time, I am closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants