Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully lock adding node queues during hinted handoff #4353

Merged
merged 1 commit into from
Oct 7, 2015

Conversation

otoolep
Copy link
Contributor

@otoolep otoolep commented Oct 7, 2015

I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage. A detailed explanation follows.

During testing of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated.

Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently. This is not thread-safe, so these mutiple open calls were corrupting the file.

Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was manipulating HH node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files.

This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.

Before this change I could cause HH to fail 100% within a couple of minutes due to corrupt HH data, under specific testing. With this change in place everything looks fine.

@otoolep otoolep changed the title Locking changes Fully lock adding node queue during hinted handoff Oct 7, 2015
I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage.. A detailed explanation follows.

During testing of of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated.

Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently, which was not file-level thread-safe, so these mutiple open calls were corrupting the file.

Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was checking for node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files.

This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.
if queue, err = p.addQueue(ownerID); err != nil {
if err := func() error {
// Check again under write-lock.
p.mu.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locking pattern is the same as used by the expvar package in the standard libary.

@otoolep
Copy link
Contributor Author

otoolep commented Oct 7, 2015

@jwilder

@otoolep otoolep changed the title Fully lock adding node queue during hinted handoff Fully lock adding node queues during hinted handoff Oct 7, 2015
func (p *Processor) WriteShard(shardID, ownerID uint64, points []models.Point) error {
p.mu.RLock()
queue, ok := p.queues[ownerID]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the root cause. p.queues was not being locked when checked for an entry for ownerID.

@corylanou
Copy link
Contributor

Would be nice it we were able to validate this via a test.

@jwilder
Copy link
Contributor

jwilder commented Oct 7, 2015

👍 That definitely needs synchronization.

@otoolep
Copy link
Contributor Author

otoolep commented Oct 7, 2015

@corylanou - I am open to suggestions but I not sure how this is any
different from any piece of code in our system, which if the locking is
wrong, we go in and fix the lock. It's difficult to create a unit test for
this since it may always pass since it's a race.

One way we now can check - and I will make sure we do - is to always run
one of the burn-in boxes with race detection on. That way we will start
catching issues like this during system testing.

On Wednesday, October 7, 2015, Cory LaNou notifications@github.com wrote:

Would be nice it we were able to validate this via a test.


Reply to this email directly or view it on GitHub
#4353 (comment).

otoolep added a commit that referenced this pull request Oct 7, 2015
Fully lock adding node queues during hinted handoff
@otoolep otoolep merged commit 889fd58 into master Oct 7, 2015
@otoolep otoolep deleted the hh_file_thrashing branch October 7, 2015 16:25
@otoolep
Copy link
Contributor Author

otoolep commented Oct 7, 2015

@corylanou - let me know if you have any ideas for testing this. I will be enabling race-detection on a single burn-in now.

@corylanou
Copy link
Contributor

@otoolep yeah, I don't have an easy answer. We could possibly add a test that does a bunch of routines all calling HH and hope the race detector catches it, but again, it's hard to write a test that can reproduce this consistently, hence why the race detector exists. I think your idea of the burn in box with race enabled is a great step in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants