Fully lock adding node queues during hinted handoff #4353

otoolep · 2015-10-07T09:14:40Z

I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage. A detailed explanation follows.

During testing of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated.

Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently. This is not thread-safe, so these mutiple open calls were corrupting the file.

Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was manipulating HH node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files.

This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.

Before this change I could cause HH to fail 100% within a couple of minutes due to corrupt HH data, under specific testing. With this change in place everything looks fine.

I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage.. A detailed explanation follows. During testing of of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated. Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently, which was not file-level thread-safe, so these mutiple open calls were corrupting the file. Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was checking for node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files. This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.

otoolep · 2015-10-07T09:37:27Z

services/hh/processor.go

-		if queue, err = p.addQueue(ownerID); err != nil {
+		if err := func() error {
+			// Check again under write-lock.
+			p.mu.Lock()


This locking pattern is the same as used by the expvar package in the standard libary.

otoolep · 2015-10-07T09:40:48Z

@jwilder

otoolep · 2015-10-07T09:50:47Z

services/hh/processor.go

 func (p *Processor) WriteShard(shardID, ownerID uint64, points []models.Point) error {
+	p.mu.RLock()
 	queue, ok := p.queues[ownerID]


This is the root cause. p.queues was not being locked when checked for an entry for ownerID.

corylanou · 2015-10-07T11:36:06Z

Would be nice it we were able to validate this via a test.

jwilder · 2015-10-07T14:35:54Z

👍 That definitely needs synchronization.

otoolep · 2015-10-07T16:08:25Z

@corylanou - I am open to suggestions but I not sure how this is any
different from any piece of code in our system, which if the locking is
wrong, we go in and fix the lock. It's difficult to create a unit test for
this since it may always pass since it's a race.

One way we now can check - and I will make sure we do - is to always run
one of the burn-in boxes with race detection on. That way we will start
catching issues like this during system testing.

On Wednesday, October 7, 2015, Cory LaNou notifications@github.com wrote:

Would be nice it we were able to validate this via a test.

—
Reply to this email directly or view it on GitHub
#4353 (comment).

Fully lock adding node queues during hinted handoff

otoolep · 2015-10-07T16:26:12Z

@corylanou - let me know if you have any ideas for testing this. I will be enabling race-detection on a single burn-in now.

corylanou · 2015-10-07T16:32:14Z

@otoolep yeah, I don't have an easy answer. We could possibly add a test that does a bunch of routines all calling HH and hope the race detector catches it, but again, it's hard to write a test that can reproduce this consistently, hence why the race detector exists. I think your idea of the burn in box with race enabled is a great step in the right direction.

otoolep added the 2 - Working label Oct 7, 2015

otoolep changed the title ~~Locking changes~~ Fully lock adding node queue during hinted handoff Oct 7, 2015

otoolep force-pushed the hh_file_thrashing branch from c35bd20 to 44d52ac Compare October 7, 2015 09:33

otoolep reviewed Oct 7, 2015
View reviewed changes

otoolep changed the title ~~Fully lock adding node queue during hinted handoff~~ Fully lock adding node queues during hinted handoff Oct 7, 2015

otoolep reviewed Oct 7, 2015
View reviewed changes

otoolep added a commit that referenced this pull request Oct 7, 2015

Merge pull request #4353 from influxdb/hh_file_thrashing

889fd58

Fully lock adding node queues during hinted handoff

otoolep merged commit 889fd58 into master Oct 7, 2015

otoolep deleted the hh_file_thrashing branch October 7, 2015 16:25

otoolep removed the 2 - Working label Oct 7, 2015

jwilder mentioned this pull request Oct 7, 2015

[0.9.4.2] Point not replicated to all nodes in cluster #4334

Closed

otoolep mentioned this pull request Oct 16, 2015

In a cluster(three node),data is not sync when one node crash a while and then restart. #4205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully lock adding node queues during hinted handoff #4353

Fully lock adding node queues during hinted handoff #4353

otoolep commented Oct 7, 2015

otoolep Oct 7, 2015

otoolep commented Oct 7, 2015

otoolep Oct 7, 2015

corylanou commented Oct 7, 2015

jwilder commented Oct 7, 2015

otoolep commented Oct 7, 2015

otoolep commented Oct 7, 2015

corylanou commented Oct 7, 2015

Fully lock adding node queues during hinted handoff #4353

Fully lock adding node queues during hinted handoff #4353

Conversation

otoolep commented Oct 7, 2015

otoolep Oct 7, 2015

Choose a reason for hiding this comment

otoolep commented Oct 7, 2015

otoolep Oct 7, 2015

Choose a reason for hiding this comment

corylanou commented Oct 7, 2015

jwilder commented Oct 7, 2015

otoolep commented Oct 7, 2015

otoolep commented Oct 7, 2015

corylanou commented Oct 7, 2015