Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

jacobmarble · 2018-05-17T16:12:51Z

More work to properly fix #9786

Confirmed that DROP SHARD X works properly when the service is loaded with 10 concurrent select sum(f) style queries. Files are deleted, and file descriptors are closed. So #9792 is WAI.

Looking at RP more closely:

created a 1 minute RP (one-line local code modification)
set 10 second RP check interval (config file setting)
loaded the service with the same queries mentioned above
watched the oldest shard (89) disappear from SHOW SHARDS
shard 89's files never deleted from the filesystem
shard 89's file descriptors never closed
waited 30 minutes, then killed the query load

Discovered that a RP event and exactly one query were deadlocked. RP was blocking here:

// Close closes the TSMReader.
func (t *TSMReader) Close() error {
	t.refsWG.Wait()  // blocked forever

Hung query discovered here (duration reached as high as 1h, never returned):

> show queries
qid  query                   database duration status
---  -----                   -------- -------- ------
8216 SHOW QUERIES            rain     38µs     running
8158 SELECT sum(rand) FROM m rain     1m10s    killed

The deadlock happens because

the RP event holds the FileStore mutex (write) while waiting for a TSMReader WaitGroup
a query waits for the FileStore mutex (read) in order to done the TSMReader WaitGroup

This change closes FileStore (clears the object members so that it looks closed) under write lock then releases the write lock before closing the underlying files.

To test this change:

set 1 minute RP
write random points with 100k cardinality, in an infinite loop
select sum(field) from m in 5 concurrent infinite loops
observe RP deadlocks and hung queries

Before this change, the test fails within 1-3 RP cycles, even deadlocking 3 of 5 queries in one case. After the change, the test has not failed after 200 RP cycles (about 2.5 hours).

hercules-influx · 2018-05-17T16:13:43Z

During a run of megacheck the following issues were discovered:

Close TSMReaders from FileStore.Close after releasing FileStore mutex

c119f9a

ghost assigned jacobmarble May 17, 2018

ghost added the review label May 17, 2018

jacobmarble requested a review from benbjohnson May 17, 2018 16:16

jacobmarble added area/retention policies kind/bug labels May 17, 2018

benbjohnson approved these changes May 17, 2018

View reviewed changes

jacobmarble merged commit f6f02ce into master May 17, 2018

ghost removed the review label May 17, 2018

jacobmarble deleted the jgm-rp branch May 17, 2018 17:52

This was referenced May 17, 2018

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9867

Merged

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

jacobmarble commented May 17, 2018 •

edited

Loading

hercules-influx commented May 17, 2018

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

Conversation

jacobmarble commented May 17, 2018 • edited Loading

hercules-influx commented May 17, 2018

jacobmarble commented May 17, 2018 •

edited

Loading