Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close TSMReaders from FileStore.Close after releasing FileStore mutex #9866

Merged
merged 1 commit into from
May 17, 2018

Conversation

jacobmarble
Copy link
Member

@jacobmarble jacobmarble commented May 17, 2018

More work to properly fix #9786

Confirmed that DROP SHARD X works properly when the service is loaded with 10 concurrent select sum(f) style queries. Files are deleted, and file descriptors are closed. So #9792 is WAI.

Looking at RP more closely:

  • created a 1 minute RP (one-line local code modification)
  • set 10 second RP check interval (config file setting)
  • loaded the service with the same queries mentioned above
  • watched the oldest shard (89) disappear from SHOW SHARDS
  • shard 89's files never deleted from the filesystem
  • shard 89's file descriptors never closed
  • waited 30 minutes, then killed the query load

Discovered that a RP event and exactly one query were deadlocked. RP was blocking here:

// Close closes the TSMReader.
func (t *TSMReader) Close() error {
	t.refsWG.Wait()  // blocked forever

Hung query discovered here (duration reached as high as 1h, never returned):

> show queries
qid  query                   database duration status
---  -----                   -------- -------- ------
8216 SHOW QUERIES            rain     38µs     running
8158 SELECT sum(rand) FROM m rain     1m10s    killed

The deadlock happens because

  1. the RP event holds the FileStore mutex (write) while waiting for a TSMReader WaitGroup
  2. a query waits for the FileStore mutex (read) in order to done the TSMReader WaitGroup

This change closes FileStore (clears the object members so that it looks closed) under write lock then releases the write lock before closing the underlying files.

To test this change:

  • set 1 minute RP
  • write random points with 100k cardinality, in an infinite loop
  • select sum(field) from m in 5 concurrent infinite loops
  • observe RP deadlocks and hung queries

Before this change, the test fails within 1-3 RP cycles, even deadlocking 3 of 5 queries in one case. After the change, the test has not failed after 200 RP cycles (about 2.5 hours).

@ghost ghost assigned jacobmarble May 17, 2018
@ghost ghost added the review label May 17, 2018
@hercules-influx
Copy link
Contributor

During a run of megacheck the following issues were discovered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TSM files not closed when shard deleted
3 participants