Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent retention service from hanging #25055

Merged
merged 4 commits into from
Jun 13, 2024
Merged

fix: prevent retention service from hanging #25055

merged 4 commits into from
Jun 13, 2024

Conversation

gwossum
Copy link
Member

@gwossum gwossum commented Jun 11, 2024

Fix issue that can cause the retention service to hang waiting on a Shard.Close call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up.

The fix adds to new methods to Store, SetShardNewReadersBlocked and ShardInUse. ShardInUse can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. SetShardNewReadersBlocked determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of ShardInUse and the deletion of shards.

If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers.

closes: #25054

Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.

The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.

If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.

closes: #25054
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one question about something I don't understand.

tsdb/store.go Outdated Show resolved Hide resolved
Fix misplaced `ScheduleFullCompaction` call in test code where it belongs.
Remove commented out code.
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly test changes to improve rigor, and a few error-handling suggestions.

tsdb/engine/tsm1/file_store.go Outdated Show resolved Hide resolved
tsdb/engine/tsm1/file_store.go Outdated Show resolved Hide resolved
tsdb/engine/tsm1/file_store_test.go Outdated Show resolved Hide resolved
tsdb/engine/tsm1/file_store_test.go Outdated Show resolved Hide resolved
tsdb/engine/tsm1/file_store_test.go Outdated Show resolved Hide resolved
tsdb/shard_test.go Outdated Show resolved Hide resolved
tsdb/shard_test.go Outdated Show resolved Hide resolved
tsdb/store.go Outdated Show resolved Hide resolved
tsdb/store.go Outdated Show resolved Hide resolved
tsdb/engine/tsm1/file_store.go Show resolved Hide resolved
Improve error messages, tests, and comments.
Copy link
Contributor

@davidby-influx davidby-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@gwossum gwossum merged commit b4bd607 into master-1.x Jun 13, 2024
9 checks passed
@gwossum gwossum deleted the gw_25054 branch June 13, 2024 16:07
gwossum added a commit that referenced this pull request Jun 21, 2024
Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.

The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.

If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.

This is a port of ad68ec8 from master-1.x to main-2.x.

closes: #25076
(cherry picked from commit b4bd607)
gwossum added a commit that referenced this pull request Jun 24, 2024
* fix: prevent retention service from hanging (#25055)

Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.

The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.

If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.

This is a port of ad68ec8 from master-1.x to main-2.x.

closes: #25076
(cherry picked from commit b4bd607)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants