Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for when consumer would stop working due to errPartialCache returned from fileStore. #2761

Merged
merged 2 commits into from Dec 27, 2021

Conversation

derekcollison
Copy link
Member

Stabilize filestore to eliminate sporadic errPartialCache errors under certain situations.

The filestore would release a msgBlock lock while trying to load a cache block if it thought it needed to flush pending data. With async false, this should be very rare but was possible after careful inspection and could lead under the right conditions to a consumer getting an errPartialCache returned trying to load a message.

Resolves #2732 (Hopefully)

/cc @nats-io/core

…urn hint about clearing cache.

Signed-off-by: Derek Collison <derek@nats.io>
… certain situations. Related to #2732

The filestore would release a msgBlock lock while trying to load a cache block if it thought it needed to flush pending data.
With async false, this should be very rare but was possible after careful inspection.

I constructed an artificial test with sleeps throughout the filestore code to reproduce.
It involved having 2 Go routines that were through and waiting on the last msg block, and another one that was writing.
After the write, but before we flushed after releasing the lock we would also artificially sleep.
This would lead to the second read seeing the cache load was already in progress and return no error.
If the load was for a sequence before the current write sequence, and async was false, the cache fseq would be higher than what was requested.
This would cause the errPartialCache to be returned.

Once returned to the consumer level in loopAndGather, it would exit that Go routine and the consumer would cease to function.

This change removed the unlock of a msgBlock to perform and flush, ensuring that two cacheLoads would not yield the errPartialCache.

I also updated the consumer in the case this does happen in the future to not exit the loopAndGather Go routine.

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Contributor

@matthiashanel matthiashanel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 34555ae into main Dec 27, 2021
@derekcollison derekcollison deleted the fs_partial_err branch December 27, 2021 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consumer stopped working after errPartialCache (nats-server oom-killed)
2 participants