Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] cleaning up sanitize=thread found several races #771

Merged
merged 14 commits into from
Jul 22, 2024

Conversation

levb
Copy link
Collaborator

@levb levb commented Jul 13, 2024

note: I've been working on a bigger PR for async fetch (aka Consume), it is not ready yet but since it will be a big PR I'll try to send in parts that are ready to simplify the review. This is part 1.

  • Lock on msgDlvWorker->lock where we were not before.
  • Use natsSock_Shutdown rather than natsSock_Close from natsConnection_Reconnect to avoid a potential data race with the readloop.
  • Cleanup thread sanitizer CI runs and prior test failures. Fixes Tests failing -fsanitize=thread #672

Before this PR warnings generated by-fsanitize=thread in CI were ignored, since the detailed test logs were dropped and not scanned. This PR adds the log scanning.

It also uncovered several issues with the existing tests, all of them but one were relatively trivial. In Test_ForceReconnect detected a race caused by us forecefully close-ing an fd which is still being read on by the readloop. I could not think of a simple solution that would not have performance implications, so I disabled the test for thread sanitizer. The code seems to be doing the right thing anyway, and the readloop thread would error out, as designed.

The following tests were fixed:

  • Test_SrvVersionAtLeast
  • Test_ForcedReconnect
  • Test_ClientAutoUnsubAndReconnect
  • Test_AutoUnsubNoUnsubOnDestroy
  • Test_EventLoop
  • Test_JetStreamOrderedConsSrvRestart
  • Test_JetStreamOrderedConsumerWithErrors

This PR also adds a few more CI jobs, including (experimentally for now) DEV_MODE. I will remove the DEV_MODE jobs if they prove flaky and not easily fixable. Same for the (added) debug/sanitize jobs, will monitor and remove/fix if too slow/flaky.

@levb levb requested review from mtmk and kozlovic July 13, 2024 05:07
test/test.c Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 28.57143% with 15 lines in your changes missing coverage. Please review.

Project coverage is 68.79%. Comparing base (1553d4a) to head (bf65830).
Report is 2 commits behind head on main.

Files Patch % Lines
src/js.c 20.00% 3 Missing and 5 partials ⚠️
src/conn.c 36.36% 3 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #771      +/-   ##
==========================================
+ Coverage   68.71%   68.79%   +0.08%     
==========================================
  Files          39       39              
  Lines       15207    15227      +20     
  Branches     3143     3153      +10     
==========================================
+ Hits        10449    10476      +27     
+ Misses       1700     1684      -16     
- Partials     3058     3067       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good catches with missing unlocking in tests. I have a question regarding the need of verbose output for sanitize thread and about the reconnect implementation.

.github/workflows/build-test.yml Outdated Show resolved Hide resolved
test/test.c Outdated Show resolved Hide resolved
@levb levb changed the title [TEST+CI only] cleanup of sanitizeThread [FIXED] use natsSock_Shutdown in Reconnect to avoid a race Jul 13, 2024
@levb levb requested a review from kozlovic July 13, 2024 20:04
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove redirect to temp file? By the way, if you would have preferred having different PRs, you could have had one with just the close->shutdown change, that's it, and then once merged, rebase the PR that was just CI/Tests.
Finally, I had disabled sanitize=thread in Travis due to some bug in Ubuntu/travis combination. We may want to reactivate from time to time to see if that works now.

.github/workflows/build-test.yml Outdated Show resolved Hide resolved
@@ -45,6 +45,10 @@ foreach(name ${listOfTestNames})
# Make sure the test passes
set_tests_properties(Test_${name} PROPERTIES PASS_REGULAR_EXPRESSION "ALL PASSED")

# Set TSAN_OPTIONS for the test
set_tests_properties(Test_${name} PROPERTIES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have an effect on any other builds? (where we don't need/want sanitize)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since TSAN is for Thread SANitizer, and it's an environment variable for running the tests, not a build flag but I put it in an if for NATS_SANITIZE only just in case.

@levb
Copy link
Collaborator Author

levb commented Jul 15, 2024

@kozlovic The travis build still does sanitize=thread for clang. I tried to enable gcc, and got the same cmake configuration-time failure, so disabled it back. I added a fuller matrix of sanitize to GH Actions, will see what it produces.

@levb levb requested a review from kozlovic July 15, 2024 09:55
@levb
Copy link
Collaborator Author

levb commented Jul 15, 2024

@kozlovic The fuller matrix found a couple more, in the code, missing libDlvWorker lock; please check.

@levb levb changed the title [FIXED] use natsSock_Shutdown in Reconnect to avoid a race [FIXED] cleaning up sanitize=thread found several races Jul 15, 2024
sanitize: address
server_version: main

san-thread:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure why you would shorten to san- here while you used santize- for the address matrix run above :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorter is better? :) (will change for consistency)

@@ -1137,17 +1137,29 @@ _resendSubscriptions(natsConnection *nc)

adjustedMax = 0;
natsSub_Lock(sub);
if (sub->libDlvWorker != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are macro for lock/unlock of this in sub.c. Maybe we move the definition of these in sub.h (we include this header in conn.c), and so you could replace these 2 lines with:

SUB_DLV_WORKER_LOCK(sub);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlovic I stayed away from further exposing these macros. I am about to submit a re-factor (just passed all tests now without kicking!) that hopefully eliminates the need for double-locking, leaving the worker lock almost exclusively to protect the message queue. Ditto for the sub->closed comment.

It's ok with me if we keep this PR "hanging" for now until you see the upcoming solution, and then we can decide which way to fix? I just want to have it as my base branch to get the CI/unrelated test fixes out of the way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the last part, that is, do you want to merge this PR to be the base for the other work or the other work will be the (re)base for the CI changes. Either way, I am going to LGTM this PR so that you proceed as you wish.

// If JS ordered consumer, trigger a reset. Don't check the error
// condition here. If there is a failure, it will be retried
// at the next HB interval.
if ((sub->jsi != NULL) && (sub->jsi->ordered))
{
jsSub_resetOrderedConsumer(sub, sub->jsi->sseq+1);
if (sub->libDlvWorker != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and:

SUB_DLV_WORKER_UNLOCK(sub);

@@ -2045,6 +2045,10 @@ _hbTimerFired(natsTimer *timer, void* closure)
natsStatus s = NATS_OK;

natsSub_Lock(sub);
if (sub->libDlvWorker != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same than for conn.c. We also include sub.h here, so that would work.

if (sub->libDlvWorker != NULL)
{
natsMutex_Lock(sub->libDlvWorker->lock);
}
if (!sub->closed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm... there are may places where we check this under the sub's lock only. Sure, it is set under both locks and checked only under the lib's lock in nats.c (the code that deals with the delivery's pool thread), but if we find many other reports, then I think it would be simpler to modify _deliverMsgs so that we capture sub->closed under the sub's lock after releasing the delivery's lock. Anyway, for now change is ok, but again, let's keep that in mind.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@levb levb merged commit bd700a9 into main Jul 22, 2024
28 checks passed
@levb levb deleted the lev-ci-logging-threadsanitizer branch July 22, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failing -fsanitize=thread
2 participants