Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SERVER-30283 Update topology_coordinator's .h and .cpp files #1196

Closed
wants to merge 3 commits into from

Conversation

cramaechi
Copy link
Contributor

PingStats, defined in topology_coordinator.cpp, is a data structure used
to track statistics about replication heartbeat messages. It has a
variable called _numFailuresSinceLastStart, which keeps track of failed
heartbeat attempts. When a good heartbeat response comes back, this
variable is set to std::numeric_limitsmax(), presumably as an easy
way for this value to always be larger than others in certain
comparisons. In the case that we actually get back a good hearbeat
response (PingStats::hit) and then a failed heartbeat response from the
same node (PingStats::miss), we will actually overflow this
_numFailuresSinceLastStart variable. The integer max function should
not be used to determine if a good heartbeat response was received.

To remediate the issue, a field named _goodHeartbeatReceived was
added to PingStats to indicate whether a good heartbeat response
(PingStats::hit) was received since the last start
(PingStats::start). The assignment statement responsible for the
potential overflow is then replaced by a new assignment statement
that sets _goodHeartbeatReceived to true when a good heartbeat
response comes back.

Resolves: SERVER-30283

PingStats, defined in topology_coordinator.cpp, is a data structure used
to track statistics about replication heartbeat messages. It has a
variable called _numFailuresSinceLastStart, which keeps track of failed
heartbeat attempts. When a good heartbeat response comes back, this
variable is set to std::numeric_limits<int>max(), presumably as an easy
way for this value to always be larger than others in certain
comparisons. In the case that we actually get back a good hearbeat
response (PingStats::hit) and then a failed heartbeat response from the
same node (PingStats::miss), we will actually overflow this
_numFailuresSinceLastStart variable. The integer max function should
not be used to determine if a good heartbeat response was received.

To remediate the issue, a field named _goodHeartbeatReceived was
added to PingStats to indicate whether a good heartbeat response
(PingStats::hit) was received since the last start
(PingStats::start). The assignment statement responsible for the
potential overflow is then replaced by a new assignment statement
that sets _goodHeartbeatReceived to true when a good heartbeat
response comes back.

Resolves: SERVER-30283
Copy link
Contributor

@will62794 will62794 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of this patch, you should also consider adding a small C++ unit test in db/repl/topology_coordinator_test.cpp to exercise the methods of PingStats.

Thanks for taking on this ticket!

* heartbeat has been received since the last call to start().
*/
bool noMoreRetries() const {
return _numFailuresSinceLastStart > 2 || _goodHeartbeatReceived == true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max number of heartbeat failures should not be hard-coded. You should use the kMaxHeartbeatRetries constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to make sure there is no confusion, it might be prudent to surround each condition in parentheses.

* exceeded the max number of heartbeat retries or whether a good
* heartbeat has been received since the last call to start().
*/
bool noMoreRetries() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name hasFailed might be a clearer name for this method.

if (hbStats.getNumFailuresSinceLastStart() <= kMaxHeartbeatRetries &&
alreadyElapsed < _rsConfig.getHeartbeatTimeoutPeriod()) {
if ((hbStats.getNumFailuresSinceLastStart() <= kMaxHeartbeatRetries) &&
(hbStats.getGoodHeartbeatReceived() == false) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace these two condition checks with one call to noMoreRetries? If so, it looks like this would also make the getGoodHeartbeatReceived method obsolete, since it is only used here.

if (!_rsConfig.isInitialized() ||
(hbStats.getNumFailuresSinceLastStart() > kMaxHeartbeatRetries) ||
(alreadyElapsed >= _rsConfig.getHeartbeatTimeoutPeriodMillis())) {
if (!_rsConfig.isInitialized() || (hbStats.noMoreRetries() == true) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are checking if an expression X is true, it is redundant to write if(X == true), you can just write if(X). You can apply this to the hbStats.noMoreRetries() condition.

if (!_rsConfig.isInitialized() ||
(hbStats.getNumFailuresSinceLastStart() > kMaxHeartbeatRetries) ||
(alreadyElapsed >= _rsConfig.getHeartbeatTimeoutPeriodMillis())) {
if (!_rsConfig.isInitialized() || (hbStats.noMoreRetries() == true) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as on line 911.

@@ -1093,8 +1096,9 @@ HeartbeatResponseAction TopologyCoordinator::processHeartbeatResponse(
if (!hbResponse.isOK()) {
if (isUnauthorized) {
hbData.setAuthIssue(now);
} else if (hbStats.getNumFailuresSinceLastStart() > kMaxHeartbeatRetries ||
alreadyElapsed >= _rsConfig.getHeartbeatTimeoutPeriod()) {
} else if ((hbStats.noMoreRetries() == true) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as on line 911.

*/
int getNumFailuresSinceLastStart() const {
return _numFailuresSinceLastStart;
}

/**
* Gets a flag that indicates whether the number of failed heartbeats has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When describing a function like this, it's a bit redundant to say "Gets a flag...". The fact that a function returns a boolean should be enough information about the nature of the return value. In a comment like this, it can be clearer to just concisely describe the case when the function returns true, and when the function returns false.

}

/**
* Gets a flag that indicates whether a good heartbeat was received
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies here as on line 1054.

@cramaechi
Copy link
Contributor Author

cramaechi commented Jan 3, 2018

Hi will62794,

All the changes have been applied, as per your instructions.

I also added a new method called PingStats::retriesLeft(), which will replace the arithmetic expression in the following line of code:

src/mongo/db/repl/topology_coordinator.cpp
screen shot 2018-01-03 at 4 09 17 pm

The reason for this particular change is that kMaxHeartbeatRetries will no longer be declared in src/mongo/db/repl/topology_coordinator.cpp, but instead in the .h file. So instead of incurring the overhead of calling the accessors to kMaxHeartbeatRetries and numFailuresSinceLastStart, I will now just be calling one method to handle the computation.

As for the C++ unit test, are there any comprehensive guides available to aid me in the design and implementation? I have already read the guide provided in the Wiki section of this repo, but are there any others?

@will62794
Copy link
Contributor

will62794 commented Jan 3, 2018

Sounds good. Once you add the unit test feel free to push the new changes and I will take another look at them.

The best guide for how to add a unit test is probably to look at an example i.e. src/mongo/db/repl/topology_coordinator_test.cpp. Since PingStats is a class inside the TopologyCoordinator file, you could probably add a test case in there.

@cramaechi
Copy link
Contributor Author

Hi will62794,

After building src/mongo/db/repl/topology_coordinator_test.cpp with my new unit test, I received a compilation error notifying me that PingStats was a private member class of TopologyCoordinator, which made me realize that I was attempting to access a private member of a class.

Typically, unit tests only test the public interface, so are we allowed to test private members of a class in this context? If not, then we can still have the peace of mind that there are multiple tests in src/mongo/db/repl/topology_coordinator_test.cpp that are already exercising all the methods of PingStats via public methods (e.g., TopologyCoordinator::prepareHeartbeatRequest(), TopologyCoordinator::prepareHeartbeatResponse(), TopologyCoordinator::processHeartbeatResponse(), etc..).

I will push my other changes for now and await your decision on the unit test part.

For documentation purposes, here's what my (unfinished) unit test would look like:

TEST_F(TopoCoordTest, UpdatePingStatsWhenHeartbeatRequestHasBeenProcessed) {
    // This test confirms that topology coordinator correctly updates heartbeat
    // stats for each heartbeat request processed.

    Date_t firstRequestDate = unittest::assertGet(dateFromISOString("2018-01-04T13:00Z"));
    TopologyCoordinator::PingStats hbStats;

    // Initial heartbeat request started at firstRequestDate.
    hbStats.start(firstRequestDate);
    ASSERT_EQUALS(hbStats.getLastHeartbeatStartDate(), firstRequestDate);
    ASSERT_EQUALS(hbStats.retriesLeft(), 2);
    ASSERT_EQUALS(hbStats.getCount(), 0);
    ASSERT_FALSE(hbStats.hasFailed());
    
    hbStats.miss(); 
    ASSERT_EQUALS(hbStats.retriesLeft(), 1);
    ASSERT_EQUALS(hbStats.getCount(), 0);
    ASSERT_FALSE(hbStats.hasFailed()); 

    hbStats.hit(Milliseconds(100))
    ASSERT_EQUALS(hbStats.retriesLeft(), 1);
    ASSERT_EQUALS(hbStats.getCount(), 1);
    ASSERT_TRUE(hbStats.hasFailed());

    hbStats.miss();
    ASSERT_EQUALS(hbStats.retriesLeft(), 0);
    ASSERT_EQUALS(hbStats.getCount(), 1);
    ASSERT_TRUE(hbStats.hasFailed());
}

@cramaechi
Copy link
Contributor Author

My new changes have now been pushed.

@will62794
Copy link
Contributor

Hi cramaechi,

You are right, I overlooked the fact that PingStats is private to topology_coordinator.cpp. As long as we are exercising PingStats via the public interface of TopologyCoordinator, that should be fine. Sorry about that confusion.

Also, it looks like your latest changes include many other unrelated changes, which makes it difficult to do another review pass. Did you perhaps rebase your branch against master and then push? That may be causing the noise. If you could try to clean that up and push your changes again, then I will take another look.

Thanks!

@cramaechi
Copy link
Contributor Author

cramaechi commented Jan 8, 2018

Hi will62794,

Sorry for the trouble. I believe I have resolved the conflict.

A new pull request has been opened.

}

private:
static constexpr Milliseconds UninitializedPing{-1};

// Maximum number of retries for a failed heartbeat.
int _kMaxHeartbeatRetries = 2;
Copy link
Contributor

@will62794 will62794 Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a constant in the header file, instead of a member variable of PingStats.

@cramaechi
Copy link
Contributor Author

Change has been made.

@will62794
Copy link
Contributor

@will62794 will62794 changed the title Update topology_coordinator's .h and .cpp files SERVER-30283 Update topology_coordinator's .h and .cpp files Jan 9, 2018
@will62794
Copy link
Contributor

Hi cramaechi,

This patch looks good, thanks! I have compiled it and run the appropriate tests and everything seems to look fine. I will let you know when I get this merged. Thank you for your work!

Will

@cramaechi
Copy link
Contributor Author

Hi will62794,

No problem. I look forward to contributing more!

Best Regards,
Chibuikem Amaechi

@will62794 will62794 closed this in 6e3b0de Jan 9, 2018
@will62794
Copy link
Contributor

Your changes have been merged at 6e3b0d!

@cramaechi
Copy link
Contributor Author

Great. Thank you!

@cramaechi cramaechi deleted the server-30283 branch January 9, 2018 22:52
apocarteres pushed a commit to apocarteres/mongo that referenced this pull request Jan 10, 2018
Closes mongodb#1196

Signed-off-by: William Schultz <william.schultz@mongodb.com>
will62794 pushed a commit that referenced this pull request Feb 26, 2018
Closes #1196

Signed-off-by: William Schultz <william.schultz@mongodb.com>
(cherry picked from commit 6e3b0de)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants