-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INDY-1683: bugfix backup_instance_faulty_processor with quorum logic #954
INDY-1683: bugfix backup_instance_faulty_processor with quorum logic #954
Conversation
Changes: - add tests - bugfix backup_instance_faulty_processor with quorum logic Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
from plenum.test.test_node import ensureElectionsDone, checkNodesConnected | ||
|
||
|
||
@pytest.fixture(scope="module", params=[{"instance_degraded": "local", "instance_primary_disconnected": "local"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can write unit tests for BackupInstanceFaultyProcessor
. Please have a look at fake_node
fixture we can use for it.
I can imagine the following unit tests:
- test that 1 call of
on_backup_degradation
leads to replica removal with local strategy - test that 1 call of
on_backup_degradation
does not lead to replica removal with quorum strategy - test that 1 call of
on_backup_primary_disconnected
leads to replica removal with local strategy - test that 1 call of
on_backup_primary_disconnected
does not lead to replica removal with quorum strategy - test that
process_backup_instance_faulty_msg
does not lead to removal in case of local strategy - test that f calls of
process_backup_instance_faulty_msg
with same messages does not lead to removal in case of quorum strategy - test that f+1 calls of
process_backup_instance_faulty_msg
with same messages leads to removal in case of quorum strategy - test that BackupInstanceFaulty from our Node is not required to remove replica (just a quorum of any nodes)
- test that only BackupInstanceFaulty with the same viewNo as the current one leads to removal
- test that n BackupInstanceFaulty with different instances where we have f count for every instance in total doesn't lead to removal (in case of quorum)
- test that n BackupInstanceFaulty with different instances where we have f+1 count for 1 instance and f for others lead to removal this 1 instance only
- test that n BackupInstanceFaulty with different instances where we have f+1 for all instances in total leads to removal of all instances
- test that restore_replicas restores all replicas
- test BackupInstanceFaulty with empty instances values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added test_on_backup_degradation_for_one with local strategy
- Added test_on_backup_degradation_for_one with quorum strategy
- Added test_on_backup_primary_disconnected_for_one with local strategy
- Added test_on_backup_primary_disconnected_for_one with quorum strategy
- Was in method __process_backup_instance_faulty_msg_work_with_different_msgs()
- Added check in __process_backup_instance_faulty_msg_work_with_different_msgs() but I don't sure that this check is correct. Should we sleep some time and what time?
- In test_process_backup_instance_faulty_msg() we send n-1 message and it's more than f. Is it enough?
- Changed sending in __process_backup_instance_faulty_msg_work_with_different_msgs()
- Added test_process_backup_instance_with_incorrect_view_no
- Why shouldn't remove?
- Was in test_restore_replicas
- Added test_process_backup_instance_empty_msg
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
…nto task-1682-remove-replicas-with-ic
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
…nto task-1682-remove-replicas-with-ic
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Changes: - change logic to remove replica with quorum of own messages - add tests to test_instance_faulty_processor.py Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Changes: - refactoring test_instance_faulty_processor.py - update test_replica_removing_after_node_started.py - change REPLICAS_REMOVING_WITH_DEGRADATION to quorum Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
…nto task-1682-remove-replicas-with-ic
if inst_id not in self.node.replicas.keys(): | ||
continue | ||
self.backup_instances_faulty.setdefault(inst_id, dict()).setdefault(frm, 0) | ||
self.backup_instances_faulty[inst_id].setdefault(self.node.name, 0) | ||
self.backup_instances_faulty[inst_id][frm] += 1 | ||
if not self.node.quorums.backup_instance_faulty.is_reached( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add. This code add default value for message sender and for own node messages. But may be you see cleaner way to do it.
""" | ||
instance_to_remove = 1 | ||
view_no = txnPoolNodeSet[0].viewNo | ||
Node will change view even though it does not find the master to be degraded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we talking about a view change or removing of replicas here? Is the comment correct?
if not self.node.quorums.backup_instance_faulty.is_reached( | ||
len(self.backup_instances_faulty[inst_id])): | ||
len(self.backup_instances_faulty[inst_id].keys())) \ | ||
and not self.node.quorums.backup_instance_faulty.is_reached( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we use a different quorum here?
- Should we take into account only degradations from this node in a row, that is if we face degradation once per hour, it should not be accumulated, and replica should not be removed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the second item: should we send BackupInstanceNotFaulty
when there is no degradation observed during the next check in monitor? BackupInstanceNotFaulty
could clear all backup_instance_faulty
for this node .
Unit tests need to be added for this.
|
||
backup_instance_faulty_processor.on_backup_degradation(degraded_backups) | ||
|
||
assert not (set(node.replicas.remove_replica_calls) - set(degraded_backups)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just compare two sets here? assert set(node.replicas.remove_replica_calls) == set(degraded_backups)
|
||
backup_instance_faulty_processor.restore_replicas() | ||
# check that all replicas were restored and backup_instances_faulty has been cleaned | ||
assert not backup_instance_faulty_processor.backup_instances_faulty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But was it non-empty before we call restore_replicas
?
assert nodes.issubset(backup_instance_faulty_processor.backup_instances_faulty[instance_to_remove]) | ||
assert not node.replicas.remove_replica_calls | ||
|
||
# check that messages from all nodes lead to replica removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says that we should send just 1 more message (sufficient for quorum)
node = FakeNode(tdir, config=tconf) | ||
node.view_change_in_progress = False | ||
node.requiredNumberOfInstances = len(node.replicas) | ||
node.allNodeNames = ["Node{}".format(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be "Node{}".format(i+1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a range(1, (node.requiredNumberOfInstances - 1) * 3 + 2)
. But I will change it to range((node.requiredNumberOfInstances - 1) * 3 + 1)
to make it code more understandable.
Changes: