Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDY-1683: bugfix backup_instance_faulty_processor with quorum logic #954

Merged

Conversation

Toktar
Copy link
Contributor

@Toktar Toktar commented Oct 24, 2018

Changes:

  • add tests
  • bugfix backup_instance_faulty_processor with quorum logic

Changes:
- add tests
- bugfix backup_instance_faulty_processor with quorum logic

Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
from plenum.test.test_node import ensureElectionsDone, checkNodesConnected


@pytest.fixture(scope="module", params=[{"instance_degraded": "local", "instance_primary_disconnected": "local"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can write unit tests for BackupInstanceFaultyProcessor. Please have a look at fake_node fixture we can use for it.
I can imagine the following unit tests:

  • test that 1 call of on_backup_degradation leads to replica removal with local strategy
  • test that 1 call of on_backup_degradation does not lead to replica removal with quorum strategy
  • test that 1 call of on_backup_primary_disconnected leads to replica removal with local strategy
  • test that 1 call of on_backup_primary_disconnected does not lead to replica removal with quorum strategy
  • test that process_backup_instance_faulty_msg does not lead to removal in case of local strategy
  • test that f calls of process_backup_instance_faulty_msg with same messages does not lead to removal in case of quorum strategy
  • test that f+1 calls of process_backup_instance_faulty_msg with same messages leads to removal in case of quorum strategy
  • test that BackupInstanceFaulty from our Node is not required to remove replica (just a quorum of any nodes)
  • test that only BackupInstanceFaulty with the same viewNo as the current one leads to removal
  • test that n BackupInstanceFaulty with different instances where we have f count for every instance in total doesn't lead to removal (in case of quorum)
  • test that n BackupInstanceFaulty with different instances where we have f+1 count for 1 instance and f for others lead to removal this 1 instance only
  • test that n BackupInstanceFaulty with different instances where we have f+1 for all instances in total leads to removal of all instances
  • test that restore_replicas restores all replicas
  • test BackupInstanceFaulty with empty instances values

Copy link
Contributor Author

@Toktar Toktar Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Added test_on_backup_degradation_for_one with local strategy
  2. Added test_on_backup_degradation_for_one with quorum strategy
  3. Added test_on_backup_primary_disconnected_for_one with local strategy
  4. Added test_on_backup_primary_disconnected_for_one with quorum strategy
  5. Was in method __process_backup_instance_faulty_msg_work_with_different_msgs()
  6. Added check in __process_backup_instance_faulty_msg_work_with_different_msgs() but I don't sure that this check is correct. Should we sleep some time and what time?
  7. In test_process_backup_instance_faulty_msg() we send n-1 message and it's more than f. Is it enough?
  8. Changed sending in __process_backup_instance_faulty_msg_work_with_different_msgs()
  9. Added test_process_backup_instance_with_incorrect_view_no
  10. Why shouldn't remove?
  11. Was in test_restore_replicas
  12. Added test_process_backup_instance_empty_msg

Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Changes:
- change logic to remove replica with quorum of own messages
- add tests to test_instance_faulty_processor.py

Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
Changes:
- refactoring test_instance_faulty_processor.py
- update test_replica_removing_after_node_started.py
- change REPLICAS_REMOVING_WITH_DEGRADATION to quorum

Signed-off-by: toktar <renata.toktar@dsr-corporation.com>
if inst_id not in self.node.replicas.keys():
continue
self.backup_instances_faulty.setdefault(inst_id, dict()).setdefault(frm, 0)
self.backup_instances_faulty[inst_id].setdefault(self.node.name, 0)
self.backup_instances_faulty[inst_id][frm] += 1
if not self.node.quorums.backup_instance_faulty.is_reached(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining this condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add. This code add default value for message sender and for own node messages. But may be you see cleaner way to do it.

"""
instance_to_remove = 1
view_no = txnPoolNodeSet[0].viewNo
Node will change view even though it does not find the master to be degraded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking about a view change or removing of replicas here? Is the comment correct?

if not self.node.quorums.backup_instance_faulty.is_reached(
len(self.backup_instances_faulty[inst_id])):
len(self.backup_instances_faulty[inst_id].keys())) \
and not self.node.quorums.backup_instance_faulty.is_reached(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we use a different quorum here?
  2. Should we take into account only degradations from this node in a row, that is if we face degradation once per hour, it should not be accumulated, and replica should not be removed, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the second item: should we send BackupInstanceNotFaulty when there is no degradation observed during the next check in monitor? BackupInstanceNotFaulty could clear all backup_instance_faulty for this node .
Unit tests need to be added for this.


backup_instance_faulty_processor.on_backup_degradation(degraded_backups)

assert not (set(node.replicas.remove_replica_calls) - set(degraded_backups))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just compare two sets here? assert set(node.replicas.remove_replica_calls) == set(degraded_backups)


backup_instance_faulty_processor.restore_replicas()
# check that all replicas were restored and backup_instances_faulty has been cleaned
assert not backup_instance_faulty_processor.backup_instances_faulty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But was it non-empty before we call restore_replicas?

assert nodes.issubset(backup_instance_faulty_processor.backup_instances_faulty[instance_to_remove])
assert not node.replicas.remove_replica_calls

# check that messages from all nodes lead to replica removing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says that we should send just 1 more message (sufficient for quorum)

node = FakeNode(tdir, config=tconf)
node.view_change_in_progress = False
node.requiredNumberOfInstances = len(node.replicas)
node.allNodeNames = ["Node{}".format(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "Node{}".format(i+1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a range(1, (node.requiredNumberOfInstances - 1) * 3 + 2) . But I will change it to range((node.requiredNumberOfInstances - 1) * 3 + 1) to make it code more understandable.

@ashcherbakov ashcherbakov merged commit 85d9bb4 into hyperledger:master Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants