Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookkeeping for reachability of nodes; reporting them to Lokid #268

Merged
merged 5 commits into from
Sep 6, 2019

Conversation

msgmaxim
Copy link
Contributor

  • Keep track of all nodes that failed the reachability test.
  • Periodically test two nodes: one randomly selected and one that is known to have failed previously (this is done to give failing nodes a better chance to be recommissioned as soon as possible once they are back online).
  • When a node is not reachable for longer than 120mins (to match uptime proofs) it will be reported to Lokid via a new endpoint. The same endpoint is used to report previously offending nodes as healthy.
  • To be on the safe side and keep things simple, there is currently some redundancy in how many times a node's status is reported to Lokid.

common/include/loki_common.h Outdated Show resolved Hide resolved
httpserver/service_node.cpp Show resolved Hide resolved
httpserver/service_node.cpp Outdated Show resolved Hide resolved
httpserver/service_node.cpp Outdated Show resolved Hide resolved
httpserver/service_node.cpp Outdated Show resolved Hide resolved
httpserver/service_node.cpp Show resolved Hide resolved
class reach_record_t {

// The time the node failed for the first time
// (and hasn't come back online)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this comment be above line 20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, updated.


bool reachability_records_t::record_unreachable(const sn_pub_key_t& sn) {

auto it = offline_nodes_.find(sn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the iterator can be const while still being able to mutate the element in the map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Updated.

LOKI_LOG(warn, " - will REPORT this node to Lokid!");
return true;
} else {
if (it->second.reported) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to

} else if (...) {

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and I changed the order to simplify further.


void reachability_records_t::set_reported(const sn_pub_key_t& sn) {

auto it = offline_nodes_.find(sn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

}

void Swarm::update_state(const all_swarms_t& swarms,
const std::vector<sn_record_t>& decommissioned,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I am only looking at a partial diff, but I can't see where this new decommissioned variable is being used in update_state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug, good catch! Should be adding them to all_funded_nodes_, otherwise they wouldn't be tested and won't have a chance to come back online! Fixed now.

@msgmaxim msgmaxim merged commit e60b24e into oxen-io:dev Sep 6, 2019
@msgmaxim msgmaxim deleted the reachable-nodes branch September 9, 2019 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants