-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookkeeping for reachability of nodes; reporting them to Lokid #268
Conversation
msgmaxim
commented
Aug 30, 2019
- Keep track of all nodes that failed the reachability test.
- Periodically test two nodes: one randomly selected and one that is known to have failed previously (this is done to give failing nodes a better chance to be recommissioned as soon as possible once they are back online).
- When a node is not reachable for longer than 120mins (to match uptime proofs) it will be reported to Lokid via a new endpoint. The same endpoint is used to report previously offending nodes as healthy.
- To be on the safe side and keep things simple, there is currently some redundancy in how many times a node's status is reported to Lokid.
dcc4e96
to
effcd7e
Compare
effcd7e
to
69aa778
Compare
httpserver/reachability_testing.h
Outdated
class reach_record_t { | ||
|
||
// The time the node failed for the first time | ||
// (and hasn't come back online) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this comment be above line 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, updated.
httpserver/reachability_testing.cpp
Outdated
|
||
bool reachability_records_t::record_unreachable(const sn_pub_key_t& sn) { | ||
|
||
auto it = offline_nodes_.find(sn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the iterator can be const while still being able to mutate the element in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Updated.
httpserver/reachability_testing.cpp
Outdated
LOKI_LOG(warn, " - will REPORT this node to Lokid!"); | ||
return true; | ||
} else { | ||
if (it->second.reported) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to
} else if (...) {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and I changed the order to simplify further.
httpserver/reachability_testing.cpp
Outdated
|
||
void reachability_records_t::set_reported(const sn_pub_key_t& sn) { | ||
|
||
auto it = offline_nodes_.find(sn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Const iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
} | ||
|
||
void Swarm::update_state(const all_swarms_t& swarms, | ||
const std::vector<sn_record_t>& decommissioned, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I am only looking at a partial diff, but I can't see where this new decommissioned
variable is being used in update_state
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bug, good catch! Should be adding them to all_funded_nodes_
, otherwise they wouldn't be tested and won't have a chance to come back online! Fixed now.