Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop cleaning monitor updates on new block connect #2779

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

G8XSU
Copy link
Contributor

@G8XSU G8XSU commented Dec 8, 2023

Previously, we used to cleanup monitor updates at both consolidation threshold and new block connects. With this change we will only cleanup when our consolidation criteria is met. Also, we remove monitor read from cleanup logic, in case of update consolidation.
Note: In case of channel-closing monitor update, we still need to read the old monitor before persisting the new one in order to determine the cleanup range.

Closes #2706

@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (c2bbfff) 88.53% compared to head (ef09096) 88.55%.
Report is 181 commits behind head on main.

Files Patch % Lines
lightning/src/util/persist.rs 85.41% 5 Missing and 2 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2779      +/-   ##
==========================================
+ Coverage   88.53%   88.55%   +0.02%     
==========================================
  Files         114      114              
  Lines       89497    91569    +2072     
  Branches    89497    91569    +2072     
==========================================
+ Hits        79237    81092    +1855     
- Misses       7882     7988     +106     
- Partials     2378     2489     +111     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

let maybe_old_monitor = self.read_monitor(&monitor_name);
match maybe_old_monitor {
Ok((_, ref old_monitor)) => {
// Check that this key isn't already storing a monitor with a higher update_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this check since we no longer read old_monitor in case of full monitor persist anymore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think its fine. I'd previously suggested doing it here, but really ChainMonitor should handle it for us.

}
// This means the channel monitor is new.
Err(ref e) if e.kind() == io::ErrorKind::NotFound => {}
_ => return chain::ChannelMonitorUpdateStatus::UnrecoverableError,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer fail in this case.

@@ -641,59 +624,6 @@ where
&monitor_bytes,
) {
Ok(_) => {
// Assess cleanup. Typically, we'll clean up only between the last two known full
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved cleanup logic to update_persisted_channel, to only cleanup in case of consolidation due to max_pending_updates

} else if update_id != start && monitor.get_latest_update_id() != CLOSED_CHANNEL_UPDATE_ID
{
// We're deleting something we should know doesn't exist.
panic!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility to issue some extra deletes since there will be no update persisted in case of block connect, but we will try to cleanup update with that update_id.

TheBlueMatt
TheBlueMatt previously approved these changes Dec 8, 2023
@G8XSU
Copy link
Contributor Author

G8XSU commented Dec 8, 2023

Squashed commits into one.

@G8XSU
Copy link
Contributor Author

G8XSU commented Dec 8, 2023

Tagging @domZippilli as well.

@TheBlueMatt TheBlueMatt added this to the 0.0.119 milestone Dec 11, 2023
@domZippilli
Copy link
Contributor

Tagging @domZippilli as well.

LGTM. I didn't clone it and take an in-depth look yet, but this seems to make sense. Incidentally, we had a problem at c= where we weren't connecting new blocks for a little bit, so with no new blocks we were effectively running in a manner very similar to this changeset, which gives a kind of "preview" of what we should end up with. The I/O utilization was really low, and memory was dead flat. After a few restarts, things got going again, and there were no FCs or anything either. So I'm optimistic that this changeset will be good not only for reduced heap fragmentation, but even further reduced I/O churn.

I guess the main question on my mind is: We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

@TheBlueMatt
Copy link
Collaborator

I guess the main question on my mind is: We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

For users of the ChainMonitor, quite - it already does the "fail on dup monitor" stuff for us. I suppose in theory someone could prune a monitor, leaving it in its datastore but removing it from the in-memory store, and end up overwriting that monitor, but if it was safe to prune (ie nothing more we need to claim on-chain for a closed channel), then it was also safe to overwrite. Now, someone may have a bug where they remove a monitor too soon, and then an attacker could exploit this to make them not able to recover after the bug, but that seems a somewhat esoteric concern.

The more general reason we had it is for those trying to use this without a ChainMonitor, but I'm not sure its worth paying any real price for that - users implementing chain::Watch manually should read its documentation and implement the API faithfully :).

@tnull tnull self-requested a review December 11, 2023 18:41
@G8XSU
Copy link
Contributor Author

G8XSU commented Dec 11, 2023

For now, not too worried about overwriting monitor without reading it first since we already do that in non-MUP implementation.

We were designing this very defensively on the first pass. How confident are we that we really don't need to do that monitor read anymore?

I think earlier we were focusing more on not creating redundant deletes.
With the approach in this PR, we just stop cleaning in block connects(which was in any case being forced upon MUP). We could have just done this in first attempt, but i think we missed this minor optimization.

Worst to worst case there will be some stale un-deleted monitor updates, but that should be fine.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, generally looks good. Some questions, some nits.

lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
lightning/src/util/persist.rs Outdated Show resolved Hide resolved
);
Some((start, end))
}
_ => None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we shouldn't do some cleanup (i.e., default to the behavior of the else clause below) if we fail to read the old monitor?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also probably just panic instead.

Copy link
Contributor Author

@G8XSU G8XSU Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned here,
earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.

we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaulting to else clause below won't help since it is pointless to cleanup u64::MAX .. u64::MAX-max_pending_updates.
It will not result in any deletes, just extra IO calls.

Copy link
Contributor

@tnull tnull Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned here, earlier we were returning an UnrecoverableError in this case, but i think not failing/panic would be better here. Since the worst case is wasted storage and not something critical to functioning of node.

Hmm, yes, we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one? But I think we also had this discussion in the original PR and the conclusion was not to error out under certain circumstances, so I don't want to block this PR by restarting that conversation.

we can't do some default cleanup since we don't know which update_range to cleanup. Lmk if you have any suggestions.

defaulting to else clause below won't help since it is pointless to cleanup u64::MAX .. u64::MAX-max_pending_updates. It will not result in any deletes, just extra IO calls.

Ah, right, the update ids will be u64::MAX for closed channels. Fair point. It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened. 🤷‍♂️

If you're positive that we shouldn't error out or panic here, let's at least leave a comment that the error case is unhandled and explain the motivation why this is the case.

Copy link
Contributor Author

@G8XSU G8XSU Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in earlier version of code in this PR it was a bit more evident, now it is slightly hidden( with .ok()), will add a comment.

It's still a bit weird to get an unrecoverable IO error here and just continue on as if nothing happened.

Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.

There are other instances such as error in delete call where we don't fail. Best to treat it as optimistic cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could ignore the error in this specific case I guess, but generally unrecoverable IO errors are fatal for us? So "as a policy" it might be generally preferable to stop as soon as we see one?

I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.

Copy link
Contributor

@tnull tnull Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we were not getting an unrecoverable error but converting monitor read failure into unrecoverable.

Yeah, that's why I spoke of lower-case 'unrecoverable error' rather than UnrecoverableError. In any case, it's still unrecoverable for us at this point.

I don't think that is an ideal policy for all cases. I would hate it if my node panics in production because a read call got throttled. And it was totally optional for functioning of my node.

If we 'properly' fail to read the monitor due to an IO error at this point it would indicate something is seriously wrong and the user needs to take action ASAP, as much more serious issues could arise from broken IO down the line. That is really the purpose of UnrecoverableError, no? And, as Dom mentioned below, any recoverable errors should be handled in-line by the KVStore implementation.

I mean, sure, it's not strictly a bug here to continue on if we fail to read the monitor, but I'm not sure if we want to trust the user to have proper log monitoring set up to catch that there are IO errors happing before it's too late.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that panicking here, though potentially inconvenient for the user, can definitely save them from a big mishap later down the line.

However, I can also notice one thing. The old_monitor is only tried to be read once throughout the life of a channel that is when the channel is closed.

So, if IO errors are happening for a user, they should be detected by this time in the life of the channel. So a panic here, though potentially useful, will probably also be inconsequential in the grand scheme of things.

lightning/src/util/persist.rs Outdated Show resolved Hide resolved
@domZippilli
Copy link
Contributor

domZippilli commented Dec 14, 2023 via email

Previously, we used to cleanup monitor updates at both consolidation
threshold and new block connects. With this change we will only
cleanup when our consolidation criteria is met. Also, we remove
monitor read from cleanup logic, in case of update consolidation.
Note: In case of channel-closing monitor update, we still need to
read the old monitor before persisting the new one in order to
determine the cleanup range.
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a super strong opinion about panicking early on the read failure, but it may be worth noting that its possible the read failure hit a bad sector/flash segment and the write that we just did moved the data to a new segment/remapped sector and managed to fix the issue, so its not like its impossible that continuing could Just Work, even outside of some enterprise environment.

In any case, unless someone feels super strongly, I say we go ahead and merge this.

@G8XSU
Copy link
Contributor Author

G8XSU commented Dec 14, 2023

In summary, I don't think we should panic.
Even after retries, if a read fails in cleanup, we shouldn't fail, since that doesn't result in any functional degradation to node.
A few KB's of wasted storage for client doesn't mean we should shut down the node immediately.

In cleanup, we don't even fail if all the delete calls start failing. A read failure in "optimistic cleanup" is no different or worse than a delete failure.

I'm not sure if we want to trust the user to have proper log monitoring set up to catch that there are IO errors happing before it's too late.

Read failure in cleanup does not result in any mishap later down the line for sure. If later on, there is a read failure and it was critical for functioning, we should fail there instead of preempting it here. (for example in node restart)
The same way we trust clients to have metric on node shutdown, we should trust our clients to have proper checks in place for read-error metrics, storage usage metrics etc. And in any case, we will fail if there is a critical failure for write/read.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary, I don't think we should panic.

Yes, as mentioned above and elsewhere it feels a bit odd to me to just proceed on read failure, but it's also not really wrong or critical here.

LGTM

SP::Target: SignerProvider + Sized
{
// Cleans up monitor updates for given monitor in range `start..=end`.
fn cleanup_in_range(&self, monitor_name: MonitorName, start: u64, end: u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit: Fine for now, but if we touch this again we may want to change these semantics of cleanup_in_range. I already had brought this up in the original PR, but usually start..end in Rust has the semantics start <= x < end (cf. https://doc.rust-lang.org/std/ops/struct.Range.html). So for a reader it may be a bit unexpected to have this work as start..=end. If we want to keep this version we could at least rename the variables to first/last to be a bit more clear and to show that this is unconventional behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to complicate semantics for calculating cleanup_range by changing end calculation.
Probably makes sense to rename it as last. (But not super critical for now, since it is private function)

@tnull tnull merged commit e1897de into lightningdevkit:main Dec 15, 2023
15 checks passed
Copy link
Contributor

@shaavan shaavan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post merge ACK

I also agree with not panicking here, as its potential usefulness is outweighed by the unnecessary preemptive failure it might create.

Other than this, the PR looks perfect! Thanks a lot for this great cleanup improvement!

Summary and notes:
  1. Not needing to clean the monitor, in the case of persist_new_channel makes sense, as we are doing a full persist
  2. Not needing to read the old_monitor, in case of % maximum_pending_updates == 0, to determine the cleanup range, is a great use of the fact of the deterministic length of cleanup.
  3. Separating cleanup_in_range in a separate function makes the code modular and the functionality usable separately, a good preemptive measure for future updates.

As a small follow-up, we can add a comment in persist_new_channel fn to explain why we don’t need to do an old_monitor read & clean up here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop reading monitors when persisting in updating persister
6 participants