Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lagging validators on blob reads #2220

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andresilva91
Copy link
Contributor

@andresilva91 andresilva91 commented Jul 4, 2024

Motivation

We need to update lagging validators if they're not aware of blobs that have already been published.

Proposal

A few things had to be done here:

  • Write the actual code to update the lagging validators. This is initially used in two places: when we're first staging block execution, and when we're synchronizing the chain state from the validators.
  • Created test only ReadBlob SystemOperation to be able to test this without creating or modifying existing applications.
  • Added hashed_certificate_values and hashed_blobs to process_validated_block. This was an existing bug because we check for missing blobs there, but didn't pass the information along on a retry.
  • We had to move synchronize_chain_state and try_synchronize_chain_state_from to the client, so that we could properly call the lagging validator code on chain synchronization as well.
  • Fixed a bug where our test LocalValidatorClient wasn't properly handling certificates: if NoConfirm and we're trying to handle a validated certificate, it wouldn't actually call the handle code like it should.

Test Plan

3 tests were written (more to come in following PRs)

Copy link
Contributor Author

andresilva91 commented Jul 4, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @andresilva91 and the rest of your teammates on Graphite Graphite

@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch from 396f3f8 to f96dc7a Compare July 4, 2024 16:16
@andresilva91 andresilva91 force-pushed the 06-25-read_blob_system_api branch 3 times, most recently from 22affac to 48c1f52 Compare July 8, 2024 12:42
@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch 3 times, most recently from 246dfc5 to c017110 Compare July 8, 2024 13:09
Copy link
Contributor

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! But so far this addresses only a part of the lagging validator situation: We also need to be able to re-propose another owner's ValidatedBlock proposal in a later round, even if neither we nor a quorum of validators have the blob. (Maybe that's a separate issue, for a different PR.)

linera-storage/src/lib.rs Outdated Show resolved Hide resolved
linera-execution/src/system.rs Outdated Show resolved Hide resolved
linera-core/src/client.rs Outdated Show resolved Hide resolved
linera-core/src/client.rs Outdated Show resolved Hide resolved
@@ -1328,6 +1335,59 @@ where
message.action = MessageAction::Reject;
continue;
}
} else if let ChainError::ExecutionError(
ExecutionError::SystemError(SystemExecutionError::BlobNotFoundOnRead(blob_id)),
_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also try to fetch the missing blobs in the ChainExecutionContext::IncomingMessage(index) case, and reject that message only after this has failed. (Doesn't apply in this PR yet, because messages can't read blobs yet.)

So we could leave stage_block_execution_and_discard_failing_messages as it is, and move this logic into an inner self.stage_block_execution(block) call (which in turn calls self.client.local_node.stage_block_execution(block.clone())). But as discussed, we will ultimately want to fetch the blobs without restarting execution from the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all then for a separate PR after messages are able to read blobs then? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer, but we have to make sure we remember it. The new system operation is meant to simulate what in production will actually be user operations and incoming messages. So I'm a bit worried about any code that only makes the operation work, but not messages, because then our tests give us a false sense of security.

linera-core/src/unit_tests/client_tests.rs Outdated Show resolved Hide resolved
linera-core/src/unit_tests/client_tests.rs Outdated Show resolved Hide resolved
linera-chain/src/data_types.rs Outdated Show resolved Hide resolved
@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch from c017110 to 3400e87 Compare July 8, 2024 13:53
Base automatically changed from 06-25-read_blob_system_api to main July 8, 2024 14:50
@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch 4 times, most recently from 9e103fd to 265a1a4 Compare July 10, 2024 13:44
@andresilva91 andresilva91 marked this pull request as ready for review July 10, 2024 14:02
Copy link

graphite-app bot commented Jul 10, 2024

Graphite Automations

"Assign reviewers" took an action on this PR • (07/10/24)

6 reviewers were added to this PR based on Andre da Silva's automation.

@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch from 265a1a4 to 7983873 Compare July 10, 2024 14:24
linera-chain/src/data_types.rs Outdated Show resolved Hide resolved
linera-chain/src/data_types.rs Outdated Show resolved Hide resolved
_,
) = &**chain_error
{
self.update_lagging_validators(*blob_id).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to update the lagging validators if our own local node is missing the blob? It would be clearer if synchronize_from_validators really only synchronized from, not to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but AFAIU in our test we try to confirm a block that was validated by a different client, from a client without the blob, and also with lagging validators. How should we deal with this case if we're not updating the lagging validators and getting the blob while processing the certificate? I just couldn't find an alternative, but I'm probably missing something 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the second client to only update the lagging validators when they send back an error about the missing blob.

linera-core/src/unit_tests/test_utils.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

I think this test_re_propose_validated test has only been passing because of the NoConfirm bug 🤔

@andresilva91 andresilva91 force-pushed the 07-04-update_lagging_validators_on_blob_reads branch from 7983873 to 8fba8a5 Compare July 12, 2024 18:17
@afck
Copy link
Contributor

afck commented Jul 13, 2024

You mean the existing one on main? If I only apply your test_utils changes, test_re_propose_validated still passes for me on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants