store: implement repair process #11

Open
peterbourgon opened this Issue Jan 16, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@peterbourgon
Member

peterbourgon commented Jan 16, 2017

The repair process should walk the complete data set (the complete timespace) and essentially perform read repair. When it completes, we should have guaranteed that all records are at the desired replication factor. (There may be smarter ways to do this.)

@lafikl

This comment has been minimized.

Show comment
Hide comment
@lafikl

lafikl Jan 17, 2017

Wouldn't a merkle tree like system, be a better alternative?
at least to filter out segments that don't need replicating.

lafikl commented Jan 17, 2017

Wouldn't a merkle tree like system, be a better alternative?
at least to filter out segments that don't need replicating.

@peterbourgon

This comment has been minimized.

Show comment
Hide comment
@peterbourgon

peterbourgon Jan 17, 2017

Member

Hmm, yes, potentially! I'd have to do a bit more research, but it seems like each store node could keep a Merkle tree of the segments it knows about, and that could be done without coördination.

Member

peterbourgon commented Jan 17, 2017

Hmm, yes, potentially! I'd have to do a bit more research, but it seems like each store node could keep a Merkle tree of the segments it knows about, and that could be done without coördination.

@lafikl

This comment has been minimized.

Show comment
Hide comment
@lafikl

lafikl Jan 17, 2017

You could repurpose the same hashes to detect data corruption on a segment level, too.

I have built a similar system at work, to detect divergence between different system.
It's a bit expensive operation but totally worth it.

lafikl commented Jan 17, 2017

You could repurpose the same hashes to detect data corruption on a segment level, too.

I have built a similar system at work, to detect divergence between different system.
It's a bit expensive operation but totally worth it.

@lafikl

This comment has been minimized.

Show comment
Hide comment
@lafikl

lafikl Jan 17, 2017

I think hashing ULIDs alone without fetching the data, would be enough since they're unique.

lafikl commented Jan 17, 2017

I think hashing ULIDs alone without fetching the data, would be enough since they're unique.

@untoldone

This comment has been minimized.

Show comment
Hide comment
@untoldone

untoldone Mar 17, 2018

@peterbourgon I've spent the last hour or so reading about oklog and what you've been working on -- really excited to find it!

I was just hoping to clarify the implications of this pending enhancement.

In the case that a store node is lost, is there currently no (simple) way restore the cluster to a state where a node can be lost again without permanently lost segments?

@peterbourgon I've spent the last hour or so reading about oklog and what you've been working on -- really excited to find it!

I was just hoping to clarify the implications of this pending enhancement.

In the case that a store node is lost, is there currently no (simple) way restore the cluster to a state where a node can be lost again without permanently lost segments?

@peterbourgon

This comment has been minimized.

Show comment
Hide comment
@peterbourgon

peterbourgon Mar 18, 2018

Member

@untoldone If you're running a 3-node cluster with a replication factor of 2, and a single node dies, and (let's say) is replaced by a new, empty node, then ca. 33% of your log data will exist only on a single node, and is vulnerable to permanent loss if that node dies. (Observe this scenario will continue to be true until you've run in this mode for your log retention period, after which you'll be back and fully-replicated.)

Pretend we're in this situation, and someone performs a query, which returns log data that only exists on a single node. It's possible for the query subsystem to detect that it only received a single copy of each of those logs (instead of the expected 2) and then trigger a so-called read repair, where it takes the logs it received and writes it to a node that doesn't already have it. This enhancement is about implementing that process, but rather than doing it reactively to user queries, doing it proactively by making regular, continuous queries of the entire log time range, in effect generating fake read load on the cluster. (In that sense, this ticket depends on #6.)

To answer your question, there are several ways to avoid this situation. One, perform regular backups of node disks, and restore them if/when a node dies — OK Log handles this scenario gracefully, without operator intervention. Two, have a larger cluster, with a higher replication factor, to tolerate more node loss.

Hope this helps.

Member

peterbourgon commented Mar 18, 2018

@untoldone If you're running a 3-node cluster with a replication factor of 2, and a single node dies, and (let's say) is replaced by a new, empty node, then ca. 33% of your log data will exist only on a single node, and is vulnerable to permanent loss if that node dies. (Observe this scenario will continue to be true until you've run in this mode for your log retention period, after which you'll be back and fully-replicated.)

Pretend we're in this situation, and someone performs a query, which returns log data that only exists on a single node. It's possible for the query subsystem to detect that it only received a single copy of each of those logs (instead of the expected 2) and then trigger a so-called read repair, where it takes the logs it received and writes it to a node that doesn't already have it. This enhancement is about implementing that process, but rather than doing it reactively to user queries, doing it proactively by making regular, continuous queries of the entire log time range, in effect generating fake read load on the cluster. (In that sense, this ticket depends on #6.)

To answer your question, there are several ways to avoid this situation. One, perform regular backups of node disks, and restore them if/when a node dies — OK Log handles this scenario gracefully, without operator intervention. Two, have a larger cluster, with a higher replication factor, to tolerate more node loss.

Hope this helps.

@untoldone

This comment has been minimized.

Show comment
Hide comment
@untoldone

untoldone Mar 19, 2018

Thanks! This is really helpful and I really appreciate the quick reply.

Sorry if this is off-topic for this bug, but as a follow up to your response, hypothetically, in a cluster of 3 storage nodes, would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc) and in the case of node failure, copy all data files back to a new node before re-joining the cluster? Would I want to backup both the store and ingest directory? Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

I didn't notice anything in the design that suggested this wouldn't be ok, but thought I'd ask as I didn't see any explicit comments around this type of topic.

Thanks! This is really helpful and I really appreciate the quick reply.

Sorry if this is off-topic for this bug, but as a follow up to your response, hypothetically, in a cluster of 3 storage nodes, would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc) and in the case of node failure, copy all data files back to a new node before re-joining the cluster? Would I want to backup both the store and ingest directory? Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

I didn't notice anything in the design that suggested this wouldn't be ok, but thought I'd ask as I didn't see any explicit comments around this type of topic.

@peterbourgon

This comment has been minimized.

Show comment
Hide comment
@peterbourgon

peterbourgon Mar 19, 2018

Member

would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc)

Yes.

in the case of node failure, [we can] copy all data files back to a new node before re-joining the cluster?

Yes, also safe.

Would I want to backup both the store and ingest directory?

Yes, also safe.

Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

Nope, there are no index files or anything. If you start up a new ingest or store node with some restored data in the relevant data dir, the node should pick up and manage everything transparently.

Member

peterbourgon commented Mar 19, 2018

would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc)

Yes.

in the case of node failure, [we can] copy all data files back to a new node before re-joining the cluster?

Yes, also safe.

Would I want to backup both the store and ingest directory?

Yes, also safe.

Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

Nope, there are no index files or anything. If you start up a new ingest or store node with some restored data in the relevant data dir, the node should pick up and manage everything transparently.

@untoldone

This comment has been minimized.

Show comment
Hide comment
@untoldone

untoldone Mar 25, 2018

Awesome, thanks!

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment