Skip to content
This repository has been archived by the owner. It is now read-only.

store: implement repair process #11

Open
peterbourgon opened this issue Jan 16, 2017 · 9 comments
Open

store: implement repair process #11

peterbourgon opened this issue Jan 16, 2017 · 9 comments
Labels

Comments

@peterbourgon
Copy link
Member

@peterbourgon peterbourgon commented Jan 16, 2017

The repair process should walk the complete data set (the complete timespace) and essentially perform read repair. When it completes, we should have guaranteed that all records are at the desired replication factor. (There may be smarter ways to do this.)

@lafikl
Copy link

@lafikl lafikl commented Jan 17, 2017

Wouldn't a merkle tree like system, be a better alternative?
at least to filter out segments that don't need replicating.

@peterbourgon
Copy link
Member Author

@peterbourgon peterbourgon commented Jan 17, 2017

Hmm, yes, potentially! I'd have to do a bit more research, but it seems like each store node could keep a Merkle tree of the segments it knows about, and that could be done without coördination.

@lafikl
Copy link

@lafikl lafikl commented Jan 17, 2017

You could repurpose the same hashes to detect data corruption on a segment level, too.

I have built a similar system at work, to detect divergence between different system.
It's a bit expensive operation but totally worth it.

@lafikl
Copy link

@lafikl lafikl commented Jan 17, 2017

I think hashing ULIDs alone without fetching the data, would be enough since they're unique.

@untoldone
Copy link

@untoldone untoldone commented Mar 17, 2018

@peterbourgon I've spent the last hour or so reading about oklog and what you've been working on -- really excited to find it!

I was just hoping to clarify the implications of this pending enhancement.

In the case that a store node is lost, is there currently no (simple) way restore the cluster to a state where a node can be lost again without permanently lost segments?

@peterbourgon
Copy link
Member Author

@peterbourgon peterbourgon commented Mar 18, 2018

@untoldone If you're running a 3-node cluster with a replication factor of 2, and a single node dies, and (let's say) is replaced by a new, empty node, then ca. 33% of your log data will exist only on a single node, and is vulnerable to permanent loss if that node dies. (Observe this scenario will continue to be true until you've run in this mode for your log retention period, after which you'll be back and fully-replicated.)

Pretend we're in this situation, and someone performs a query, which returns log data that only exists on a single node. It's possible for the query subsystem to detect that it only received a single copy of each of those logs (instead of the expected 2) and then trigger a so-called read repair, where it takes the logs it received and writes it to a node that doesn't already have it. This enhancement is about implementing that process, but rather than doing it reactively to user queries, doing it proactively by making regular, continuous queries of the entire log time range, in effect generating fake read load on the cluster. (In that sense, this ticket depends on #6.)

To answer your question, there are several ways to avoid this situation. One, perform regular backups of node disks, and restore them if/when a node dies — OK Log handles this scenario gracefully, without operator intervention. Two, have a larger cluster, with a higher replication factor, to tolerate more node loss.

Hope this helps.

@untoldone
Copy link

@untoldone untoldone commented Mar 19, 2018

Thanks! This is really helpful and I really appreciate the quick reply.

Sorry if this is off-topic for this bug, but as a follow up to your response, hypothetically, in a cluster of 3 storage nodes, would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc) and in the case of node failure, copy all data files back to a new node before re-joining the cluster? Would I want to backup both the store and ingest directory? Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

I didn't notice anything in the design that suggested this wouldn't be ok, but thought I'd ask as I didn't see any explicit comments around this type of topic.

@peterbourgon
Copy link
Member Author

@peterbourgon peterbourgon commented Mar 19, 2018

would it be safe to have the date directory frequently synced (like every hour) to alternative durable storage (e.g. s3 etc)

Yes.

in the case of node failure, [we can] copy all data files back to a new node before re-joining the cluster?

Yes, also safe.

Would I want to backup both the store and ingest directory?

Yes, also safe.

Is there anything I should be aware of in regard to data consistency around backups (e.g. an index file that might be out of sync with data files etc)?

Nope, there are no index files or anything. If you start up a new ingest or store node with some restored data in the relevant data dir, the node should pick up and manage everything transparently.

@untoldone
Copy link

@untoldone untoldone commented Mar 25, 2018

Awesome, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.