Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist data #35

Closed
brancz opened this issue Apr 7, 2022 · 17 comments
Closed

Persist data #35

brancz opened this issue Apr 7, 2022 · 17 comments
Assignees

Comments

@brancz
Copy link
Member

brancz commented Apr 7, 2022

Currently, once the configured size of data is reached the active-append table-block is swapped out for a new one and the old one is thrown away. We of course want to persist data in some way. Since we already keep the data in parquet format in memory, it would be great to write that out and memory map it.

@ostafen
Copy link
Contributor

ostafen commented May 10, 2022

I could try to work on this, if it is fine for you.
Do you plan to make writes robust against crash recovery?

@ostafen
Copy link
Contributor

ostafen commented May 11, 2022

@brancz, actually I see from code that, when a table block is rotated, the old one is simply discarded, and queries only act on the current active block for each table.
So, what's the purpose of memory mapping thrown blocks?

@thorfour
Copy link
Contributor

@ostafen Hello! We're actually working on this issue right now! Stay tuned!

@thorfour thorfour self-assigned this May 11, 2022
@ostafen
Copy link
Contributor

ostafen commented May 11, 2022

Okay, if you need help, I would be happy to participate on this :=)

@thorfour
Copy link
Contributor

Awesome! I hope to have something that's at least compiling :) very soon

@ostafen
Copy link
Contributor

ostafen commented May 11, 2022

Okay, I'll stay tuned on this :)

@thorfour
Copy link
Contributor

I've pushed a very WIP branch hackday-storage where it just writes the rotated blocks to disk, and then on query reads all of them. It's terribly inefficient but it at least is a starting point that can inform an implementation.

@ostafen
Copy link
Contributor

ostafen commented May 12, 2022

Gave a look at the code.
I think the RotateBlock() function is a good place to put the code for persisting the block.
Also, instead of writing each block in a separate file, I would keep a write file opened, in order to append blocks in a log-like fashion (the file could be also rotated, when a given size is reached).
I would also store a reference to read only table blocks (which could be memory mapped also) inside the Table struct and update function Iterator() in order to iterate not only on the current active block.
Also, we may also want to perform crash recovery on startup (as a matter of fact, only the last written block could be damaged)
What do you think?

@thorfour
Copy link
Contributor

Those all sound like reasonable approaches to me

@ostafen
Copy link
Contributor

ostafen commented May 12, 2022

If you want, I could try to provide a PR addressing these points

@ostafen
Copy link
Contributor

ostafen commented May 13, 2022

I think there is a potential problem arising from the fact that the block is written to disk in the background by a separate goroutine.
A concurrent query may try to read a block (inside the IterateRowGroup() function) while it is being written to disk, thus reading a corrupted version of the block.
It could also happen that the disk write has not yet been initialized at that time, so the disk block would not be read at all and skipped. Depending on the workload and configuration, it could also happen that multiple blocks could be waiting to be written to disk.

To solve this problem, when a query try to iterate on the set of table blocks of the database, the following property should be satisfied: all blocks which have been created before the current active block should be correctly synced to disk.
This condition could be checked by giving a sequence number to blocks, and by storing in a variable the sequence number of the last table block which have been correctly (and entirely) written to disk.
So when a read operation arrives, one should first check if lastWrittenBlockSeqNumber < activeBlock.seqNumber-1.
If this is the case, we could wait for the condition to become false, or simply return an error.

@thorfour
Copy link
Contributor

I was thinking we'd likely want to drop a metadata file alongside blocks that informs us about their contents and potentially validity. We could update it on successful block rights to avoid reading an incomplete block. If we use a ULID for the block name we could implement the sequence number you've described to determine last valid block.

@ostafen
Copy link
Contributor

ostafen commented May 13, 2022

The problem is that if a query is submitted while you are writing a block to disk, you have to wait for the incomplete block to be written on disk. If you simply detect that a block is invalid on disk and skip it, you may end with an incomplete result set because the block may contain rows satisfying the query.
Assume that an user runs the same query twice, and that just before running the query for the second time the current active block is rotated because of a concurrent insertion. In this case, he could get a different number of rows for the same query, because rows contained in the block being written would not be returned the second time.
So, I would use the sequence numbers to detect pending writes, but the important part is that if you find that
lastWrittenBlockSeqNumber < activeBlock.seqNumber-1 during a query, you have to wait until lastWrittenBlockSeqNumber == activeBlock.seqNumber-1 holds or return an error to tell the user he should run again the query.
Do you agree with this point?

@thorfour
Copy link
Contributor

Yea definitely agree. What if we read from the in-memory block that is actively being written to disk? So in our table we have something like

active *TableBlock
pending []*TableBlock

so on read we query from both active and any pending blocks (maybe we only ever keep a single pending block?) and once that pending block has been successfully written to disk, we can remove it from in memory pending list.

@ostafen
Copy link
Contributor

ostafen commented May 15, 2022

This is a valid alternative I also was thinking about. There is only a drawback with this approach. Under a very high write pressure, several block could be queued in the pending list (keeping only one pending block is not enought in the worst case), thus there could be situtations where the memory limit imposed by activeMemorySize is by far excedeed. However, if this could be acceptable for your purposes, we could go with this solution :=)

@thorfour
Copy link
Contributor

I think we would have to do something like this to avoid temporarily not finding data during queries when a block is being written.

@ostafen
Copy link
Contributor

ostafen commented May 15, 2022

Good, I'll go for this solution. I'll wait for #60 to be merged before opening a PR.
Meanwhile, you can give a look at my fork to check the progress: https://github.com/ostafen/arcticdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants