Persist data #35

brancz · 2022-04-07T15:46:11Z

Currently, once the configured size of data is reached the active-append table-block is swapped out for a new one and the old one is thrown away. We of course want to persist data in some way. Since we already keep the data in parquet format in memory, it would be great to write that out and memory map it.

ostafen · 2022-05-10T08:41:31Z

I could try to work on this, if it is fine for you.
Do you plan to make writes robust against crash recovery?

ostafen · 2022-05-11T14:52:01Z

@brancz, actually I see from code that, when a table block is rotated, the old one is simply discarded, and queries only act on the current active block for each table.
So, what's the purpose of memory mapping thrown blocks?

thorfour · 2022-05-11T15:30:18Z

@ostafen Hello! We're actually working on this issue right now! Stay tuned!

ostafen · 2022-05-11T15:51:12Z

Okay, if you need help, I would be happy to participate on this :=)

thorfour · 2022-05-11T15:57:07Z

Awesome! I hope to have something that's at least compiling :) very soon

ostafen · 2022-05-11T15:59:55Z

Okay, I'll stay tuned on this :)

thorfour · 2022-05-12T07:38:19Z

I've pushed a very WIP branch hackday-storage where it just writes the rotated blocks to disk, and then on query reads all of them. It's terribly inefficient but it at least is a starting point that can inform an implementation.

ostafen · 2022-05-12T07:54:04Z

Gave a look at the code.
I think the RotateBlock() function is a good place to put the code for persisting the block.
Also, instead of writing each block in a separate file, I would keep a write file opened, in order to append blocks in a log-like fashion (the file could be also rotated, when a given size is reached).
I would also store a reference to read only table blocks (which could be memory mapped also) inside the Table struct and update function Iterator() in order to iterate not only on the current active block.
Also, we may also want to perform crash recovery on startup (as a matter of fact, only the last written block could be damaged)
What do you think?

thorfour · 2022-05-12T11:54:16Z

Those all sound like reasonable approaches to me

ostafen · 2022-05-12T12:23:04Z

If you want, I could try to provide a PR addressing these points

ostafen · 2022-05-13T07:07:19Z

I think there is a potential problem arising from the fact that the block is written to disk in the background by a separate goroutine.
A concurrent query may try to read a block (inside the IterateRowGroup() function) while it is being written to disk, thus reading a corrupted version of the block.
It could also happen that the disk write has not yet been initialized at that time, so the disk block would not be read at all and skipped. Depending on the workload and configuration, it could also happen that multiple blocks could be waiting to be written to disk.

To solve this problem, when a query try to iterate on the set of table blocks of the database, the following property should be satisfied: all blocks which have been created before the current active block should be correctly synced to disk.
This condition could be checked by giving a sequence number to blocks, and by storing in a variable the sequence number of the last table block which have been correctly (and entirely) written to disk.
So when a read operation arrives, one should first check if lastWrittenBlockSeqNumber < activeBlock.seqNumber-1.
If this is the case, we could wait for the condition to become false, or simply return an error.

thorfour · 2022-05-13T17:50:28Z

I was thinking we'd likely want to drop a metadata file alongside blocks that informs us about their contents and potentially validity. We could update it on successful block rights to avoid reading an incomplete block. If we use a ULID for the block name we could implement the sequence number you've described to determine last valid block.

ostafen · 2022-05-13T22:48:17Z

The problem is that if a query is submitted while you are writing a block to disk, you have to wait for the incomplete block to be written on disk. If you simply detect that a block is invalid on disk and skip it, you may end with an incomplete result set because the block may contain rows satisfying the query.
Assume that an user runs the same query twice, and that just before running the query for the second time the current active block is rotated because of a concurrent insertion. In this case, he could get a different number of rows for the same query, because rows contained in the block being written would not be returned the second time.
So, I would use the sequence numbers to detect pending writes, but the important part is that if you find that
lastWrittenBlockSeqNumber < activeBlock.seqNumber-1 during a query, you have to wait until lastWrittenBlockSeqNumber == activeBlock.seqNumber-1 holds or return an error to tell the user he should run again the query.
Do you agree with this point?

thorfour · 2022-05-15T15:51:57Z

Yea definitely agree. What if we read from the in-memory block that is actively being written to disk? So in our table we have something like

active *TableBlock
pending []*TableBlock

so on read we query from both active and any pending blocks (maybe we only ever keep a single pending block?) and once that pending block has been successfully written to disk, we can remove it from in memory pending list.

ostafen · 2022-05-15T16:51:33Z

This is a valid alternative I also was thinking about. There is only a drawback with this approach. Under a very high write pressure, several block could be queued in the pending list (keeping only one pending block is not enought in the worst case), thus there could be situtations where the memory limit imposed by activeMemorySize is by far excedeed. However, if this could be acceptable for your purposes, we could go with this solution :=)

thorfour · 2022-05-15T17:41:30Z

I think we would have to do something like this to avoid temporarily not finding data during queries when a block is being written.

ostafen · 2022-05-15T18:51:53Z

Good, I'll go for this solution. I'll wait for #60 to be merged before opening a PR.
Meanwhile, you can give a look at my fork to check the progress: https://github.com/ostafen/arcticdb

thorfour self-assigned this May 11, 2022

ostafen mentioned this issue May 15, 2022

Problem when splitGranule() is aborted #59

Closed

brancz mentioned this issue May 25, 2022

Does Parca support to config a storage so that it could view continuous chart even if Parca server restarted? parca-dev/parca#1077

Open

thorfour closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist data #35

Persist data #35

brancz commented Apr 7, 2022

ostafen commented May 10, 2022 •

edited

Loading

ostafen commented May 11, 2022

thorfour commented May 11, 2022

ostafen commented May 11, 2022

thorfour commented May 11, 2022

ostafen commented May 11, 2022

thorfour commented May 12, 2022

ostafen commented May 12, 2022 •

edited

Loading

thorfour commented May 12, 2022

ostafen commented May 12, 2022

ostafen commented May 13, 2022 •

edited

Loading

thorfour commented May 13, 2022

ostafen commented May 13, 2022 •

edited

Loading

thorfour commented May 15, 2022

ostafen commented May 15, 2022 •

edited

Loading

thorfour commented May 15, 2022

ostafen commented May 15, 2022 •

edited

Loading

Persist data #35

Persist data #35

Comments

brancz commented Apr 7, 2022

ostafen commented May 10, 2022 • edited Loading

ostafen commented May 11, 2022

thorfour commented May 11, 2022

ostafen commented May 11, 2022

thorfour commented May 11, 2022

ostafen commented May 11, 2022

thorfour commented May 12, 2022

ostafen commented May 12, 2022 • edited Loading

thorfour commented May 12, 2022

ostafen commented May 12, 2022

ostafen commented May 13, 2022 • edited Loading

thorfour commented May 13, 2022

ostafen commented May 13, 2022 • edited Loading

thorfour commented May 15, 2022

ostafen commented May 15, 2022 • edited Loading

thorfour commented May 15, 2022

ostafen commented May 15, 2022 • edited Loading

ostafen commented May 10, 2022 •

edited

Loading

ostafen commented May 12, 2022 •

edited

Loading

ostafen commented May 13, 2022 •

edited

Loading

ostafen commented May 13, 2022 •

edited

Loading

ostafen commented May 15, 2022 •

edited

Loading

ostafen commented May 15, 2022 •

edited

Loading