Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-group based Storage #1474

Open
15 of 20 tasks
semihsalihoglu-uw opened this issue Apr 19, 2023 · 0 comments
Open
15 of 20 tasks

Node-group based Storage #1474

semihsalihoglu-uw opened this issue Apr 19, 2023 · 0 comments

Comments

@semihsalihoglu-uw
Copy link
Contributor

semihsalihoglu-uw commented Apr 19, 2023

In contrast to previous major design documents we have, I want to write up this one publicly. If we can keep this relatively up to date, it can help others who are interested in our designs to understand it more easily than reading the code.

Overview

The general goal of the 2nd version of storage is to implement efficient compression techniques in the storage and zone maps by moving towards a design that implements columns and rel-lists in a partitioned way (called RowGroups in ORC, Parquet and some publications).
We call these partitions NodeGroups. We could continue calling these RowGroups but this term is confusing for rel-lists, where partitions refer to the rel-lists of a set of nodes instead of a fixed number of relationship records.

NodeGroup

A node group can be viewed as a horizontal partition of a table. Consequently, a table is logically composed of a set of node groups.

Our storage consists of two types of tables: node tables and rel tables. For rel tables, the data is stored in both forward and backward directions on disk, resulting in two duplicate copies. Data within rel tables can be stored as columns when the multiplicity type is x_to_one, or as rel-lists when the multiplicity type is x_to_many.

Within a node group, each property of the table has a chunk of its data stored sequentially in the file. Each column chunk is independent, and can be compressed using different algorithms. Column chunks within the same node group are not necessarily stored together in the file.
In addition to the data itself, we maintain information about positions (page id and page offset), compression details and statistics, as well as other potential metadata for each column chunk. This information helps organize node groups and enables skipping scans of unnecessary node groups with the help of zone maps. This information is stored per column chunk as ColumnChunkMetadata, which is read into memory and cached during the database initialization. Accesses to a column chunk inside a node group will first consult its ColumnChunkMetadata.

In the query processor, a node group is also the morsel that threads grab when we scan tables. Note that this is different than our current code in April, 2023, where a morsel=vector.

NodeGroup Sizes

For tables stored as columns, a horizontal partition is a set of tuples within a specific node offset range (configurable, but default to 64*2048), such as [0, 131071].

For tables stored as rel-lists, a horizontal partition consists of a set of tuples with source node offsets falling within a specific node offset range (configurable, but default to 512), such as [0, 511].

Physical Layout

Typically, a database consists of following disk files:

  • Catalog file (catalog.kz): This file contains table schema information and table-level statistics.
  • Data file (data.kz): This file stores node groups.
  • Metadata file (metadata.kz): This file holds column chunk metadata, and the list of free pages. This can eventually be incorporated into the data or merge with catalog file, but it is kept separate for now to simplify the design.
  • Index file (n-x.hindex): For each node table, there will be one index file for the primary key. In the future, we may consider moving index data to the data file, as this can simplify the buffer manager, see Notes).
  • WAL file (.wal): This file serves as the write-ahead log for transactions.

Data File

The data file stores all data from both node and rel tables within the database.

For instance, consider a database with two tables: person(name, age), knows(since). The data file would contain the following:

|--------------------| >--|
| person-NG0 name    |    |
|--------------------|    |
| person-NG0 age     |    |
|--------------------|    | person
| Person-NG1 name    |    |
|--------------------|    |
| person-NG1 age     |    |
|--------------------|    |
|    ...  ...        |    |
|--------------------| >--|
| knows-NG0  header  |    |
|--------------------|    |
| knows-NG0  nbrID   |    | knows
|--------------------|    |
| knows-NG0  relID   |    | 
|--------------------| >--|
| person-NG2 name    |    |
|--------------------|    | person
| person-NG2 age     |    |
|--------------------| >--|
| knows-NG1  relID   |    |
|--------------------|    |
| knows-NG1  nbrID   |    | knows
|--------------------| >--|
|    ...  ...        |
|--------------------|

Metadata File

The metadata file is always read into RAM during database initialization, and is flushed back when updates occur.
The file contains two in-memory disk arrays: one for the list of free pages in data file and another for node group info.

FreePagesList is an in-memory disk array of PageRange entries. A PageRange entry records a sequential set of free pages, which typically occurs due to the movement of column chunks.

|----------------------| >--|
| FreePagesList Header |    | In mem disk array header of free pages in the file.
|----------------------| >--|
| ColumnMetaDA Header|    | In mem disk array header of ColumnChunkMetadata.
|----------------------| >--|
|     ...   ...        |
|----------------------|

Column Chunk

There are two data structures for table properties: Columns and Rel-Lists.
Properties in node tables and x_to_one rel tables are stored as Columns, while properties in x_to_many rel tables are stored as Rel-Lists.

Data Chunk

The basic storage layout of a column chunk includes both values and null bits.
For instance, an INT64 column chunk looks like this:

| Null bits             | | Values                 |
|-----------------------| |------------------------|
| 0001000000000111...   | |10 | 20 | 30 | 0 |  ... |

Note that null bits itself are stored as an independent column chunk, too.

A string column chunk consists of three parts: offsets, values (dict) and null bits.
The length of a string is calculated from the offsets.

| Null bits             | | Offsets               | | Values                      |
|-----------------------| |-----------------------| |-----------------------------|
| 0001000000000111...   | | 2 | 8 | 10 | 15 ...   | | ab | cdead | rd | dse4d ... |

A list column chunk comprises three parts: null bits, offsets, values.

| Null bits             | | Offsets             | | Values                  |
|-----------------------| |---------------------| |-------------------------|
| 0001000000000111...   | | 1 | 2 | 4 | 4 | ... | | 10 | 20 | 35 | 0 | ...  |

Empty lists are allowed, which is why null bits are necessary.

Rel-Lists Column Chunk

Rel-Lists is a unique type of "List" that it is guaranteed to have no nulls (a source node can have no rels, resulting in a list size of 0).
All Rel-Lists within the same node group share the same offsets array, called lists_header and independently stored as a column chunk.
Accessing any Rel-Lists requires going to the lists_header first to obtain the corresponding value offsets.

Consider the following example with the knows node group 0:

# adjLists: src -> [dst]
# 0 -> [10, 15], 1 -> [11, 200, 201, 20, 15, 19], 2 -> [100, 101], 3 -> [], ...

# lists_header column chunk:
| Offsets               |
-------------------------
| 2 | 8 | 10 | 10 | ... |

# nbrID data (no nulls):
| Values                                   |
--------------------------------------------
| 10 | 15 | 11 | 200 | 201 | 20 | 15 | ... |

# since data
| Null bits        | | Values                                        |
|------------------| |-----------------------------------------------|
| 0000000000000... | | 2000 | 2001 | 2021 | 2022 | 1993 | 1903 | ... |

Compressed Partitioned Column and Rel-Lists Storage

Compression is applied to each property in each NodeGroup independently. The logic is very similar when the data in the property is being stored as a compressed Column or compressed List. When the data is stored in a Column we simply compress the null bits, compressed data in separate sets of pages (but stored consecutively during initial copy). When we compress data stored as lists, we do exactly the same and store the "logical" CSR offsets in List headers. Logical here refers to the CSR offsets referring to element positions in lists and not byte positions.

Compression Techniques:

The primary criteria for us when choosing our compression techniques is that we require compressing techniques
that allow random access to elements when decompressing. This is because our plans heavily use node offset semi-masks to decrease the amount of node properties data we scan and copy into vectors. For example, consider the query MATCH (a:Person)-[:Likes]->(b:Person) WHERE a.ID = 0 RETURN b.age. Suppose node with ID 0 has only 1 neighbor and that has ID 1234 We will create a semimask with only true for ID 1234 and pass it to the scan operator that scans b.age. If we ensure that compression techniques allows random access, that scan operator can scan two pages, one to read null bit of 1234.age and one to read the actual age value of 1234, then copy and decompress those 2 values. This is more efficient than compressing the entire set of pages for the for the NodeGroup to read a single value.

We should start with the following compression techniques:

Bit/BytePacking

Find the maximum number of bits k needed to represent the values in the NodeGroup's property and write each property using k bits. For example if the column is age and is an INT64 and all values are between 0 and 63, we would use 6 bits. In a first implementation we do byte packing, which would be easier to implement but we should probably bite the bullet and implement the more compact version.

String Dictionary Compression

This is useful when some strings are exactly repeated. We store strings in 4 parts including null bits (if necessary).
Consider a NodeGroup with 5 nodes and a string property species: "cat", "bear", "cat", "cat", "dog". The following is what we would store. The indices would be bit packed using ceil (log_2(num distinct strings in the node group) many bits. The dictionary offsets would be stored based on the log_2(dictionary size).

| Null bits|| Indices           || Dict. Offsets || Dict. Strings
|----------||-------------------||---------------|| ------------
| 111      || 0 | 1 | 0 | 0 | 2 || 0 | 3 | 7     || catbeardog 

Note: One thing we did in our storage v1 was that we adopted the umbra style 16-byte fixed-length + overflow style (see ku_string_t struct in ku_string_t.h also for our string representation on disk. This is however redundant since there is no clear value storing the 4 prefix characters of long strings with an additional pointer. Can't remember what we thought back in the day.

Possible improvement: Global FSST symbol table.
FSST (reference implementation by Peter Boncz et al. here) is a technique to further compress the dictionary part (so catbeardog). This is done by a secondary dictionary, called the "symbol table", of size 255 codes to compress the commonly appearing substrings in single bytes. So strings remain strings except there is a dictionary of 255 chars to strings. In the original design in the FSST paper and DuckDB, each RowGroup gets its own separate FSST symbol table. I would advice against doing this because according to this DuckDB PR, this slows down query processing time a bit. This is probably because scans need to now resolve 2 dictionaries. In our case, where we care about random access and decompression, this would also slow these random read operations because no matter what, we would have to scan the entire FSST symbol table of a NodeGroup before we can decompress it (though maybe this is OK after all).

However, we can try a design in which we keep a "Global FSST" symbol table per column that is generated when we do the initial column. The reason is again according to the same DuckDB PR, FSST can reduce storage quite a bit over vanilla dictionary compression (e.g., from 510MB to 250MB in the TPCH SF1 experiment). We could sample a small number (~10K) of strings from a column and use that to generate an FSST symbol table and use it to compress all strings. Then we would keep the FSST symbol table of each column always in memory, so we could get some of the compression benefits but do not incur (or incur a minimal) overhead in decompression. This is something we can study first and decide.

Constant Compression

When the data values or null bits in a particular NodeGroup for a property are exactly the same, we could store that value once in the NodeGroupInfo for the property. This should be useful to compress null bits in many cases where we can imagine all values in NodeGroup for a property to be non-null.

Zone Maps

For numeric (and possibly date) properties, NodeGroupInfo of each NodeGroup's column or list should keep min and max values of the data stored in that NodeGroup. We could have different versions of NodeGroupInfo to have these fields, or we could just keep them as nulls for non-integer types. There are two questions to answer here:

  1. Should we keep these for integers only or also for floats.
  2. Should we keep this for every integer (and float) property or should we have the flexibility to not keep zone maps?
  3. Should we do it also for dates and timezones?

For all of these questions, my instinct is: do it for all numeric types and do not have the flexibility, i.e., do it for every property.

The other important choice is how zone maps will be used in Scans.

  1. Frontend: The frontend will need to change to generate zone maps by analyzing the query during compilation time. For example if the query is MATCH (a:Person) WHERE a. age > 30 then we need to infer that any node group on the age property where the maximum is <= 30 don't need to be scanned. See this issue in DuckDB for further opportunities.
  2. ScanNodeID followed by Flatten + ScanRel: We have plans where we first put in a ScanNodeID as the leaf operator and then instead of a join, we put in a Flatten and a ScanRel operator. There can be queries where both of those scans can be ingested separate min/max filters. For example, MATCH (a:Person)-[e:Likes]->(b:Person) WHERE a. age > 30 AND e.year < 2020. However currently only the leaf ScanNodeID grabs morsels. We would need to modify ScanRel to benefit from zone maps as well.
  3. Dynamic ZoneMap Generation: There can be queries where we can generate a zone map on the fly and pass it in. There are good systems papers on these and some of these could be effective. For example:
MATCH (a:Person)
WITH average(a.age) as sum_age
MATCH (b:Person)
WHERE b.age > sum_age + 5
RETURN count(*)

sum_age+5 can be computed dynamically and ingested into the Scan b.age during query processing.

Non-nullable Constraint

Some node and rel properties can have a non-nullable constraint in which case we do not store any null bits. This constraint should be declared in Cypher and stored in the Catalog. Therefore the class that serves as the interface to read a NodeGroup's column or list data from disk (I think @ray6080 is calling this ColumnChunk) should get this information from the Catalog.

The update logic should also change and check that we do not write a null value to a non-nullable property's storage structure (i.e., column or list).

Compressed Query Processing

Uncompressing data as we scan them into ValueVectors simplifies the query processor a lot. However there are obvious cases when this is also redundant and we can improve processing time by writing data into ValueVectors in compressed format and then uncompressing them only when writing back to FlatTuples as the user is iterating over results. I would limit only to the properties that are merely returned and no expression ever runs on them, i.e., we have the guarantee that the properties will never be computed on. The other alternative is to delay uncompressing until a computation will be performed on the compressed data on ValueVectors. Yet another alternative is to have the ability to perform computations on compressed data. Each of these are more advanced features but it's unclear how much performance benefits we'll get from them vs the complication these features will add to the query processor. I would be hesitant to go in this direction for now.

Even the simpler feature however requires quite a lot of change. Specifically:

  • ValueVector: Needs to hold the information that the data it has is compressed and how it is compressed.
  • FactorizedTable: Needs to inherit the same information
  • FlatTupleIterator: Needs to check whether data in a particular column in FactorizedTable is compressed and uncompress there.

It is not clear to which compression schemes we should limit this feature but the obvious one is constant compression and bit packing.

TODO Items

  • Remove large rel-lists.
  • Separate null bits and data in property columns.
  • Rework copy transaction to get rid of renaming mechanism.
  • Node table storage. (Consolidate all node table data into a single file, while rel tables remain stored separately.)
    • Node-Group based data copier and scans. (read-only; updates may be disabled)
    • WAL-based node table updates. (create, set, delete)
  • Rel table storage. (Combine all node and rel table data into a single file)
    • Morse-driven parallelism.
    • Node-Group based data storage. (read-only; updates may be disabled)
    • Local storage based Rel table updates. (create, set, delete)
  • Statistics: Table level and node group level. (Consider relocating table statistics to catalog).
  • Rework Scan operator for node groups.
  • ZoneMap.
  • No-null constraints.
  • Basic compression framework
    • Constant compression.
    • Dictionary compression.
    • Bit/Byte-Packing compression.
  • Node-Group based local storage.
  • Free pages list.

Other Features To Consider

Here are a set of other features I think that makes sense though I have not thought hard about them. These are simply to record some initial thoughts. Some of these are independent from the sequence above, so could be designed and implemented concurrently.

Gaps at the End of Rel-Lists:

Currently we store rel-lists as compact CSRs. This means that each NodeGroup (or chunk) in our case would have to be rewritten entirely even if a single edge gets inserted into it. We should leave gaps at the end of rel-lists to make space for their growth without requiring rewriting entire NodeGroup worth of data. I suggest a 1.1x growth factor for each list though we would need to understand the update speed vs scan speed vs memory performance.

This requires changing:

  1. Update/checkpoint logic. The interesting part is the algorithm we use to find space in the left or right neighbors, similar to how B+ trees would find space in the left or right neighbors.
  2. List headers: List headers can no longer be stored as 32 bits. We need to store both the start and the length of the rel-lists to be stored. So we would move to 64 bits. To win back some space, we could bitpack both the offset and length.

Undirected Edges:

TODO(Semih/Xiyang): Design

RDF-Specialized Sub/Object Node Table-Predicate Rel Table Pairs (S/O-P Table Pair)

We should support specialized storage for datasets that are more naturally represented as RDF triples. The immediate design is to have a specialized notion of a Node Table storing a subject and an object table. We should not separate these as a particular entity in RDF model can be both a subject or an object. We would then have a separate "RDF Rel Table" to store the actual graph.

Hash Index Changes: We already support a string primary key hash indexes on node tables. So we can index all of the URIs of nodes using our hash index. We would need to extend this hash table to also index all URIs in the predicates assuming we want URIs to be uniquely identify both subject/objects and predicates. Or we need to store a separate hash index for the rels but change the copy statements and inserting/update logic to ensure that an insertion of a predicate or a subject-object is unique across 2 hash indices.  We would further need to store these URIs in two columns, one for subjects and objects, and the other for rels to serve as a reverse index, one for nodes and the other for rels. Then there would be changes in the query planner, which needs to translate URIs in the queries to integer IDs.

Note I'm only imagining supporting existing Cypher in a more optimized way on RDF triples, whose subjects and objects are modeled as a node table and its predicates are modeled as a rel table. We would need to think much harder to find a design that supports more advanced RDF/SPARQL style processing (e.g., OWL-based reasoning etc.).

We would have several further choices in such a design:
(1) Number of S/O-P Table Pairs to support: should we give S/O-P Table Pairs names and support multiple of them or limit a database to have only 1 such S/O-P Table Pair?
(2) Should we mix the RDF triples with the property graph model, so users can add properties on the S/O Node Table and P Rel Table? This flexibility could be useful in modeling but should be weighed against the complexity it introduces in the system.

Avoiding the Primary Key Hash Index When the Primary Key is the Consecutive Integer Domain

This is implemented in PR: #1493 with some further TODOs in Issue: #1496.

@andyfengHKU andyfengHKU pinned this issue May 10, 2023
@ray6080 ray6080 changed the title Storage Version 2 Node Group-based Storage Aug 16, 2023
@ray6080 ray6080 changed the title Node Group-based Storage Node-group based Storage Aug 23, 2023
@ray6080 ray6080 mentioned this issue Nov 6, 2023
40 tasks
@andyfengHKU andyfengHKU unpinned this issue Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant