On-disk size of data.kz is large when using multiple `CREATE` statements #3411

prrao87 · 2024-04-30T12:57:40Z

Migrating this from our Discord channel as reported by a user. Basically, when a few thousand CREATE statements are sent via individual transactions, the on-disk size of the database directory is much larger (400 MB) compared to sending batched transactions (50 MB) via COPY. The question is about whether some sort of compaction process can be triggered to reduce the disk space usage when batched transactions are not possible.

Original message:

Is there a way to reduce the on-disk footprint of kuzu graph db? Can we trigger some sort of compaction on-demand? If I load a few thousand transactions, nodes and edges combined, my on-disk size grows to 400MB. Same transactions when batched create less than 50MB. However, batching is not practical in many cases, therefore is there a way to load transactions and then trigger compaction to reduce the disk footprint?

Questions we asked to clarify more:

"A few thousand transactions, nodes and edges combined": Is this for creation only? i.e., no deletions?
Batched meaning a few thousand creations within a single transaction?
Does most of the size difference come from data.kz file?

Clarifications:

Creation only, no deletion.
It was not a single batch... around 20 batches.
Size difference is all from data.kz file. That single file has most of the data.

Reproducible example:

Here is a test, that reproduces. Just two node types, create a 1000 instances of each. Once for batched and once for non-batched, and print the directory sizes for each.

import kuzu
import os
import shutil
import pandas as pd
from tempfile import NamedTemporaryFile


def run_kuzu_query(query, conn):
    qR = conn.execute(query)
    return qR.get_as_df()


def init_kuzu_db(_kuzu_loc):
    if os.path.exists(_kuzu_loc):
        shutil.rmtree(_kuzu_loc)
    ddl1 = 'create node table ContainerNode(my_key STRING, name1 STRING, name2 STRING, name3 STRING[], count INT64, PRIMARY KEY (my_key))'
    ddl2 = 'create node table ContentNode(my_key STRING, name1 STRING, name2 STRING, name3 STRING[], val1 DOUBLE, val2 DOUBLE, val3 DOUBLE, x BOOLEAN, y BOOLEAN, z BOOLEAN, w STRING[], PRIMARY KEY (my_key))'
    ddl3 = 'CREATE REL TABLE ContContRel(FROM ContainerNode TO ContentNode)'
    buf_pool_gb = 1
    _kuzuDB = kuzu.Database(_kuzu_loc, buffer_pool_size=buf_pool_gb*(1024**3))
    conn = kuzu.Connection(_kuzuDB, num_threads=1)
    run_kuzu_query(ddl1, conn)
    run_kuzu_query(ddl2, conn)
    run_kuzu_query(ddl3, conn)
    return conn


def create_container_nodes_no_batch(conn):
    for i in range(1000):
        dml  = "CREATE(t:ContainerNode {{ my_key : '{0}', name1 : 'xyz', name2 : 'abc', name3 : ['pqr'], count : {1} }})".format(i, i)
        run_kuzu_query(dml, conn)


def create_content_nodes_no_batch(conn):
    for i in range(1000):
        dml  = ("CREATE(t:ContentNode {{ my_key : '{0}', name1 : 'xyz', name2 : 'abc', name3 : ['pqr'], val1 : {1}, val2: {2}, val3: 3.5, "
                "x: TRUE, y: FALSE, z: TRUE, w: ['random'] }})").format(i, i, 1.5*i)
        run_kuzu_query(dml, conn)


def create_container_nodes_batched(conn):
    df = pd.DataFrame(columns=['my_key', 'name1', 'name2', 'name3', 'count'])
    for i in range(1000):
        df.loc[len(df.index)] = [f'{i}', 'xyz', 'abc', ['pqr'], i]

    df.reset_index(drop=True, inplace=True)

    tf = NamedTemporaryFile(suffix=".parquet")
    with tf:
        df.to_parquet(tf)
        tf.flush()
        dml = f"""COPY ContainerNode FROM "{tf.name}" """
        conn.execute(dml)



def create_content_nodes_batched(conn):
    df = pd.DataFrame(columns=['my_key', 'name1', 'name2', 'name3', 'val1', 'val2', 'val3', 'x', 'y', 'z', 'w'])
    for i in range(1000):
        df.loc[len(df.index)] = [f'{i}', 'xyz', 'abc', ['pqr'], 1.0*i, 1.0*i, 3.5, True, False, True, ['random']]

    df.reset_index(drop=True, inplace=True)
    tf = NamedTemporaryFile(suffix=".parquet")
    with tf:
        df.to_parquet(tf)
        tf.flush()
        dml = f"""COPY ContentNode FROM "{tf.name}" """
        conn.execute(dml)


def get_dir_size(path):
    total = 0
    with os.scandir(path) as it:
        for entry in it:
            if entry.is_file():
                total += entry.stat().st_size
            elif entry.is_dir():
                total += get_dir_size(entry.path)
    return total


## Without batching
conn_no_batch = init_kuzu_db("/tmp/no_batch.kuzudb")
create_container_nodes_no_batch(conn_no_batch)
create_content_nodes_no_batch(conn_no_batch)

## With batching
conn_batched = init_kuzu_db("/tmp/batched.kuzudb")
create_container_nodes_batched(conn_batched)
create_content_nodes_batched(conn_batched)


print('Size without batching:\t', get_dir_size("/tmp/no_batch.kuzudb"))
print('Size with batching:\t\t', get_dir_size("/tmp/batched.kuzudb"))

The text was updated successfully, but these errors were encountered:

ray6080 · 2024-04-30T13:04:20Z

By looking at the script, one hypothesis for the size difference is due to compression, a few thousand transactions might trigger re-compression of existing tuples, and right now we don't reclaim those space yet (will be added for sure later), while a single COPY statement doesn't trigger re-compression at all.

Will profile a bit more to verify if that's the case.

ray6080 · 2024-07-23T15:45:15Z

This is now solved with mvcc #3718

prrao87 added the bug Something isn't working label Apr 30, 2024

prrao87 assigned ray6080 Apr 30, 2024

ray6080 mentioned this issue Jul 12, 2024

Release v0.5.0 #3666

Closed

81 tasks

ray6080 closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-disk size of data.kz is large when using multiple `CREATE` statements #3411

On-disk size of data.kz is large when using multiple `CREATE` statements #3411

prrao87 commented Apr 30, 2024 •

edited

Loading

ray6080 commented Apr 30, 2024

ray6080 commented Jul 23, 2024

On-disk size of data.kz is large when using multiple CREATE statements #3411

On-disk size of data.kz is large when using multiple CREATE statements #3411

Comments

prrao87 commented Apr 30, 2024 • edited Loading

Reproducible example:

ray6080 commented Apr 30, 2024

ray6080 commented Jul 23, 2024

On-disk size of data.kz is large when using multiple `CREATE` statements #3411

On-disk size of data.kz is large when using multiple `CREATE` statements #3411

prrao87 commented Apr 30, 2024 •

edited

Loading