GitHub - milesgranger/pytpch: Python bindings to TPC-H data generation

Ergonomically create TPC-H data thru Python as Arrow tables.

NOTE: This was a weekend project, it is a WIP. For now only x86_64 linux wheels are available on PyPI

import pytpch
import pyarrow as pa

# Generate TPC-H data at scale 1 (~1GB)
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1)

# Generate a single table at scale 1
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, table=pytpch.Table.Nation)

# Generate a single chunk out of n chunks of a single table
# this is wildly helpful when generating larger scale factors as you can make
# subsets of the data and store them or join them after some sort of parallelism.
tables: dict[str, pa.Table] = pytpch.dbgen(sf=1, n_steps=10, step=1, table=pytpch.Table.Nation)


# NOTE! As mentioned in the docs for this function, it is NOT thread-safe.
#       If you want to generate data in parallel, you must do so in other processes for now
#       by using things like `multiprocessing` or `concurrent.futures.ProcessPoolExecutor`.
#       This is a TODO, as the original C code uses copious amounts of global and static function
#       variables to maintain state, and while the state is reset between function calls from refactoring
#       in milesgranger/libdbgen, these shared global states are not removed so thus not thread-safe.
#
# Example of generating data in parallel:
from concurrent.futures import ProcessPoolExecutor

n_steps = 10  # 10 total chunks

def gen_step(step):
    return pytpch.dbgen(sf=10, n_steps=n_steps, nth_step=step)

with ProcessPoolExecutor() as executor:
    jobs: list[dict[str, pa.Table]] = list(executor.map(gen_step, range(n_steps)))
  

# Default reference queries provided (1-22) as:
print(pytpch.QUERY_1)

Tell me more...

Python bindings (thru Rust, b/c why not) to libdbgen which is a fork of databricks/tpch-dbgen for generating TPC-H data.

tpch-dbgen is originally a CLI to generate CSV files for TPC-H data. I wanted to make it into an ergonomic Python API for use in other projects.

TODOS (roughly in order of priority):

Support for more than Linux x86_64 (mostly just adapting C lib and updating CI)
Remove verbose stdout
Write directly to Arrow, removing CSV writing (w/ nanoarrow probably)
Make thread safe (remove global and static function variables in C lib, and remove changing of CWD)
Separate out the Rust stuff into it's own crate.

Build from source...

Roughly:

git clone --recursive git@github.com:milesgranger/pytpch.git
python -m pip install maturin
maturin build --release

That'll only work if you're on x86_64 linux for now, you can try adapting build.rs but good luck with that. For now.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-TPC		LICENSE-TPC
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Tell me more...

Build from source...

About

Licenses found

Releases 2

Packages

Languages

License

Licenses found

milesgranger/pytpch

Folders and files

Latest commit

History

Repository files navigation

Tell me more...

Build from source...

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages