Skip to content

paultiq/fabduckdb

Repository files navigation

Introducing Fabulous Duckdb [Fab]

Fabulous DuckDB [Fab] extends DuckDB's Python API with additional control flow, query templating, and pseudo-table functions.

The Fabulous Mr Duck

Fab intercepts and rewrites SQL statements called via PyConnection.execute(), extending the DuckDB SQL syntax to add three primary features:

  • Loops / Iteration: Parameterizes a statement using parameters from a separate iteration statement
  • Dynamic SQL Execution: Executes SQL statements created by other SQL statements
  • Inline Table Functions: Allows functions that return DataFrames (or other DuckDB supported types), Parquet Files, or CSV files to be embedded in SQL statements.

Quick Start

%pip install 'fabduckdb @ git+https://github.com/paultiq/fabduckdb'
import duckdb
import fabduckdb

createstatement = "CREATE or REPLACE TABLE abc as SELECT * FROM range(1,10,1) t(x);CREATE or REPLACE TABLE def as SELECT * FROM range(20,30,2) t(x)"
duckdb.default_connection.execute(createstatement)

loopstatement = "loop (select * from {{x}}) over (select x from (values('abc'), ('def')) t(x)) using 'union all'"
duckdb.default_connection.execute(loopstatement).df()

Details

  • Loop Statements: Iterates over a the iteration statement, fed as parameters to the template (Jinja2) query
loop (select * from {{tablename}}) over (select tablename from tableoftablenames) [USING 'None' | 'Union' | ...]
create table xyz as (loop (select * from {{tablename}}) over (select tablename from tableoftablenames) USING ('union all by name')
  • Dynamic Queries: Executes the quer(ies) dynamically generated by the inner select
import fabduckdb
query = "execute (select 'select * from range(' || x::varchar || ')' from range(3) t(x)) USING 'Union'"
duckdb.connect().execute(query)

A con [duckdb.pyconnection] object is passed to every function. If generates_filepath is True, a filename [str] object is also passed.

import fabduckdb
fabduckdb.register_function("dfcreate", lambda rows, cols, con: pd.DataFrame(np.random.rand(rows, cols)), generates_filepath=False)

duckdb.connect().execute("select * from dfcreate(3,4)").df()
  • File Functions: Extracts and rewrites embedded functions that populate external files
import fabduckdb
fabduckdb.register_function("dfcreate", lambda rows, cols, filename, con: pd.DataFrame(np.random.rand(rows, cols)).to_parquet(filename), generates_filepath=True)
duckdb.connect().execute("select * from dfcreate(3,4)").df()

Project Status

This project is currently at the Proof of Concept stage: to prove the idea, get feedback, and refine the syntax and flow of operations.

This is also intended to make it easy to experiment with language extensions.

The next step is likely to propose changes the DuckDB Python API that eliminate the need to monkey-patch.

Current Limitations

  • Functions have no visibility to CTEs or subqueries: Functions are executed first and the function is rewritten to use their output (either an DataFrame/similar object, or a file path).
  • Only PyConnection.execute() calls are supported. .sql() and pyrelations are not supported.
  • Objects created by functions are registered, and left registered until the next statement execution, as they can't be unregistered until they've been consumed.
  • Files created by functions are not yet deleted: need to come up with a way to specify the directory path and/or clean-up policy
  • Parameters:
    • Use named parameters: pass $name parameters in a Dict
    • Anonymous parameters won't rewrite properly, although they're fine for functions that don't need rewriting.
    • parameters are string-replaced in function parameters. This will replace both $TEST and any variable partially matching $TEST, along with $TEST appearing inside any string.

To Do's

  • More test cases
  • Think through unregistration, deletion
  • Caching: Wanted to implement a caching operation, but the chaining of operations makes this a bit more involved than simply adding a decorator over a function: it's the final .df() (or whatever) operation that we want to cache, but this is chained.

How it Works

Loops

TBD

Dynamic Execute

TBD

Functions

TBD

Decorators

TBD

More Information

Decorators

Allows code to be injected before or after the execute() call is passed to underlying DuckDB Python bindings.

Why? Primarily envisioned for performance or timing use: ensure all queries are logged (and timed) in consistent paths.

decorator_example.ipynb

import time
def execute_decorator(func):
    def wrapper(*args, **kwargs):
        start = time.time()

        result = func(*args, **kwargs)

        end = time.time()
        logger.info(f"Query duration: {round(end-start, 3)}s, {args} {kwargs}")
        return result
    return wrapper

# append=True: stacks the decorator on top of any existing decorators
# append=False: decorates the original function and discards any additional decorators
fabduckdb.wrap_execute(execute_decorator, append=False)

Python Injection

For registered functions, Fab will execute the Python inline. Functions must be registered: eval() of arbitrary Python code is not supported (although could be):

Assuming somefunction is registered with fab_duckdb.register_function, this transforms this statement:

select * from somefunction(arguments)
    Into:
python (somefunction(arguments)) into ('df123');
** Implicit: duckdb register('df123', 'obj');
select * from df123;
duckdb unregister('df123', 'obj');

Python Statement

Executes the statement, storing the result in the named "into" object. The "into" object is then registered in the connection. After the subsequent operation, the object is unregistered, so it's not pinned.

Loop

loop (jinja2_template_statement) over (iterationstatement) [how 'None'|'Union'|...}

Parameterizes the statement, using each row from iteration statement. Each row from iteration statement creates a new statement, with the row results used as a Jinja2 parameters into the statement. The column names from iterationstatement are used as parameters. Note that the column names must match the parameter names inside the {{}}.

Example:

CREATE or REPLACE TABLE abc as SELECT * FROM range(1,10,1)
;
CREATE or REPLACE TABLE def as SELECT * FROM range(20,30,2)
;
loop (select * from {{x}}) over (select x from (values('abc'), ('def')) t(x))

How: None [Default]

None executes each statement sequentially. None would be useful for insert and update statements, such as to split a single expensive operation that may hit memory limits into individual sub-queries.

    loop (INSERT INTO abc SELECT * FROM mytable WHERE id between x and y) on (x,y) using (SELECT * from (values(1,10), (10,20)) with ('none')

transforms into two individual statements:

    INSERT INTO abc SELECT * FROM mytable WHERE id between 1 and 10);
    INSERT INTO abc SELECT * FROM mytable WHERE id between 10 and 20);

USING: [Union | Union All | Union By Name | Union All By Name]

Applies the Union Function to the individual results:

    loop (SELECT * FROM mytable WHERE id between x and y) using (x,y) over (SELECT * from (values(1,10), (10,20)) USING ('union all')

transforms into the following statements:

    SELECT * FROM mytable WHERE id between 1 and 10)
    UNION ALL
    SELECT * FROM mytable WHERE id between 10 and 20);

More Examples

Retrieving multiple files

%load_ext magic_duckdb
import duckdb
import fabduckdb

with duckdb.connect() as con:
    con.execute("install httpfs")
    con.execute("create or replace table files as (values ('https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv'), ('https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/contacts.csv'))")
    df=con.execute("loop (select * from '{{col0}}') over (select * from files) using 'union all by name'").df()
    display(df)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published