Ambiguity in `DataFrame.write_database(if_exists='replace')` could lead to data loss. #12779

itay747 · 2023-11-29T17:44:58Z

Description

It is possible to incorrectly assume that the if_exists="replace" kwarg to DataFrame.write_database refers to a record-level operation, similar to PostgREST's Prefer: resolution=merge-duplicates header.

This assumption lead to a destructive operation where the table is dropped and recreated with the schema of the supplied DataFrame. To make matters worse, the only other non-NOOP value that if_exists accepts, append, does in fact refer to a record-level operation (as one would expect), since supplying a DataFrame with a different schema to the target table leads to an exception (again, as one would expect) as opposed to overwriting the table.

A potential fix is to rename if_exists to mode, in tandem with DataFrame.write_delta(mode='overwrite').

Tangential idea: This also opens up the door to adding a new feat mode="merge" where duplicated records in the DataFrame are merged on their primary key the way PostgREST does.

Link

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2023-11-29T18:50:35Z

Thanks for the issue. I don't really see any ambiguity in the docstring here though; seems fairly clear what's going to happen, and it's also consistent with other major DataFrame libraries 🤔

Polars:
‘replace’ will create a new database table, overwriting an existing one.

This is the same behaviour (and parameter name) used in Pandas¹ ...

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

    How to behave if the table already exists.
    fail: Raise a ValueError.
    replace: Drop the table before inserting new values.    <<<<<<<
    append: Insert new values to the existing table.

...and ADBC² (which calls the same parameter "mode" instead):

mode
How to deal with existing data:

‘append’: append to a table (error if table does not exist)
‘create’: create a table and insert (error if table exists)
‘create_append’: create a table (if not exists) and insert
‘replace’: drop existing table (if any), then same as ‘create’    <<<<<<<

Given the precedents (and the docstring), I don't think most users will be that surprised.

alexander-beedie · 2023-11-29T18:53:26Z

Thinking about it some more, we could consider renaming the parameter if_table_exists, just to really drive home that this parameter applies to tables, not data. If that helps someone to think twice before accidentally dropping their database table, I'd support that... ;)

itay747 · 2023-11-29T19:33:33Z

we could consider renaming the parameter if_table_exists

I think this is an improvement, but it comes at the expense of worsening the inconsistency from DataFrame.write_delta, which afaik is the only other DataFrame write operation that targets a url endpoint.

alexander-beedie · 2023-11-29T19:41:40Z

comes at the expense of worsening the inconsistency from DataFrame.write_delta,

Not really. That method already has a different parameter name, and different options; improving write_database doesn't change that either way 🤔 Good you draw attention to it though - might not be a bad idea to standardise the two method parameters/options! Will have a look at that separately later.

astrojuanlu · 2024-05-21T06:10:20Z

Tangential idea: This also opens up the door to adding a new feat mode="merge" where duplicated records in the DataFrame are merged on their primary key the way PostgREST does.

Do you know if there an issue tracking this already? Would be similar to .write_delta(mode="merge") https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_delta.html

itay747 added the documentation Improvements or additions to documentation label Nov 29, 2023

alexander-beedie mentioned this issue Nov 29, 2023

depr(python): Rename write_database parameter if_exists to if_table_exists #12783

Merged

alexander-beedie closed this as completed in #12783 Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity in `DataFrame.write_database(if_exists='replace')` could lead to data loss. #12779

Ambiguity in `DataFrame.write_database(if_exists='replace')` could lead to data loss. #12779

itay747 commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading

itay747 commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading

astrojuanlu commented May 21, 2024

Ambiguity in DataFrame.write_database(if_exists='replace') could lead to data loss. #12779

Ambiguity in DataFrame.write_database(if_exists='replace') could lead to data loss. #12779

Comments

itay747 commented Nov 29, 2023 • edited Loading

Description

Link

alexander-beedie commented Nov 29, 2023 • edited Loading

Footnotes

alexander-beedie commented Nov 29, 2023 • edited Loading

itay747 commented Nov 29, 2023 • edited Loading

alexander-beedie commented Nov 29, 2023 • edited Loading

astrojuanlu commented May 21, 2024

Ambiguity in `DataFrame.write_database(if_exists='replace')` could lead to data loss. #12779

Ambiguity in `DataFrame.write_database(if_exists='replace')` could lead to data loss. #12779

itay747 commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading

itay747 commented Nov 29, 2023 •

edited

Loading

alexander-beedie commented Nov 29, 2023 •

edited

Loading