Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in DataFrame.write_database(if_exists='replace') could lead to data loss. #12779

Closed
itay747 opened this issue Nov 29, 2023 · 5 comments · Fixed by #12783
Closed

Ambiguity in DataFrame.write_database(if_exists='replace') could lead to data loss. #12779

itay747 opened this issue Nov 29, 2023 · 5 comments · Fixed by #12783
Labels
documentation Improvements or additions to documentation

Comments

@itay747
Copy link

itay747 commented Nov 29, 2023

Description

It is possible to incorrectly assume that the if_exists="replace" kwarg to DataFrame.write_database refers to a record-level operation, similar to PostgREST's Prefer: resolution=merge-duplicates header.

This assumption lead to a destructive operation where the table is dropped and recreated with the schema of the supplied DataFrame. To make matters worse, the only other non-NOOP value that if_exists accepts, append, does in fact refer to a record-level operation (as one would expect), since supplying a DataFrame with a different schema to the target table leads to an exception (again, as one would expect) as opposed to overwriting the table.

A potential fix is to rename if_exists to mode, in tandem with DataFrame.write_delta(mode='overwrite').

Tangential idea: This also opens up the door to adding a new feat mode="merge" where duplicated records in the DataFrame are merged on their primary key the way PostgREST does.

Link

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html

@itay747 itay747 added the documentation Improvements or additions to documentation label Nov 29, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 29, 2023

Thanks for the issue. I don't really see any ambiguity in the docstring here though; seems fairly clear what's going to happen, and it's also consistent with other major DataFrame libraries 🤔

Polars:
‘replace’ will create a new database table, overwriting an existing one.

This is the same behaviour (and parameter name) used in Pandas1 ...

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

    How to behave if the table already exists.
    fail: Raise a ValueError.
    replace: Drop the table before inserting new values.    <<<<<<<
    append: Insert new values to the existing table.

...and ADBC2 (which calls the same parameter "mode" instead):

mode
How to deal with existing data:

‘append’: append to a table (error if table does not exist)
‘create’: create a table and insert (error if table exists)
‘create_append’: create a table (if not exists) and insert
‘replace’: drop existing table (if any), then same as ‘create’    <<<<<<<

Given the precedents (and the docstring), I don't think most users will be that surprised.

Footnotes

  1. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

  2. https://arrow.apache.org/adbc/0.8.0/python/api/adbc_driver_manager.html#adbc_driver_manager.dbapi.Cursor.adbc_ingest

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 29, 2023

Thinking about it some more, we could consider renaming the parameter if_table_exists, just to really drive home that this parameter applies to tables, not data. If that helps someone to think twice before accidentally dropping their database table, I'd support that... ;)

@itay747
Copy link
Author

itay747 commented Nov 29, 2023

we could consider renaming the parameter if_table_exists

I think this is an improvement, but it comes at the expense of worsening the inconsistency from DataFrame.write_delta, which afaik is the only other DataFrame write operation that targets a url endpoint.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 29, 2023

comes at the expense of worsening the inconsistency from DataFrame.write_delta,

Not really. That method already has a different parameter name, and different options; improving write_database doesn't change that either way 🤔 Good you draw attention to it though - might not be a bad idea to standardise the two method parameters/options! Will have a look at that separately later.

@astrojuanlu
Copy link

Tangential idea: This also opens up the door to adding a new feat mode="merge" where duplicated records in the DataFrame are merged on their primary key the way PostgREST does.

Do you know if there an issue tracking this already? Would be similar to .write_delta(mode="merge") https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_delta.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants