Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Deprecate automatic schema conversion in load_table_from_dataframe #9042

Closed
4 tasks done
tswast opened this issue Aug 16, 2019 · 6 comments · Fixed by #9176
Closed
4 tasks done

BigQuery: Deprecate automatic schema conversion in load_table_from_dataframe #9042

tswast opened this issue Aug 16, 2019 · 6 comments · Fixed by #9176
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Aug 16, 2019

After discussion around #9024, I'm coming to realize that there are a lot of inconsistencies with the pandas DataFrame serialization when we have to autodetect the schema. I propose that we warn when we are given a DataFrame but can't determine the correct schema.

I realize this will be a step backwards in terms of usability. I think the following feature requests are needed to be prioritized if we proceed with this deprecation:

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels Aug 16, 2019
@tswast
Copy link
Contributor Author

tswast commented Aug 16, 2019

#9049 is pending (needs unit test updates), which will make me slightly more comfortable with this deprecation. My current thought for the "load dataframe" algorithm is:

@tswast
Copy link
Contributor Author

tswast commented Aug 22, 2019

Added #5572 (comment) as another blocker. If we don't provide a way to explicitly serializing index(es), we actually lose the ability to write indexes outside the deprecated path.

@tswast
Copy link
Contributor Author

tswast commented Aug 23, 2019

Thought: Do we want to 100% deprecate the to_parquet code path? What if someone explicitly chooses to use an autodetected schema? Maybe if autodetect is set, we don't warn if we weren't able to determine a full schema? That'd certainly be more explicit that "hey we're gonna try to detect some data types based on the actual values, but we might not get it exactly how you want" if we warned unless autodetect was true in that case.

@plamut
Copy link
Contributor

plamut commented Aug 23, 2019

Hmm ... I would still find it useful to be warned about any potential problems with schema detection, even I explicitly instruct the method to detect it for me. Suppressing the warning might make me think that everything went well behind the scenes.

How about having an ignore_schema_warnings (default: False) option to explicitly ignore any warnings? With an explanation that if an explicit schema is provided (and thus the autodetect is not used), the option has no effect?

@tswast
Copy link
Contributor Author

tswast commented Aug 23, 2019

Suppressing the warning might make me think that everything went well behind the scenes.

In a lot of cases, to_parquet works fine. It's just that we've been hit with a bunch of issues recently around missing struct support. I do wish we had a way to determine when that case might be hit without inspecting the actual values (slow!).

My thought with #9042 (comment) is that there are still going to be times when you have an arbitrary DataFrame and can't exactly guess the schema and want to explicitly opt-in to whatever pandas is doing to convert DataFrames to Parquet files, even if we know it might not always pick the type we want (such as confusing nullable integer columns and float columns).

@plamut
Copy link
Contributor

plamut commented Aug 23, 2019

I am not that familiar with what a "typical" use case is, but the scenario described above sounds plausible. And we can still mention the potential misdetection of schema in the "opt-in"' parameter's docstring, and that warnings will not be issued. Users will then know that they are explicitly turning off this aspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
2 participants