-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe
#8142
Comments
Hi @tswast ! I was looking into how to solve this issue, because #8105 has closed my issue #8093. It would be great if we can do this in the background. Would this be as simple as adding the following code here https://github.com/googleapis/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/client.py#L1523? This code would get the schema of the destination table and applies it to the if not job_config.schema:
client = Client()
job_config.schema = client.get_table(destination).schema Note: I am not sure if you want to initialize the client here. |
A client object is already available as
This is necessary, but not sufficient. There are several cases to handle.
Note: B & C can both be true if there are some new columns and some missing columns in the DataFrame. Also, for those users that are using fastparquet instead of pyarrow, I don't want to force them to have to use pyarrow, as that's a breaking change (though given the difference in behavior, we may want to consider dropping fastparquet as a supported serialization library). |
Oh, of course :P
How are these cases (B and C) expected to be handled?
I am not sure if I understand how you would force users to use pyarrow. Can you elaborate? |
#9064 actually handles both these cases, as it filters the schema by column name and re-orders the schema to match the DataFrame column order. |
FYI: #9096 will affect this implementation. After getting the table schema, you'll have to filter out any columns not present in the dataframe. If there are any columns in the dataframe that aren't present in the table, we have 2 options:
I believe option 2 is what pandas-gbq does, but option 1 is current behavior. If we do want to pursue option 2, then we should file it as a separate feature request. |
As I'm writing some samples for this, I'm realizing we probably don't want to fetch the schema if the write disposition is |
@tswast Sounds like something to update in the PR? |
Follow-up to #8105 (comment)
When a table schema isn't supplied in
load_table_from_dataframe
, try to get the existing table schema. This will prevent errors due to ambiguous pandas types (#7370) without having to explicitly provide a schema.Note: this behavior is similar to that of pandas-gbq, which always fetches the table schema and then compares to make sure it's compatible with the dataframe schema.
https://github.com/pydata/pandas-gbq/blob/59228d9c20cee12b24caa5cc41d3f2e6c0337932/pandas_gbq/gbq.py#L1115-L1121
The text was updated successfully, but these errors were encountered: