New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALLOW_FIELD_ADDITION not working #1095
Comments
I need some more information. Is it that this feature doesn't work in combination with I see we have two system tests / samples with passing tests where we set
Neither of these uses |
I have been able to do this successfully using the existing documentation for both the cli and the Node.js client library. Would also like to see more information about the failing Python code. |
One scenario that could explain your issue is if the new field doesn't appear until a few hundred rows into the jsonl file. Since autodetect only looks at a limited number of rows, if the new column/field doesn't appear until the end I suspect autodetect doesn't include it in the updated schema. |
@tswast no its not about the amount of rows before the new field Although the above documentation is for data frames the behavior for json data shouldn't be any different (EDIT: It actually is)
EDIT: The behavior is actually different although it shouldn't from a usage perspective. Proof: (with DataFrames it works)
|
Thanks for the reproducible test case! This appears to be a bug in |
Taking a look at this to see what options we have. Thanks for the robust troubleshooting. That really helps. |
Behind the scenes, the The The With that in mind... By providing a schema out the gate, as shown in the snippet from your code, is it possible that we short circuit the autodetection process and never allow BigQuery to do the expected processing.
Does the code not work properly if we simply load the config?
NOTE: as indicated by Tim, my understanding is that behind the scenes, when performing autodetection, BigQuery will attempt to define the correct schema, but there are limits with JSON. BigQuery only parses a few (hundred?) rows to make a reasonable estimate of what the schema should look like. |
Hi @chalmerlowe So it's crucial to be able to provide a partial schema definition (also via load_table_from_json) |
Any updates on this? I'm trying to add a new BigQuery column from Apache Beam and this setting seems to have no effect. I resorted to creating to columns manually in BQ. |
i can also add: just run into the problem that our Database Dump didn't extend automatically
|
would be great if this or #1646 would finally get some priority as it impacts the usability of BigQuery |
@tpcgold Thank you for your input. There are a lot of changes, proposed changes, suggestions, and complexity in the area of loading tables and we hope to identify what makes the most sense in terms of cleaning up the code, fixing important bugs, so that we can evolve this code effectively. |
For the However, with |
@tswast @chalmerlowe
|
The backend defines what parquet data types we need to set for various BigQuery data types, documented here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions I'm pretty sure the user-provided schema is ignored in the case of file types that contain embedded schema, but I don't recall. We have logic in our pandas conversion to convert BigQuery types to Arrow so they serialize correctly. This is why we pass in a BQ schema in the
Like your recent fix to python-bigquery/google/cloud/bigquery/client.py Line 2658 in 6249032
@Linchin Are you able to reproduce the bug with |
That's what the user wants in this case, right? |
Thank you @tswast for explaining how For For So I feel like the issue is with |
This is expected for file formats with embedded schema.
See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_schemas |
In that case, maybe we can add documentation somewhere to make it clearer that user provided schema in |
Hi, linking me into the conversation. Yes I would expect the From a usage perspective, they need to behave the same. In addition, the "partial schema definition" described in the Docs: So short: |
Hi @tpcgold, thank you for the comments. I think it's reasonable that when users are adding a new column, the backend should be able to autodetect the type, and only for the new column. However, I think this would be a backend feature request, and outside the clients' scope. You can find how to file a feature request in the support page: As to this:
Could you clarify for me what is the request here? I think this feature is working as intended right now. |
yeah dataframe as described above is working well.
is not working on the load_table_from_json() and load_table_from_file() Also further context: with this not working BigQuery lacks the capability to allow the automatic insertion of forward-compatible changes to be automatically uploaded via BigQuery Python Client. |
The sample
is for @tswast @chalmerlowe Do you have more context on why only dataframe has partial schema? |
I believe DataFrame allowed this as a way to override the default types that we got from pandas-> arrow -> parquet. Specifically, I see in BigQuery DataFrames we have problems differentiating between DATETIME and TIMESTAMP without this logic. I'll have to think about what this would mean for loading JSON where we don't have any local type-detection logic at the moment other than the fetching of the table schema, which I believe we recently added. |
here another way to show how wierd the function works nowdays :
this code creates a table with two columns autoDetect = True + provideSchema + ALLOW_FIELD_ADDITION = works but extra column is not added this comment to demonstrate it's not just about adding columns, but also we must provide partial schemas to avoid auto-detection. If I may comment auto detection is confusing, while value is quoted it considers id as INT, i would have expected such behavior if id was given id:3 anyway, thanks for the further improvments |
Since about 1 Month we are unable to use the ALLOW_FIELD_ADDITION in our python scripts as it doesn't work (has no effect) with python-bigquery anymore. The scripts did work as expected before so probably there has been something changed in the Job insert API rather than the library.
This issue really grinds my gears
google-cloud-bigquery
version: 2.31.0Steps to reproduce
The text was updated successfully, but these errors were encountered: