Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQ update error: Cannot add required fields to an existing schema. (field: application), invalid #63

Closed
whd opened this issue Aug 16, 2019 · 6 comments

Comments

@whd
Copy link
Member

whd commented Aug 16, 2019

* google_bigquery_table.telemetry_first_shutdown_v4_live: googleapi: Error 400: Provided Schema does not match Table moz-fx-data-shar-nonprod-efed:telemetry_live.first_shutdown_v4. Cannot add required fields to an existing schema. (field: application), invalid
* google_bigquery_table.telemetry_first_shutdown_v4_stable: googleapi: Error 400: Provided Schema does not match Table moz-fx-data-shar-nonprod-efed:telemetry_stable.first_shutdown_v4. Cannot add required fields to an existing schema. (field: application), invalid

This is in the process of going from an "empty" schema to one copied via #58. I haven't seen this error before and I'm not sure what's up and how to resolve it. Generally if the upgrade path from "schemaless" to "schemad" pings involves an incompatible schema change that would be no bueno, but it may be specific to this case.

@acmiyaguchi
Copy link
Contributor

This is related to the following documentation: https://cloud.google.com/bigquery/docs/managing-table-schemas#manually_adding_an_empty_column

Apparently, it's not possible to add a required column to an existing table. This actually makes sense -- fields that are currently not in the schema can't possible be added as a required field down the line because the existing pings would fail to include those required fields.

This makes migrating the first_shutdown ping from the empty schema to the full main ping schema tricky, it probably needs to be blown away unfortunately. We don't have a good story to deal with this error.

@whd
Copy link
Member Author

whd commented Aug 16, 2019

This sounds awful (but does make some sense), because it means the upgrade path for going from schema-less processing to schema'd processing will usually involve an incompatible schema change. We should mention this in the Monday technical sync but perhaps the number of cases where we actually end up creating schemas for previously schema-less pings will be rather small.

The other outstanding one I believe is the sync ping.

@jklukas
Copy link
Contributor

jklukas commented Aug 19, 2019

I filed Filed https://enterprise.google.com/supportcenter/managecases#Case/001000000040sBR/U-20348571 to inquire about any performance implications of relaxing all fields to NULLABLE.

It is worth noting that we already do not support REQUIRED columns for tables populated via queries in bigquery-etl. AFAICT, when a new table is created as a result of a query job with createDisposition CREATE_IF_NEEDED, all fields are nullable. I am not aware of any way to mark fields in a query as required.

Thus, if we decide to drop REQUIRED mode for all tables populated by the pipeline, we will be no worse off than our derived tables. Also, I believe we would be able to proceed with that change without removing any existing tables, since BQ allows relaxing of mode from REQUIRED to NULLABLE for existing fields.

@whd
Copy link
Member Author

whd commented Aug 19, 2019

I filed mozilla/jsonschema-transpiler#84 for this feature.

Also, I believe we would be able to proceed with that change without removing any existing tables, since BQ allows relaxing of mode from REQUIRED to NULLABLE for existing fields.

This is true, but the implementation of mozilla/jsonschema-transpiler#84 does then block the deployment of BQ tables in the interim.

@jklukas
Copy link
Contributor

jklukas commented Aug 20, 2019

Got a response back from the google support case, which makes it sound unlikely that we'll suffer any performance issues from making everything nullable. I'm inclined to move ahead with the change.

The Engineering team has advised me that the difference in query performance between 'REQUIRED' and 'NULLABLE' is expected to be very small.

In intensive filtering queries, the REQUIRED MODE would be slightly more optimal than NULLABLE since it contains one less state--NULL. In conclusion, it would really depend on how much weight your query is placing on the NULL state.

FYI, BigQuery permits you to relax the schema field MODE from 'REQUIRED' to 'NULLABLE' and not vice versa. Therefore initializing your schema with 'REQUIRED' would be a safe choice for your upcoming roll-out because its modifiable anytime to NULLABLE (and not vice versa.)

Lastly, you can run a benchmarking test by running two queries and then comparing the query performance information as outlined in [1].

In general, it is recommended to follow the documented best practices [2] in order to achieve optimal performance. As long as you are not using any of the documented anti-patterns, you should not have to worry about BigQuery’s query planning performance.

I hope the information I have provided you with has fulfilled your inquiries. I will be holding this case open for you for the next few days to give you time to investigate the above. If you have any additional questions or comments concerning the inquiry I will be happy to assist you further.

I hope this message finds you well, and that you have a wonderful day.

[1] https://cloud.google.com/bigquery/query-plan-explanation
[2] https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

@whd
Copy link
Member Author

whd commented Aug 20, 2019

We've deployed this to stage and the first shutdown ping looks plenty complicated now, so this looks to have been effective.

@whd whd closed this as completed Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants