ALLOW_FIELD_ADDITION not working #1095

tpcgold · 2022-01-04T14:00:47Z

Since about 1 Month we are unable to use the ALLOW_FIELD_ADDITION in our python scripts as it doesn't work (has no effect) with python-bigquery anymore. The scripts did work as expected before so probably there has been something changed in the Job insert API rather than the library.

This issue really grinds my gears

OS type and version: Windows 11
Python version: 3.9
pip version: 21.3.1
google-cloud-bigquery version: 2.31.0

Steps to reproduce

Create table
Upload dataframe or json (doesn't matter) with new field (use autodetect)

The text was updated successfully, but these errors were encountered:

tswast · 2022-01-11T22:56:53Z

I need some more information. Is it that this feature doesn't work in combination with autodetect?

I see we have two system tests / samples with passing tests where we set ALLOW_FIELD_ADDITION:

python-bigquery/docs/snippets.py

Line 515 in 679e8ed

bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION
python-bigquery/samples/client_query_add_column.py

Line 35 in 679e8ed

schema_update_options=[bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION],

Neither of these uses autodetect, though.

steffnay · 2022-01-12T19:13:21Z

I have been able to do this successfully using the existing documentation for both the cli and the Node.js client library. Would also like to see more information about the failing Python code.

tswast · 2022-01-13T22:01:26Z

One scenario that could explain your issue is if the new field doesn't appear until a few hundred rows into the jsonl file. Since autodetect only looks at a limited number of rows, if the new column/field doesn't appear until the end I suspect autodetect doesn't include it in the updated schema.

tpcgold · 2022-03-04T12:09:17Z

@tswast no its not about the amount of rows before the new field
@steffnay here you go, a minimal example of BigQuery failing to use the partial schema definition:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
https://cloud.google.com/bigquery/docs/pandas-gbq-migration

Although the above documentation is for data frames the behavior for json data shouldn't be any different (EDIT: It actually is)


from google.cloud import bigquery


def load_json_to_bigquery(project: str, dataset: str, table_id: str, json_rows):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)
        try:
            bqschema = client.get_table(table).schema
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.autodetect = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.ignore_unknown_values = True
        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
        )

        job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'
        job = client.load_table_from_json(json_rows=json_rows, destination=table, job_config=job_config)
        # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))
        print(job.errors)


os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = ''
project = ''
dataset = ''
table_id = ''

# this creates a table with a STRING and b INTEGER
json_rows = [{"a": "asdf", "b": 4525}]
load_json_to_bigquery(project=project, dataset=dataset, table_id=table_id, json_rows=json_rows)

# this adds a and b to the table but ignores c!!!
json_rows = [{"a": "asdf", "b": 4525, "c": "asdf"}]
load_json_to_bigquery(project=project, dataset=dataset, table_id=table_id, json_rows=json_rows)

EDIT: The behavior is actually different although it shouldn't from a usage perspective. Proof: (with DataFrames it works)

import os
import pandas as pd
from google.cloud import bigquery


def load_df_to_bigquery(project: str, dataset: str, table_id: str, df):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)
        try:
            bqschema = client.get_table(table).schema
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.autodetect = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.ignore_unknown_values = True
        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
        )

        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'
        job = client.load_table_from_dataframe(dataframe=df, destination=table, job_config=job_config)
        # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))
        print(job.errors)


os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = ''
project = ''
dataset = ''
table_id = ''

# this creates a table with a STRING and b INTEGER
df = pd.DataFrame([{"a": "asdf", "b": 4525}])
load_df_to_bigquery(project=project, dataset=dataset, table_id=table_id, df=df)

# this adds a and b to the table but ignores c!!!
df = pd.DataFrame([{"a": "asdf", "b": 4525, "c": "asdf"}])
load_df_to_bigquery(project=project, dataset=dataset, table_id=table_id, df=df)

tswast · 2022-03-04T22:35:29Z

Thanks for the reproducible test case! This appears to be a bug in load_table_from_json. I agree it should not ignore new columns.

chalmerlowe · 2022-09-02T15:17:12Z

Taking a look at this to see what options we have. Thanks for the robust troubleshooting. That really helps.

chalmerlowe · 2023-05-11T15:47:39Z

@tpcgold

Behind the scenes, the load_table_from_json() function and load_table_from_dataframe() function handle detection of schemas differently:

The *_from_dataframe() function basically extracts the existing schema, if one exists, and compares it to the columns in the Pandas DataFrame to determine whether any new columns have been added. NOTE: all of this is handled in the Python client OR related helper libraries.

The *_from_json() function checks to see if a job_config is provided AND is valid. If not it will attempt to generate one (in the code referred to as new_job_config) and will try to populate it with a set of default values. Next it detects whether new_job_config includes a schema. IF not, it sets the new_job_config.autodetect to True. Note: setting *.autodetect to True happens in the helper libraries, but the actual schema detection process happens in BigQuery (i.e. that process in not under the purview of these Python libraries, more on this later).

With that in mind...

By providing a schema out the gate, as shown in the snippet from your code, is it possible that we short circuit the autodetection process and never allow BigQuery to do the expected processing.

try: 
    bqschema = client.get_table(table).schema
    job_config = bigquery.LoadJobConfig(schema=bqschema)

Does the code not work properly if we simply load the config?

job_config = bigquery.LoadJobConfig()

NOTE: as indicated by Tim, my understanding is that behind the scenes, when performing autodetection, BigQuery will attempt to define the correct schema, but there are limits with JSON. BigQuery only parses a few (hundred?) rows to make a reasonable estimate of what the schema should look like.

tpcgold · 2023-05-12T11:37:32Z

Hi @chalmerlowe
the reason for the schema pull of the existing table is that the BigQuery Python library ignores existing types in the already existing BigQuery table, which leads to errors (e.g. Cannot convert X to Y)
Hence we fixed it by pulling the schema first to give the library the right "partial" schema in advance.

So it's crucial to be able to provide a partial schema definition (also via load_table_from_json)
as long as the detection ignores the already existing types on BigQuery

HaraldNordgren · 2023-05-27T09:39:46Z

Any updates on this? I'm trying to add a new BigQuery column from Apache Beam and this setting seems to have no effect.

I resorted to creating to columns manually in BQ.

tpcgold · 2023-06-13T11:45:33Z

i can also add:
this lets call it "bug" also occurs if we do upload a table from file:
(Note: all the additional checks are to avoid BigQuery issues as the library itself doesn't do it)

just run into the problem that our Database Dump didn't extend automatically

def load_file_to_bigquery(project: str, dataset: str, table_id: str, path: str, columns: list):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)

        try:
            bqschema = client.get_table(table).schema
            bqschema = [field for field in bqschema if field.name in set(columns)]
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.allow_quoted_newlines = True
        # job_config.skip_leading_rows = 1
        job_config.autodetect = True
        # job_config.ignore_unknown_values = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.field_delimiter = ';'

        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
            field="partitiondate"
        )

        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'

        with open(path, "rb") as file:
            job = client.load_table_from_file(
                file, table, job_config=job_config
            )
            # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))

tpcgold · 2024-01-04T10:38:47Z

would be great if this or #1646 would finally get some priority as it impacts the usability of BigQuery

chalmerlowe · 2024-01-04T13:20:41Z

@tpcgold Thank you for your input.
The BigQuery Client Library Team is actively looking at #1646 (which includes this issue) to identify an appropriate set of next steps.

There are a lot of changes, proposed changes, suggestions, and complexity in the area of loading tables and we hope to identify what makes the most sense in terms of cleaning up the code, fixing important bugs, so that we can evolve this code effectively.

Linchin · 2024-02-02T22:14:37Z

For the load_table_from_json() method, the new column gets successfully added if I explicitly set job_config.schema = None. So I feel like it's working as intended: in the sample, when the table already exists, we pull the schema and use it as part of the input of the new load job. The backend is just using the schema we give them to parse the data, and ignoring the part that's not described in it.

However, with load_table_from_dataframe(), the backend adds column c despite the schema provided explicitly doesn't include it. It's likely that the schema mapped from the dataframe takes precedence over the schema manually provided. This is indeed a behavior discrepancy between these similar methods, but which is the best way to fix it is debatable.

Linchin · 2024-02-02T22:25:12Z

@tswast @chalmerlowe
I wonder what you think about this.

Is it necessary to enforce that load_table_from_json() and load_table_from_dataframe() behave exactly the same about this
Should we request the backend to give user provided schema higher priority when loading dataframe?
How can we improve the documentation to reduce confusion about it?

tswast · 2024-02-05T15:44:57Z

Is it necessary to enforce that load_table_from_json() and load_table_from_dataframe() behave exactly the same about this

load_table_from_dataframe with Parquet nevers uses autodetect schema as the parquet file has an embedded schema, so it would have to be a different fix. I'm not sure about CSV in this regard, though.

Should we request the backend to give user provided schema higher priority when loading dataframe?

The backend defines what parquet data types we need to set for various BigQuery data types, documented here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions I'm pretty sure the user-provided schema is ignored in the case of file types that contain embedded schema, but I don't recall.

We have logic in our pandas conversion to convert BigQuery types to Arrow so they serialize correctly. This is why we pass in a BQ schema in the dataframe_to_parquet helper:

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 607 in 6249032

bq_schema,

Like your recent fix to load_table_from_json() we do a call to get_table() already.

python-bigquery/google/cloud/bigquery/client.py

Line 2658 in 6249032

table = self.get_table(destination)

@Linchin Are you able to reproduce the bug with load_table_from_dataframe()? The logic looks correct to me, but maybe we need to reset the schema to None (or include the new fields) after the parquet serialization?

tswast · 2024-02-05T15:45:58Z

However, with load_table_from_dataframe(), the backend adds column c despite the schema provided explicitly doesn't include it. It's likely that the schema mapped from the dataframe takes precedence over the schema manually provided. This is indeed a behavior discrepancy between these similar methods, but which is the best way to fix it is debatable.

That's what the user wants in this case, right?

Linchin · 2024-02-05T18:55:07Z

Thank you @tswast for explaining how bq_schema is used here, it's very nuanced.

For load_table_from_json(), the user is expecting: with new data including a new column, and autodetect = True, the backend will add the new column automatically. The user sample provided above is not adding the new column. I'm arguing that backend did not do it in this case, because user also provided table schema (as in the sample), which doesn't have the new column. So I think load_table_from_json() is working as intended. I also tested that if schema is empty, the new column is added.

For load_table_from_dataframe(), the user is expecting something similar: with new data including a new column, and autodetect = True, the backend will add the new column automatically. This function is working like this, so it looks fine. But I see that in the sample above, the user is also providing the table schema without the new column - despite this, the new column is added anyway.

So I feel like the issue is with load_table_from_dataframe(), rather than load_table_from_json(). It is ignoring the user defined schema passed through job_config.schema and using the dataframe instead.

tswast · 2024-02-05T19:35:34Z

But I see that in the sample above, the user is also providing the table schema without the new column - despite this, the new column is added anyway. So I feel like the issue is with load_table_from_dataframe(), rather than load_table_from_json(). It is ignoring the user defined schema passed through job_config.schema and using the dataframe instead.

This is expected for file formats with embedded schema.

When you load Parquet files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_schemas

Linchin · 2024-02-05T21:04:07Z

In that case, maybe we can add documentation somewhere to make it clearer that user provided schema in load_table_from_dataframe() only serves as a reference, but the definitive schema comes from the dataframe itself?

tpcgold · 2024-02-06T06:43:02Z

Hi, linking me into the conversation. Yes I would expect the load_table_from_json and load_table_from_file (also affected) to behave exactly like load_table_from_dataframe

From a usage perspective, they need to behave the same.

In addition, the "partial schema definition" described in the Docs:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
Is a critical feature and needs to work.
There are a lot of reasons for this e.g. failing to autodetect from "rows to upload".
Also, the library or BigQuery backend (if the table exists) fail to get the correct schema from the table with autodetect enabled. This leads to issues uploading the rows as the autodetected types are sometimes not compatible with the existing table, although they could be inserted with the correct schema. (As already described above: #1095 (comment))
So the partial schema is rather a workaround to another problem with the library.
In other words from the mentioned "priority" perspective:
Table Schema (full or partial) > User Schema (full or partial) > Autdetect Schema (full or partial)
So existing table schemas would overwrite the autodetect schemas, and the user schema would overwrite the autodetect schemas for any given field. (this would avoid conflicts and allow extendability)

So short: ALLOW_FIELD_ADDITION should do what it describes. Add Fields if enabled.
Other configs should not have an impact on this.

Linchin · 2024-02-07T23:32:13Z

Hi @tpcgold, thank you for the comments. I think it's reasonable that when users are adding a new column, the backend should be able to autodetect the type, and only for the new column. However, I think this would be a backend feature request, and outside the clients' scope. You can find how to file a feature request in the support page:
https://cloud.google.com/bigquery/docs/getting-support

As to this:

In addition, the "partial schema definition" described in the Docs:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
Is a critical feature and needs to work.

Could you clarify for me what is the request here? I think this feature is working as intended right now.

tpcgold · 2024-02-08T06:57:51Z

yeah dataframe as described above is working well.
but the feature to

# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.

is not working on the load_table_from_json() and load_table_from_file()
although it should work the same from a user's perspective.
So I do think that this combination (partial schema and ALLOW_FIELD_ADDITION) doesn't work looks like a "client" issue as it's handled differently in those functions.

Also further context: with this not working BigQuery lacks the capability to allow the automatic insertion of forward-compatible changes to be automatically uploaded via BigQuery Python Client.
This has a major impact on the usability of BigQuery and is not just a "nice to have" request.
So, the backend and client team should find a solution, I still think it's rather a client issue (as it only impacts everything else than DataFrames)

Linchin · 2024-02-08T18:06:29Z

The sample

https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

is for load_table_from_dataframe(), which doesn't apply to load_table_from_json(). As the client doesn't do the schema detection locally, this is still a backend feature request. To be more specific, the client passes the config, the schema and file to the backend, and the backend decides how to handle it.

@tswast @chalmerlowe Do you have more context on why only dataframe has partial schema?

tswast · 2024-02-08T18:41:47Z

I believe DataFrame allowed this as a way to override the default types that we got from pandas-> arrow -> parquet. Specifically, I see in BigQuery DataFrames we have problems differentiating between DATETIME and TIMESTAMP without this logic.

I'll have to think about what this would mean for loading JSON where we don't have any local type-detection logic at the moment other than the fetching of the table schema, which I believe we recently added.

ytd-oc · 2024-03-29T17:42:36Z

here another way to show how wierd the function works nowdays :

bq = BigQueryUtils(project_id)
bq.execute_query_string(
    project_id,
    f"""
    CREATE or replace TABLE {project_id}.{dataset_id}.{table_id} (
        id STRING,
        updated_at TIMESTAMP,
    )
    """,
)

data = [
    {
        "id": "1",
        "updated_at": "2021-01-01 00:00:00",
    },
    {
        "id": "2",
        "updated_at": "2021-01-02 00:00:00",
    },
    {
        "id": "3",
        "updated_at": "2021-01-03 00:00:00",
        "truc": "test",
    },
]

job_config = bigquery.LoadJobConfig(
    autodetect=True,
    schema=bq.get_table(project_id, dataset_id, table_id).schema,
    schema_update_options=bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
    write_disposition= "WRITE_APPEND",
)

job = bq.bq_client.load_table_from_json(data, f"{project_id}.{dataset_id}.{table_id}", job_config=job_config)


try:
    r = job.result()
except Exception as err:
    print(f"an error has occurred while inserting rows in to {table_id}: {err}")
pprint.pprint(job.errors)

this code creates a table with two columns
and tries to insert some rows with a 3rd one having an extra column.

autoDetect = True + provideSchema + ALLOW_FIELD_ADDITION = works but extra column is not added
autoDetect = False + provideSchema + ALLOW_FIELD_ADDITION = row starting at position 98: No such field: truc.
autoDetect = False + NoSchema + ALLOW_FIELD_ADDITION = row starting at position 98: No such field: truc.
autoDetect = True + NoSchema + ALLOW_FIELD_ADDITION = Field id has 'id' changed type from STRING to INTEGER',

this comment to demonstrate it's not just about adding columns, but also we must provide partial schemas to avoid auto-detection. If I may comment auto detection is confusing, while value is quoted it considers id as INT, i would have expected such behavior if id was given id:3

anyway, thanks for the further improvments

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jan 4, 2022

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jan 5, 2022

plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 11, 2022

tswast self-assigned this Mar 4, 2022

tswast removed their assignment Apr 14, 2022

shollyman assigned chalmerlowe Aug 2, 2022

suryasoma mentioned this issue Apr 21, 2023

New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 GoogleCloudDataproc/spark-bigquery-connector#945

Closed

chalmerlowe mentioned this issue Aug 17, 2023

Revisit the method load_table_from_json() #1646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALLOW_FIELD_ADDITION not working #1095

ALLOW_FIELD_ADDITION not working #1095

tpcgold commented Jan 4, 2022 •

edited

tswast commented Jan 11, 2022

steffnay commented Jan 12, 2022

tswast commented Jan 13, 2022

tpcgold commented Mar 4, 2022 •

edited

tswast commented Mar 4, 2022

chalmerlowe commented Sep 2, 2022

chalmerlowe commented May 11, 2023

tpcgold commented May 12, 2023

HaraldNordgren commented May 27, 2023 •

edited

tpcgold commented Jun 13, 2023 •

edited

tpcgold commented Jan 4, 2024

chalmerlowe commented Jan 4, 2024

Linchin commented Feb 2, 2024

Linchin commented Feb 2, 2024

tswast commented Feb 5, 2024

tswast commented Feb 5, 2024

Linchin commented Feb 5, 2024

tswast commented Feb 5, 2024

Linchin commented Feb 5, 2024

tpcgold commented Feb 6, 2024 •

edited

Linchin commented Feb 7, 2024

tpcgold commented Feb 8, 2024 •

edited

Linchin commented Feb 8, 2024 •

edited

tswast commented Feb 8, 2024

ytd-oc commented Mar 29, 2024 •

edited

ALLOW_FIELD_ADDITION not working #1095

ALLOW_FIELD_ADDITION not working #1095

Comments

tpcgold commented Jan 4, 2022 • edited

Steps to reproduce

tswast commented Jan 11, 2022

steffnay commented Jan 12, 2022

tswast commented Jan 13, 2022

tpcgold commented Mar 4, 2022 • edited

tswast commented Mar 4, 2022

chalmerlowe commented Sep 2, 2022

chalmerlowe commented May 11, 2023

tpcgold commented May 12, 2023

HaraldNordgren commented May 27, 2023 • edited

tpcgold commented Jun 13, 2023 • edited

tpcgold commented Jan 4, 2024

chalmerlowe commented Jan 4, 2024

Linchin commented Feb 2, 2024

Linchin commented Feb 2, 2024

tswast commented Feb 5, 2024

tswast commented Feb 5, 2024

Linchin commented Feb 5, 2024

tswast commented Feb 5, 2024

Linchin commented Feb 5, 2024

tpcgold commented Feb 6, 2024 • edited

Linchin commented Feb 7, 2024

tpcgold commented Feb 8, 2024 • edited

Linchin commented Feb 8, 2024 • edited

tswast commented Feb 8, 2024

ytd-oc commented Mar 29, 2024 • edited

tpcgold commented Jan 4, 2022 •

edited

tpcgold commented Mar 4, 2022 •

edited

HaraldNordgren commented May 27, 2023 •

edited

tpcgold commented Jun 13, 2023 •

edited

tpcgold commented Feb 6, 2024 •

edited

tpcgold commented Feb 8, 2024 •

edited

Linchin commented Feb 8, 2024 •

edited

ytd-oc commented Mar 29, 2024 •

edited