Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALLOW_FIELD_ADDITION not working #1095

Open
tpcgold opened this issue Jan 4, 2022 · 25 comments
Open

ALLOW_FIELD_ADDITION not working #1095

tpcgold opened this issue Jan 4, 2022 · 25 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tpcgold
Copy link

tpcgold commented Jan 4, 2022

Since about 1 Month we are unable to use the ALLOW_FIELD_ADDITION in our python scripts as it doesn't work (has no effect) with python-bigquery anymore. The scripts did work as expected before so probably there has been something changed in the Job insert API rather than the library.

This issue really grinds my gears

  • OS type and version: Windows 11
  • Python version: 3.9
  • pip version: 21.3.1
  • google-cloud-bigquery version: 2.31.0

Steps to reproduce

  1. Create table
  2. Upload dataframe or json (doesn't matter) with new field (use autodetect)
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jan 4, 2022
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jan 5, 2022
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 11, 2022
@tswast
Copy link
Contributor

tswast commented Jan 11, 2022

I need some more information. Is it that this feature doesn't work in combination with autodetect?

I see we have two system tests / samples with passing tests where we set ALLOW_FIELD_ADDITION:

Neither of these uses autodetect, though.

@steffnay
Copy link
Contributor

I have been able to do this successfully using the existing documentation for both the cli and the Node.js client library. Would also like to see more information about the failing Python code.

@tswast tswast added priority: p3 Desirable enhancement or fix. May not be included in next release. type: question Request for information or clarification. Not an issue. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jan 12, 2022
@tswast
Copy link
Contributor

tswast commented Jan 13, 2022

One scenario that could explain your issue is if the new field doesn't appear until a few hundred rows into the jsonl file. Since autodetect only looks at a limited number of rows, if the new column/field doesn't appear until the end I suspect autodetect doesn't include it in the updated schema.

@tpcgold
Copy link
Author

tpcgold commented Mar 4, 2022

@tswast no its not about the amount of rows before the new field
@steffnay here you go, a minimal example of BigQuery failing to use the partial schema definition:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
https://cloud.google.com/bigquery/docs/pandas-gbq-migration

Although the above documentation is for data frames the behavior for json data shouldn't be any different (EDIT: It actually is)


from google.cloud import bigquery


def load_json_to_bigquery(project: str, dataset: str, table_id: str, json_rows):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)
        try:
            bqschema = client.get_table(table).schema
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.autodetect = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.ignore_unknown_values = True
        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
        )

        job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'
        job = client.load_table_from_json(json_rows=json_rows, destination=table, job_config=job_config)
        # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))
        print(job.errors)


os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = ''
project = ''
dataset = ''
table_id = ''

# this creates a table with a STRING and b INTEGER
json_rows = [{"a": "asdf", "b": 4525}]
load_json_to_bigquery(project=project, dataset=dataset, table_id=table_id, json_rows=json_rows)

# this adds a and b to the table but ignores c!!!
json_rows = [{"a": "asdf", "b": 4525, "c": "asdf"}]
load_json_to_bigquery(project=project, dataset=dataset, table_id=table_id, json_rows=json_rows)

EDIT: The behavior is actually different although it shouldn't from a usage perspective. Proof: (with DataFrames it works)

import os
import pandas as pd
from google.cloud import bigquery


def load_df_to_bigquery(project: str, dataset: str, table_id: str, df):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)
        try:
            bqschema = client.get_table(table).schema
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.autodetect = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.ignore_unknown_values = True
        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
        )

        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'
        job = client.load_table_from_dataframe(dataframe=df, destination=table, job_config=job_config)
        # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))
        print(job.errors)


os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = ''
project = ''
dataset = ''
table_id = ''

# this creates a table with a STRING and b INTEGER
df = pd.DataFrame([{"a": "asdf", "b": 4525}])
load_df_to_bigquery(project=project, dataset=dataset, table_id=table_id, df=df)

# this adds a and b to the table but ignores c!!!
df = pd.DataFrame([{"a": "asdf", "b": 4525, "c": "asdf"}])
load_df_to_bigquery(project=project, dataset=dataset, table_id=table_id, df=df)

@tswast
Copy link
Contributor

tswast commented Mar 4, 2022

Thanks for the reproducible test case! This appears to be a bug in load_table_from_json. I agree it should not ignore new columns.

@tswast tswast added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Mar 4, 2022
@tswast tswast self-assigned this Mar 4, 2022
@tswast tswast removed their assignment Apr 14, 2022
@chalmerlowe
Copy link
Contributor

Taking a look at this to see what options we have. Thanks for the robust troubleshooting. That really helps.

@chalmerlowe
Copy link
Contributor

@tpcgold

Behind the scenes, the load_table_from_json() function and load_table_from_dataframe() function handle detection of schemas differently:

The *_from_dataframe() function basically extracts the existing schema, if one exists, and compares it to the columns in the Pandas DataFrame to determine whether any new columns have been added. NOTE: all of this is handled in the Python client OR related helper libraries.

The *_from_json() function checks to see if a job_config is provided AND is valid. If not it will attempt to generate one (in the code referred to as new_job_config) and will try to populate it with a set of default values. Next it detects whether new_job_config includes a schema. IF not, it sets the new_job_config.autodetect to True. Note: setting *.autodetect to True happens in the helper libraries, but the actual schema detection process happens in BigQuery (i.e. that process in not under the purview of these Python libraries, more on this later).

With that in mind...

By providing a schema out the gate, as shown in the snippet from your code, is it possible that we short circuit the autodetection process and never allow BigQuery to do the expected processing.

try: 
    bqschema = client.get_table(table).schema
    job_config = bigquery.LoadJobConfig(schema=bqschema)

Does the code not work properly if we simply load the config?

job_config = bigquery.LoadJobConfig()

NOTE: as indicated by Tim, my understanding is that behind the scenes, when performing autodetection, BigQuery will attempt to define the correct schema, but there are limits with JSON. BigQuery only parses a few (hundred?) rows to make a reasonable estimate of what the schema should look like.

@tpcgold
Copy link
Author

tpcgold commented May 12, 2023

Hi @chalmerlowe
the reason for the schema pull of the existing table is that the BigQuery Python library ignores existing types in the already existing BigQuery table, which leads to errors (e.g. Cannot convert X to Y)
Hence we fixed it by pulling the schema first to give the library the right "partial" schema in advance.

So it's crucial to be able to provide a partial schema definition (also via load_table_from_json)
as long as the detection ignores the already existing types on BigQuery

@HaraldNordgren
Copy link

HaraldNordgren commented May 27, 2023

Any updates on this? I'm trying to add a new BigQuery column from Apache Beam and this setting seems to have no effect.

I resorted to creating to columns manually in BQ.

@tpcgold
Copy link
Author

tpcgold commented Jun 13, 2023

i can also add:
this lets call it "bug" also occurs if we do upload a table from file:
(Note: all the additional checks are to avoid BigQuery issues as the library itself doesn't do it)

just run into the problem that our Database Dump didn't extend automatically

def load_file_to_bigquery(project: str, dataset: str, table_id: str, path: str, columns: list):
    client = bigquery.Client()

    try:
        table_ref = bigquery.TableReference.from_string('.'.join([project, dataset, table_id]))
        table = bigquery.Table(table_ref)

        try:
            bqschema = client.get_table(table).schema
            bqschema = [field for field in bqschema if field.name in set(columns)]
            job_config = bigquery.LoadJobConfig(schema=bqschema)
        except Exception as e:
            job_config = bigquery.LoadJobConfig()

        job_config.allow_quoted_newlines = True
        # job_config.skip_leading_rows = 1
        job_config.autodetect = True
        # job_config.ignore_unknown_values = True
        job_config.schema_update_options = [
            bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
            bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
        ]
        # job_config.max_bad_records = 5
        # job_config.field_delimiter = ';'

        job_config.time_partitioning = bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.DAY,
            field="partitiondate"
        )

        job_config.create_disposition = 'CREATE_IF_NEEDED'
        job_config.write_disposition = 'WRITE_APPEND'

        with open(path, "rb") as file:
            job = client.load_table_from_file(
                file, table, job_config=job_config
            )
            # Wait for the load job to complete.
        print(job.result())
    except Exception as e:
        print('FAILED')
        print('Exception on loading data' + str(e))

@tpcgold
Copy link
Author

tpcgold commented Jan 4, 2024

would be great if this or #1646 would finally get some priority as it impacts the usability of BigQuery

@chalmerlowe
Copy link
Contributor

@tpcgold Thank you for your input.
The BigQuery Client Library Team is actively looking at #1646 (which includes this issue) to identify an appropriate set of next steps.

There are a lot of changes, proposed changes, suggestions, and complexity in the area of loading tables and we hope to identify what makes the most sense in terms of cleaning up the code, fixing important bugs, so that we can evolve this code effectively.

@Linchin
Copy link
Contributor

Linchin commented Feb 2, 2024

For the load_table_from_json() method, the new column gets successfully added if I explicitly set job_config.schema = None. So I feel like it's working as intended: in the sample, when the table already exists, we pull the schema and use it as part of the input of the new load job. The backend is just using the schema we give them to parse the data, and ignoring the part that's not described in it.

However, with load_table_from_dataframe(), the backend adds column c despite the schema provided explicitly doesn't include it. It's likely that the schema mapped from the dataframe takes precedence over the schema manually provided. This is indeed a behavior discrepancy between these similar methods, but which is the best way to fix it is debatable.

@Linchin
Copy link
Contributor

Linchin commented Feb 2, 2024

@tswast @chalmerlowe
I wonder what you think about this.

  1. Is it necessary to enforce that load_table_from_json() and load_table_from_dataframe() behave exactly the same about this
  2. Should we request the backend to give user provided schema higher priority when loading dataframe?
  3. How can we improve the documentation to reduce confusion about it?

@tswast
Copy link
Contributor

tswast commented Feb 5, 2024

Is it necessary to enforce that load_table_from_json() and load_table_from_dataframe() behave exactly the same about this

load_table_from_dataframe with Parquet nevers uses autodetect schema as the parquet file has an embedded schema, so it would have to be a different fix. I'm not sure about CSV in this regard, though.

Should we request the backend to give user provided schema higher priority when loading dataframe?

The backend defines what parquet data types we need to set for various BigQuery data types, documented here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions I'm pretty sure the user-provided schema is ignored in the case of file types that contain embedded schema, but I don't recall.

We have logic in our pandas conversion to convert BigQuery types to Arrow so they serialize correctly. This is why we pass in a BQ schema in the dataframe_to_parquet helper:

Like your recent fix to load_table_from_json() we do a call to get_table() already.

table = self.get_table(destination)

@Linchin Are you able to reproduce the bug with load_table_from_dataframe()? The logic looks correct to me, but maybe we need to reset the schema to None (or include the new fields) after the parquet serialization?

@tswast
Copy link
Contributor

tswast commented Feb 5, 2024

However, with load_table_from_dataframe(), the backend adds column c despite the schema provided explicitly doesn't include it. It's likely that the schema mapped from the dataframe takes precedence over the schema manually provided. This is indeed a behavior discrepancy between these similar methods, but which is the best way to fix it is debatable.

That's what the user wants in this case, right?

@Linchin
Copy link
Contributor

Linchin commented Feb 5, 2024

Thank you @tswast for explaining how bq_schema is used here, it's very nuanced.

For load_table_from_json(), the user is expecting: with new data including a new column, and autodetect = True, the backend will add the new column automatically. The user sample provided above is not adding the new column. I'm arguing that backend did not do it in this case, because user also provided table schema (as in the sample), which doesn't have the new column. So I think load_table_from_json() is working as intended. I also tested that if schema is empty, the new column is added.

For load_table_from_dataframe(), the user is expecting something similar: with new data including a new column, and autodetect = True, the backend will add the new column automatically. This function is working like this, so it looks fine. But I see that in the sample above, the user is also providing the table schema without the new column - despite this, the new column is added anyway.

So I feel like the issue is with load_table_from_dataframe(), rather than load_table_from_json(). It is ignoring the user defined schema passed through job_config.schema and using the dataframe instead.

@tswast
Copy link
Contributor

tswast commented Feb 5, 2024

But I see that in the sample above, the user is also providing the table schema without the new column - despite this, the new column is added anyway. So I feel like the issue is with load_table_from_dataframe(), rather than load_table_from_json(). It is ignoring the user defined schema passed through job_config.schema and using the dataframe instead.

This is expected for file formats with embedded schema.

When you load Parquet files into BigQuery, the table schema is automatically retrieved from the self-describing source data.

See: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_schemas

@Linchin
Copy link
Contributor

Linchin commented Feb 5, 2024

In that case, maybe we can add documentation somewhere to make it clearer that user provided schema in load_table_from_dataframe() only serves as a reference, but the definitive schema comes from the dataframe itself?

@tpcgold
Copy link
Author

tpcgold commented Feb 6, 2024

Hi, linking me into the conversation. Yes I would expect the load_table_from_json and load_table_from_file (also affected) to behave exactly like load_table_from_dataframe

From a usage perspective, they need to behave the same.

In addition, the "partial schema definition" described in the Docs:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
Is a critical feature and needs to work.
There are a lot of reasons for this e.g. failing to autodetect from "rows to upload".
Also, the library or BigQuery backend (if the table exists) fail to get the correct schema from the table with autodetect enabled. This leads to issues uploading the rows as the autodetected types are sometimes not compatible with the existing table, although they could be inserted with the correct schema. (As already described above: #1095 (comment))
So the partial schema is rather a workaround to another problem with the library.
In other words from the mentioned "priority" perspective:
Table Schema (full or partial) > User Schema (full or partial) > Autdetect Schema (full or partial)
So existing table schemas would overwrite the autodetect schemas, and the user schema would overwrite the autodetect schemas for any given field. (this would avoid conflicts and allow extendability)

So short: ALLOW_FIELD_ADDITION should do what it describes. Add Fields if enabled.
Other configs should not have an impact on this.

@Linchin
Copy link
Contributor

Linchin commented Feb 7, 2024

Hi @tpcgold, thank you for the comments. I think it's reasonable that when users are adding a new column, the backend should be able to autodetect the type, and only for the new column. However, I think this would be a backend feature request, and outside the clients' scope. You can find how to file a feature request in the support page:
https://cloud.google.com/bigquery/docs/getting-support

As to this:

In addition, the "partial schema definition" described in the Docs:
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe
Is a critical feature and needs to work.

Could you clarify for me what is the request here? I think this feature is working as intended right now.

@tpcgold
Copy link
Author

tpcgold commented Feb 8, 2024

yeah dataframe as described above is working well.
but the feature to

# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.

is not working on the load_table_from_json() and load_table_from_file()
although it should work the same from a user's perspective.
So I do think that this combination (partial schema and ALLOW_FIELD_ADDITION) doesn't work looks like a "client" issue as it's handled differently in those functions.

Also further context: with this not working BigQuery lacks the capability to allow the automatic insertion of forward-compatible changes to be automatically uploaded via BigQuery Python Client.
This has a major impact on the usability of BigQuery and is not just a "nice to have" request.
So, the backend and client team should find a solution, I still think it's rather a client issue (as it only impacts everything else than DataFrames)

@Linchin
Copy link
Contributor

Linchin commented Feb 8, 2024

The sample

https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

is for load_table_from_dataframe(), which doesn't apply to load_table_from_json(). As the client doesn't do the schema detection locally, this is still a backend feature request. To be more specific, the client passes the config, the schema and file to the backend, and the backend decides how to handle it.

@tswast @chalmerlowe Do you have more context on why only dataframe has partial schema?

@tswast
Copy link
Contributor

tswast commented Feb 8, 2024

I believe DataFrame allowed this as a way to override the default types that we got from pandas-> arrow -> parquet. Specifically, I see in BigQuery DataFrames we have problems differentiating between DATETIME and TIMESTAMP without this logic.

I'll have to think about what this would mean for loading JSON where we don't have any local type-detection logic at the moment other than the fetching of the table schema, which I believe we recently added.

@ytd-oc
Copy link

ytd-oc commented Mar 29, 2024

here another way to show how wierd the function works nowdays :

bq = BigQueryUtils(project_id)
bq.execute_query_string(
    project_id,
    f"""
    CREATE or replace TABLE {project_id}.{dataset_id}.{table_id} (
        id STRING,
        updated_at TIMESTAMP,
    )
    """,
)

data = [
    {
        "id": "1",
        "updated_at": "2021-01-01 00:00:00",
    },
    {
        "id": "2",
        "updated_at": "2021-01-02 00:00:00",
    },
    {
        "id": "3",
        "updated_at": "2021-01-03 00:00:00",
        "truc": "test",
    },
]

job_config = bigquery.LoadJobConfig(
    autodetect=True,
    schema=bq.get_table(project_id, dataset_id, table_id).schema,
    schema_update_options=bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
    write_disposition= "WRITE_APPEND",
)

job = bq.bq_client.load_table_from_json(data, f"{project_id}.{dataset_id}.{table_id}", job_config=job_config)


try:
    r = job.result()
except Exception as err:
    print(f"an error has occurred while inserting rows in to {table_id}: {err}")
pprint.pprint(job.errors)

this code creates a table with two columns
and tries to insert some rows with a 3rd one having an extra column.

autoDetect = True + provideSchema + ALLOW_FIELD_ADDITION = works but extra column is not added
autoDetect = False + provideSchema + ALLOW_FIELD_ADDITION = row starting at position 98: No such field: truc.
autoDetect = False + NoSchema + ALLOW_FIELD_ADDITION = row starting at position 98: No such field: truc.
autoDetect = True + NoSchema + ALLOW_FIELD_ADDITION = Field id has 'id' changed type from STRING to INTEGER',

this comment to demonstrate it's not just about adding columns, but also we must provide partial schemas to avoid auto-detection. If I may comment auto detection is confusing, while value is quoted it considers id as INT, i would have expected such behavior if id was given id:3

anyway, thanks for the further improvments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

9 participants