Skip to content

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema #2373

@henryharbeck

Description

@henryharbeck

Environment details

  • OS type and version: Ubuntu 20.04 (WSL)
  • Python version: 3.11.8
  • pip version: 24.1.2
  • google-cloud-bigquery version: 3.25.0

Steps to reproduce

  1. Create a table that has a single field with mode=REQUIRED
  2. Attempt to append to the table using client.load_table_from_file with a parquet file written from memory to a BytesIO buffer. The library writing to the buffer either does not have an option of nullable/required fields (Polars), or nullable=False is provided to the field (PyArrow).

Issue details

I am unable to use client.load_table_from_file to append to an existing table with a REQUIRED field, without providing the table schema in the LoadJobConfig. The docs say that the schema does not need to be supplied if the table already exits.

This issue is similar to googleapis/google-cloud-python#8093, but relates to load_table_from_file rather than load_table_from_dataframe. It is also somewhat related to googleapis/google-cloud-python#8142 (as explicitly suppling the BigQuery table schema fixes the issue), but again this relates to load_table_from_file rather than load_table_from_dataframe.

As an aside, the fix should definitely not require PyArrow. The current Polars code functions without PyArrow if the table BigQuery schema is provided.

I am filing this as a bug rather than a feature request as the docs for schema in JobConfigurationLoad say

The schema can be omitted if the destination table already exists

Which does not hold up in the below example.

Code example

Apologies, in advance that the example is a bit long.

It demonstrates Parquet files written to BytesIO buffers from both Polars and Pyarrow unable to be written to a BigQuery table with mode=REQUIRED.

Code example
from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq
from google.cloud import bigquery

import polars as pl

PROJECT = "<project>"


def create_and_return_table(table_name: str, client: bigquery.Client) -> bigquery.Table:
    schema = [bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED")]
    table = bigquery.Table(f"{PROJECT}.testing.{table_name}", schema=schema)

    client.delete_table(table, not_found_ok=True)
    return client.create_table(table)


def polars_way(table: bigquery.Table, client: bigquery.Client):
    df = pl.DataFrame({"foo": [1, 2, 3]})

    with BytesIO() as stream:
        df.write_parquet(stream)

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )

        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def pyarrow_way(table: bigquery.Table, client: bigquery.Client):
    # nullable=True is the default, but make it explicit
    # This issue does not occur if nullable=False, but the docs imply that shouldn't be
    # required
    pyarrow_schema = pa.schema([pa.field("foo", pa.int64(), nullable=True)])
    pyarrow_table = pa.Table.from_pydict({"foo": [1, 2, 3]}, schema=pyarrow_schema)

    with BytesIO() as stream:
        writer = pq.ParquetWriter(stream, pyarrow_schema)
        writer.write(pyarrow_table)
        writer.close()

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
            # Default option, but make it explicit
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            # The issue does not occur if the schema is explicitly provided.
            # schema=table.schema,
        )
        job = client.load_table_from_file(
            stream,
            destination=table,
            rewind=True,
            job_config=job_config,
        )

    job.result()


def main():
    client = bigquery.Client()
    table = create_and_return_table("test_pl", client)
    polars_way(table, client)
    table = create_and_return_table("test_pa", client)
    pyarrow_way(table, client)


if __name__ == "__main__":
    main()

Stack trace

Both the polars_way and the pyarrow_way raise with the error. Here they both are.

# polars_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 93, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 80, in main
    polars_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 41, in polars_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pl.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pl. Field foo has changed mode from REQUIRED to NULLABLE

# pyarrow_way
Traceback (most recent call last):
  File "/home/henry/development/polars_bq/combined.py", line 86, in <module>
    main()
  File "/home/henry/development/polars_bq/combined.py", line 82, in main
    pyarrow_way(table, client)
  File "/home/henry/development/polars_bq/combined.py", line 74, in pyarrow_way
    job.result()
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pa.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pa. Field foo has changed mode from REQUIRED to NULLABLE

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions