-
Notifications
You must be signed in to change notification settings - Fork 324
Description
Environment details
- OS type and version: Ubuntu 20.04 (WSL)
- Python version: 3.11.8
- pip version: 24.1.2
google-cloud-bigqueryversion: 3.25.0
Steps to reproduce
- Create a table that has a single field with mode=REQUIRED
- Attempt to append to the table using
client.load_table_from_filewith a parquet file written from memory to aBytesIObuffer. The library writing to the buffer either does not have an option of nullable/required fields (Polars), ornullable=Falseis provided to the field (PyArrow).
Issue details
I am unable to use client.load_table_from_file to append to an existing table with a REQUIRED field, without providing the table schema in the LoadJobConfig. The docs say that the schema does not need to be supplied if the table already exits.
This issue is similar to googleapis/google-cloud-python#8093, but relates to load_table_from_file rather than load_table_from_dataframe. It is also somewhat related to googleapis/google-cloud-python#8142 (as explicitly suppling the BigQuery table schema fixes the issue), but again this relates to load_table_from_file rather than load_table_from_dataframe.
As an aside, the fix should definitely not require PyArrow. The current Polars code functions without PyArrow if the table BigQuery schema is provided.
I am filing this as a bug rather than a feature request as the docs for schema in JobConfigurationLoad say
The schema can be omitted if the destination table already exists
Which does not hold up in the below example.
Code example
Apologies, in advance that the example is a bit long.
It demonstrates Parquet files written to BytesIO buffers from both Polars and Pyarrow unable to be written to a BigQuery table with mode=REQUIRED.
Code example
from io import BytesIO
import pyarrow as pa
import pyarrow.parquet as pq
from google.cloud import bigquery
import polars as pl
PROJECT = "<project>"
def create_and_return_table(table_name: str, client: bigquery.Client) -> bigquery.Table:
schema = [bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED")]
table = bigquery.Table(f"{PROJECT}.testing.{table_name}", schema=schema)
client.delete_table(table, not_found_ok=True)
return client.create_table(table)
def polars_way(table: bigquery.Table, client: bigquery.Client):
df = pl.DataFrame({"foo": [1, 2, 3]})
with BytesIO() as stream:
df.write_parquet(stream)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
# Default option, but make it explicit
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
# The issue does not occur if the schema is explicitly provided.
# schema=table.schema,
)
job = client.load_table_from_file(
stream,
destination=table,
rewind=True,
job_config=job_config,
)
job.result()
def pyarrow_way(table: bigquery.Table, client: bigquery.Client):
# nullable=True is the default, but make it explicit
# This issue does not occur if nullable=False, but the docs imply that shouldn't be
# required
pyarrow_schema = pa.schema([pa.field("foo", pa.int64(), nullable=True)])
pyarrow_table = pa.Table.from_pydict({"foo": [1, 2, 3]}, schema=pyarrow_schema)
with BytesIO() as stream:
writer = pq.ParquetWriter(stream, pyarrow_schema)
writer.write(pyarrow_table)
writer.close()
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
# Default option, but make it explicit
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
# The issue does not occur if the schema is explicitly provided.
# schema=table.schema,
)
job = client.load_table_from_file(
stream,
destination=table,
rewind=True,
job_config=job_config,
)
job.result()
def main():
client = bigquery.Client()
table = create_and_return_table("test_pl", client)
polars_way(table, client)
table = create_and_return_table("test_pa", client)
pyarrow_way(table, client)
if __name__ == "__main__":
main()Stack trace
Both the polars_way and the pyarrow_way raise with the error. Here they both are.
# polars_way
Traceback (most recent call last):
File "/home/henry/development/polars_bq/combined.py", line 93, in <module>
main()
File "/home/henry/development/polars_bq/combined.py", line 80, in main
polars_way(table, client)
File "/home/henry/development/polars_bq/combined.py", line 41, in polars_way
job.result()
File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pl.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pl. Field foo has changed mode from REQUIRED to NULLABLE
# pyarrow_way
Traceback (most recent call last):
File "/home/henry/development/polars_bq/combined.py", line 86, in <module>
main()
File "/home/henry/development/polars_bq/combined.py", line 82, in main
pyarrow_way(table, client)
File "/home/henry/development/polars_bq/combined.py", line 74, in pyarrow_way
job.result()
File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 966, in result
return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/henry/development/polars_bq/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
raise self._exception
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project>:testing.test_pa.
Field foo has changed mode from REQUIRED to NULLABLE; reason: invalid,
message: Provided Schema does not match Table <project>:testing.test_pa. Field foo has changed mode from REQUIRED to NULLABLE