-
Notifications
You must be signed in to change notification settings - Fork 125
Description
Working on a PR...
Problem Description
pandas-gbq version: 0.10.0
GBQ table:
Reproducible example:
from copy import deepcopy
import datetime
import pandas_gbq
import pandas as pd
from google.oauth2.service_account import Credentials
df = pd.DataFrame(
{
"field1": ["a", "b"],
"field2": [1, 2],
"field3": [datetime.date(2019, 1, 1), datetime.date(2019, 5, 1)],
}
)
original_schema = [
{
"name": "field1",
"type": "STRING",
"mode": "REQUIRED",
},
{
"name": "field2",
"type": "INTEGER",
},
{
"name": "field3",
"type": "DATE",
},
]
original_schema_cp = deepcopy(original_schema)
pandas_gbq.to_gbq(
dataframe=df,
destination_table="XXXXX.schematest",
project_id="XXXXX",
credentials=Credentials.from_service_account_file("XXXXX.json"),
if_exists="append",
table_schema=original_schema,
)Results:
>>> original_schema
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
{'name': 'field2', 'type': 'INTEGER', 'mode': 'NULLABLE'},
{'name': 'field3', 'type': 'DATE', 'mode': 'NULLABLE'}]
>>> original_schema_cp == original_schema
FalseAhhhhh! Nowhere is it noted in https://pandas-gbq.readthedocs.io/en/latest/api.html that, in some situations, the object passed to table_schema may be modified in place. Most libraries go out of their way to avoid modifying mutable arguments; pandas-gbq should do the same.
Debugging
A pdb session shows that table_schema is modified in connector.load_data(), which in turn calls load_chunks(). The specific place in load_chunks() where the inadvertant modification happens is here:
# https://github.com/bsolomon1124/pandas-gbq/blob/59228d9c20cee12b24caa5cc41d3f2e6c0337932/pandas_gbq/load.py#L72
for field in schema["fields"]:
if "mode" not in field:
field["mode"] = "NULLABLE"This is because, I think, in table_schema = schema.update_schema(default_schema, dict(fields=table_schema)), the new table_schema now contains a reference to the dict elements of the original table_schema argument.
