Skip to content

to_gbq should not modify table_schema inplace #277

@bsolomon1124

Description

@bsolomon1124

Working on a PR...


Problem Description

pandas-gbq version: 0.10.0

GBQ table:

Screen Shot 2019-05-28 at 2 50 54 PM

Reproducible example:

from copy import deepcopy
import datetime
import pandas_gbq
import pandas as pd
from google.oauth2.service_account import Credentials

df = pd.DataFrame(
    {
        "field1": ["a", "b"],
        "field2": [1, 2],
        "field3": [datetime.date(2019, 1, 1), datetime.date(2019, 5, 1)],
    }
)

original_schema = [
    {
        "name": "field1",
        "type": "STRING",
        "mode": "REQUIRED",
    },
    {
        "name": "field2",
        "type": "INTEGER",
    },
    {
        "name": "field3",
        "type": "DATE",
    },
]
original_schema_cp = deepcopy(original_schema)

pandas_gbq.to_gbq(
    dataframe=df,
    destination_table="XXXXX.schematest",
    project_id="XXXXX",
    credentials=Credentials.from_service_account_file("XXXXX.json"),
    if_exists="append",
    table_schema=original_schema,
)

Results:

>>> original_schema                                                                                                                                                                                                                                          
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
 {'name': 'field2', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'field3', 'type': 'DATE', 'mode': 'NULLABLE'}]
>>> original_schema_cp == original_schema                                                                                                                                                                                                                    
False

Ahhhhh! Nowhere is it noted in https://pandas-gbq.readthedocs.io/en/latest/api.html that, in some situations, the object passed to table_schema may be modified in place. Most libraries go out of their way to avoid modifying mutable arguments; pandas-gbq should do the same.

Debugging

A pdb session shows that table_schema is modified in connector.load_data(), which in turn calls load_chunks(). The specific place in load_chunks() where the inadvertant modification happens is here:

# https://github.com/bsolomon1124/pandas-gbq/blob/59228d9c20cee12b24caa5cc41d3f2e6c0337932/pandas_gbq/load.py#L72
    for field in schema["fields"]:
        if "mode" not in field:
        field["mode"] = "NULLABLE"

This is because, I think, in table_schema = schema.update_schema(default_schema, dict(fields=table_schema)), the new table_schema now contains a reference to the dict elements of the original table_schema argument.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions