Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_gbq(): InvalidSchema when using 'type': 'INT64' and if_exists = 'append' #322

Closed
Azerane opened this issue Jul 29, 2020 · 5 comments · Fixed by #340
Closed

df.to_gbq(): InvalidSchema when using 'type': 'INT64' and if_exists = 'append' #322

Azerane opened this issue Jul 29, 2020 · 5 comments · Fixed by #340
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Azerane
Copy link

Azerane commented Jul 29, 2020

Hello there,

I'm facing an issue with the function df.to_gbq()

From my understanding, it seems that when we are using if_exists = 'append' when the destination_table already exists, we are retrieving the original_schema from destination_table.

The original_schema will contain every fields but type retrieve is INTEGER and not INT64 anymore.
Since the code is simply checking that all fields in the schema_local are the same in the schema_remote, this conducts to a missmatch and function schema_is_subset is returning False instead of True.

I'm not sure about where we should fix this error, but it's kind of annoying when working with integers values.

@tswast tswast added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jul 29, 2020
@norbertsteele
Copy link

Is there a workaround that anyone has found?

@gilangardya
Copy link

Hi @Azerane,

Maybe you get that error because you use object instead of int64 to represent some integer values.
If you need object representation because you store big-integer (i.e. 2**65), you should change the data type since BigQuery INTEGER range is -2^63 .. 2^63-1.
But if you don't need that, you can cast the Series into int64 first before loading into the BigQuery table, or you can specify the table_schema in to_gbq parameters.
If that's not your case, please share more details, that would be great.

Maybe it can help you too @norbertsteele

@tswast tswast self-assigned this Nov 9, 2020
@tswast
Copy link
Collaborator

tswast commented Nov 9, 2020

I'm able to reproduce with the following code example:

import pandas_gbq

df1 = pandas.DataFrame({"int_col": [1, 2, 3]})
pandas_gbq.to_gbq(
    df1,
    "my_dataset.int64table",
    table_schema=[{"name": "int_col", "type": "INT64"}]
)
df2 = pandas.DataFrame({"int_col": [4, 5, 6]})

# This fails, despite using the same schema as was used for the initial
# upload.
pandas_gbq.to_gbq(
    df2,
    "my_dataset.int64table",
    table_schema=[{"name": "int_col", "type": "INT64"}],
    if_exists="append",

@tswast tswast added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Nov 9, 2020
@tswast
Copy link
Collaborator

tswast commented Nov 9, 2020

Is there a workaround that anyone has found?

Workaround is to use legacy SQL type names (INTEGER, not INT64), but this should be fixed once #340 is merged and released (target: 0.14.1)

@SDonatas
Copy link

SDonatas commented Dec 18, 2022

Had this error in few days. The problem is with wrong error handling. The issue I had was - pandas dataframe did not have datetime columns converted as specified in supplied schema. The integer error, at least for me was a false flag.

Bellow fraction of the code solved it for me:

    if schema == None:
        schema = [{'name': col, 'type': 'STRING'} for col in df.columns]
        writeSchema(schema)

    def clean_types(x):
        for schema_item in schema:
            name_par = schema_item['name']
            type_par = schema_item['type']

            if type_par == 'STRING':
                if type(x[name_par]) in (list, dict):
                    x[name_par] = json.dumps(x[name_par])
                else:
                    x[name_par] = str(x[name_par])
            elif type_par == 'INTEGER':
                x[name_par] = int(x[name_par])
            elif type_par in ('TIMESTAMP', 'DATETIME'):
                x[name_par] = pd.to_datetime(x[name_par])

        return x

    
    df = df.apply(lambda x: clean_types(x), axis = 1)

And scema json:

[ { "name": "object", "type": "STRING" }, { "name": "id", "type": "STRING" }, { "name": "subject", "type": "STRING" }, { "name": "from_email", "type": "STRING" }, { "name": "from_name", "type": "STRING" }, { "name": "sent_at", "type": "DATETIME" }, { "name": "send_time", "type": "DATETIME" }, { "name": "name", "type": "STRING" }, { "name": "lists", "type": "STRING" }, { "name": "excluded_lists", "type": "STRING" }, { "name": "status", "type": "STRING" }, { "name": "status_id", "type": "INTEGER" }, { "name": "status_label", "type": "STRING" }, { "name": "created", "type": "DATETIME" }, { "name": "updated", "type": "DATETIME" }, { "name": "num_recipients", "type": "INTEGER" }, { "name": "campaign_type", "type": "STRING" }, { "name": "is_segmented", "type": "BOOLEAN" }, { "name": "message_type", "type": "STRING" }, { "name": "template_id", "type": "STRING" } ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
5 participants