Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty strings inconsistently converted to NULL's when using df.to_gbq() #366

Closed
jerome-asselin-ivadolabs opened this issue Jun 21, 2021 · 5 comments · Fixed by #413
Closed
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jerome-asselin-ivadolabs

Here is a reproducible example of inconsistent conversion of empty strings to NULL when exporting to Bigquery.

import pandas as pd
from google.cloud import bigquery

df1 = pd.DataFrame.from_dict([{'a': ""}])
df2 = pd.DataFrame.from_dict([{'a': "", 'b': ""}])

df1.to_gbq('aa_temp.df1') # Keeps empty string, as expected.
df2.to_gbq('aa_temp.df2') # Converts empty strings to null values.

client = bigquery.Client()
print(client.list_rows('aa_temp.df1').to_dataframe())
print(client.list_rows('aa_temp.df2').to_dataframe())

Actual Output:

  a
0  
      a     b
0  None  None

Expected Output:

  a
0  
      a     b
0    

Using python 3.7, pandas 1.2.4, pandas-gbq 0.15.0.

@tswast tswast added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jul 1, 2021
@tswast tswast added api: bigquery Issues related to the googleapis/python-bigquery-pandas API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jul 16, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jul 20, 2021
@tswast tswast self-assigned this Aug 9, 2021
@tseaver
Copy link
Contributor

tseaver commented Oct 5, 2021

Running through this in the debugger, I found that BigQuery correctly infers schema for both dataframes (all fields are STRING w/ NULLABLE). The difference between the two dataframes seems to be in how dataframe.to_csv is generating the file used to upload the data:

>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict([{'a': ""}])
>>> df1.to_csv(None, index=False, header=False)
'""\n'
>>> df2 = pd.DataFrame.from_dict([{'a': "", 'b': ""}])
>>> df2.to_csv(None, index=False, header=False)
',\n'

When loading the file dumped from df1 preserves the empty string, because that file has the "". When loading the file dumped from df2, which does not have the "", the columns are loaded (properly) as null values.

@tswast
Copy link
Collaborator

tswast commented Oct 7, 2021

I wonder if there's a flag we can pass to to_csv to force it to create a null value indicator? I think we'd also have to set https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html#google.cloud.bigquery.job.LoadJobConfig.null_marker in that case.

@tswast
Copy link
Collaborator

tswast commented Oct 7, 2021

Per https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad.FIELDS.null_marker

The default value is the empty string

So that explains the current behavior.

@tswast
Copy link
Collaborator

tswast commented Oct 7, 2021

The other option we can do is to start sending Parquet data instead of CSV, as that should correctly disambiguate between null and empty string in all cases.

@tswast
Copy link
Collaborator

tswast commented Oct 27, 2021

Confirmed with a test case in #413 that using Parquet as the serialization format addresses this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
4 participants