Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table_from_dataframe column values loaded as null to BQ if column name has spaces in it #1566

Closed
osmanamjad opened this issue May 10, 2023 · 2 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@osmanamjad
Copy link

Environment details

  • OS type and version: Linux
  • Python version: 3.10.10
  • pip version: 22.3.1
  • google-cloud-bigquery version: 3.10.0

Steps to reproduce

  1. Create a Pandas dataframe with a column name that has a space in it.
  2. Call load_table_from_dataframe on the pandas dataframe.
  3. Resulting BQ table does not have values for the column whose name had a space in it.

Code example

import pandas as pd
from google.cloud import bigquery

# Initialize a pandas DataFrame (Step 1)
data = {'Full Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

# Create a BigQuery client
client = bigquery.Client()

# Define the BigQuery dataset and table
dataset_id = 'your_dataset_id'
table_id = 'your_table_id'

# Load the DataFrame into BigQuery table (Step 2)
job_config = bigquery.LoadJobConfig()
job = client.load_table_from_dataframe(df, f"{dataset_id}.{table_id}", job_config=job_config)
job.result()  # Wait for the job to complete

The issue is specifically when the job config's source_format is not specified to CSV. I understand this means the source_format defaults to Parquet which does not support having spaces in column names. However, this is not obvious from the user's perspective so I am wondering if there should be some error if a column name has a space in it (or any other characters that are not supported).

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label May 10, 2023
@tswast tswast added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 10, 2023
@tswast
Copy link
Contributor

tswast commented May 10, 2023

Since spaces in column names are a new feature https://cloud.google.com/bigquery/docs/release-notes#March_22_2023 I'll mark this as a feature request.

Do we know if this is a limitation of the parquet format or the BigQuery parquet loader?

@chalmerlowe
Copy link
Contributor

The issue with spaces in column names appears to be a well-known issue with the parquet format itself.
The standard solution shown on the Internet is to create column names that do not have spaces, i.e. if given: Full Name one might try:

  • FullName
  • Full_Name
  • Fullname

The downside/risk of implementing an automagic column name reformatter is that it will mangle the user's column names:

  • invisibly, without user knowledge leading to difficult to debug troubleshooting
  • might lead to column name collisions (i.e. if the dataset already has two columns Product ID and ProductID, then how does the reformatter resolve this?)

I prefer to leave the user in charge of understanding their data and handling the names and naming of their data manually rather than seeking to invisibly manipulate it for them.

Closing as will not fix.

@chalmerlowe chalmerlowe added the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants