Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

Closed
tswast opened this issue May 30, 2019 · 8 comments · Fixed by #146
Closed

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

tswast opened this issue May 30, 2019 · 8 comments · Fixed by #146
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. status: blocked Resolving the issue is dependent on other work. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented May 30, 2019

Is your feature request related to a problem? Please describe.

If you have a pandas Series containing dictionaries, ideally this could be uploaded to BigQuery as a STRUCT / RECORD column. Currently this fails with a "file does not exist error" because the arrow write_table fails with ""ArrowInvalid: Nested column branch had multiple children".

Describe the solution you'd like

Upload of a RECORD column succeeds. This will require a fix to https://jira.apache.org/jira/browse/ARROW-2587.

Describe alternatives you've considered

Change intermediate file format to JSON or some other type. This isn't ideal, since most other types are row-oriented, but pandas DataFrames are column-oriented.

@plamut plamut transferred this issue from googleapis/google-cloud-python Feb 4, 2020
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Feb 4, 2020
@plamut plamut added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 4, 2020
@meredithslota
Copy link
Contributor

https://jira.apache.org/jira/browse/ARROW-2587 is still open. I'm not sure what we are able to do until that is fixed.

@meredithslota meredithslota added the status: blocked Resolving the issue is dependent on other work. label Mar 27, 2020
@emkornfield
Copy link

Writing nested structs will be fixed in Arrow 0.17.0 release (sometime in the next few weeks).

wesm pushed a commit to apache/arrow that referenced this issue Mar 29, 2020
 - Plumbs through engine version
 - Makes engine version settable via environment variable
 - Adds unit tests coverage

Should also unblock: googleapis/python-bigquery#21

CC @wesm

Closes #6751 from emkornfield/add_flag_to_python

Authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
@plamut
Copy link
Contributor

plamut commented Mar 31, 2020

I can confirm that the error from the issue description is not reproducible anymore with the latest pyarrow (development version, compiled from source).

I was able to successfully load the following data (into a new table, that is):

schema = [
    bigquery.SchemaField(
        "bar",
        "STRUCT",
        fields=[
            bigquery.SchemaField("aaa", "INTEGER", mode="REQUIRED"),
            bigquery.SchemaField("bbb", "INTEGER", mode="REQUIRED"),
        ],
        mode="REQUIRED",
    ),
]

dict_series = [
    {"aaa": 1, "bbb": 2}, {"aaa": 3, "bbb": 4}, {"aaa": 5, "bbb": 6}
]
df = pd.DataFrame(data={"bar": dict_series}, columns=["bar"])

job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(
    df, "my.table.reference", job_config=job_config
).result()

This resulted in the following table and schema on the backend:

Row bar.aaa bar.bbb
1 1 2
2 3 4
3 5 6

Schema:

Field name Type Mode
bar RECORD REQUIRED
bar. aaa INTEGER REQUIRED
bar. bbb INTEGER REQUIRED

@MainHanzo
Copy link

Hello, I am very interested in this feature and I would love to compile the latest pyarrow from source.
Could you please give me some guides on how to compile it locally? I haven't found any instructions on how to build this project.

@plamut
Copy link
Contributor

plamut commented Apr 10, 2020

@MainHanzo Fortunately, compiling from source is not needed, as the pyarrow maintainers made the nightly pre-release builds available (comment).

If you still want to compile the project on your own, check pyarrow docs (I didn't manage to compiling it for Python 3.7 on my machine, though, but succeeded with Python 3.6).

@jack-tee
Copy link

Should this be fixed now? I'm using pyarrow 0.17.1 and google-cloud-bigquery 1.25.0.
If I take a dataframe and pass it to .load_table_from_dataframe() if the table does not exist in bigquery then it loads the struct field correctly.
If the table already exists I get this error and a link to this page? Caused by: Uploading dataframes with struct (record) column types is not supported.

@plamut
Copy link
Contributor

plamut commented Jul 11, 2020

@jack-tee AFAIK it should be fixed in pyarrow, yes, although only for Python 3.

The error you are seeing is raised by the client library itself. There is a PR that will remove this error-raising part, but it's not merged yet (it's actually put on hold, because it only works in Python 3 and we are dropping Python 2 support in the near future anyway).

How urgently do you need the fix?

If feasible, you can temporarily manually comment out the linked code block until the fix gets actually released.

@jack-tee
Copy link

Thanks for clarifying @plamut. When I saw that I could load struct fields into a table that didn't exist but not into an existing table I thought perhaps that path had been missed, but your explanation makes sense.

I can work around it for now. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. status: blocked Resolving the issue is dependent on other work. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants