New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support ARRAY data type when loading from DataFrame with Parquet #980
fix: support ARRAY data type when loading from DataFrame with Parquet #980
Conversation
2b1b5c9
to
3f533c0
Compare
3f533c0
to
62fd565
Compare
I've made this default to Users who need the old behaviour can set the argument to |
4c77f16
to
f44a6ae
Compare
I suppose another question is whether or not we should set |
These suggestions might be acceptable if released in Would it be acceptable to default |
@judahrand First of all, thanks for the research and the fix!
BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming I'm just not sure about bumping As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed? Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19. |
I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0. |
I agree but I can't really see another way to address this (and it is quite annoying for some use cases). For example (Feast)[https://github.com/feast-dev/feast] is currently having to implement an ugly hack around this. It's a shame that PyArrow just didn't write proper Parquet for so long! Plus, tbh we're already on |
We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version ( We can take the best from both worlds. P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution. |
3ddbc66
to
a9ec5ca
Compare
I've added to the basic I suppose I should probably also add tests for all the combinations of |
Can you point me to an example of where this has been done before so I can replicate the pattern here? |
I recommend adding a property like
See a similar feature flag for the BQ Storage client:
|
google/cloud/bigquery/client.py
Outdated
@@ -2456,6 +2457,7 @@ def load_table_from_dataframe( | |||
project: str = None, | |||
job_config: LoadJobConfig = None, | |||
parquet_compression: str = "snappy", | |||
parquet_use_compliant_nested_type: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this? I would argue that the non-compliant version is just plain wrong, right?
I don't think we officially supported list/struct types for the pandas connector yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more than happy to not expose this!! I agree that it's just plain wrong.
I was just thinking about anyone who needs a work around for a table they've already created with a 'wrong' format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should clarity this. The reason it might break old users tables is that the format that pyarrow
uses for nested lists by default is list<item: list<item: value>>
.
By default (without use_compliant_nested_type
) it turns this into list<list<item: value>>
when writing to Parquet.
When enable_list_inference
is off in BigQuery this ends up as a ARRAY<STRUCT>
in BigQuery where the STRUCT
looks like:
{
"list: {
"item: value
}
}
When enable_list_inference
is on in BigQuery this ends up as a ARRAY<STRUCT>
in BigQuery where the STRUCT
looks like:
{
"item": value
}
When PyArrow is told to use_compliant_nested_type
it outputs list<list<element: value>>
.
When enable_list_inference
is off in BigQuery this ends up as a ARRAY<STRUCT>
in BigQuery where the STRUCT
looks like:
{
"list: {
"element: value
}
}
When enable_list_inference
is on in BigQuery this ends up as a ARRAY<value>
in BigQuery. Which is what it should be!!!!
My point being that if we switch on use_compliant_nested_type
without giving the option to turn it off the item
becomes element
always. This will be guaranteed to be incompatible with schemas created with the legacy version.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're happy to have this breaking change but 'officially' support list types for the pandas connector - then I'm happy to not expose it! And will make the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this change seems to be what's needed to officially support list types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I hadn't really expected anyone to actually use that rather strange schema...
Since #19 is still open, I don't consider lists supported as of today, so I'm okay making this breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if you think it would be disruptive we can consider adding a field now, but then we immediately remove it in google-cloud-bigquery
3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to make the breaking change. If someone cares enough to maintain the old behaviour they can write the Dataframe using whatever options they want manually and then use whatever JobConfig they like with load_table_from_file
. Doesn't seem worth it to add the option and immediately remove it. Especially, as as you say... the old schema is.... difficult to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Let's remove it, then. Especially since folks have a workaround if they really need the unsupported schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
4e46f83
to
a7fb8fa
Compare
Cool |
6993df4
to
ea54491
Compare
@tswast Are we all good here? |
@judahrand Thank you very much for the contribution and for addressing our feedback! I hope to release this change soon. |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #19🦕