Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support ARRAY data type when loading from DataFrame with Parquet #980

Merged
merged 22 commits into from Oct 7, 2021

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Sep 21, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #19 馃

@judahrand judahrand requested a review from a team September 21, 2021 21:45
@judahrand judahrand requested a review from a team as a code owner September 21, 2021 21:45
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 21, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 21, 2021
@judahrand
Copy link
Contributor Author

judahrand commented Sep 21, 2021

I've made this default to True because it is technically the correct thing to do from a BigQuery perspective. BQ expects Parquet so we should be giving it compliant Parquet. However, the argument could be made that it should default to False for backwards compatibility?

Users who need the old behaviour can set the argument to False manually.

@judahrand judahrand changed the title Use compliant Parquet by default loading from dataframe Use compliant Parquet by default when loading from dataframe Sep 21, 2021
@judahrand judahrand changed the title Use compliant Parquet by default when loading from dataframe fix: use compliant Parquet by default when loading from dataframe Sep 21, 2021
@plamut plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 22, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 22, 2021
@judahrand
Copy link
Contributor Author

I suppose another question is whether or not we should set parquet_options.enable_list_inference = True by default in this method?

@judahrand
Copy link
Contributor Author

judahrand commented Sep 22, 2021

These suggestions might be acceptable if released in 3.0.0?

Would it be acceptable to default use_compliant_nested_type and job_config.parquet_options.enable_list_inference to True in 3.0.0? This is the only way that a DataFrame with lists in the columns can be written to BigQuery and the same DataFrame retrieved with .to_dataframe(). I would have thought this would be the expected behaviour? @plamut

@plamut
Copy link
Contributor

plamut commented Sep 22, 2021

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

@tswast
Copy link
Contributor

tswast commented Sep 22, 2021

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

@judahrand
Copy link
Contributor Author

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

I agree but I can't really see another way to address this (and it is quite annoying for some use cases). For example (Feast)[https://github.com/feast-dev/feast] is currently having to implement an ugly hack around this. It's a shame that PyArrow just didn't write proper Parquet for so long! 馃ぃ

Plus, tbh we're already on 3.0.0 which was Jan 2021. 4.0.0 is April 2021. So only a few months difference.

@plamut
Copy link
Contributor

plamut commented Sep 22, 2021

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 馃檪

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

@judahrand
Copy link
Contributor Author

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

I've added to the basic test_load_table_from_dataframe_w_automatic_schema to test that simple REPEATED versions of all the basic types end up in the correct schema in BQ which is decent initial pass.

I suppose I should probably also add tests for all the combinations of use_compliant_nested_type and enable_list_inference?

@judahrand
Copy link
Contributor Author

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 馃檪

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

Can you point me to an example of where this has been done before so I can replicate the pattern here?

@tswast
Copy link
Contributor

tswast commented Sep 22, 2021

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

def is_read_session_optional(self) -> bool:

@@ -2456,6 +2457,7 @@ def load_table_from_dataframe(
project: str = None,
job_config: LoadJobConfig = None,
parquet_compression: str = "snappy",
parquet_use_compliant_nested_type: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this? I would argue that the non-compliant version is just plain wrong, right?

I don't think we officially supported list/struct types for the pandas connector yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more than happy to not expose this!! I agree that it's just plain wrong.

I was just thinking about anyone who needs a work around for a table they've already created with a 'wrong' format.

Copy link
Contributor Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should clarity this. The reason it might break old users tables is that the format that pyarrow uses for nested lists by default is list<item: list<item: value>>.

By default (without use_compliant_nested_type) it turns this into list<list<item: value>> when writing to Parquet.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "list: {
        "item: value
    }
}

When enable_list_inference is on in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "item": value
}

When PyArrow is told to use_compliant_nested_type it outputs list<list<element: value>>.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "list: {
        "element: value
    }
}

When enable_list_inference is on in BigQuery this ends up as a ARRAY<value> in BigQuery. Which is what it should be!!!!

My point being that if we switch on use_compliant_nested_type without giving the option to turn it off the item becomes element always. This will be guaranteed to be incompatible with schemas created with the legacy version.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're happy to have this breaking change but 'officially' support list types for the pandas connector - then I'm happy to not expose it! And will make the change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change seems to be what's needed to officially support list types. 馃榿

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I hadn't really expected anyone to actually use that rather strange schema...

Since #19 is still open, I don't consider lists supported as of today, so I'm okay making this breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if you think it would be disruptive we can consider adding a field now, but then we immediately remove it in google-cloud-bigquery 3.0 馃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to make the breaking change. If someone cares enough to maintain the old behaviour they can write the Dataframe using whatever options they want manually and then use whatever JobConfig they like with load_table_from_file. Doesn't seem worth it to add the option and immediately remove it. Especially, as as you say... the old schema is.... difficult to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's remove it, then. Especially since folks have a workaround if they really need the unsupported schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@judahrand
Copy link
Contributor Author

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

def is_read_session_optional(self) -> bool:

Cool 馃榿 Implemented this plus a bit of tidy up.

@google-cla google-cla bot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Sep 27, 2021
@judahrand
Copy link
Contributor Author

@tswast Are we all good here? 馃槃

@tswast tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 6, 2021
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 6, 2021
@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 6, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 6, 2021
@tswast tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 7, 2021
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 7, 2021
@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 7, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 7, 2021
@tswast tswast added the automerge Merge the pull request once unit tests and other checks pass. label Oct 7, 2021
@tswast
Copy link
Contributor

tswast commented Oct 7, 2021

@judahrand Thank you very much for the contribution and for addressing our feedback! I hope to release this change soon.

@gcf-merge-on-green gcf-merge-on-green bot merged commit 1e59083 into googleapis:main Oct 7, 2021
12 checks passed
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Oct 7, 2021
@judahrand judahrand deleted the dataframe-arrays branch November 19, 2021 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: Upload pandas DataFrame containing arrays
4 participants