Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support ARRAY data type when loading from DataFrame with Parquet #980

Merged
merged 22 commits into from Oct 7, 2021

Conversation

@judahrand
Copy link
Contributor

@judahrand judahrand commented Sep 21, 2021

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #19 馃

@judahrand judahrand force-pushed the dataframe-arrays branch from 3f533c0 to 62fd565 Sep 21, 2021
@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 21, 2021

I've made this default to True because it is technically the correct thing to do from a BigQuery perspective. BQ expects Parquet so we should be giving it compliant Parquet. However, the argument could be made that it should default to False for backwards compatibility?

Users who need the old behaviour can set the argument to False manually.

Loading

@judahrand judahrand changed the title Use compliant Parquet by default loading from dataframe Use compliant Parquet by default when loading from dataframe Sep 21, 2021
@judahrand judahrand changed the title Use compliant Parquet by default when loading from dataframe fix: use compliant Parquet by default when loading from dataframe Sep 21, 2021
@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

I suppose another question is whether or not we should set parquet_options.enable_list_inference = True by default in this method?

Loading

@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

These suggestions might be acceptable if released in 3.0.0?

Would it be acceptable to default use_compliant_nested_type and job_config.parquet_options.enable_list_inference to True in 3.0.0? This is the only way that a DataFrame with lists in the columns can be written to BigQuery and the same DataFrame retrieved with .to_dataframe(). I would have thought this would be the expected behaviour? @plamut

Loading

@plamut
Copy link
Contributor

@plamut plamut commented Sep 22, 2021

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Sep 22, 2021

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

Loading

@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

@tswast do you know if bumping the minimum pyarrow version would affect some users?

I would prefer not to bump pyarrow minimum to 4.0, especially since it was only released this year. I learned recently that there are some Apache Beam / Dataflow systems that are stuck on pyarrow 2.0.

I agree but I can't really see another way to address this (and it is quite annoying for some use cases). For example (Feast)[https://github.com/feast-dev/feast] is currently having to implement an ugly hack around this. It's a shame that PyArrow just didn't write proper Parquet for so long! 馃ぃ

Plus, tbh we're already on 3.0.0 which was Jan 2021. 4.0.0 is April 2021. So only a few months difference.

Loading

@plamut
Copy link
Contributor

@plamut plamut commented Sep 22, 2021

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 馃檪

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

Loading

@judahrand judahrand force-pushed the dataframe-arrays branch from 3ddbc66 to a9ec5ca Sep 22, 2021
@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

@judahrand First of all, thanks for the research and the fix!

These suggestions might be acceptable if released in 3.0.0?

BigQuery version 3.0 is definitely an opportunity to make such breaking changes. But if it turns out that the change fixes the issue without negative consequences, then it can probably be accepted even in one of the upcoming 2.x releases. Since no tests broke, this is already a good sign.

I'm just not sure about bumping pyarrow to >= 4.0.0 - we generally try to keep the version pins wide to reduce the chance of conflicting requirements with other cloud libraries or 3rd party dependencies. @tswast do you know if bumping the minimum pyarrow version would affect some users?

As for the fix itself, can you also add a corresponding test which proves that the issue has indeed been fixed?

Since it has to do with parquet decoding on the server side, a system test would be needed, but that requires access to an active Google Cloud Project. If you do not have time/resources to write such test, that's OK, we can add one. It's probably just the matter of adjusting one of the code snippets posted in #19.

I've added to the basic test_load_table_from_dataframe_w_automatic_schema to test that simple REPEATED versions of all the basic types end up in the correct schema in BQ which is decent initial pass.

I suppose I should probably also add tests for all the combinations of use_compliant_nested_type and enable_list_inference?

Loading

@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

We can conditionally apply the fix using a feature flag whose value depends on the detected dependency version (pyarrow in our case). We've done it several times in the past, and it proved successful - it improved user experience while at the same time not spoiled it for those who were stuck on older versions (for any reason).

We can take the best from both worlds. 馃檪

P.S.: And for the record, we've also done a few innocent-looking version bumps only to later realize that the bump caused problems at other places, i.e. outside the Python BigQuery client - hence all the caution.

Can you point me to an example of where this has been done before so I can replicate the pattern here?

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Sep 22, 2021

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

def is_read_session_optional(self) -> bool:

Loading

@@ -2456,6 +2457,7 @@ def load_table_from_dataframe(
project: str = None,
job_config: LoadJobConfig = None,
parquet_compression: str = "snappy",
parquet_use_compliant_nested_type: bool = True,
Copy link
Contributor

@tswast tswast Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this? I would argue that the non-compliant version is just plain wrong, right?

I don't think we officially supported list/struct types for the pandas connector yet.

Loading

Copy link
Contributor Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more than happy to not expose this!! I agree that it's just plain wrong.

I was just thinking about anyone who needs a work around for a table they've already created with a 'wrong' format.

Loading

Copy link
Contributor Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should clarity this. The reason it might break old users tables is that the format that pyarrow uses for nested lists by default is list<item: list<item: value>>.

By default (without use_compliant_nested_type) it turns this into list<list<item: value>> when writing to Parquet.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "list: {
        "item: value
    }
}

When enable_list_inference is on in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "item": value
}

When PyArrow is told to use_compliant_nested_type it outputs list<list<element: value>>.

When enable_list_inference is off in BigQuery this ends up as a ARRAY<STRUCT> in BigQuery where the STRUCT looks like:

{
    "list: {
        "element: value
    }
}

When enable_list_inference is on in BigQuery this ends up as a ARRAY<value> in BigQuery. Which is what it should be!!!!

My point being that if we switch on use_compliant_nested_type without giving the option to turn it off the item becomes element always. This will be guaranteed to be incompatible with schemas created with the legacy version.

Does that make sense?

Loading

Copy link
Contributor Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're happy to have this breaking change but 'officially' support list types for the pandas connector - then I'm happy to not expose it! And will make the change?

Loading

Copy link
Contributor

@tswast tswast Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this change seems to be what's needed to officially support list types. 馃榿

Loading

Copy link
Contributor

@tswast tswast Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I hadn't really expected anyone to actually use that rather strange schema...

Since #19 is still open, I don't consider lists supported as of today, so I'm okay making this breaking change.

Loading

Copy link
Contributor

@tswast tswast Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if you think it would be disruptive we can consider adding a field now, but then we immediately remove it in google-cloud-bigquery 3.0 馃

Loading

Copy link
Contributor Author

@judahrand judahrand Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to make the breaking change. If someone cares enough to maintain the old behaviour they can write the Dataframe using whatever options they want manually and then use whatever JobConfig they like with load_table_from_file. Doesn't seem worth it to add the option and immediately remove it. Especially, as as you say... the old schema is.... difficult to use.

Loading

Copy link
Contributor

@tswast tswast Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's remove it, then. Especially since folks have a workaround if they really need the unsupported schema.

Loading

Copy link
Contributor Author

@judahrand judahrand Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Loading

@judahrand judahrand force-pushed the dataframe-arrays branch from 4e46f83 to a7fb8fa Sep 22, 2021
@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 22, 2021

Can you point me to an example of where this has been done before so I can replicate the pattern here?

I recommend adding a property like use_compliant_nested_type to to PyarrowVersions:

class PyarrowVersions:

See a similar feature flag for the BQ Storage client:

def is_read_session_optional(self) -> bool:

Cool 馃榿 Implemented this plus a bit of tidy up.

Loading

@judahrand judahrand force-pushed the dataframe-arrays branch from 9d5f66f to 6993df4 Sep 27, 2021
@google-cla google-cla bot added cla: yes and removed cla: no labels Sep 27, 2021
@judahrand judahrand force-pushed the dataframe-arrays branch from 6993df4 to ea54491 Sep 27, 2021
@judahrand
Copy link
Contributor Author

@judahrand judahrand commented Sep 30, 2021

@tswast Are we all good here? 馃槃

Loading

google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
Loading
google/cloud/bigquery/client.py Outdated Show resolved Hide resolved
Loading
tswast
tswast approved these changes Oct 7, 2021
@tswast
Copy link
Contributor

@tswast tswast commented Oct 7, 2021

@judahrand Thank you very much for the contribution and for addressing our feedback! I hope to release this change soon.

Loading

@gcf-merge-on-green gcf-merge-on-green bot merged commit 1e59083 into googleapis:main Oct 7, 2021
12 checks passed
Loading
@judahrand judahrand deleted the dataframe-arrays branch Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants