Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: load_table_from_dataframe method for issue 1692 #1698

Conversation

Gaurang033
Copy link
Contributor

Fixes #1692

@Gaurang033 Gaurang033 requested review from a team as code owners October 23, 2023 16:17
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Oct 23, 2023
@dandhlee
Copy link
Contributor

Please keep titles less than 50 characters.

google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
tests/unit/test_client.py Outdated Show resolved Hide resolved
@Gaurang033 Gaurang033 force-pushed the feature/fix_1692_load_table_from_dataframe branch 2 times, most recently from ed452fb to 6dd6a79 Compare October 23, 2023 17:49
@Gaurang033 Gaurang033 changed the title fix: load_table_from_dataframe does not error out when nan in a requi… fix: load_table_from_dataframe method for issue 1692 Oct 23, 2023
tests/unit/test_client.py Outdated Show resolved Hide resolved
@Gaurang033 Gaurang033 force-pushed the feature/fix_1692_load_table_from_dataframe branch from 6dd6a79 to 98e568f Compare October 23, 2023 18:14
@Gaurang033
Copy link
Contributor Author

@dandhlee could you please review the changes.

@Gaurang033 Gaurang033 force-pushed the feature/fix_1692_load_table_from_dataframe branch 2 times, most recently from f123095 to d06a2ac Compare October 31, 2023 23:50
@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 2, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 2, 2023
@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 2, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 2, 2023
@Gaurang033 Gaurang033 force-pushed the feature/fix_1692_load_table_from_dataframe branch from afe6a01 to a731061 Compare November 3, 2023 21:12
@Gaurang033 Gaurang033 force-pushed the feature/fix_1692_load_table_from_dataframe branch from a731061 to 3a57815 Compare November 3, 2023 21:12
@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 6, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 6, 2023
@@ -302,6 +302,20 @@ def bq_to_arrow_array(series, bq_field):
return pyarrow.Array.from_pandas(series, type=arrow_type)


def _check_nullability(arrow_fields, dataframe):
"""Throws error if dataframe has null values and column doesn't allow nullable"""
if dataframe.index.name:
Copy link
Contributor

@Linchin Linchin Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me understand what lines 307-308 are for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to let you know which exact code as I am on vacation. But when dataframe with index is used, index name is transformed as airow column name. There were two way to fix it. One was to put exception for this case or the create another column with index name. I choose the second option as it's easier. Without this the dataframe unit test case where they use index names will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, hope you are having a good time in vacation! I played with dataframe's index a little bit, and I think there are several corner cases (which are likely non-comprehensive) that we need to cover:

  • Index doesn't have a name, does it still get converted into arrow?
  • Multiple index
  • Index with the same name as columns, which is possible with dataframes
  • Index columns have the same names (possible too)
  • multiindex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gaurang033 Thanks for offering up this PR.
@Linchin I appreciate this summary of additional edge cases that may not be covered by this solution.

I too worry about the edge cases, but more importantly, I worry about spending too much time and energy trying to create a work around for what we all agree is a problem in pyarrow. This feels like it creates greater complexity in our code, increased fragility, and a higher maintenance burden in the long run. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also uncertain if we should add logic in our repo to correct an issue with pyarrow. I have been thinking about this PR as more of a temporary patchwork that maybe reverted later, but for now does help our customers. However if the logic covering the corner cases get too convoluted with the behaviors of pyarrow, I agree that perhaps it's a better idea to open an issue with pyarrow instead.

@Linchin Linchin added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 14, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 14, 2023
@Linchin Linchin removed the request for review from mrfaizal November 14, 2023 19:57
@tswast
Copy link
Contributor

tswast commented Nov 22, 2023

Let's just let the server-side determine if we aren't matching the correct schema. I propose #1735 instead.

@tswast tswast closed this Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_table_from_dataframe does not error out when nan in a required column - Million dollar bug
6 participants