Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_dataframe fails when fetching timestamp values outside nanosecond bounds #168

Closed
tswast opened this issue Jul 15, 2020 · 1 comment
Closed

Comments

@tswast
Copy link
Contributor

@tswast tswast commented Jul 15, 2020

Environment details

  • OS type and version: macOS Catalina (10.15.5)
  • Python version: python --version: Python 3.7.6
  • pip version: pip --version: pip 20.0.2
  • google-cloud-bigquery version: pip show google-cloud-bigquery
Name: google-cloud-bigquery
Version: 1.24.0
Summary: Google BigQuery API client library
Home-page: https://github.com/GoogleCloudPlatform/google-cloud-python
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /Users/swast/miniconda3/envs/ibis-dev/lib/python3.7/site-packages
Requires: google-cloud-core, google-api-core, google-resumable-media, google-auth, protobuf, six

Code example

Code:

from google.cloud import bigquery                                                                                                    
client = bigquery.Client()                                                                                                           
df = client.query("SELECT TIMESTAMP '4567-01-01 00:00:00' AS `tmp`").to_dataframe()  

Stack trace

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-3-6b8b40790c39> in <module>
----> 1 df = client.query("SELECT TIMESTAMP '4567-01-01 00:00:00' AS `tmp`").to_dataframe()

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/job.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   3372             dtypes=dtypes,
   3373             progress_bar_type=progress_bar_type,
-> 3374             create_bqstorage_client=create_bqstorage_client,
   3375         )
   3376 

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/table.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   1729                 create_bqstorage_client=create_bqstorage_client,
   1730             )
-> 1731             df = record_batch.to_pandas()
   1732             for column in dtypes:
   1733                 df[column] = pandas.Series(df[column], dtype=dtypes[column])

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    764     _check_data_column_metadata_consistency(all_columns)
    765     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 766     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    767 
    768     axes = [columns, index]

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1100     columns = block_table.column_names
   1101     result = pa.lib.table_to_blocks(options, block_table, categories,
-> 1102                                     list(extension_columns.keys()))
   1103     return [_reconstruct_block(item, columns, extension_columns)
   1104             for item in result]

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

~/miniconda3/envs/ibis-dev/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Casting from timestamp[us, tz=UTC] to timestamp[ns] would result in out of bounds timestamp: 81953424000000000

Potential solutions

In order of my preference:

@tswast
Copy link
Contributor Author

@tswast tswast commented Jul 15, 2020

Add option to use datetime objects for timestamp/datetime columns.

This is listed last in my order of preference because I imagine there could be many possible options like this, and it feels wrong to have to pass an option to not throw an exception when these values are encountered.

Add option to use Fletcher to make a dataframe backed by the arrow table https://github.com/xhochy/fletcher

This could be really interesting to try, and much better for performance with string and (out-of-bounds) timestamp columns. From https://uwekorn.com/2020/02/25/fletcher-status-report.html, it sounds like we'd probably be using FletcherChunkedArray, since I think we are converting each page to a pyarrow array and concatting them at the end.

Loading

@plamut plamut self-assigned this Jul 31, 2020
gcf-merge-on-green bot pushed a commit that referenced this issue Aug 15, 2020
Fixes #168.

This PR fixes the problem when converting query results to Pandas with `pyarrow` when data contains timestamps that would fall out of `pyarrow`'s nanoseconds precision.

The fix requires `pyarrow>=1.0.0`, thus it only works on Python 3.

### PR checklist
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants