Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

Closed
2 tasks done
alippai opened this issue Aug 30, 2020 · 11 comments · Fixed by #36004
Closed
2 tasks done
Labels
Bug Datetime Datetime data dtype IO Parquet parquet, feather
Milestone

Comments

@alippai
Copy link
Contributor

alippai commented Aug 30, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

import pandas as pd
from datetime import datetime, timezone

df = pd.DataFrame([[datetime.now(timezone.utc)]], columns=['date']).set_index('date')
df.to_parquet('out.parquet')
pd.read_parquet('out.parquet')

Problem description

The bug above happens with pandas 1.1.1 and pyarrow 1.0.1.
The timezone-aware date in the index should survive the parquet round trip.
If date is not index, or when I add parameter ignore_metadata=True to the pyarrow.Table.to_pandas() it works (but date won't be an index automatically)

Expected Output

A correct DataFrame

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f2ca0a2 python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.19.104-microsoft-standard Version : #1 SMP Wed Feb 19 06:37:35 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@alippai alippai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 30, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

Thanks @alippai, looks like something may have changed for 1.1.0. Previously (1.0.5 and 1.0.0 at least) writing with the pyarrow engine would actually raise an ArrowInvalid exception, whereas now you can write but can't read. If you have the dependencies installed then an apparent workaround is to write using fastparquet instead:

In [1]: import pandas as pd
   ...: from datetime import datetime, timezone
   ...:
   ...: print(pd.__version__)
   ...:
   ...: idx = 5 * [datetime.now(timezone.utc)]
   ...: df = pd.DataFrame(index=idx)
   ...: df.to_parquet("out.parquet", engine="fastparquet")
   ...: pd.read_parquet("out.parquet", engine="pyarrow")
   ...:
1.1.1
Out[1]:
Empty DataFrame
Columns: []
Index: [2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00]

cc @jorisvandenbossche who may know what's happening

@dsaxton dsaxton added IO Parquet parquet, feather Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020
@dsaxton dsaxton changed the title BUG: Reading from parque throws UnknownTimeZoneError using timezone-aware date in index BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index Aug 31, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

Likely relevant:

commit be9ee6d
Author: Joris Van den Bossche jorisvandenbossche@gmail.com
Date: Wed Feb 5 02:08:59 2020 +0100

BUG: avoid specifying default coerce_timestamps in to_parquet (#31652)

doc/source/whatsnew/v1.1.0.rst | 4 +++-
pandas/io/parquet.py | 8 +-------
pandas/tests/io/test_parquet.py | 7 +++++++
3 files changed, 11 insertions(+), 8 deletions(-)

@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

Note that pyarrow version="2.0" doesn't help, that was the first setup I got the error with. I made the example more minimal later :)

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020
@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

I've added a test, maybe it helps: https://travis-ci.org/github/pandas-dev/pandas/jobs/722666704

@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

One more interesting thing. Using timezone from datetime fails, but using timezone from pytz works:
Code:

import pandas as pd
from datetime import datetime, timezone
import pyarrow.parquet as pq
from pytz import timezone as pytztimezone

normal_date = datetime(2011, 8, 15, 8, 15, 12, 0, timezone.utc)
pytz_date = datetime(2011, 8, 15, 8, 15, 12, 0, pytztimezone('UTC'))

print(f'Normal date: {normal_date.tzinfo}')
print(f'pytz: {pytz_date.tzinfo}')
print(f'Normal date: {normal_date.tzname()}')
print(f'pytz: {pytz_date.tzname()}')

pd.DataFrame(index=[pytz_date]).to_parquet('pytz.parquet')
pd.DataFrame(index=[normal_date]).to_parquet('normal.parquet')
print(pq.read_table('pytz.parquet').schema.metadata)
print(pq.read_table('normal.parquet').schema.metadata)
print(pq.read_table('pytz.parquet').to_pandas())
print(pq.read_table('normal.parquet').to_pandas())

Output:

Normal date: UTC
pytz: UTC
Normal date: UTC
pytz: UTC
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "columns": [{"name": null, "field_name": "__index_level_0__", "pandas_type": "datetimetz", "numpy_type": "datetime64[ns]", "metadata": {"timezone": "UTC"}}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.2.0.dev0+182.g1e1e942e7"}'}
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "columns": [{"name": null, "field_name": "__index_level_0__", "pandas_type": "datetimetz", "numpy_type": "datetime64[ns]", "metadata": {"timezone": "+00:00"}}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.2.0.dev0+182.g1e1e942e7"}'}
Empty DataFrame
Columns: []
Index: [2011-08-15 08:15:12+00:00]
Traceback (most recent call last):
  File "app.py", line 19, in <module>
    print(pq.read_table('normal.parquet').to_pandas())
  File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 769, in table_to_blockmanager
    table, index = _reconstruct_index(table, index_descriptors,
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 916, in _reconstruct_index
    result_table, index_level, index_name = _extract_index_level(
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 973, in _extract_index_level
    index_level = (pd.Series(values).dt.tz_localize('utc')
  File "/home/alippai/repositories/pandas/pandas/core/accessor.py", line 99, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/accessors.py", line 104, in _delegate_method
    result = method(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/datetimes.py", line 233, in tz_convert
    arr = self._data.tz_convert(tz)
  File "/home/alippai/repositories/pandas/pandas/core/arrays/datetimes.py", line 797, in tz_convert
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 91, in pandas._libs.tslibs.timezones.maybe_get_tz
    cpdef inline tzinfo maybe_get_tz(object tz):
  File "pandas/_libs/tslibs/timezones.pyx", line 106, in pandas._libs.tslibs.timezones.maybe_get_tz
    tz = pytz.timezone(tz)
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pytz/__init__.py", line 181, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: '+00:00'```

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020
@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

Fastparquet fails writing date with timezone but no timezone name:

import pandas as pd
from datetime import datetime, timezone
idx = [datetime.strptime('2019-01-04T16:41:24+0200', "%Y-%m-%dT%H:%M:%S%z")]
df = pd.DataFrame(index=idx)
df.to_parquet("out.parquet", engine="fastparquet")

Raises:

Traceback (most recent call last):
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/util.py", line 234, in get_column_metadata
    pd.Series([pd.to_datetime('now')]).dt.tz_localize(str(dtype.tz))
  File "/home/alippai/repositories/pandas/pandas/core/accessor.py", line 99, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/accessors.py", line 104, in _delegate_method
    result = method(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/datetimes.py", line 240, in tz_localize
    arr = self._data.tz_localize(tz, ambiguous, nonexistent)
  File "/home/alippai/repositories/pandas/pandas/core/arrays/datetimes.py", line 967, in tz_localize
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 91, in pandas._libs.tslibs.timezones.maybe_get_tz
    cpdef inline tzinfo maybe_get_tz(object tz):
  File "pandas/_libs/tslibs/timezones.pyx", line 106, in pandas._libs.tslibs.timezones.maybe_get_tz
    tz = pytz.timezone(tz)
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pytz/__init__.py", line 181, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'UTC+02:00'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "app.py", line 12, in <module>
    pd.DataFrame(index=[normal_date]).to_parquet('normal.parquet', engine='fastparquet')
  File "/home/alippai/repositories/pandas/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/frame.py", line 2396, in to_parquet
    to_parquet(
  File "/home/alippai/repositories/pandas/pandas/io/parquet.py", line 303, in to_parquet
    return impl.write(
  File "/home/alippai/repositories/pandas/pandas/io/parquet.py", line 211, in write
    self.api.write(
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/writer.py", line 875, in write
    fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/writer.py", line 697, in make_metadata
    get_column_metadata(data[column], column))
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/util.py", line 237, in get_column_metadata
    raise ValueError("Time-zone information could not be serialised: "
ValueError: Time-zone information could not be serialised: UTC+02:00, please use another```

@jreback jreback added this to the Contributions Welcome milestone Sep 5, 2020
@alippai
Copy link
Contributor Author

alippai commented Sep 6, 2020

@jreback The PR is ready to review now

@jorisvandenbossche
Copy link
Member

@alippai will take a look tomorrow (will also look into if this is not actually something we should long term solve in pyarrow, but need to better understand the actual first)

@alippai
Copy link
Contributor Author

alippai commented Sep 7, 2020

@jorisvandenbossche regarding pyarrow #35997 (comment) this behavior is the most interesting. With two same / similar timezones pyarrow serializes the metadata differently. This is something you want to change eventually.

The PR doesn't fix/improve the serialization, but it helps reading back the already written metadata.

@jorisvandenbossche
Copy link
Member

looks like something may have changed for 1.1.0. Previously (1.0.5 and 1.0.0 at least) writing with the pyarrow engine would actually raise an ArrowInvalid exception, whereas now you can write but can't read.

@dsaxton that's actually because of the commit you linked to (#31652). If you pass coerce_timestamps=None to get pyarrow's default, you get the same error with pandas 1.0 as with 1.1.

So it seems this bug is already present some time. Also with pyarrow 0.17 I get a similar error for roundtripping a pandas dataframe to pyarrow table with tz-aware index.

@alippai the issue is indeed with how pyarrow stores the timezone in the schema metadata. For pytz it uses "UTC" (which is correctly recognized by pandas afterwards), but for datetime.timezone.utc it uses "+00:00" in the schema's pandas_metadata. So your PR to ensure pandas recognizes such format should indeed fix the issue.

I am still wondering why it doesn't fail for normal columns, though. And pyarrow should probably also recognize datetime.timezone.utc properly as "UTC".

@jorisvandenbossche
Copy link
Member

So the reason is that for normal columns pyarrow first converted the string "+01:00" to a python timezone with an internal utility (pa.lib.string_to_tzinfo), and for index columns we didn't do this. Fixing this on the pyarrow side with apache/arrow#8162

I think we should still recognize datetime.timezone.utc as "UTC" as well (regardless of that, the above fix is needed in general for "fixed offset" timezones). For that I opened https://issues.apache.org/jira/browse/ARROW-9963

Independently from my fix in arrow, I think we can certainly still try to support "+01:00"-like strings in pandas as well (-> #36004)

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants