Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame with index having tzlocal() timezone could not be saved to parquet #33786

Open
2 of 3 tasks
vfilimonov opened this issue Apr 25, 2020 · 3 comments
Open
2 of 3 tasks
Labels
Bug IO Parquet parquet, feather Timezones Timezone data dtype

Comments

@vfilimonov
Copy link
Contributor

vfilimonov commented Apr 25, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


If Dataframe has index which timezone is set to dateutil.tz.tzlocal(), it could not be saved to parquet.

It might be related to #24310

Code Sample, a copy-pastable example

from dateutil.tz import tzlocal
ind = pd.date_range('2020-02-01','2020-04-14').tz_localize(tzlocal())
x = pd.DataFrame([[1,2]]*len(ind), index=ind, columns=['A','B'])
x.to_parquet('tmp.parquet')

Problem description

The code raises ValueError: Unable to convert timezone "tzlocal()" to string.

However saving to e.g. CSV works well:

x.to_csv('tmp.csv')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.6.3
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.3
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.3.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.1
sqlalchemy : 1.3.13
tables : 3.4.4
tabulate : None
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : None
numba : 0.46.0

@vfilimonov vfilimonov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2020
@jorisvandenbossche
Copy link
Member

@vfilimonov dateutil timezones are currently not supported by pyarrow, see https://issues.apache.org/jira/browse/ARROW-5248

So the best option, for now, is to convert the timezone to a datetime.timezone fixed offset of pytz timezone.

@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2020
@vfilimonov
Copy link
Contributor Author

Thank you for quick response @jorisvandenbossche!

What would be the easiest way to convert index to fixed offset?

I could think of a workaround by iterating over index and parsing strings:

from dateutil.tz import tzlocal
ind = pd.date_range('2020-02-01','2020-04-14').tz_localize(tzlocal())

ind = [pd.Timestamp(str(_)) for _ in ind]

@jbrockmendel
Copy link
Member

@jorisvandenbossche is there any prospect of this being supported in pyarrow? if not it might make sense to add a helpful message for this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants