Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby with daily frequency fails with AmbiguousTimeError on clock change day in Cuba #25758

Closed
pierremoulinier opened this issue Mar 17, 2019 · 3 comments · Fixed by #33137
Closed
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@pierremoulinier
Copy link

Code Sample

import pandas as pd
from datetime import datetime
start = datetime(2018, 11, 3, 12)
end = datetime(2018, 11, 5, 12)
index = pd.date_range(start, end, freq="1H")
index = index.tz_localize('UTC').tz_convert('America/Havana')
data = list(range(len(index)))
dataframe = pd.DataFrame(data, index=index)
groups = dataframe.groupby(pd.Grouper(freq='1D'))

Problem description

On a long clock-change day in Cuba, e.g 2018-11-04, midnight local time is an ambiguous timestamp. pd.Grouper does not handle this as I expect. More precisely the call to groupby in the code above raises an AmbiguousTimeError.

This issue is of a similar nature to #23742 but it seems #23742 was fixed in 0.24 whereas this was not.

Expected Output

The call to groupby should return three groups (one for each day, 3rd, 4th, and 5th of november). The group for the 4th of november should be labelled as '2018-11-04 00:00:00-04:00' (that is the first midnight, before the clock change) and it should contain the 25 hourly data points for this day.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Linux OS-release: 4.9.125-linuxkit machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.2
pytest: 3.3.2
pip: None
setuptools: 40.6.3
Cython: 0.29.6
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented Mar 17, 2019

you have a pretty old pytz

try updating as this should work correctly in 0.24.2

@mroeschke
Copy link
Member

I am getting the same result on master.

I think the problem here is that since '1D' means 1 calendar day here, its ambiguous whether either of these two are correct:

In [2]: groups
Out[2]:
                              0
2018-11-03 00:00:00-04:00   7.5
2018-11-04 00:00:00-04:00  28.0
2018-11-05 00:00:00-05:00  44.5

In [2]: groups
Out[2]:
                              0
2018-11-03 00:00:00-04:00   8.0
2018-11-04 00:00:00-05:00  28.5
2018-11-05 00:00:00-05:00  44.5

We probably need to expose the ambiguous (and nonexistent) keywords somehow in the resample API for this corner case.

@WillAyd WillAyd added Timezones Timezone data dtype Bug labels Mar 22, 2019
@pierremoulinier
Copy link
Author

pierremoulinier commented Mar 23, 2019

Thanks for the follow up, and sorry about the late reply.

@mroeschke I don't know if we need to expose these new keywords in the API, as I don't think the meaning of freq='1D' is ambiguous here.

The "calendar day" in Cuba on this date starts at 2018-11-04 00:00:00-04:00 (and this day lasts 25 hours). This means that of the two possible behaviours you suggest, only the first seems correct to me.

If for some reason someone wants to aggregate the data between 2018-11-04 00:00:00-04:00 and 2018-11-04 00:00:00-05:00 with the data from the previous day (November 3rd), they would have to write a custom grouper (but I would be curious to know why exactly someone would want to do this as those 60 minutes unambiguously belong to November 4th in Cuba).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants