Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: validate partial string indexing with tz-aware end-points #16785

Closed
1kastner opened this issue Jun 28, 2017 · 12 comments

Comments

@1kastner
Copy link

commented Jun 28, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import dateutil.parser

df = pd.DataFrame(index=pd.date_range('2016-01-01T00:00', '2016-12-31T23:59', freq='T'))

df[
    "2016-01-01T00:00-02:00"
    :
    "2016-01-01T02:03"
]  # returns 124 entries

df[
    dateutil.parser.parse("2016-01-01T00:00-02:00")
    :
    dateutil.parser.parse("2016-01-01T02:03")
]  # returns 4 entries

df[
    pd.Timestamp("2016-01-01T00:00-02:00")
    :
    pd.Timestamp("2016-01-01T02:03")
]  # returns 4 entries

Problem description

The current behavior is that the time zone information is ignored without any warning when providing strings.

Expected Output

Either use the time zone information and only return the four desired entries OR warn the user about the fact that the time zone information is not used

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@1kastner 1kastner changed the title pandas DatetimeIndex not time zone sensitive DatetimeIndex not time zone sensitive Jun 28, 2017

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2017

Your end-point is naive (FYI dateutil.parser is not used anywhere in pandas except as a very last resort as it is slow).

In [8]: pd.Timestamp("2016-01-01T00:00-02:00")
Out[8]: Timestamp('2016-01-01 00:00:00-0200', tz='pytz.FixedOffset(-120)')

In [9]: pd.Timestamp("2016-01-01T02:03")
Out[9]: Timestamp('2016-01-01 02:03:00')

In [12]: dateutil.parser.parse("2016-01-01T02:03")
Out[12]: datetime.datetime(2016, 1, 1, 2, 3)

In [13]: dateutil.parser.parse("2016-01-01T00:00-02:00")
Out[13]: datetime.datetime(2016, 1, 1, 0, 0, tzinfo=tzoffset(None, -7200))

Full working example

In [4]: idx = pd.date_range('2016-01-01T00:00', '2016-12-31T23:59', freq='T')

In [5]: df = pd.DataFrame({'A': range(len(idx))}, index=idx)

In [6]: df
Out[6]: 
                          A
2016-01-01 00:00:00       0
2016-01-01 00:01:00       1
2016-01-01 00:02:00       2
2016-01-01 00:03:00       3
2016-01-01 00:04:00       4
2016-01-01 00:05:00       5
...                     ...
2016-12-31 23:54:00  527034
2016-12-31 23:55:00  527035
2016-12-31 23:56:00  527036
2016-12-31 23:57:00  527037
2016-12-31 23:58:00  527038
2016-12-31 23:59:00  527039

[527040 rows x 1 columns]

In [7]: df["2016-01-01T00:00-02:00":"2016-01-02T00:00-02:00"]
Out[7]: 
                        A
2016-01-01 00:00:00     0
2016-01-01 00:01:00     1
2016-01-01 00:02:00     2
2016-01-01 00:03:00     3
2016-01-01 00:04:00     4
2016-01-01 00:05:00     5
...                   ...
2016-01-01 23:55:00  1435
2016-01-01 23:56:00  1436
2016-01-01 23:57:00  1437
2016-01-01 23:58:00  1438
2016-01-01 23:59:00  1439
2016-01-02 00:00:00  1440

[1441 rows x 1 columns]

@1kastner if you are proposing that we raise if the partial-string indexing with a tz must match on both end-points (if given), I would agree.

If you would like to submit a PR would be great. Just step thru to see where the conversions happens.

@jreback jreback added this to the 0.21.0 milestone Jun 28, 2017

@jreback jreback changed the title DatetimeIndex not time zone sensitive ERR: validate partial string indexing with tz-aware end-points Jun 28, 2017

@1kastner

This comment has been minimized.

Copy link
Author

commented Jun 28, 2017

Thank you very much for this explanation! I will try to find a good spot to insert the proposed exception.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2017

@1kastner want to do a PR for this?

@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

@1kastner

This comment has been minimized.

Copy link
Author

commented Sep 24, 2017

After 17th Oct. (probably a few days earlier) I will have some free time and I'd be interested to solve this. If you prefer to have it solved by somebody else before this date, that'd be no deal for me.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2017

@1kastner totally fine.

@1kastner

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

So the easiest solution seems now to invoke pd.Timestamp("timestamp-string") on the respective endpoint. I do not agree that the timezones should match. Because in some countries they have daylight saving time, so it is completely normal for several countries to change their timezone twice a year.

Because of that it is so critical (when not working with UTC) that the timezone information is not ignored.

@1kastner

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

Well, in datetimes.py the parsed offset is totally ignored in the method _parsed_string_to_bounds, for my case lines 1292-1296. Instead, the assumed timezone defaults to the timezone of the DatetimeIndex. As said before, because of daylight saving time (also called "summer time" in German), this does not necessarily make sense. You should be free to choose whatever timezone you need for your project, changing with different contexts. I can understand that if no timezone is provided, it defaults to the timezone of the DatetimeIndex timezone instead of UTC. But if a timezone is provided, neglecting it here is not nice.

@bkmrkr

This comment has been minimized.

Copy link

commented Oct 19, 2017

@1kastner

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

@bkmrkr Can't you unsubscribe yourself?

@1kastner

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

@jreback Now I struggle with Visual Studio... It might take some some time to properly check everything. But the main idea is to introduce the following change:

    def _parsed_string_to_bounds(self, reso, parsed):
        # ...
        if parsed.tzinfo is None:
            target_tz = self.tz
        else:
            target_tz = parsed.tzinfo

        if reso == 'year':
            return (Timestamp(datetime(parsed.year, 1, 1), tz=target_tz),
                    Timestamp(datetime(parsed.year, 12, 31, 23,
                                       59, 59, 999999), tz=target_tz))
    #...

Of course throughout all resolutions self.tz needs to be replaced with target_tz.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 16, 2017

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

@1kastner

This comment has been minimized.

Copy link
Author

commented Dec 27, 2018

At #24076 it seems like something has changed.
I will soon re-check whether this issue still exists or whether it can be closed.

@1kastner

This comment has been minimized.

Copy link
Author

commented Jan 1, 2019

I locally built the version 0.24.0.dev0+1448.gb9284a2.dirty and the problem continued to exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.