New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Closed
Liam3851 opened this Issue Sep 29, 2016 · 6 comments

Comments

Projects
None yet
3 participants
@Liam3851
Contributor

Liam3851 commented Sep 29, 2016

When constructing a union of 2 DatetimeIndex objects that themselves were constructed from differences from a third DatetimeIndex, the union operator is ignored. This appears to be new behavior in 0.19rc1; the code functioned correctly under 0.18.1.

x = pd.DatetimeIndex(['20160923', '20160924'])                                                                                                                                                 
y = pd.DatetimeIndex(['20160923', '20160925'])                                             
z = pd.date_range("20160921", "20160930")                                                  
a = z.difference(x)                                                                        
b = z.difference(y)                                                                        

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',                              
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],                             
              dtype='datetime64[ns]', freq='D')                                                     

In [60]: a.union(b).difference(a)                                                                   
Out[60]: Index([], dtype='object')

#### Expected Output

# a.union(b) should contain 2016-09-24
# ('2016-09-24' exists in b, since it was in z and not in y,
#  and thus '2016-09-24' should be in a.union(b)).

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-24', '2016-09-25',                              
               '2016-09-26', '2016-09-27', '2016-09-28', '2016-09-29',                              
               '2016-09-30'],                                                                       
              dtype='datetime64[ns]', freq=None, tz=None)

In [60]: a.union(b).difference(a)                                                                   
Out[60]: DatetimeIndex(['2016-09-24'], dtype='datetime64[ns]', freq=None, tz=None)       

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0rc1+0.g497a3bc.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 29, 2016

Member

@Liam3851 Thanks for the report! I can confirm it is indeed a bug/regression.

Seems it has something to do with offset being now defined:

In [36]: pd.__version__
Out[36]: '0.19.0rc1+34.gc128626'

In [37]: a
Out[37]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq='D')

In [38]: a.offset
Out[38]: <Day>

vs

In [12]: pd.__version__
Out[12]: '0.18.1'

In [13]: a
Out[13]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq=None)

In [15]: a.offset is None
Out[15]: True
Member

jorisvandenbossche commented Sep 29, 2016

@Liam3851 Thanks for the report! I can confirm it is indeed a bug/regression.

Seems it has something to do with offset being now defined:

In [36]: pd.__version__
Out[36]: '0.19.0rc1+34.gc128626'

In [37]: a
Out[37]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq='D')

In [38]: a.offset
Out[38]: <Day>

vs

In [12]: pd.__version__
Out[12]: '0.18.1'

In [13]: a
Out[13]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq=None)

In [15]: a.offset is None
Out[15]: True
@Liam3851

This comment has been minimized.

Show comment
Hide comment
@Liam3851

Liam3851 Oct 3, 2016

Contributor

It looks to me like this was introduced with #13514. Index.difference now returns a shallow copy of the original index with the differenced values:

return this._shallow_copy(the_diff, name=result_name)

Before, this code returned a new DatetimeIndex:

return Index(theDiff, name=result_name)

I'm not too familiar with this code, but it looks to me like _shallow_copy is copying all the attributes, including the freq (which is no longer valid after the differencing operation). The Index constructor version would have re-computed the frequency during the construction of the new Index (thus determining there was no valid frequency).

Contributor

Liam3851 commented Oct 3, 2016

It looks to me like this was introduced with #13514. Index.difference now returns a shallow copy of the original index with the differenced values:

return this._shallow_copy(the_diff, name=result_name)

Before, this code returned a new DatetimeIndex:

return Index(theDiff, name=result_name)

I'm not too familiar with this code, but it looks to me like _shallow_copy is copying all the attributes, including the freq (which is no longer valid after the differencing operation). The Index constructor version would have re-computed the frequency during the construction of the new Index (thus determining there was no valid frequency).

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 4, 2016

Member

@Liam3851 I think that is a perfect assessment of the situation. We could use _simple_new instead of _shallow_copy (which does not retain the attributes), but then eg the timezone of a DatetimeIndex would get lost.

@sinhrks do you think of a generic approach without adding code to specifically invalidate the freq in case of a DatetimeIndex

Member

jorisvandenbossche commented Oct 4, 2016

@Liam3851 I think that is a perfect assessment of the situation. We could use _simple_new instead of _shallow_copy (which does not retain the attributes), but then eg the timezone of a DatetimeIndex would get lost.

@sinhrks do you think of a generic approach without adding code to specifically invalidate the freq in case of a DatetimeIndex

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 4, 2016

Contributor

you can just pass freq=None to _shallow_copy
don't directly use _simple_new

Contributor

jreback commented Oct 4, 2016

you can just pass freq=None to _shallow_copy
don't directly use _simple_new

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 4, 2016

Member

Ah, I thought that would not work for PeriodIndex (which wants to keep its freq), but apparently it does.

@Liam3851 Wants to do a PR for this change?

Member

jorisvandenbossche commented Oct 4, 2016

Ah, I thought that would not work for PeriodIndex (which wants to keep its freq), but apparently it does.

@Liam3851 Wants to do a PR for this change?

@Liam3851

This comment has been minimized.

Show comment
Hide comment
@Liam3851

Liam3851 Oct 4, 2016

Contributor

I'm doing a PR (sorry, still a newb at Git and pandas building). Just want to get the requirements straight for the unit tests:

  1. DatetimeIndex and TimedeltaIndex should be None freq after the differencing, not an inferred frequency (it appears this was the old behavior, counter to what I said above).
  2. PeriodIndex should retain its frequency after differencing, even without the missing datapoints (because the Periods themselves have frequency)
  3. All three should have a union of differences matching a union of indexes.

That about right?

Contributor

Liam3851 commented Oct 4, 2016

I'm doing a PR (sorry, still a newb at Git and pandas building). Just want to get the requirements straight for the unit tests:

  1. DatetimeIndex and TimedeltaIndex should be None freq after the differencing, not an inferred frequency (it appears this was the old behavior, counter to what I said above).
  2. PeriodIndex should retain its frequency after differencing, even without the missing datapoints (because the Periods themselves have frequency)
  3. All three should have a union of differences matching a union of indexes.

That about right?

@jreback jreback closed this in bee90a7 Oct 24, 2016

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

[Backport 14346] BUG: GH14323 Union of differences from DatetimeIndex…
… incorrect

closes #14323

Sets freq to None when doing a difference operation on a DatetimeIndex
or TimedeltaIndex, rather than retaining the frequency (which can
cause  problems with downstream operations). Frequency of PeriodIndex
is retained.

Author: David Krych <davidk@ciphercap.com>

Closes #14346 from Liam3851/dtind_diff_14323 and squashes the following commits:

1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect

(cherry picked from commit bee90a7)

amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

BUG: GH14323 Union of differences from DatetimeIndex incorrect
closes #14323

Sets freq to None when doing a difference operation on a DatetimeIndex
or TimedeltaIndex, rather than retaining the frequency (which can
cause  problems with downstream operations). Frequency of PeriodIndex
is retained.

Author: David Krych <davidk@ciphercap.com>

Closes #14346 from Liam3851/dtind_diff_14323 and squashes the following commits:

1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment