DatetimeIndex union fails in 0.19rc1 when constructed from differences in DatetimeIndexes #14323

Closed
Liam3851 opened this Issue Sep 29, 2016 · 6 comments

Comments

Projects
None yet
3 participants

When constructing a union of 2 DatetimeIndex objects that themselves were constructed from differences from a third DatetimeIndex, the union operator is ignored. This appears to be new behavior in 0.19rc1; the code functioned correctly under 0.18.1.

x = pd.DatetimeIndex(['20160923', '20160924'])                                                                                                                                                 
y = pd.DatetimeIndex(['20160923', '20160925'])                                             
z = pd.date_range("20160921", "20160930")                                                  
a = z.difference(x)                                                                        
b = z.difference(y)                                                                        

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',                              
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],                             
              dtype='datetime64[ns]', freq='D')                                                     

In [60]: a.union(b).difference(a)                                                                   
Out[60]: Index([], dtype='object')

#### Expected Output

# a.union(b) should contain 2016-09-24
# ('2016-09-24' exists in b, since it was in z and not in y,
#  and thus '2016-09-24' should be in a.union(b)).

In [59]: a.union(b)                                                                                 
Out[59]:                                                                                            
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-24', '2016-09-25',                              
               '2016-09-26', '2016-09-27', '2016-09-28', '2016-09-29',                              
               '2016-09-30'],                                                                       
              dtype='datetime64[ns]', freq=None, tz=None)

In [60]: a.union(b).difference(a)                                                                   
Out[60]: DatetimeIndex(['2016-09-24'], dtype='datetime64[ns]', freq=None, tz=None)       

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0rc1+0.g497a3bc.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.4.2
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@Liam3851 Thanks for the report! I can confirm it is indeed a bug/regression.

Seems it has something to do with offset being now defined:

In [36]: pd.__version__
Out[36]: '0.19.0rc1+34.gc128626'

In [37]: a
Out[37]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq='D')

In [38]: a.offset
Out[38]: <Day>

vs

In [12]: pd.__version__
Out[12]: '0.18.1'

In [13]: a
Out[13]: 
DatetimeIndex(['2016-09-21', '2016-09-22', '2016-09-25', '2016-09-26',
               '2016-09-27', '2016-09-28', '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', freq=None)

In [15]: a.offset is None
Out[15]: True

@jorisvandenbossche jorisvandenbossche modified the milestone: 0.19.0, 0.19.1 Sep 29, 2016

Liam3851 commented Oct 3, 2016

It looks to me like this was introduced with #13514. Index.difference now returns a shallow copy of the original index with the differenced values:

return this._shallow_copy(the_diff, name=result_name)

Before, this code returned a new DatetimeIndex:

return Index(theDiff, name=result_name)

I'm not too familiar with this code, but it looks to me like _shallow_copy is copying all the attributes, including the freq (which is no longer valid after the differencing operation). The Index constructor version would have re-computed the frequency during the construction of the new Index (thus determining there was no valid frequency).

@Liam3851 I think that is a perfect assessment of the situation. We could use _simple_new instead of _shallow_copy (which does not retain the attributes), but then eg the timezone of a DatetimeIndex would get lost.

@sinhrks do you think of a generic approach without adding code to specifically invalidate the freq in case of a DatetimeIndex

Contributor

jreback commented Oct 4, 2016

you can just pass freq=None to _shallow_copy
don't directly use _simple_new

Ah, I thought that would not work for PeriodIndex (which wants to keep its freq), but apparently it does.

@Liam3851 Wants to do a PR for this change?

Liam3851 commented Oct 4, 2016

I'm doing a PR (sorry, still a newb at Git and pandas building). Just want to get the requirements straight for the unit tests:

  1. DatetimeIndex and TimedeltaIndex should be None freq after the differencing, not an inferred frequency (it appears this was the old behavior, counter to what I said above).
  2. PeriodIndex should retain its frequency after differencing, even without the missing datapoints (because the Periods themselves have frequency)
  3. All three should have a union of differences matching a union of indexes.

That about right?

jreback closed this in bee90a7 Oct 24, 2016

@jorisvandenbossche jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

@jorisvandenbossche David Krych + jorisvandenbossche [Backport 14346] BUG: GH14323 Union of differences from DatetimeIndex…
… incorrect


closes #14323

Sets freq to None when doing a difference operation on a DatetimeIndex
or TimedeltaIndex, rather than retaining the frequency (which can
cause  problems with downstream operations). Frequency of PeriodIndex
is retained.

Author: David Krych <davidk@ciphercap.com>

Closes #14346 from Liam3851/dtind_diff_14323 and squashes the following commits:

1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect

(cherry picked from commit bee90a7)
6400cdd

@amolkahat amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

@amolkahat David Krych + amolkahat BUG: GH14323 Union of differences from DatetimeIndex incorrect
closes #14323

Sets freq to None when doing a difference operation on a DatetimeIndex
or TimedeltaIndex, rather than retaining the frequency (which can
cause  problems with downstream operations). Frequency of PeriodIndex
is retained.

Author: David Krych <davidk@ciphercap.com>

Closes #14346 from Liam3851/dtind_diff_14323 and squashes the following commits:

1dbf582 [David Krych] BUG: GH14323 Union of differences from DatetimeIndex incorrect
3b71c0d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment