Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.difference and Index.intersection doesn't preserve type of Index for some Index subclasses for corner cases #20040

Closed
Dr-Irv opened this issue Mar 7, 2018 · 4 comments · Fixed by #20062
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 7, 2018

Code Sample, a copy-pastable example if possible

pi1 = pd.PeriodIndex(start='2000', end='2010', freq='A')
print(pi1.difference(pi1), pi1.intersection(pi1.drop(pi1)))

ci = pd.CategoricalIndex(['a','b','c'], categories=['a','b','c'])
print(ci.difference(ci), ci.intersection(ci.drop(ci)))

ri = pd.RangeIndex(start=1, stop=5)
print(ri.difference(ri), ri.intersection(ri.drop(ri)))

Problem description

The result of taking the difference of an Index for various Index subclasses and the Index produces a resulting Index that does not preserve the type of the subclass.

From a set algebra point of view, for a set S, S.difference(S) should equal S.intersection(nullset).

The output from the above is:

Index([], dtype='object') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
Index([], dtype='object') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
Index([], dtype='object') Int64Index([], dtype='int64')

There is some discussion in the pull request #19849, where I discovered this bug, but at request of @jreback, I have split this into a separate issue.

Expected Output

PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC') PeriodIndex([], dtype='period[A-DEC]', freq='A-DEC')
CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category') CategoricalIndex([], categories=['a', 'b', 'c'], ordered=False, dtype='category')
RangeIndex(start=0, stop=0, step=1) RangeIndex(start=0, stop=0, step=1)

Note that for RangeIndex, the result of the intersection operation is also incorrect.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Mar 7, 2018

I'm willing to work on this, but can we have a discussion on the implementation? The suggested solution in the discussion in #19849 is to use self._shallow_copy([]), but that method doesn't work right for empty indexes, so I think it is easier to just have a method that creates an empty index, but preserves the other properties of the index (e.g., categories for CategoricalIndex, range step for RangeIndex, freq for PeriodIndex, etc.)

Alternatively, I can make self._shallow_copy([]) work for the various Index subclasses with an empty list argument.

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Mar 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 8, 2018

@Dr-Irv : That seems like a good first attempt to patch this, though other options are welcome of course.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Mar 8, 2018

@gfyoung By "That seems", do you mean having a method to create an empty index, or fixing _shallow_copy([])

@gfyoung
Copy link
Member

gfyoung commented Mar 8, 2018

Oh, sorry! I was referring to fixing _shallow_copy([]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants