New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcut: Using cut with IntervalIndex provided by qcut producing wrong NaN values #17284

Closed
prcastro opened this Issue Aug 18, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@prcastro
Contributor

prcastro commented Aug 18, 2017

xref #17282

Code Sample

>>> x.isnull().sum()
0

>>> x.value_counts()
0.000000     693
12.561725      1
13.568112      1
12.521249      1
13.007628      1
6.993961       1
14.815512      1
6.017280       1
12.944714      1
Name: 0, dtype: int64

>>> categorized = pd.qcut(x, 10, duplicates='drop')
>>> categorized.isnull().sum()
0

>>> categorized.cat.categories  # Notice how all values of x are contained in the only interval
IntervalIndex([(-0.001, 14.816]]
              closed='right',
              dtype='interval[float64]')

>>> res = pd.cut(x, categorized.cat.categories)
>>> res.isnull().sum()
701

Copy pastable

x = pd.read_csv('x.csv', header=None).iloc[:, 0]  # x.csv is provided in a comment below
categorized = pd.qcut(x, 10, duplicates='drop')
res = pd.cut(x, categorized.cat.categories)
res.isnull().sum()

Problem description

When I use qcut to get the IntervalIndex corresponding to the quantiles of a float64 series, and than use this as the bins of cut on the same float64 series, it doesn't work. It produces a new series with a lot of NaN values, while the original series contained no NaN and all of its values are contained at the interval of IntervalIndex.

Expected Output

The result of both qcut and cut should also be the same, but they are not.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.26.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Reshaping label Aug 18, 2017

@prcastro

This comment has been minimized.

Contributor

prcastro commented Aug 18, 2017

To be clear: Even if the feature in #17282 is implemented, this one would persist.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 19, 2017

you would have to show your original data input. it looks like object dtype.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 22, 2017

@prcastro pls show a complete copy-pastable example in the top section (just update). This should be a minimal example which shows the problem.

@jreback jreback added the Can't Repro label Aug 22, 2017

@prcastro

This comment has been minimized.

Contributor

prcastro commented Aug 22, 2017

Updated

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 24, 2017

@prcastro your example in the top is not copy-pastable. This needs to be a minimal example.

@prcastro

This comment has been minimized.

Contributor

prcastro commented Aug 28, 2017

@jreback Is this considered a minimal example?

x = pd.Series([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.9997242948235283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.554204550787334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.688134638736401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.200711854240529, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5832425505088623, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0647107369924282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.373043556642607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5822156198526636, 0.0, 0.0, 0.0, 0.0, 1.672944473242426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5384474167160302, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.667706820558076, 0.0, 0.0, 4.034952986707273, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.422144328051685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
categorized = pd.qcut(x, 10, duplicates='drop')
res = pd.cut(x, categorized.cat.categories)
res.isnull().sum()
@jreback

This comment has been minimized.

Contributor

jreback commented Aug 29, 2017

@prcastro yes thank you. I am not exactly what is going on here. if you would debug and see where its going wrong would be great.

@jreback jreback added Bug and removed Can't Repro labels Aug 29, 2017

@prcastro

This comment has been minimized.

Contributor

prcastro commented Aug 30, 2017

@jreback I traced the problem to IntervalIndex.get_indexer method. If I do:

import pandas as pd
x = pd.Series([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.9997242948235283, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.554204550787334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.688134638736401, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.200711854240529, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5832425505088623, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0647107369924282, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.373043556642607, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5822156198526636, 0.0, 0.0, 0.0, 0.0, 1.672944473242426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5384474167160302, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.667706820558076, 0.0, 0.0, 4.034952986707273, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.422144328051685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
categorized = pd.qcut(x, 10, duplicates='drop')
bins = categorized.cat.categories
bins.get_indexer(x)

It wrongly returns a lot of -1values (which are converted to NaN after, resulting in the original problem). After entering the if statement in here the result in stop variable seems to be wrong.

I took a time to invetigate why this was happening. What I found was that when computing stop, in the method IntervalIndex._searchsorted_monotonic, the program was not entering the else clause here. If it does, everything seems to go just fine.

However, I don't understand get_indexer enough to pin down the cause of this bug, and don't even know why or even if it should enter the else clause in IntervalIndex._searchsorted_monotonic I pointed before.

@jschendel

This comment has been minimized.

Member

jschendel commented Aug 30, 2017

To distill this down a bit further, I think the root of the issue occurs when an IntervalIndex only has a single element:

In [2]: idx1 = pd.IntervalIndex.from_tuples([(0, 5)])

In [3]: idx1.contains(3)
Out[3]: False

In [4]: idx1.get_indexer([3])
Out[4]: array([-1], dtype=int64)

The same operations work when there are two elements:

In [5]: idx2 = pd.IntervalIndex.from_tuples([(0, 5), (5, 10)])

In [6]: idx2.contains(3)
Out[6]: True

In [7]: idx2.get_indexer([3])
Out[7]: array([0], dtype=int64)

Quick fix could be to write a special case in IntervalIndex._searchsorted_monotonic for when there's only a single element. Would probably be beneficial to examine the logic as a whole to see if could be made more robust though.

@shoyer

This comment has been minimized.

Member

shoyer commented Aug 30, 2017

This sounds related to the IntervalIndex indexing issues being tracked down in #16316 and #16386.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment