New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with pivot_table and categorical data when add dropna args in version 0.23 #21133

Closed
diegogarcilazo opened this Issue May 19, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@diegogarcilazo

diegogarcilazo commented May 19, 2018

pd.__version__
'0.23.0'

labels = ["< 1 mes","1 a 11 meses","1 a 5 años","6 a 11 años","12 a 17 años","18 a + años","alli"]

df = pd.DataFrame(
    {'a': pd.cut(range(0,50), bins = [0,5,10,20,30,40,50,60], labels = labels, right = True),
     'b':(' '.join(['One'] * 20) + ' ' +' '.join(['Two'] * 30)).split(), 'c': range(10,60)})

df.pivot_table(index='a', aggfunc = 'count', fill_value = 0\
            , margins = True, margins_name = 'Total', values = 'c',dropna = False)

output
               c
a               
< 1 mes        5
1 a 11 meses   5
1 a 5 años    10
6 a 11 años   10
12 a 17 años  10
18 a + años    9
alli           0
Total         50


df.pivot_table(index='a', aggfunc = 'count', fill_value = 0\
            , margins = True, margins_name = 'Total', values = 'c',dropna = True)

output

               c
a               
NaN            5
< 1 mes        5
1 a 11 meses  10
1 a 5 años    10
6 a 11 años   10
12 a 17 años   9
18 a + años    0
Total         49

pd.__version__
'0.22.0'

df.pivot_table(index='a', aggfunc = 'count', fill_value = 0\
            , margins = True, margins_name = 'Total', values = 'c',dropna = False)

               c
a               
< 1 mes        5
1 a 11 meses   5
1 a 5 años    10
6 a 11 años   10
12 a 17 años  10
18 a + años    9
alli           0
Total         50

df.pivot_table(index='a', aggfunc = 'count', fill_value = 0\
            , margins = True, margins_name = 'Total', values = 'c',dropna = True)

               c
a               
< 1 mes        5
1 a 11 meses   5
1 a 5 años    10
6 a 11 años   10
12 a 17 años  10
18 a + años    9
alli           0
Total         49

When entering the argument dropna = False with the function pivot_table the value and order of the categories is correct. But when dropna = True both the order and the value of the categories do not perform as expected, add a NaN category and drop last category. With other dataset w/o NaN values dropna = True causes categories lose coherence with the values.

The same goes for the pd.crosstab () function

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: Spanish_Argentina.1252

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: 0.3.1
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@diegogarcilazo diegogarcilazo changed the title from Pandas error with pivot_table and categorical data when add dropna args in version 0.23 to Error with pivot_table and categorical data when add dropna args in version 0.23 May 19, 2018

@gfyoung

This comment has been minimized.

Member

gfyoung commented May 21, 2018

@diegogarcilazo : Thanks for reporting this! Did this behavior work in a previous version of pandas?

@jreback

This comment has been minimized.

Contributor

jreback commented May 21, 2018

@diegogarcilazo pls edit the top to show the output

@diegogarcilazo

This comment has been minimized.

diegogarcilazo commented May 21, 2018

Thanks you for pay attention. @jreback I just added the outputs. @gfyoung In version 0.22 this behavior does not appear I just added the output with 0.22 version too.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 24, 2018

@diegogarcilazo Thanks for the example. I made a simplified version that also illustrates the regression:

In [59]: df = pd.DataFrame({'A': pd.Categorical([np.nan, 'low', 'high', 'low', 'high'], categories=['low', 'high'], ordered=True), 
    ...:                    'B': range(5)})

In [60]: df.pivot_table(index='A', values='B', dropna=False)
Out[60]: 
      B
A      
low   2
high  3

In [61]: df.pivot_table(index='A', values='B', dropna=True)
Out[61]: 
        B
A        
NaN   2.0
low   3.0
high  NaN

For that last one, the correct output on 0.22:

In [23]: df.pivot_table(index='A', values='B', dropna=True)
Out[23]: 
      B
A      
low   2
high  3

@jreback jreback self-assigned this May 25, 2018

jreback added a commit to jreback/pandas that referenced this issue May 29, 2018

jreback added a commit to jreback/pandas that referenced this issue May 31, 2018

jreback added a commit to jreback/pandas that referenced this issue Jun 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment