Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

Closed
mojones opened this issue Apr 16, 2018 · 12 comments · Fixed by #25095
Closed

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

mojones opened this issue Apr 16, 2018 · 12 comments · Fixed by #25095
Labels
Milestone

Comments

@mojones
Copy link
Contributor

mojones commented Apr 16, 2018

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(['1-1','1-1',np.NaN], dtype='category')
>>> s1.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]
>>> s2 = pd.Series(['1-1','1-2',np.NaN], dtype='category')
>>> s2.apply(lambda x: x.split('-')[0])
0    1
1    1
2    1
dtype: object

Problem description

In the above code, s1 shows the expected behaviour. We are trying to transform a categorical series by getting the part before the hyphen, and for rows where the original value is NaN the output is also NaN.

The series s2 shows the unexpected behaviour - note only a single change to the original series, the middle value has changed from '1-1' to '1-2'. The third value, which was NaN in the original series now becomes '1' in the output rather than staying as NaN. Also, the dtype of the result series is now object rather than category. It looks like maybe the NaN is somehow getting the applied value of the previous row.

Expected Output

0      1
1      1
2    NaN

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

Thanks for the report. This is partially a symptom of / related to #15706, but that is more an API issue this is an actual bug.

If the resulting map against the categories isn't unique we take against them, but are using np.take which wraparounds the -1 used for missing values, should use our take_1d instead.

return np.take(new_categories, self._codes)

@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 16, 2018
@ladydata
Copy link

ladydata commented Apr 18, 2018

Hey @chris-b1 can I work on this bug? I will aim to have it done before the next major release.
cc: @geoninja

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 18, 2018 via email

@nprad
Copy link
Contributor

nprad commented May 13, 2018

Hello @chris-b1. Is @ladydata still working on this issue? If not, can I take this issue up?

@ladydata
Copy link

Hi @nprad, I'm participating of a sprint tomorrow and I plan to work on this issue. Please don't be discouraged, there are more than 2000 issues waiting to be worked on, you will surely find something else that you're interested. Also, if for any reason I'm unable to work on this issue specifically, I will ping you, but as I said, it has been in my plan.

@ladydata
Copy link

ladydata commented May 16, 2018

Hello @chris-b1, I currently have the result below. Should s2 be of the category type without casting? If so, any suggestion on the best way to approach this?

>>> import pandas as pd
>>> import numpy as np

>>> s1 = pd.Series(['1-1','1-1',np.NaN], dtype='category')
>>> s1.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]

>>> s2 = pd.Series(['1-1','1-2',np.NaN], dtype='category')
>>> s2.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: object

>>> s2.apply(lambda x: x.split('-')[0]).astype('category')
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]

@chris-b1
Copy link
Contributor

Yes, that's consistent with the current API - there's certainly an argument for changing it (#15706), but to just fix the bug that behavior is fine.

In [4]: s1 = pd.Series(['1-1','1-2'], dtype='category')

In [5]: s1.apply(lambda x: x.split('-')[0])
Out[5]:
0    1
1    1
dtype: object

@ladydata
Copy link

@chris-b1 great! I will write the tests based on that. I will also take a look at the older issue you mentioned.

@manuhortet
Copy link

Hey @ladydata, did you write those tests? If not, I'm up to take the task.

@ladydata
Copy link

hey @manuhortet, yes I did write a couple of tests but haven't completed the process. I will set a deadline to complete by the end of the month, and if it's not done by then, from August 1st it's all yours, sounds good?

@manuhortet
Copy link

Hi again @ladydata, should I take this already? 😄

@ladydata
Copy link

ladydata commented Aug 1, 2018

@manuhortet sure, go ahead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants