New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.split on np.nan gives np.nan in one column but None in another column #18450

Closed
JeroenDelcour opened this Issue Nov 23, 2017 · 3 comments

Comments

Projects
None yet
4 participants
@JeroenDelcour

JeroenDelcour commented Nov 23, 2017

import pandas as pd
import numpy as np

s = pd.Series(['19HT|C2', np.nan, '20ZT|C1'])
print(s)
0    19HT|C2
1        NaN
2    20ZT|C1
dtype: object
s_split = s.str.split('|', expand=True)
print(s_split)
      0     1
0  19HT    C2
1   NaN  None
2  20ZT    C1
print(s_split.dtypes)
0    object
1    object
dtype: object
print(type(s_split.loc[1,0]))
float
print(type(s_split.loc[1,1]))
NoneType

Problem description

When np.nan gets split, it becomes np.nan (of type float) in the first column but None (of type NoneType) in the second column. I'd consider this unexpected behavior. How come splitting a value of one type results in two values of different types?

Expected Output

      0     1
0  19HT    C2
1   NaN   NaN
2  20ZT    C1

Either np.nan or None in both columns, but not a mix of both. I'd say np.nan makes most sense, since that's the original value of the row.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd

This comment has been minimized.

Show comment
Hide comment
@WillAyd

WillAyd Nov 24, 2017

Member

Just to be clear the issue here is how expand=True handles NaN values. You'd still mix str and None instances using expansion if the split character isn't in one of the records. For example,

s = pd.Series(['19HT|C2', np.nan, '20ZT|C1', 'foo'])
s.str.split("|", expand=True)

yields

      0     1
0  19HT    C2
1   NaN  None
2  20ZT    C1
3   foo  None

The str_split docstring is a little ambiguous because it says it should propagate NaN values, but I think that references the value returned irrespective of the expansion mechanism. Will leave it to others here to comment as to whether or not we think this is a bug with how expansion works, or if the docstring should be modified.

pattern, propagating NA values. Equivalent to :meth:`str.split`.

Member

WillAyd commented Nov 24, 2017

Just to be clear the issue here is how expand=True handles NaN values. You'd still mix str and None instances using expansion if the split character isn't in one of the records. For example,

s = pd.Series(['19HT|C2', np.nan, '20ZT|C1', 'foo'])
s.str.split("|", expand=True)

yields

      0     1
0  19HT    C2
1   NaN  None
2  20ZT    C1
3   foo  None

The str_split docstring is a little ambiguous because it says it should propagate NaN values, but I think that references the value returned irrespective of the expansion mechanism. Will leave it to others here to comment as to whether or not we think this is a bug with how expansion works, or if the docstring should be modified.

pattern, propagating NA values. Equivalent to :meth:`str.split`.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 24, 2017

Contributor

this is a bug; we want to use np.nan as the missing value indicator

Contributor

jreback commented Nov 24, 2017

this is a bug; we want to use np.nan as the missing value indicator

@louis-red

This comment has been minimized.

Show comment
Hide comment
@louis-red

louis-red Jun 27, 2018

Contributor

FYI this behavior (pd.Series.str.split with expand=True expands with Nones) is still present in my version of pandas : 0.23.1

Contributor

louis-red commented Jun 27, 2018

FYI this behavior (pd.Series.str.split with expand=True expands with Nones) is still present in my version of pandas : 0.23.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment