BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

stephanievela · 2018-08-23T02:50:15Z

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    n
1    n
2    n
3    n
4    n
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Before I upgraded pandas to 0.23.0, pd.Series(dtype=str, index=range(5)) gave me a series filled with NaN values. However, with this recent upgrade, dtype=str converts null values to lower case letter ’n’. I’m appending the normal null value series output below as a comparison.

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

jorisvandenbossche · 2018-08-23T08:50:47Z

Thanks for the report! That certainly seems like a regression.

I seem to remember we had some issues about it last year, but can't directly find it.

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

abrakababra · 2018-08-24T16:56:24Z

Hello, is this related or a different issue?

x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
0   NaN
1   NaN
dtype: float64

I think this prevents me from loading HDF-Files that were written with earlier pandas versions. Accessing certain HDF-nodes I get an ValueError leading back to this line in pytables.py:

data = Series(data).str.decode(encoding, errors=errors).values

Its raising because the original "data" (numpy-ndarray-object) was something like the above x but Series.str.decode makes it all NaNs.

kokes · 2018-08-27T14:22:49Z

git bisect reveals this as the first bad commit: c8fcfcb

The bisect script did just this

import pandas
assert pandas.Series(index=range(5), dtype=str).isnull().iloc[0] == True

TomAugspurger · 2018-08-27T14:40:05Z

@abrakababra that looks different.

@kokes thanks (and sorry). I think the issue is that dtype should be sanitized to be a dtype pandas expects by the time we get here. Right now it's <U' and it should just be object.

toobaz · 2018-08-28T06:23:40Z

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

Maybe... I'm quite busy until mid September, but for the moment I think this is the issue you were looking at: #9428 . It is extremely similar, so the fix might be too.

Nikoleta-v3 · 2018-09-01T10:26:35Z

Hey everyone!
I am currently at a sprint at EuroScipy 2017. I worked with @jorisvandenbossche on this and will open a PR to close the issue 👍

Add a check so if the dtype is str is will create an empty array type object and then pass the values. Add test for an empty series. To chech that it fills the series with NaN and not with 'n'. Also add a test for cases that no string values are given.

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes #22477

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes pandas-dev#22477

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Aug 23, 2018

jorisvandenbossche added this to the 0.23.5 milestone Aug 23, 2018

Nikoleta-v3 mentioned this issue Sep 1, 2018

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

Merged

4 tasks

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jorisvandenbossche closed this as completed in #22564 Nov 20, 2018

jorisvandenbossche pushed a commit that referenced this issue Nov 20, 2018

BUG: Fix dtype=str converts NaN to 'n' (#22564)

f0b2ff3

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes #22477

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

stephanievela commented Aug 23, 2018

jorisvandenbossche commented Aug 23, 2018

abrakababra commented Aug 24, 2018

kokes commented Aug 27, 2018

TomAugspurger commented Aug 27, 2018

toobaz commented Aug 28, 2018

Nikoleta-v3 commented Sep 1, 2018 •

edited

Loading

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

Comments

stephanievela commented Aug 23, 2018

jorisvandenbossche commented Aug 23, 2018

abrakababra commented Aug 24, 2018

kokes commented Aug 27, 2018

TomAugspurger commented Aug 27, 2018

toobaz commented Aug 28, 2018

Nikoleta-v3 commented Sep 1, 2018 • edited Loading

Nikoleta-v3 commented Sep 1, 2018 •

edited

Loading