Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

Closed
stephanievela opened this issue Aug 23, 2018 · 6 comments

Comments

Projects
None yet
8 participants
@stephanievela
Copy link

commented Aug 23, 2018

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    n
1    n
2    n
3    n
4    n
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Before I upgraded pandas to 0.23.0, pd.Series(dtype=str, index=range(5)) gave me a series filled with NaN values. However, with this recent upgrade, dtype=str converts null values to lower case letter ’n’. I’m appending the normal null value series output below as a comparison.

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 23, 2018

Thanks for the report! That certainly seems like a regression.

I seem to remember we had some issues about it last year, but can't directly find it.

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

@jorisvandenbossche jorisvandenbossche added this to the 0.23.5 milestone Aug 23, 2018

@abrakababra

This comment has been minimized.

Copy link

commented Aug 24, 2018

Hello, is this related or a different issue?

x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
0   NaN
1   NaN
dtype: float64

I think this prevents me from loading HDF-Files that were written with earlier pandas versions. Accessing certain HDF-nodes I get an ValueError leading back to this line in pytables.py:

data = Series(data).str.decode(encoding, errors=errors).values

Its raising because the original "data" (numpy-ndarray-object) was something like the above x but Series.str.decode makes it all NaNs.

@kokes

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2018

git bisect reveals this as the first bad commit: c8fcfcb

The bisect script did just this

import pandas
assert pandas.Series(index=range(5), dtype=str).isnull().iloc[0] == True
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2018

@abrakababra that looks different.

@kokes thanks (and sorry). I think the issue is that dtype should be sanitized to be a dtype pandas expects by the time we get here. Right now it's <U' and it should just be object.

@toobaz

This comment has been minimized.

Copy link
Member

commented Aug 28, 2018

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

Maybe... I'm quite busy until mid September, but for the moment I think this is the issue you were looking at: #9428 . It is extremely similar, so the fix might be too.

@Nikoleta-v3

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2018

Hey everyone!
I am currently at a sprint at EuroScipy 2017. I worked with @jorisvandenbossche on this and will open a PR to close the issue 👍

Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Sep 1, 2018

Closes pandas-dev#22477
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.

Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Sep 1, 2018

Closes pandas-dev#22477
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Nov 11, 2018

Closes pandas-dev#22477
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.

Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Nov 11, 2018

Closes pandas-dev#22477
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.

jorisvandenbossche added a commit that referenced this issue Nov 20, 2018

BUG: Fix dtype=str converts NaN to 'n' (#22564)
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes #22477

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

BUG: Fix dtype=str converts NaN to 'n' (pandas-dev#22564)
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019

BUG: Fix dtype=str converts NaN to 'n' (pandas-dev#22564)
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.