Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

Closed
stephanievela opened this issue Aug 23, 2018 · 6 comments · Fixed by #22564
Closed

BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477

stephanievela opened this issue Aug 23, 2018 · 6 comments · Fixed by #22564
Labels
Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@stephanievela
Copy link

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    n
1    n
2    n
3    n
4    n
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Before I upgraded pandas to 0.23.0, pd.Series(dtype=str, index=range(5)) gave me a series filled with NaN values. However, with this recent upgrade, dtype=str converts null values to lower case letter ’n’. I’m appending the normal null value series output below as a comparison.

>>> import pandas as pd
>>> series = pd.Series(index=range(5), dtype=str)
>>> series
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: object
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

Thanks for the report! That certainly seems like a regression.

I seem to remember we had some issues about it last year, but can't directly find it.

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Aug 23, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.5 milestone Aug 23, 2018
@abrakababra
Copy link

Hello, is this related or a different issue?

x = np.array(['x','y'])
pd.Series(x).str.decode(encoding='UTF-8',errors='strict')
0   NaN
1   NaN
dtype: float64

I think this prevents me from loading HDF-Files that were written with earlier pandas versions. Accessing certain HDF-nodes I get an ValueError leading back to this line in pytables.py:

data = Series(data).str.decode(encoding, errors=errors).values

Its raising because the original "data" (numpy-ndarray-object) was something like the above x but Series.str.decode makes it all NaNs.

@kokes
Copy link
Contributor

kokes commented Aug 27, 2018

git bisect reveals this as the first bad commit: c8fcfcb

The bisect script did just this

import pandas
assert pandas.Series(index=range(5), dtype=str).isnull().iloc[0] == True

@TomAugspurger
Copy link
Contributor

@abrakababra that looks different.

@kokes thanks (and sorry). I think the issue is that dtype should be sanitized to be a dtype pandas expects by the time we get here. Right now it's <U' and it should just be object.

@toobaz
Copy link
Member

toobaz commented Aug 28, 2018

cc @toobaz you worked a bit on the series constructor and related things, not sure if this is related

Maybe... I'm quite busy until mid September, but for the moment I think this is the issue you were looking at: #9428 . It is extremely similar, so the fix might be too.

@Nikoleta-v3
Copy link
Contributor

Nikoleta-v3 commented Sep 1, 2018

Hey everyone!
I am currently at a sprint at EuroScipy 2017. I worked with @jorisvandenbossche on this and will open a PR to close the issue 👍

Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Sep 1, 2018
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.
Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Sep 1, 2018
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.
@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Nov 11, 2018
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.
Nikoleta-v3 added a commit to Nikoleta-v3/pandas that referenced this issue Nov 11, 2018
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.
jorisvandenbossche pushed a commit that referenced this issue Nov 20, 2018
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes #22477
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants