Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

Merged
merged 19 commits into from Nov 20, 2018

Conversation

Nikoleta-v3
Copy link
Contributor

@Nikoleta-v3 Nikoleta-v3 commented Sep 1, 2018

Now:

>>> import pandas as pd
>>> result = pd.Series(index=range(5), dtype=str)
>>> result
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: object

This is implemented by adding a check so if the dtype is str is will create an empty array type object and then pass the values. Two tests have been implemented:

  • test for an empty series. To check that it fills the series with NaN and not with 'n'. For example now:
  • test for cases that no string values are given.

@Nikoleta-v3 Nikoleta-v3 changed the title Fix issue 22477 BUG: Fix (22477) dtype=str converts NaN to 'n' Sep 1, 2018
@codecov
Copy link

codecov bot commented Sep 1, 2018

Codecov Report

Merging #22564 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22564      +/-   ##
==========================================
+ Coverage   92.28%   92.28%   +<.01%     
==========================================
  Files         161      161              
  Lines       51457    51461       +4     
==========================================
+ Hits        47489    47493       +4     
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.68% <100%> (ø) ⬆️
#single 42.31% <28.57%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 88.99% <100%> (+0.07%) ⬆️
pandas/core/dtypes/common.py 94.37% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a289aee...0692db0. Read the comment docs.

@Nikoleta-v3
Copy link
Contributor Author

Just saw that this failed. I am travelling tomorrow but will pick it up asap 👍

@gfyoung gfyoung added the Regression Functionality that used to work in a prior pandas version label Sep 1, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2018

@Nikoleta-v3 : Awesome! Also, don't forget the whatsnew entry!

@gfyoung gfyoung added this to the 0.23.5 milestone Sep 1, 2018
dtype = np.float64
subarr = np.empty(length, dtype=dtype)
dtype = np.dtype('float64')
if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dtype.kind in ("U", "S") can be replaced with is_string_dtype(dtype).

is_string_dtype can be found in pandas/core/dtypes/common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will replace asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be an elif

dtype = np.float64
subarr = np.empty(length, dtype=dtype)
dtype = np.dtype('float64')
if isinstance(dtype, np.dtype) and is_string_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the isinstance check for np.dtype needed here?

if isinstance(dtype, np.dtype) and is_string_dtype(dtype):
subarr = np.empty(length, dtype=object)
if not isna(value):
value = to_str(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to have all of the dtype checking in the if/elif/else and then construct the subarr after

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, dtype needs to be overwritten with object because we don't want actual string dtypes (which is easy to do, just noting :-))

@@ -367,6 +367,8 @@ def is_datetime64_dtype(arr_or_dtype):
tipo = _get_dtype_type(arr_or_dtype)
except TypeError:
return False
except UnicodeEncodeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what hits this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an array of a Unicode element. The problem is with numpy's dtype I opened an issue a few days ago: numpy/numpy#11860

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can catch the TypeError and UnicodeDecodeError on a single line (unless we maybe want to add a specific comment pointing to the numpy issue explaining why)

@@ -137,6 +137,26 @@ def test_constructor_no_data_index_order(self):
result = pd.Series(index=['b', 'a', 'c'])
assert result.index.tolist() == ['b', 'a', 'c']

def test_constructor_no_data_string_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls parametrize these tests

# GH 22477
result = pd.Series(index=[1], dtype=str)
assert result.isna().all()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the value using iloc instead here which returns a scalar

@Nikoleta-v3
Copy link
Contributor Author

Hey everyone! Thank you very much for your comments. I haven't forgotten about this, I am currently very busy preparing for PyCon UK.

I will make sure to fix the failures and address your comments as soon as possible.

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

can you rebase and fixup

More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series
Add a check so if the dtype is str is will create
an empty array type object and then pass the values.

Add test for an empty series. To chech that it fills the series
with NaN and not with 'n'.

Also add a test for cases that no string values are given.
To allow the developers to remember why the specific test was added
This is currently failing.
is_datetime64_dtype is trying to check the type
of unicodes but numpy does not support unicode and
this line breaks.

Add except error and return false
Test for unicode still fails for python 2
This was breaking for python 2.
The fix is to use pandas text_type to return string type
parametrize tests and use iloc to check value
@pep8speaks
Copy link

Hello @Nikoleta-v3! Thanks for updating the PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note as well. In the Missing section of bug fixes

dtype = np.float64
subarr = np.empty(length, dtype=dtype)
dtype = np.dtype('float64')
if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be an elif

result = pd.Series(index=[1], dtype=str)
assert np.isnan(result.iloc[0])

@pytest.mark.parametrize('item', ['13'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to parameterize this test (only 1 case), and you need to change the name as the next test overwrites it.

@@ -807,7 +807,7 @@ def test_constructor_corner_shape(self):

@pytest.mark.parametrize("data, index, columns, dtype, expected", [
(None, lrange(10), ['a', 'b'], object, np.object_),
(None, None, ['a', 'b'], 'int64', np.dtype('int64')),
(None, None, ['a', 'b'], 'int64', np.dtype('float64')),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @TomAugspurger this makes it pass, but this looks slightly suspect, e.g. in master

In [4]: pd.DataFrame(columns=list('ab'),dtype=int).dtypes
Out[4]: 
a    int64
b    int64
dtype: object

though we coerce almost always on assignment

In [9]: df = pd.DataFrame(columns=list('ab'),dtype=int)

In [10]: df.loc[0] = 5

In [11]: df
Out[11]: 
   a  b
0  5  5

In [12]: df.dtypes
Out[12]: 
a    int64
b    int64
dtype: object

In [13]: df = pd.DataFrame(columns=list('ab'),dtype=int)

In [14]: df.loc[0, 'a'] = 5

In [15]: df
Out[15]: 
     a   b
0  5.0 NaN

In [16]: df.dtypes
Out[16]: 
a    float64
b    float64
dtype: object

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the DataFrame coercion might be wrong here

@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

i think this is fixed now.

@Nikoleta-v3
Copy link
Contributor Author

Thank you for the commits @jreback 👍

if not isna(value):
value = to_str(value)
else:
subarr = np.empty(length, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback by putting this here in the else statement, there is no creation of subarr if you are in the first if case of if length and is_integer_dtype(dtype) and isna(value) (which seems to indicate this is not covered by the tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do u gave have a case that doesn’t work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do u gave have a case that doesn’t work?

@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

ok should be fixed up.

@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

@jorisvandenbossche if you have any other issues.

@jorisvandenbossche jorisvandenbossche merged commit f0b2ff3 into pandas-dev:master Nov 20, 2018
@jorisvandenbossche
Copy link
Member

Thanks @Nikoleta-v3 and @jreback !

if you have any other issues.

I suppose that integer line is still not covered in the tests, but yeah, that's not related to what this PR was trying to fix.

thoo added a commit to thoo/pandas that referenced this pull request Nov 20, 2018
…fixed

* upstream/master:
  DOC: Removing rpy2 dependencies, and converting examples using it to regular code blocks (pandas-dev#23737)
  BUG: Fix dtype=str converts NaN to 'n' (pandas-dev#22564)
  DOC: update pandas.core.resample.Resampler.nearest docstring (pandas-dev#20381)
  REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23810)
  Added support for Fraction and Number (PEP 3141) to pandas.api.types.is_scalar (pandas-dev#22952)
  DOC: Updating to_timedelta docstring (pandas-dev#23259)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
More specifically the cases that seem to have an issue
are when:
- the series in empty
- it's a single element series

* Closes pandas-dev#22477
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: dtype=str in 0.23.0 converts NaN to 'n'
6 participants