BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

Nikoleta-v3 · 2018-09-01T12:08:09Z

closes BUG: dtype=str in 0.23.0 converts NaN to 'n' #22477
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Now:

>>> import pandas as pd
>>> result = pd.Series(index=range(5), dtype=str)
>>> result
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: object

This is implemented by adding a check so if the dtype is str is will create an empty array type object and then pass the values. Two tests have been implemented:

test for an empty series. To check that it fills the series with NaN and not with 'n'. For example now:
test for cases that no string values are given.

codecov · 2018-09-01T13:27:26Z

Codecov Report

Merging #22564 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22564      +/-   ##
==========================================
+ Coverage   92.28%   92.28%   +<.01%     
==========================================
  Files         161      161              
  Lines       51457    51461       +4     
==========================================
+ Hits        47489    47493       +4     
  Misses       3968     3968

Flag	Coverage Δ
#multiple	`90.68% <100%> (ø)`	⬆️
#single	`42.31% <28.57%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/cast.py	`88.99% <100%> (+0.07%)`	⬆️
pandas/core/dtypes/common.py	`94.37% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a289aee...0692db0. Read the comment docs.

Nikoleta-v3 · 2018-09-01T21:38:39Z

Just saw that this failed. I am travelling tomorrow but will pick it up asap 👍

gfyoung · 2018-09-01T22:09:07Z

@Nikoleta-v3 : Awesome! Also, don't forget the whatsnew entry!

mroeschke · 2018-09-04T00:16:30Z

pandas/core/dtypes/cast.py

-            dtype = np.float64
-        subarr = np.empty(length, dtype=dtype)
+            dtype = np.dtype('float64')
+        if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):


I think dtype.kind in ("U", "S") can be replaced with is_string_dtype(dtype).

is_string_dtype can be found in pandas/core/dtypes/common

Sure will replace asap

this can also be an elif

jreback · 2018-09-04T11:20:08Z

pandas/core/dtypes/cast.py

-            dtype = np.float64
-        subarr = np.empty(length, dtype=dtype)
+            dtype = np.dtype('float64')
+        if isinstance(dtype, np.dtype) and is_string_dtype(dtype):


is the isinstance check for np.dtype needed here?

jreback · 2018-09-04T11:21:31Z

pandas/core/dtypes/cast.py

+        if isinstance(dtype, np.dtype) and is_string_dtype(dtype):
+            subarr = np.empty(length, dtype=object)
+            if not isna(value):
+                value = to_str(value)


prefer to have all of the dtype checking in the if/elif/else and then construct the subarr after

In that case, dtype needs to be overwritten with object because we don't want actual string dtypes (which is easy to do, just noting :-))

jreback · 2018-09-04T11:21:42Z

pandas/core/dtypes/common.py

@@ -367,6 +367,8 @@ def is_datetime64_dtype(arr_or_dtype):
        tipo = _get_dtype_type(arr_or_dtype)
    except TypeError:
        return False
+    except UnicodeEncodeError:


what hits this?

an array of a Unicode element. The problem is with numpy's dtype I opened an issue a few days ago: numpy/numpy#11860

You can catch the TypeError and UnicodeDecodeError on a single line (unless we maybe want to add a specific comment pointing to the numpy issue explaining why)

jreback · 2018-09-04T11:22:15Z

pandas/tests/series/test_constructors.py

@@ -137,6 +137,26 @@ def test_constructor_no_data_index_order(self):
        result = pd.Series(index=['b', 'a', 'c'])
        assert result.index.tolist() == ['b', 'a', 'c']

+    def test_constructor_no_data_string_type(self):


pls parametrize these tests

jreback · 2018-09-04T11:22:51Z

pandas/tests/series/test_constructors.py

+        # GH 22477
+        result = pd.Series(index=[1], dtype=str)
+        assert result.isna().all()
+


check the value using iloc instead here which returns a scalar

Nikoleta-v3 · 2018-09-14T18:45:35Z

Hey everyone! Thank you very much for your comments. I haven't forgotten about this, I am currently very busy preparing for PyCon UK.

I will make sure to fix the failures and address your comments as soon as possible.

jreback · 2018-11-01T01:25:09Z

can you rebase and fixup

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series

Add a check so if the dtype is str is will create an empty array type object and then pass the values. Add test for an empty series. To chech that it fills the series with NaN and not with 'n'. Also add a test for cases that no string values are given.

To allow the developers to remember why the specific test was added

This is currently failing.

is_datetime64_dtype is trying to check the type of unicodes but numpy does not support unicode and this line breaks. Add except error and return false Test for unicode still fails for python 2

This was breaking for python 2. The fix is to use pandas text_type to return string type

parametrize tests and use iloc to check value

pep8speaks · 2018-11-11T22:10:16Z

Hello @Nikoleta-v3! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/dtypes/cast.py !
There are no PEP8 issues in the file pandas/core/dtypes/common.py !
There are no PEP8 issues in the file pandas/tests/series/test_constructors.py !

jreback

pls add a whatsnew note as well. In the Missing section of bug fixes

jreback · 2018-11-11T23:28:56Z

pandas/core/dtypes/cast.py

-            dtype = np.float64
-        subarr = np.empty(length, dtype=dtype)
+            dtype = np.dtype('float64')
+        if isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):


this can also be an elif

jreback · 2018-11-11T23:31:45Z

pandas/tests/series/test_constructors.py

+        result = pd.Series(index=[1], dtype=str)
+        assert np.isnan(result.iloc[0])
+
+    @pytest.mark.parametrize('item', ['13'])


you don't need to parameterize this test (only 1 case), and you need to change the name as the next test overwrites it.

jreback · 2018-11-19T02:05:37Z

pandas/tests/frame/test_constructors.py

@@ -807,7 +807,7 @@ def test_constructor_corner_shape(self):

    @pytest.mark.parametrize("data, index, columns, dtype, expected", [
        (None, lrange(10), ['a', 'b'], object, np.object_),
-        (None, None, ['a', 'b'], 'int64', np.dtype('int64')),
+        (None, None, ['a', 'b'], 'int64', np.dtype('float64')),


cc @TomAugspurger this makes it pass, but this looks slightly suspect, e.g. in master

In [4]: pd.DataFrame(columns=list('ab'),dtype=int).dtypes Out[4]: a int64 b int64 dtype: object

though we coerce almost always on assignment

In [9]: df = pd.DataFrame(columns=list('ab'),dtype=int) In [10]: df.loc[0] = 5 In [11]: df Out[11]: a b 0 5 5 In [12]: df.dtypes Out[12]: a int64 b int64 dtype: object In [13]: df = pd.DataFrame(columns=list('ab'),dtype=int) In [14]: df.loc[0, 'a'] = 5 In [15]: df Out[15]: a b 0 5.0 NaN In [16]: df.dtypes Out[16]: a float64 b float64 dtype: object

jreback

i think the DataFrame coercion might be wrong here

jreback · 2018-11-20T01:28:47Z

i think this is fixed now.

Nikoleta-v3 · 2018-11-20T10:16:24Z

Thank you for the commits @jreback 👍

jorisvandenbossche · 2018-11-20T10:27:33Z

pandas/core/dtypes/cast.py

+            if not isna(value):
+                value = to_str(value)
+        else:
+            subarr = np.empty(length, dtype=dtype)


@jreback by putting this here in the else statement, there is no creation of subarr if you are in the first if case of if length and is_integer_dtype(dtype) and isna(value) (which seems to indicate this is not covered by the tests)

and do u gave have a case that doesn’t work?

jreback · 2018-11-20T13:17:42Z

ok should be fixed up.

jreback · 2018-11-20T14:31:52Z

@jorisvandenbossche if you have any other issues.

jorisvandenbossche · 2018-11-20T15:27:16Z

Thanks @Nikoleta-v3 and @jreback !

if you have any other issues.

I suppose that integer line is still not covered in the tests, but yeah, that's not related to what this PR was trying to fix.

…fixed * upstream/master: DOC: Removing rpy2 dependencies, and converting examples using it to regular code blocks (pandas-dev#23737) BUG: Fix dtype=str converts NaN to 'n' (pandas-dev#22564) DOC: update pandas.core.resample.Resampler.nearest docstring (pandas-dev#20381) REF/TST: Add more pytest idiom to parsers tests (pandas-dev#23810) Added support for Fraction and Number (PEP 3141) to pandas.api.types.is_scalar (pandas-dev#22952) DOC: Updating to_timedelta docstring (pandas-dev#23259)

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series * Closes pandas-dev#22477

Nikoleta-v3 changed the title ~~Fix issue 22477~~ BUG: Fix (22477) dtype=str converts NaN to 'n' Sep 1, 2018

Nikoleta-v3 force-pushed the fix_issue_22477 branch from e94bebe to e4eb011 Compare September 1, 2018 13:36

gfyoung added the Regression Functionality that used to work in a prior pandas version label Sep 1, 2018

gfyoung added this to the 0.23.5 milestone Sep 1, 2018

mroeschke reviewed Sep 4, 2018

View reviewed changes

jreback requested changes Sep 4, 2018

View reviewed changes

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

Nikoleta-v3 added 11 commits November 11, 2018 20:17

tests for creating series string dtype

f069fc2

More specifically the cases that seem to have an issue are when: - the series in empty - it's a single element series

Closes pandas-dev#22477

062786f

Add a check so if the dtype is str is will create an empty array type object and then pass the values. Add test for an empty series. To chech that it fills the series with NaN and not with 'n'. Also add a test for cases that no string values are given.

undo changes to series.py

a522d7f

comment issue number under test

c8667dd

To allow the developers to remember why the specific test was added

add test for strings

4717e36

add test for unicode elements: fails

bdad724

This is currently failing.

except unicode in is_datetime64_dtype

7691c82

is_datetime64_dtype is trying to check the type of unicodes but numpy does not support unicode and this line breaks. Add except error and return false Test for unicode still fails for python 2

series with dtype accept unicode

00a7ed8

This was breaking for python 2. The fix is to use pandas text_type to return string type

fixes failure with python2

e9a290d

tweak tests as requested on pr

aa6b4a9

parametrize tests and use iloc to check value

style tweak

ee854d7

Nikoleta-v3 force-pushed the fix_issue_22477 branch from 76bbb9c to ee854d7 Compare November 11, 2018 22:10

jreback requested changes Nov 11, 2018

View reviewed changes

jreback added 2 commits November 18, 2018 17:33

Merge branch 'master' into PR_TOOL_MERGE_PR_22564

64f6e1c

fixup

fdad0c5

jreback approved these changes Nov 18, 2018

View reviewed changes

jreback added 2 commits November 18, 2018 18:25

Merge branch 'master' into PR_TOOL_MERGE_PR_22564

31021b6

fix test

9711d35

jreback reviewed Nov 19, 2018

View reviewed changes

jreback requested changes Nov 19, 2018

View reviewed changes

jreback added 2 commits November 19, 2018 20:17

Merge branch 'master' into PR_TOOL_MERGE_PR_22564

086d2b5

fixup

27701e0

jreback approved these changes Nov 20, 2018

View reviewed changes

jorisvandenbossche requested changes Nov 20, 2018

View reviewed changes

jreback added 2 commits November 20, 2018 08:12

Merge branch 'master' into PR_TOOL_MERGE_PR_22564

265f92d

fixup

0692db0

jorisvandenbossche approved these changes Nov 20, 2018

View reviewed changes

jorisvandenbossche merged commit f0b2ff3 into pandas-dev:master Nov 20, 2018

jreback mentioned this pull request Nov 21, 2018

Empty series with index and "str" dtype is initialized with values "n" instead of NaN #23838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

Nikoleta-v3 commented Sep 1, 2018 •

edited

codecov bot commented Sep 1, 2018 •

edited

Nikoleta-v3 commented Sep 1, 2018

gfyoung commented Sep 1, 2018

mroeschke Sep 4, 2018

Nikoleta-v3 Sep 4, 2018

jreback Nov 11, 2018

jreback Sep 4, 2018

jreback Sep 4, 2018

jorisvandenbossche Sep 6, 2018

jreback Sep 4, 2018

Nikoleta-v3 Sep 6, 2018

jorisvandenbossche Sep 6, 2018

jreback Sep 4, 2018

jreback Sep 4, 2018

Nikoleta-v3 commented Sep 14, 2018

jreback commented Nov 1, 2018

pep8speaks commented Nov 11, 2018

jreback left a comment

jreback Nov 11, 2018

jreback Nov 11, 2018

jreback Nov 19, 2018

jreback left a comment

jreback commented Nov 20, 2018

Nikoleta-v3 commented Nov 20, 2018

jorisvandenbossche Nov 20, 2018

jreback Nov 20, 2018

jreback Nov 20, 2018

jreback commented Nov 20, 2018

jreback commented Nov 20, 2018

jorisvandenbossche commented Nov 20, 2018

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

BUG: Fix (22477) dtype=str converts NaN to 'n' #22564

Conversation

Nikoleta-v3 commented Sep 1, 2018 • edited

codecov bot commented Sep 1, 2018 • edited

Codecov Report

Nikoleta-v3 commented Sep 1, 2018

gfyoung commented Sep 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nikoleta-v3 commented Sep 14, 2018

jreback commented Nov 1, 2018

pep8speaks commented Nov 11, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Nov 20, 2018

Nikoleta-v3 commented Nov 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 20, 2018

jreback commented Nov 20, 2018

jorisvandenbossche commented Nov 20, 2018

Nikoleta-v3 commented Sep 1, 2018 •

edited

codecov bot commented Sep 1, 2018 •

edited