Construction of Series from dict containing NaN as key #18496

toobaz · 2017-11-26T01:52:58Z

closes Initialization of Series from dict disregards np.nan key #18480
closes Inconsistent dtype of category in empty Series between dict and list input #18515
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is also a prerequisite for fixing #18455 (which is a prerequisite for fixing #18460 ). The workaround to #18485 is annoying, but it is easy to remove it when the bug is fixed.

jreback · 2017-11-26T01:56:48Z

doc/source/whatsnew/v0.22.0.txt

@@ -205,4 +205,5 @@ Other

 - Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`)
 - Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
+- Fixed initialization of Series from dict containing NaN as key (:issue:`18480`)


intialization -> construction
use double-backticks around Series and NaN and dict

jreback · 2017-11-26T01:58:00Z

pandas/core/series.py

-                            if data else np.nan)
-
+                        remap_to_mi = False
+                        keys = maybe_mi_keys


this is WAY too complicated, likely slow. pls try to simplify. at the very least, should all be split out to a more accessible location, e.g. move to a helper function, below _sanitize_array is prob ok. call it _dict_to_array or something.

going to need quite some simplification. I am ok with throwing out some cases if that will help.

jreback · 2017-11-26T01:58:15Z

pandas/tests/series/test_constructors.py

@@ -625,6 +625,18 @@ def test_constructor_dict(self):
        expected.iloc[1] = 1
        assert_series_equal(result, expected)

+        # GH 18480 - NaN key


separate test

codecov · 2017-11-26T13:28:58Z

Codecov Report

Merging #18496 into master will increase coverage by <.01%.
The diff coverage is 93.18%.

@@            Coverage Diff             @@
##           master   #18496      +/-   ##
==========================================
+ Coverage    91.3%   91.31%   +<.01%     
==========================================
  Files         163      163              
  Lines       49781    49817      +36     
==========================================
+ Hits        45451    45488      +37     
+ Misses       4330     4329       -1

Flag	Coverage Δ
#multiple	`89.11% <93.18%> (ø)`	⬆️
#single	`40.74% <54.54%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.66% <93.18%> (-0.11%)`	⬇️
pandas/core/indexes/datetimes.py	`95.04% <0%> (-0.48%)`	⬇️
pandas/core/indexes/datetimelike.py	`96.47% <0%> (-0.45%)`	⬇️
pandas/core/indexes/base.py	`96.34% <0%> (-0.06%)`	⬇️
pandas/plotting/_converter.py	`65.25% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38f41e6...d3d9f45. Read the comment docs.

codecov · 2017-11-26T13:29:05Z

Codecov Report

Merging #18496 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18496      +/-   ##
==========================================
- Coverage   91.44%   91.43%   -0.02%     
==========================================
  Files         157      157              
  Lines       51379    51374       -5     
==========================================
- Hits        46985    46972      -13     
- Misses       4394     4402       +8

Flag	Coverage Δ
#multiple	`89.3% <100%> (ø)`	⬆️
#single	`40.62% <72.22%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimes.py	`95.69% <ø> (-0.03%)`	⬇️
pandas/core/indexes/base.py	`96.46% <ø> (-0.01%)`	⬇️
pandas/core/indexes/datetimelike.py	`97.11% <ø> (+0.19%)`	⬆️
pandas/core/base.py	`96.54% <100%> (-0.01%)`	⬇️
pandas/core/series.py	`94.8% <100%> (+0.04%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74ac70...1582c42. Read the comment docs.

jreback · 2017-11-26T22:24:58Z

pandas/core/series.py

+                remap_to_mi = False
+                if data:
+                    keys, values = zip(*compat.iteritems(data))
+                    # Workaround for #18485 - part 1/3


as I said before. this giant block of code is not acceptable here. its needs to be moved to a separate function, and made much simpler.

It is smaller than before, but no problem, I'll isolate it. Any comment on the logic?

I mean: different than "made much simpler" which is everybody's dream more than a comment.

my point is the logic is so dense its not even understandable. this is too complex, no-one will be able to follow this, vet it for bugs or anything. you need to break this apart into logical abstractions, build things up into much smaller units.

@toobaz don't get me wrong, I am happy to have a fix for this, but it has to be code that is maintainable, limits special cases, and it non completely non-performant.

toobaz · 2017-11-26T23:01:47Z

@jorisvandenbossche @jreback Bringing here a conversation from Gitter: one possible way to approach this (avoiding the problem that {np.nan : 2}[float('nan')] raises KeyError) is to modify lib.fast_multiget so that it iterates the dict and then matches to specific keys, rather than directly looking for such keys. Or it could iterate the dict only if np.nan appears in the desired keys (or if the desired keys are unspecified), and follow the usual behavior otherwise.

jreback · 2017-11-26T23:02:56Z

@toobaz happy to have you modify cython code directly to avoid perf penalty (and it also abstracts away some of the complexity). inline comments a plus!

jorisvandenbossche · 2017-11-26T23:57:56Z

@toobaz sidenote: can you give the PR a bit more descriptive title?

toobaz · 2017-11-27T07:56:40Z

Since I introduced no (explicit) loops I preferred to keep code in *.py.

I could add a test at the beginning of lib.fast_multiget asserting the absence of NaN in the argument. However the case of tuples would be tricky: it's just more complicated - and not particularly useful I guess - to check if they contain NaNs, but at the same time raising an error for any tuple would be wrong.

jorisvandenbossche · 2017-11-27T08:41:34Z

pandas/core/indexes/base.py

+                locs = self.get_indexer(keys)
+                order = - np.ones(len(self), dtype=int)
+                order[locs] = np.arange(len(keys))
+                values = values[order]


Isn't this just a reindex ?

Yes, I just wanted to avoid creating a Series object, particularly so inside an Index method... but I can do it if you prefer.

Since this method is only used in Series.init, I would then try to refactor to do the reindex there, that would eliminate much of this custom code?

Yes... (that would basically be my previous version of this PR, except that I would move code from Series.__init__ to Series._init_from_dict). Notice however that if we leave Index._get_values_from_dict as it is... it is broken (doesn't work for NaNs).

But your previous PR had a lot of complex code checking for tuples etc (I think that part was the main objection), while this one does not have that any more.

OK, then if you're OK with leaving Index._get_values_from_dict as it is, I will try again.

jorisvandenbossche · 2017-11-27T08:43:48Z

pandas/tests/series/test_constructors.py

+        assert_series_equal(result, expected)
+
+        # Different NaNs:
+        d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'}


Can you add a test with None key as well? (which already works, but would be nice to assert)

(done below)

jorisvandenbossche · 2017-11-27T08:45:20Z

pandas/tests/series/test_constructors.py

+        # Different NaNs:
+        d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'}
+        result = Series(d).sort_values()
+        expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, np.nan])


Is the order of the expected result guaranteed like this? (I mean: I though we sort the index, but not the values?)

That's why I'm sort_values()ing just above ;-)

Ah, didn't see that :-)

jorisvandenbossche · 2017-11-27T08:45:32Z

Suggestion (but didn't think it fully through): we can make a distinction on whether index was passed or not. If index was passed, we do want to do dict-lookup (as otherwise it could potentially give a performance bottleneck) -> current code. If index is not passed, we can just get the keys, values of the full dict (as we want all those anyway, so no need to then do a dict-lookup for each key).

That doesn't solve the problem with pd.Series({np.nan: 1, 2:2}, index=[1, np.nan,]), but would already solve the majority of the cases I think?

toobaz · 2017-11-27T09:09:53Z

If index was passed, we do want to do dict-lookup (as otherwise it could potentially give a performance bottleneck) -> current code. If index is not passed, we can just get the keys, values of the full dict (as we want all those anyway, so no need to then do a dict-lookup for each key).

In my current version the lookup is not a bottleneck because it is just an iteration on keys/values, precisely what is needed anyway in the "no index" case (where, by the way, no reindexing happens because the indexes are identical()). However, still in the "no index" case, keys are being iterated twice, and I can fix this at the cost of few lines in Series.__init__.

toobaz · 2017-11-28T23:58:49Z

pandas/tests/series/test_combine_concat.py

@@ -181,7 +181,7 @@ def test_concat_empty_series_dtypes(self):
        # categorical
        assert pd.concat([Series(dtype='category'),
                          Series(dtype='category')]).dtype == 'category'
-        assert pd.concat([Series(dtype='category'),
+        assert pd.concat([Series(np.array([]), dtype='category'),


maybe add this issue number here as well

toobaz · 2017-11-29T07:11:28Z

@jreback ping

jreback

much simpler thanks, some comments.

jreback · 2017-11-29T11:51:32Z

pandas/core/series.py

@@ -303,6 +293,23 @@ def _can_hold_na(self):

    _index = None

+    def _init_from_dict(self, data, index, dtype):
+        # Looking for NaN in dict doesn't work ({np.nan : 1}[float('nan')]


can you make this a proper doc-string

Just to double check: given the size of the method now, do you still prefer it isolated (rather than having exactly the same lines of code in __init__)?

yes absolutely, having these separate makes them much easier to grok.

jreback · 2017-11-29T11:52:21Z

pandas/core/series.py

+            keys, values = zip(*compat.iteritems(data))
+        else:
+            keys, values = [], []
+        s = Series(values, index=keys, dtype=dtype)


can you add some comments here

jreback · 2017-11-29T11:55:18Z

pandas/tests/series/test_constructors.py

+        expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, None])
+        assert_series_equal(result, expected)
+
+        # MultiIndex:


I would add another test that parametrizes over the missing values specifically (with a fixed structured for d), e.g. parametrize over None, np.nan, float('nan')

jreback · 2017-11-29T11:56:52Z

pandas/core/base.py

@@ -875,7 +875,7 @@ def _map_values(self, mapper, na_action=None):
                # we specify the keys here to handle the
                # possibility that they are tuples
                from pandas import Series, Index
-                index = Index(mapper, tupleize_cols=False)


can you add a test that hits this (IOW that makes you change this).

Already there (notice here I'm just reverting my previous PR for consistency with the new default behaviour)... shall I mention this issue in that test?

oh, then add this issue number there as well

jorisvandenbossche · 2017-11-29T10:23:09Z

pandas/core/series.py

@@ -303,6 +293,23 @@ def _can_hold_na(self):

    _index = None

+    def _init_from_dict(self, data, index, dtype):


In frame.py it is called _init_dict so let's use that, and I would move it directly after the __init__

jorisvandenbossche · 2017-11-29T10:25:23Z

pandas/core/series.py

-                        index = Index(_try_sort(data))
-
-                try:
-                    data = index._get_values_from_dict(data)


I think this was the only usage of _get_values_from_dict, so this could be cleaned up. There are also multiple implementation (for the different types of indices, not sure if those differences are important and are all catched in the new implementation)

The different implementations seem unused and broken, see for instance

In [2]: pd.Index([])._get_values_from_dict({}) Out[2]: array([], dtype=float64) In [3]: pd.DatetimeIndex([])._get_values_from_dict({}) Out[3]: array([ nan])

however, in principle they do something sensible (not necessarily expected), which is to look for Timestamp keys in the dict. The "new implementation", that is the Series construction, doesn't care about this (and shouldn't, I think).

I'm OK with removing all of this if you want (in another PR).

It can be removed here (it is this PR that changes the implementation)

jorisvandenbossche

Small question on a test, for the rest looks good!

jorisvandenbossche · 2017-11-30T08:30:48Z

pandas/tests/series/test_combine_concat.py

@@ -181,7 +181,8 @@ def test_concat_empty_series_dtypes(self):
        # categorical
        assert pd.concat([Series(dtype='category'),
                          Series(dtype='category')]).dtype == 'category'
-        assert pd.concat([Series(dtype='category'),
+        # GH 18515
+        assert pd.concat([Series(np.array([]), dtype='category'),


Why do you change this? (it's only the dtype of the category, it is still a categorical series, so for the concat that does not matter)

Yes, the dtype of the category does matter for concat (and rightly so, since conceptually the fact that a categorical of ints is really a categorical is only an implementation detail when you're going to concat it to a non-categorical anyway).

(... or just try that test before and after my change)

How does this PR change the behaviour of empty category series?

I get that when #18515 is fixed, this test should be changed this way, but please leave that change for a PR that actually fixes #18515

Is it clear to you that that test fails with this PR? If yes, please clarify/reformulate what you're asking for.

If this PR changes (fixes) the dtype of categories for pd.Series(dtype='category'), please specify so and add an explicit test for this (and a whatsnew bug fix note).

So data is None is handled with data = {}, and thus this PR affects this behaviour? So this does fix #18515

OK, asking is good. This PR makes (incidentally, but it is the correct behaviour) category series initialized with an empty dict behave like category series initialized with an empty list, that is have object dtype. This fixes #18515 (but not #17261 ). Since this test is about concatenating an empty float category Series to an empty float (non-category) Series, I had to fix the former so that it still had dtype float64.

So since you reopened #18515, now I will add "closes #18515" and push again.

OK?

Yep.
And in general, please add commits instead of amending if you make such additional changes. Makes reviewing easier.

jreback · 2017-12-01T00:58:16Z

doc/source/whatsnew/v0.22.0.txt

@@ -208,5 +209,6 @@ Other

 - Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`)
 - Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
+- Fixed construction of :class:`Series` from ``dict`` containing ``NaN`` as key (:issue:`18480`)


Construction of a :class:`Series` from a dict

jreback · 2017-12-01T00:59:54Z

pandas/core/series.py

+            keys, values = zip(*compat.iteritems(data))
+        else:
+            keys, values = [], []
+        # Input is now list-like, so rely on "standard" construction:


blank line here

jreback · 2017-12-01T01:00:06Z

pandas/core/series.py

+            keys, values = [], []
+        # Input is now list-like, so rely on "standard" construction:
+        s = Series(values, index=keys, dtype=dtype)
+        # Now we just make sure the order is respected, if any


jreback · 2017-12-01T01:01:00Z

pandas/core/series.py

+        s = Series(values, index=keys, dtype=dtype)
+        # Now we just make sure the order is respected, if any
+        if index is not None and not index.identical(keys):
+            s = s.reindex(index)


you could do:

if index is not None: s = s.reindex(index, copy=False)

jreback · 2017-12-01T01:01:54Z

pandas/tests/series/test_constructors.py

-
-        # the are Index() and RangeIndex() which don't compare type equal
+        empty2 = Series(input_class())
+        # these are Index() and RangeIndex() which don't compare type equal


blank line before comments

jreback · 2017-12-01T01:04:10Z

pandas/tests/series/test_constructors.py

        # but are just .equals
        assert_series_equal(empty, empty2, check_index_type=False)

-        empty = Series(index=lrange(10))


so I think these tests got eliminated? can you add another test (or add onto the construction via np.nan, None, float('nan') which hits this case?

This test is wrong (dtype shouldn't be float64), the right one would fail, see #17261 . Adding an xfailing test.

The xfailing test would basically be this one with explicit dtype, it's better to change it when fixing #17261 . Adding a fixed test for this instead.

closes pandas-dev#18480 closes pandas-dev#18515

jreback · 2017-12-01T19:02:45Z

thanks!

From pandas-dev#18496 Special cases empty series construction, since the reindex is not necessary.

jreback requested changes Nov 26, 2017

View reviewed changes

toobaz force-pushed the series_init_dict_nan_18480 branch from f1a10df to d3d9f45 Compare November 26, 2017 13:28

toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 6d701ec to b10b0ce Compare November 26, 2017 18:29

jreback reviewed Nov 26, 2017

View reviewed changes

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 26, 2017

toobaz changed the title ~~Series init dict nan 18480~~ Construction of Series from dict containing NaN as key Nov 27, 2017

toobaz mentioned this pull request Nov 27, 2017

Creating series with multiindex from dict with tuple keys with None #18505

Closed

toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 69b7a12 to 236df68 Compare November 27, 2017 07:17

jorisvandenbossche reviewed Nov 27, 2017

View reviewed changes

toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 04aada8 to d50f170 Compare November 28, 2017 22:30

toobaz commented Nov 28, 2017

View reviewed changes

jreback requested changes Nov 29, 2017

View reviewed changes

jreback reviewed Nov 29, 2017

View reviewed changes

jorisvandenbossche reviewed Nov 29, 2017

View reviewed changes

toobaz force-pushed the series_init_dict_nan_18480 branch from d50f170 to 4887f78 Compare November 29, 2017 13:15

toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 43e97ca to 1ee3c3e Compare November 29, 2017 13:37

jorisvandenbossche reviewed Nov 30, 2017

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 30, 2017

Inconsistent dtype of category in empty Series between dict and list input #18515

Closed

toobaz force-pushed the series_init_dict_nan_18480 branch from d1804f2 to 2d75ffc Compare November 30, 2017 16:16

jreback requested changes Dec 1, 2017

View reviewed changes

toobaz added 3 commits December 1, 2017 11:42

BUG: fix initialization of Series with dict containing NaN as key

0d49cc6

closes pandas-dev#18480 closes pandas-dev#18515

CLN: remove now unused _get_values_from_dict

dc4c76e

TST: test and whatsnew for pandas-dev#18515

1582c42

toobaz force-pushed the series_init_dict_nan_18480 branch from 2d75ffc to 1582c42 Compare December 1, 2017 11:32

jreback added this to the 0.22.0 milestone Dec 1, 2017

jreback approved these changes Dec 1, 2017

View reviewed changes

jreback merged commit d270bbb into pandas-dev:master Dec 1, 2017

toobaz deleted the series_init_dict_nan_18480 branch December 1, 2017 20:14

TomAugspurger mentioned this pull request Apr 29, 2018

PERF: Regressions since v0.21 #18532

Closed

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Apr 29, 2018

PERF: Fixed regression in Series(index=idx) constructor

f31cb6b

From pandas-dev#18496 Special cases empty series construction, since the reindex is not necessary.

TomAugspurger mentioned this pull request Apr 29, 2018

PERF: Fixed regression in Series(index=idx) constructor #20865

Merged

		@@ -303,6 +293,23 @@ def _can_hold_na(self):

		_index = None

		def _init_from_dict(self, data, index, dtype):

Construction of Series from dict containing NaN as key #18496

Construction of Series from dict containing NaN as key #18496

Conversation

toobaz commented Nov 26, 2017 • edited by jreback Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 26, 2017

Codecov Report

codecov bot commented Nov 26, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Nov 26, 2017

jreback commented Nov 26, 2017

jorisvandenbossche commented Nov 26, 2017

toobaz commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 27, 2017

toobaz commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toobaz commented Nov 29, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 1, 2017

toobaz commented Nov 26, 2017 •

edited by jreback

Loading

codecov bot commented Nov 26, 2017 •

edited

Loading

jorisvandenbossche Nov 27, 2017 •

edited

Loading