Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construction of Series from dict containing NaN as key #18496

Merged
merged 3 commits into from
Dec 1, 2017

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Nov 26, 2017

This is also a prerequisite for fixing #18455 (which is a prerequisite for fixing #18460 ). The workaround to #18485 is annoying, but it is easy to remove it when the bug is fixed.

@@ -205,4 +205,5 @@ Other

- Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`)
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
- Fixed initialization of Series from dict containing NaN as key (:issue:`18480`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intialization -> construction
use double-backticks around Series and NaN and dict

if data else np.nan)

remap_to_mi = False
keys = maybe_mi_keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is WAY too complicated, likely slow. pls try to simplify. at the very least, should all be split out to a more accessible location, e.g. move to a helper function, below _sanitize_array is prob ok. call it _dict_to_array or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to need quite some simplification. I am ok with throwing out some cases if that will help.

@@ -625,6 +625,18 @@ def test_constructor_dict(self):
expected.iloc[1] = 1
assert_series_equal(result, expected)

# GH 18480 - NaN key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate test

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18496 into master will increase coverage by <.01%.
The diff coverage is 93.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18496      +/-   ##
==========================================
+ Coverage    91.3%   91.31%   +<.01%     
==========================================
  Files         163      163              
  Lines       49781    49817      +36     
==========================================
+ Hits        45451    45488      +37     
+ Misses       4330     4329       -1
Flag Coverage Δ
#multiple 89.11% <93.18%> (ø) ⬆️
#single 40.74% <54.54%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.66% <93.18%> (-0.11%) ⬇️
pandas/core/indexes/datetimes.py 95.04% <0%> (-0.48%) ⬇️
pandas/core/indexes/datetimelike.py 96.47% <0%> (-0.45%) ⬇️
pandas/core/indexes/base.py 96.34% <0%> (-0.06%) ⬇️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38f41e6...d3d9f45. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18496 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18496      +/-   ##
==========================================
- Coverage   91.44%   91.43%   -0.02%     
==========================================
  Files         157      157              
  Lines       51379    51374       -5     
==========================================
- Hits        46985    46972      -13     
- Misses       4394     4402       +8
Flag Coverage Δ
#multiple 89.3% <100%> (ø) ⬆️
#single 40.62% <72.22%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.69% <ø> (-0.03%) ⬇️
pandas/core/indexes/base.py 96.46% <ø> (-0.01%) ⬇️
pandas/core/indexes/datetimelike.py 97.11% <ø> (+0.19%) ⬆️
pandas/core/base.py 96.54% <100%> (-0.01%) ⬇️
pandas/core/series.py 94.8% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74ac70...1582c42. Read the comment docs.

@toobaz toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 6d701ec to b10b0ce Compare November 26, 2017 18:29
remap_to_mi = False
if data:
keys, values = zip(*compat.iteritems(data))
# Workaround for #18485 - part 1/3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said before. this giant block of code is not acceptable here. its needs to be moved to a separate function, and made much simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is smaller than before, but no problem, I'll isolate it. Any comment on the logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean: different than "made much simpler" which is everybody's dream more than a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is the logic is so dense its not even understandable. this is too complex, no-one will be able to follow this, vet it for bugs or anything. you need to break this apart into logical abstractions, build things up into much smaller units.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz don't get me wrong, I am happy to have a fix for this, but it has to be code that is maintainable, limits special cases, and it non completely non-performant.

@toobaz
Copy link
Member Author

toobaz commented Nov 26, 2017

@jorisvandenbossche @jreback Bringing here a conversation from Gitter: one possible way to approach this (avoiding the problem that {np.nan : 2}[float('nan')] raises KeyError) is to modify lib.fast_multiget so that it iterates the dict and then matches to specific keys, rather than directly looking for such keys. Or it could iterate the dict only if np.nan appears in the desired keys (or if the desired keys are unspecified), and follow the usual behavior otherwise.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

@toobaz happy to have you modify cython code directly to avoid perf penalty (and it also abstracts away some of the complexity). inline comments a plus!

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 26, 2017
@jorisvandenbossche
Copy link
Member

@toobaz sidenote: can you give the PR a bit more descriptive title?

@toobaz toobaz changed the title Series init dict nan 18480 Construction of Series from dict containing NaN as key Nov 27, 2017
@toobaz toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 69b7a12 to 236df68 Compare November 27, 2017 07:17
@toobaz
Copy link
Member Author

toobaz commented Nov 27, 2017

Since I introduced no (explicit) loops I preferred to keep code in *.py.

I could add a test at the beginning of lib.fast_multiget asserting the absence of NaN in the argument. However the case of tuples would be tricky: it's just more complicated - and not particularly useful I guess - to check if they contain NaNs, but at the same time raising an error for any tuple would be wrong.

locs = self.get_indexer(keys)
order = - np.ones(len(self), dtype=int)
order[locs] = np.arange(len(keys))
values = values[order]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just a reindex ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just wanted to avoid creating a Series object, particularly so inside an Index method... but I can do it if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is only used in Series.init, I would then try to refactor to do the reindex there, that would eliminate much of this custom code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... (that would basically be my previous version of this PR, except that I would move code from Series.__init__ to Series._init_from_dict). Notice however that if we leave Index._get_values_from_dict as it is... it is broken (doesn't work for NaNs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But your previous PR had a lot of complex code checking for tuples etc (I think that part was the main objection), while this one does not have that any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then if you're OK with leaving Index._get_values_from_dict as it is, I will try again.

assert_series_equal(result, expected)

# Different NaNs:
d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'}
Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with None key as well? (which already works, but would be nice to assert)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done below)

# Different NaNs:
d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'}
result = Series(d).sort_values()
expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, np.nan])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of the expected result guaranteed like this? (I mean: I though we sort the index, but not the values?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm sort_values()ing just above ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that :-)

@jorisvandenbossche
Copy link
Member

Suggestion (but didn't think it fully through): we can make a distinction on whether index was passed or not. If index was passed, we do want to do dict-lookup (as otherwise it could potentially give a performance bottleneck) -> current code. If index is not passed, we can just get the keys, values of the full dict (as we want all those anyway, so no need to then do a dict-lookup for each key).

That doesn't solve the problem with pd.Series({np.nan: 1, 2:2}, index=[1, np.nan,]), but would already solve the majority of the cases I think?

@toobaz
Copy link
Member Author

toobaz commented Nov 27, 2017

If index was passed, we do want to do dict-lookup (as otherwise it could potentially give a performance bottleneck) -> current code. If index is not passed, we can just get the keys, values of the full dict (as we want all those anyway, so no need to then do a dict-lookup for each key).

In my current version the lookup is not a bottleneck because it is just an iteration on keys/values, precisely what is needed anyway in the "no index" case (where, by the way, no reindexing happens because the indexes are identical()). However, still in the "no index" case, keys are being iterated twice, and I can fix this at the cost of few lines in Series.__init__.

@toobaz toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 04aada8 to d50f170 Compare November 28, 2017 22:30
@@ -181,7 +181,7 @@ def test_concat_empty_series_dtypes(self):
# categorical
assert pd.concat([Series(dtype='category'),
Series(dtype='category')]).dtype == 'category'
assert pd.concat([Series(dtype='category'),
assert pd.concat([Series(np.array([]), dtype='category'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #18515

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this issue number here as well

@toobaz
Copy link
Member Author

toobaz commented Nov 29, 2017

@jreback ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much simpler thanks, some comments.

@@ -303,6 +293,23 @@ def _can_hold_na(self):

_index = None

def _init_from_dict(self, data, index, dtype):
# Looking for NaN in dict doesn't work ({np.nan : 1}[float('nan')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a proper doc-string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check: given the size of the method now, do you still prefer it isolated (rather than having exactly the same lines of code in __init__)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes absolutely, having these separate makes them much easier to grok.

keys, values = zip(*compat.iteritems(data))
else:
keys, values = [], []
s = Series(values, index=keys, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here

expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, None])
assert_series_equal(result, expected)

# MultiIndex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add another test that parametrizes over the missing values specifically (with a fixed structured for d), e.g. parametrize over None, np.nan, float('nan')

@@ -875,7 +875,7 @@ def _map_values(self, mapper, na_action=None):
# we specify the keys here to handle the
# possibility that they are tuples
from pandas import Series, Index
index = Index(mapper, tupleize_cols=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test that hits this (IOW that makes you change this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already there (notice here I'm just reverting my previous PR for consistency with the new default behaviour)... shall I mention this issue in that test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, then add this issue number there as well

@@ -303,6 +293,23 @@ def _can_hold_na(self):

_index = None

def _init_from_dict(self, data, index, dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In frame.py it is called _init_dict so let's use that, and I would move it directly after the __init__

index = Index(_try_sort(data))

try:
data = index._get_values_from_dict(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was the only usage of _get_values_from_dict, so this could be cleaned up. There are also multiple implementation (for the different types of indices, not sure if those differences are important and are all catched in the new implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different implementations seem unused and broken, see for instance

In [2]: pd.Index([])._get_values_from_dict({})
Out[2]: array([], dtype=float64)

In [3]: pd.DatetimeIndex([])._get_values_from_dict({})
Out[3]: array([ nan])

however, in principle they do something sensible (not necessarily expected), which is to look for Timestamp keys in the dict. The "new implementation", that is the Series construction, doesn't care about this (and shouldn't, I think).

I'm OK with removing all of this if you want (in another PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be removed here (it is this PR that changes the implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, ping

@toobaz toobaz force-pushed the series_init_dict_nan_18480 branch 2 times, most recently from 43e97ca to 1ee3c3e Compare November 29, 2017 13:37
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question on a test, for the rest looks good!

@@ -181,7 +181,8 @@ def test_concat_empty_series_dtypes(self):
# categorical
assert pd.concat([Series(dtype='category'),
Series(dtype='category')]).dtype == 'category'
assert pd.concat([Series(dtype='category'),
# GH 18515
assert pd.concat([Series(np.array([]), dtype='category'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change this? (it's only the dtype of the category, it is still a categorical series, so for the concat that does not matter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the dtype of the category does matter for concat (and rightly so, since conceptually the fact that a categorical of ints is really a categorical is only an implementation detail when you're going to concat it to a non-categorical anyway).

(... or just try that test before and after my change)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this PR change the behaviour of empty category series?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that when #18515 is fixed, this test should be changed this way, but please leave that change for a PR that actually fixes #18515

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it clear to you that that test fails with this PR? If yes, please clarify/reformulate what you're asking for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR changes (fixes) the dtype of categories for pd.Series(dtype='category'), please specify so and add an explicit test for this (and a whatsnew bug fix note).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So data is None is handled with data = {}, and thus this PR affects this behaviour? So this does fix #18515

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, asking is good. This PR makes (incidentally, but it is the correct behaviour) category series initialized with an empty dict behave like category series initialized with an empty list, that is have object dtype. This fixes #18515 (but not #17261 ). Since this test is about concatenating an empty float category Series to an empty float (non-category) Series, I had to fix the former so that it still had dtype float64.

So since you reopened #18515, now I will add "closes #18515" and push again.

OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.
And in general, please add commits instead of amending if you make such additional changes. Makes reviewing easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, ping

@@ -208,5 +209,6 @@ Other

- Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`)
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`)
- Fixed construction of :class:`Series` from ``dict`` containing ``NaN`` as key (:issue:`18480`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construction of a :class:`Series` from a dict

keys, values = zip(*compat.iteritems(data))
else:
keys, values = [], []
# Input is now list-like, so rely on "standard" construction:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

keys, values = [], []
# Input is now list-like, so rely on "standard" construction:
s = Series(values, index=keys, dtype=dtype)
# Now we just make sure the order is respected, if any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

s = Series(values, index=keys, dtype=dtype)
# Now we just make sure the order is respected, if any
if index is not None and not index.identical(keys):
s = s.reindex(index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do:

if index is not None:
    s = s.reindex(index, copy=False)


# the are Index() and RangeIndex() which don't compare type equal
empty2 = Series(input_class())
# these are Index() and RangeIndex() which don't compare type equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before comments

# but are just .equals
assert_series_equal(empty, empty2, check_index_type=False)

empty = Series(index=lrange(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think these tests got eliminated? can you add another test (or add onto the construction via np.nan, None, float('nan') which hits this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is wrong (dtype shouldn't be float64), the right one would fail, see #17261 . Adding an xfailing test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xfailing test would basically be this one with explicit dtype, it's better to change it when fixing #17261 . Adding a fixed test for this instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, ping

@jreback jreback added this to the 0.22.0 milestone Dec 1, 2017
@jreback jreback merged commit d270bbb into pandas-dev:master Dec 1, 2017
@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

thanks!

@toobaz toobaz deleted the series_init_dict_nan_18480 branch December 1, 2017 20:14
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Apr 29, 2018
From pandas-dev#18496

Special cases empty series construction, since the reindex is not necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants