New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

Merged
merged 3 commits into from Apr 1, 2018

Conversation

Projects
None yet
5 participants
@toobaz
Member

toobaz commented Dec 2, 2017

  • closes #18455
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This does not solve the MI example in #18455, but that should be included in #18485 .

@@ -416,44 +416,29 @@ def _init_dict(self, data, index, columns, dtype=None):
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
columns = _ensure_index(columns)
arrays = Series(data, index=columns, dtype=object)
data_names = arrays.index

This comment has been minimized.

@jreback

jreback Dec 2, 2017

Contributor

this will be a perf issue

This comment has been minimized.

@toobaz

toobaz Dec 2, 2017

Member

Maybe... but right now it seems to be worse...

     [d163de70]       [f7447b3f]
-      47.9±0.3ms       43.5±0.4ms     0.91  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-      31.0±0.1ms       28.1±0.3ms     0.91  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BusinessDay', 2)
-        31.3±1ms       28.2±0.2ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
-      31.8±0.3ms       28.0±0.4ms     0.88  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessDay', 2)
-      32.8±0.3ms       28.2±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Day', 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

This comment has been minimized.

@toobaz

toobaz Dec 2, 2017

Member

There does seem to be a performance loss on very small dfs. E.g. for pd.DataFrame(data) with data = {1 : [2], 3 : [4], 5 : [6]} I get results around 530 µs for per loop before and 570 µs after. So we are talking about a ~10% gain on large dfs vs. a ~7.5% loss on small dfs.

... or I can avoid that Series and sort manually, at the cost of a bit of added complexity, probably ~10 LoCs.

This comment has been minimized.

@toobaz

toobaz Dec 2, 2017

Member

uhm... those asv results also seem pretty unstable:

      before           after         ratio
     [d163de70]       [f7447b3f]
+      30.7±0.2ms         41.1±3ms     1.34  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('QuarterBegin', 1)
+      29.4±0.1ms         39.1±4ms     1.33  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CBMonthBegin', 2)
+      30.7±0.7ms         40.4±3ms     1.32  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BDay', 2)
+      31.1±0.7ms         39.1±5ms     1.26  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('SemiMonthEnd', 2)
+      30.4±0.1ms         38.0±4ms     1.25  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Hour', 2)
-        33.8±1ms       30.4±0.8ms     0.90  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('Micro', 1)
-      48.3±0.6ms       42.6±0.8ms     0.88  frame_ctor.FromDicts.time_frame_ctor_nested_dict
-        23.8±1ms       20.5±0.2ms     0.86  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253Quarter_1', 2)
-      41.2±0.7ms         30.4±2ms     0.74  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('CustomBusinessHour', 2)
-      8.35±0.9ms      6.05±0.01ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('FY5253_2', 2)
-        42.4±2ms       30.5±0.5ms     0.72  frame_ctor.FromDictwithTimestampOffsets.time_frame_ctor('BMonthEnd', 2)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'll try to sort manually and see how it goes.

This comment has been minimized.

@jreback

jreback Dec 2, 2017

Contributor

i actually doubt we have good benchmarks on this you are measuring the same benchmark here

we need benchmarks that contruct with different dtypes

and reducing code complexity is paramount here (though of course don’t want to sacrifice perf)

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 21, 2018

pls rebase if you can continue on this.

@pep8speaks

This comment has been minimized.

pep8speaks commented Feb 4, 2018

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 01, 2018 at 15:27 Hours UTC
@codecov

This comment has been minimized.

codecov bot commented Feb 5, 2018

Codecov Report

Merging #18600 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18600      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         152      152              
  Lines       49265    49256       -9     
==========================================
- Hits        45247    45241       -6     
+ Misses       4018     4015       -3
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.9% <93.33%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (+0.04%) ⬆️
pandas/core/internals.py 95.53% <100%> (ø) ⬆️
pandas/core/series.py 93.9% <100%> (+0.11%) ⬆️
pandas/core/frame.py 97.15% <100%> (-0.02%) ⬇️
pandas/core/dtypes/cast.py 87.85% <0%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a22cf7...22701fc. Read the comment docs.

@@ -591,3 +591,5 @@ Other
^^^^^
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

move to reshaping

v.fill(np.nan)
# no obvious "empty" int column
if missing.any() and not (dtype is not None and

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

use is_integer_dtype

# no obvious "empty" int column
if missing.any() and not (dtype is not None and
issubclass(dtype.type, np.integer)):
if dtype is None or np.issubdtype(dtype, np.flexible):

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

use is_object_dtype

This comment has been minimized.

@toobaz

toobaz Feb 5, 2018

Member

It is not equivalent:

In [2]: a = np.array('abc'.split())

In [3]: pd.core.dtypes.common.is_object_dtype(a.dtype)
Out[3]: False

In [4]: np.issubdtype(a.dtype, np.flexible)
Out[4]: True
data_names.append(k)
arrays.append(v)
nan_dtype = dtype
v = np.empty(len(index), dtype=nan_dtype)

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

use construct_1d_arraylike_from_scalar

subarr = np.array(subarr, dtype=dtype, copy=copy)
# Take care in creating object arrays (but generators are not
# supported, hence the __len__ check):
if dtype == 'object' and (hasattr(subarr, '__len__') and

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

use is_object_dtype

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

this should be a separate branch rather than a nested if

This comment has been minimized.

@toobaz

toobaz Feb 5, 2018

Member

Do you mean

if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and
                          not isinstance(subarr, np.ndarray)):
    [...]
elif not is_extension_type(subarr):
    [...]

?

@@ -287,8 +287,49 @@ def test_constructor_dict(self):
with tm.assert_raises_regex(ValueError, msg):
DataFrame({'a': 0.7}, columns=['a'])
with tm.assert_raises_regex(ValueError, msg):

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

make a separate test (this change), with a comment

This comment has been minimized.

@toobaz

toobaz Feb 8, 2018

Member

(done)

cols = [1, value, 3]
idx = ['a', value]
values = [[0, 3], [1, 4], [2, 5]]
data = {cols[c]: pd.Series(values[c], index=idx) for c in range(3)}

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

dont' use pd. on anything

result = (DataFrame(data)
.sort_values((11, 21))
.sort_values(('a', value), axis=1))
expected = pd.DataFrame(np.arange(6, dtype='int64').reshape(2, 3),

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

same

@@ -735,15 +776,15 @@ def test_constructor_corner(self):
# does not error but ends up float
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

why is this changing?

This comment has been minimized.

@toobaz

toobaz Feb 5, 2018

Member

Because it was wrong: an int should not upcast to object (the passed dtype is currently not considered). issue? whatsnew?

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

hmm, yeah this looks suspect. I would make a new issue for this

This comment has been minimized.

@toobaz
@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
)
expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
tm.assert_frame_equal(actual_header_none, expected_header_none)
expected_header_zero = DataFrame(columns=[0], dtype='int64')
expected_header_zero = DataFrame(columns=[0])

This comment has been minimized.

@jreback

jreback Feb 5, 2018

Contributor

why is this changing?

This comment has been minimized.

@toobaz

toobaz Feb 5, 2018

Member

The test was wrong and worked by accident. The result is, and should be, of object dtype; but the "expected" one was too, because the passed dtype wasn't being considered (see above).

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

ok again add this as an example in a new issue

This comment has been minimized.

@toobaz
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 5, 2018

@toobaz did you add a test case for #19497 to see if it's fixed?

@toobaz

This comment has been minimized.

Member

toobaz commented Feb 5, 2018

@toobaz did you add a test case for #19497 to see if it's fixed?

See #19497 (comment)

@toobaz

This comment has been minimized.

Member

toobaz commented Feb 8, 2018

@jreback ping. The new commit removes a workaround to #18455.

@@ -6468,7 +6468,6 @@ def _where(self, cond, other=np.nan, inplace=False, axis=None, level=None,
if not is_bool_dtype(dt):
raise ValueError(msg.format(dtype=dt))
cond = cond.astype(bool, copy=False)

This comment has been minimized.

@jreback

jreback Feb 8, 2018

Contributor

what caused you to change this?

This comment has been minimized.

@toobaz

toobaz Feb 8, 2018

Member

It's useless (bool dtype is checked just above)... but it's admittedly unrelated to the rest of the PR (it just came out debugging it).

if not is_extension_type(subarr):
# Take care in creating object arrays (but generators are not
# supported, hence the __len__ check):
if is_object_dtype(dtype) and (hasattr(subarr, '__len__') and

This comment has been minimized.

@jreback

jreback Feb 8, 2018

Contributor

aren't you just checking is_list_like?

This comment has been minimized.

@toobaz

toobaz Feb 8, 2018

Member

No, for instance

In [2]: pd.core.dtypes.common.is_list_like((x for x in range(3)))
Out[2]: True

I don't know whether there was a general discussion about iterators as input; we could either decide to drop support for them, or to centralize its handling where possible (e.g. at least for indexes and data), in which case I would change these two lines. I can open a new issue for this.

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

ok, then add a function in pandas.core.dtypes.inference to is_generator and use it here (similar to is_iterator)

This comment has been minimized.

@toobaz

toobaz Feb 11, 2018

Member

Replacing hasattr(subarr, '__len__') with is_list_like(dtype) and not is_iterator(dtype) should do, I don't think we need is_generator. But alternatively, I could add an argument is_list_like(iterators=True), and use is_list_like(iterators=False) here. I think it would come handy at several other places.

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

just make a new function, much simpler that way

This comment has been minimized.

@toobaz

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

yes there is, we do this elsewhere.

This comment has been minimized.

@toobaz

toobaz Feb 11, 2018

Member

Please elaborate on what "new function" would make things "simpler".

@@ -762,3 +763,4 @@ Other
^^^^^
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

move to reshaping

@@ -418,44 +419,28 @@ def _init_dict(self, data, index, columns, dtype=None):
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
columns = _ensure_index(columns)
arrays = Series(data, index=columns, dtype=object)

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

do we have an asv that actually hits this path here, e.g. not-none columns and a dict as input? I am concerned that this Series conversion to object is going to cause issues (and an asv or 2 will determine this)

This comment has been minimized.

@toobaz

toobaz Feb 12, 2018

Member

Added some, see below

index = extract_index(list(data.values()))
# GH10856
# raise ValueError if only scalars in dict

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

do you need the .tolist()?

This comment has been minimized.

@toobaz

toobaz Feb 12, 2018

Member

(removed)

v.fill(np.nan)
# no obvious "empty" int column
if missing.any() and not is_integer_dtype(dtype):
if dtype is None or np.issubdtype(dtype, np.flexible):

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

why is the flexible needed here? is this actually hit by a test?

This comment has been minimized.

@toobaz

This comment has been minimized.

@jreback

jreback Feb 11, 2018

Contributor

i would appreciate an actual explanation. we do not check for this dtype anywhere else in the codebase. so at the very least this needs a comment

This comment has been minimized.

@toobaz

toobaz Feb 11, 2018

Member

Sure, I would also appreciate an explanation (on that code @ajcr wrote and you committed).

v = construct_1d_arraylike_from_scalar(np.nan, len(index),
nan_dtype)
arrays.loc[missing] = [v] * missing.sum()
arrays = arrays.tolist()

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

this is a 2-D here yes? can you add a comment

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

do you need to do this conversion?

@@ -735,15 +776,15 @@ def test_constructor_corner(self):
# does not error but ends up float
df = DataFrame(index=lrange(10), columns=['a', 'b'], dtype=int)
assert df.values.dtype == np.object_
assert df.values.dtype == np.dtype('float64')

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

hmm, yeah this looks suspect. I would make a new issue for this

@@ -511,7 +511,7 @@ def test_read_one_empty_col_with_header(self):
)
expected_header_none = DataFrame(pd.Series([0], dtype='int64'))
tm.assert_frame_equal(actual_header_none, expected_header_none)
expected_header_zero = DataFrame(columns=[0], dtype='int64')
expected_header_zero = DataFrame(columns=[0])

This comment has been minimized.

@jreback

jreback Feb 10, 2018

Contributor

ok again add this as an example in a new issue

@toobaz

This comment has been minimized.

Member

toobaz commented Feb 12, 2018

ASV run:

       before           after         ratio
     [324379ce]       [ef2340f7]
-      33.3±0.9ms       29.9±0.1ms     0.90  frame_ctor.FromDicts.time_nested_dict_index
-      32.8±0.2ms      28.5±0.09ms     0.87  frame_ctor.FromDictwithTimestamp.time_dict_with_timestamp_offsets(<Hour>)
-      50.4±0.3ms       42.4±0.5ms     0.84  frame_ctor.FromDicts.time_nested_dict_columns
-         417±3μs          281±6μs     0.67  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-     1.32±0.03ms          277±1μs     0.21  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The Travis CI problem seems unrelated.

@toobaz

This comment has been minimized.

Member

toobaz commented Mar 31, 2018

@jreback rebased, ready to merge if there are no further comments

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 1, 2018

will look

@jreback

jreback approved these changes Apr 1, 2018

if you can edit the whatsnew slightly as indicated, ok to merge (left another comment but can try to address in the future)

@@ -1135,6 +1135,9 @@ Reshaping
- Bug in :func:`DataFrame.unstack` which casts int to float if ``columns`` is a ``MultiIndex`` with unused levels (:issue:`17845`)
- Bug in :func:`DataFrame.unstack` which raises an error if ``index`` is a ``MultiIndex`` with unused labels on the unstacked level (:issue:`18562`)
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`)
- Fixed construction of a :class:`DataFrame` from a ``dict`` containing ``NaN`` as key (:issue:`18455`)
- Suppressed error in the construction of a :class:`DataFrame` from a ``dict`` containing scalar values when the corresponding keys are not included in the passed index (:issue:`18600`)
- Fixed (changed from ``object`` to ``float64``) dtype of DataFrame initialized with ``dtype=int`` and without data (:issues:`19646`)

This comment has been minimized.

@jreback

jreback Apr 1, 2018

Contributor

this 3rd one not super clear, see if you can reword a bit

if not is_extension_type(subarr):
# Take care in creating object arrays (but iterators are not
# supported):
if is_object_dtype(dtype) and (is_list_like(subarr) and

This comment has been minimized.

@jreback

jreback Apr 1, 2018

Contributor

this is pretty hard to read, but ok for now, see if can simplify in the future

This comment has been minimized.

@toobaz

toobaz Apr 1, 2018

Member

Yes, for sure we will need some unified mechanism to process iterators

@jreback jreback added this to the 0.23.0 milestone Apr 1, 2018

@toobaz toobaz merged commit 4efb39f into pandas-dev:master Apr 1, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@toobaz toobaz deleted the toobaz:df_init_dict_nan branch Apr 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment