Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: MultiIndex not dropping nan level and invalid code value #26408

Merged
merged 17 commits into from
Jun 1, 2019

Conversation

jiangyue12392
Copy link
Contributor

@jiangyue12392 jiangyue12392 commented May 15, 2019

Please take note that another bug is also fixed in this PR:

When the MultiIndex is constructed with code value < -1, a segmentation fault would be resulted in subsequent operations e.g. putting the MultiIndex in a DataFrame
Input:

x = pd.MultiIndex(levels=[['a'], ['x', np.nan]], labels=[[0,-2], [0,1]])
pd.DataFrame(index=x)

A segmentation fault can be resulted by these commands

@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26408 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26408      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50743    50749       +6     
==========================================
+ Hits        46527    46530       +3     
- Misses       4216     4219       +3
Flag Coverage Δ
#multiple 90.19% <100%> (ø) ⬆️
#single 41.18% <83.33%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.64% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.7% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b24fb6...e0d3bb4. Read the comment docs.

@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26408 into master will increase coverage by 0.14%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26408      +/-   ##
==========================================
+ Coverage   91.69%   91.84%   +0.14%     
==========================================
  Files         174      174              
  Lines       50743    50658      -85     
==========================================
- Hits        46527    46525       -2     
+ Misses       4216     4133      -83
Flag Coverage Δ
#multiple 90.38% <100%> (+0.18%) ⬆️
#single 41.7% <55%> (+0.37%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.67% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/dtypes/common.py 95.85% <0%> (-1.56%) ⬇️
pandas/core/indexes/timedeltas.py 90.96% <0%> (-0.63%) ⬇️
pandas/core/indexes/datetimes.py 96.36% <0%> (-0.49%) ⬇️
pandas/compat/__init__.py 93.54% <0%> (-0.4%) ⬇️
pandas/core/indexes/frozen.py 91.78% <0%> (-0.33%) ⬇️
pandas/io/excel/_base.py 91.81% <0%> (-0.33%) ⬇️
pandas/plotting/_style.py 76.92% <0%> (-0.26%) ⬇️
pandas/plotting/_misc.py 38.23% <0%> (-0.23%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b24fb6...a141e40. Read the comment docs.

@@ -387,8 +387,8 @@ MultiIndex
^^^^^^^^^^

- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`)
-
-
- Bug in having code value < -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make something like

Bug in :class:`Multindex` construction from levelsandcodesthat would incorrectly allowsNaNlevels

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels May 15, 2019
@TomAugspurger
Copy link
Contributor

Should we reserve negative codes for use by pandas, like we do for Categorical?

@jreback
Copy link
Contributor

jreback commented May 16, 2019

Should we reserve negative codes for use by pandas, like we do for Categorical?

yes

pandas/core/indexes/multi.py Show resolved Hide resolved


def test_na_levels():
tm.assert_index_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment with the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number added

MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]]),
MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[-1, -1, -1, -1, 3, 4]]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use
result=
expected=
tm.assert_index_equal(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test cases refactored

return result

@classmethod
def _reassign_na_codes(cls, level, code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _validate_codes, pls add a doc-string.

This should be an instance method (and not a classmethod)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method level changed and moved into validation method

@@ -387,8 +387,8 @@ MultiIndex
^^^^^^^^^^

- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`)
-
-
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a single note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes merged

@@ -243,10 +243,23 @@ def __new__(cls, levels=None, codes=None, sortorder=None, names=None,

if verify_integrity:
result._verify_integrity()

codes = [cls._reassign_na_codes(level, code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should actually be IN _verify_integrity

MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
codes=[[-1, -1, -1, -1, 3, 4]]))
tm.assert_index_equal(
MultiIndex(levels=[[np.nan, 's', pd.NaT, 128, None]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also verify if .set_levels and/or .set_codes is called. Can you add tests for each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test cases added

@@ -387,8 +387,7 @@ MultiIndex
^^^^^^^^^^

- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`)
-
-
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1 or NaN levels (:issue:`19387`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to move this under the "API breaking changes" section, as previously we did allow codes < 0?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the whatsnew entry to an API breaking change

@@ -67,6 +67,7 @@ def test_constructor_mismatched_codes_levels(idx):
length_error = (r"On level 0, code max \(3\) >= length of level \(1\)\."
" NOTE: this index is in an inconsistent state")
label_error = r"Unequal code lengths: \[4, 2\]"
code_value_error = (r"On level 0, code value \(-2\) < -1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the outer parentheses here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -387,8 +387,7 @@ MultiIndex
^^^^^^^^^^

- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`)
-
-
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1 or NaN levels (:issue:`19387`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure

return result

def _validate_codes(cls, level, code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this accept self, and add doc-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed and doc-string added

" level (%d). NOTE: this index is in an"
" inconsistent state" % (i, level_codes.max(),
len(level)))
raise ValueError("On level {level}, code max ({max_code})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do like this

msg = "On level...........".format(....)
raise ValueError(msg)

(and if needed make the message nicely formatted over multiple lines, with parens e.g.

msg = ("........"
             ".......".format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting updated

pandas/core/indexes/multi.py Show resolved Hide resolved
@@ -696,8 +713,9 @@ def _set_codes(self, codes, level=None, copy=False, validate=True,

if verify_integrity:
self._verify_integrity(codes=new_codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change _verfiy_integrity to return codes, this then becomes

if verify_integrity:
      new_codes = self._verify_integrity(codes=new_codes)
self._codes = new_codes

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing it, but I don't see an ASV benchmark for MultiIndex with user-supplied codes. Do we suspect this will have a (major) performance impact? Should we add an ASV for it?

.. _whatsnew_0250.api_breaking.multi_indexing:


MultiIndexing contracted from levels and codes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "contracted" is the right word. Maybe "Changed allowed codes in MultiIndex"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the typo in whatsnew, I think I meant constructed

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, some minor comments.

can you run the asv suite and see if any regressions

return result

def _validate_codes(self, level, code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try to type annotate level and code here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about type annotation, could you please point me to any example in the code base?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we have started generic codes in pandas._typing (see how they are used); this can be a followup


Parameters
----------
code : optional list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code : list, optional (same with level)

code : optional list
Code to reassign.
level : optional list
Level to check for Nan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nan -> missing value (NaN, NaT, None)

codes = [self._validate_codes(level, code)
for level, code in zip(levels, codes)]
new_codes = FrozenList(
_ensure_frozen(level_codes, lev, copy=False)._shallow_copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the _shallow_copy()? the constructor (of frozen) should do this

@jreback jreback added this to the 0.25.0 milestone May 26, 2019
@jiangyue12392
Copy link
Contributor Author

@jreback
This is the asv report

before after ratio
[3b24fb6] [af35fa3]
<tmp7> <multi1>

  •     413±2μs         916±50μs     2.22  multiindex_object.Duplicates.time_remove_unused_levels
    
  •    440±30μs         903±20μs     2.05  indexing.MultiIndexing.time_frame_ix
    
  •    407±20μs         830±10μs     2.04  indexing.MultiIndexing.time_series_ix
    
  •     187±1ms          341±2ms     1.82  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'skew')
    
  •     783±3ms       1.41±0.01s     1.81  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'kurt')
    
  •   199±0.6ms          353±2ms     1.77  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'kurt')
    
  •  94.9±0.8ms          161±1ms     1.70  stat_ops.FrameMultiIndexOps.time_op(1, 'kurt')
    
  •  24.6±0.5ms       40.9±0.3ms     1.66  stat_ops.SeriesMultiIndexOps.time_op(1, 'skew')
    
  •  26.0±0.5ms       42.1±0.5ms     1.62  stat_ops.SeriesMultiIndexOps.time_op(1, 'kurt')
    
  •  68.2±0.4ms          101±1ms     1.48  stat_ops.SeriesMultiIndexOps.time_op(1, 'mad')
    
  •     622±1ms          917±4ms     1.48  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'mad')
    
  •     321±2μs          457±2μs     1.43  multiindex_object.Values.time_datetime_level_values_sliced
    
  •     379±7ms          534±2ms     1.41  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'skew')
    
  •    1.23±0ms      1.73±0.01ms     1.41  reshape.SparseIndex.time_unstack
    
  •  24.5±0.2ms       34.1±0.2ms     1.39  stat_ops.FrameMultiIndexOps.time_op(0, 'kurt')
    
  •  6.50±0.2ms       8.77±0.3ms     1.35  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
    
  • 6.96±0.08ms       9.39±0.1ms     1.35  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
    
  •  51.4±0.6ms       68.6±0.4ms     1.33  stat_ops.FrameMultiIndexOps.time_op(1, 'skew')
    
  •  13.0±0.5ms       17.1±0.2ms     1.32  stat_ops.SeriesMultiIndexOps.time_op(0, 'mad')
    
  •     957±6μs      1.22±0.01ms     1.28  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')
    
  •     957±9μs      1.21±0.01ms     1.27  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')
    
  • 1.08±0.01ms      1.36±0.01ms     1.25  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'transformation')
    
  • 3.76±0.03ms      4.71±0.05ms     1.25  period.DataFramePeriodColumn.time_set_index
    
  • 1.09±0.01ms      1.35±0.01ms     1.25  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct')
    
  •     931±5μs         1.16±0ms     1.24  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'direct')
    
  •     211±6μs        262±0.9μs     1.24  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
    
  •     940±2μs      1.16±0.01ms     1.23  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
    
  •     207±3μs        253±0.9μs     1.22  reindex.Fillna.time_float_32('pad')
    
  •     955±6μs      1.17±0.02ms     1.22  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'direct')
    
  •     957±9μs      1.17±0.02ms     1.22  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'transformation')
    
  •    1.51±0ms      1.82±0.02ms     1.20  sparse.FromCoo.time_sparse_series_from_coo
    
  • 7.80±0.05ms      9.37±0.04ms     1.20  ctors.MultiIndexConstructor.time_multiindex_from_iterables
    
  •     238±8μs          282±1μs     1.19  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int16Engine'>, <class 'numpy.int16'>), 'monotonic_incr')
    
  •     945±2μs      1.11±0.01ms     1.18  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'direct')
    
  •    756±20μs          888±5μs     1.17  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'transformation')
    
  •     267±3μs          313±3μs     1.17  reindex.Fillna.time_reindexed('backfill')
    
  •     873±6μs         1.01±0ms     1.16  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'transformation')
    
  •     275±4μs          318±3μs     1.16  reindex.Fillna.time_reindexed('pad')
    
  •     948±3μs      1.10±0.01ms     1.16  groupby.GroupByMethods.time_dtype_as_group('int', 'value_counts', 'transformation')
    
  •    984±20μs         1.13±0ms     1.15  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
    
  •     289±1μs          331±4μs     1.14  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt32Engine'>, <class 'numpy.uint32'>), 'monotonic_incr')
    
  •     424±5ms          483±2ms     1.14  groupby.Transform.time_transform_lambda_max
    
  •  13.0±0.2ms      14.7±0.05ms     1.14  rolling.Pairwise.time_pairwise(1000, 'cov', True)
    
  •     216±1μs         245±10μs     1.13  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt8Engine'>, <class 'numpy.uint8'>), 'monotonic_incr')
    
  •  13.0±0.1ms      14.7±0.07ms     1.13  rolling.Pairwise.time_pairwise(10, 'cov', True)
    
  • 6.78±0.04ms      7.64±0.04ms     1.13  groupby.Transform.time_transform_multi_key2
    
  •    789±20μs          884±7μs     1.12  groupby.GroupByMethods.time_dtype_as_group('object', 'value_counts', 'direct')
    
  • 5.75±0.02ms      6.42±0.02ms     1.12  multiindex_object.Integer.time_get_indexer
    
  • 3.30±0.01ms      3.69±0.03ms     1.12  reshape.SimpleReshape.time_stack
    
  • 1.81±0.01ms      2.02±0.01ms     1.12  reshape.SimpleReshape.time_unstack
    
  • 14.6±0.09ms       16.2±0.1ms     1.11  rolling.Pairwise.time_pairwise(10, 'corr', True)
    

return result

def _validate_codes(self, level, code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we have started generic codes in pandas._typing (see how they are used); this can be a followup


Parameters
----------
code : list, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not optional

if verify_integrity:
self._verify_integrity(codes=new_codes)

new_codes = self._verify_integrity(codes=new_codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this only be called if verfiy_integrity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed

@@ -1760,9 +1799,10 @@ def __setstate__(self, state):

self._set_levels([Index(x) for x in levels], validate=False)
self._set_codes(codes)
new_codes = self._verify_integrity()
self._set_codes(new_codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we end up calling verify_integrity twice? here (and in _set_codes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the set_codes are called with default verify_integrity=False, verification would not be called twice

@jreback
Copy link
Contributor

jreback commented May 29, 2019

@jiangyue12392 on the asv's

so looks like somewhat slower, can you see if we are always checking codes? (I think we are checking twice sometims, and shouldn't check unless verify_integrity=True).

in the tests you may need to test with verfiy_integrity =True as well as the default (False)

@jreback
Copy link
Contributor

jreback commented May 29, 2019

how’s perf now? (you can just show some before / after with a sample benchmark that showed up before)

@jiangyue12392
Copy link
Contributor Author

@jreback
The new asv results on mutiindex related objects are better. The report said "BENCHMARKS NOT SIGNIFICANTLY CHANGED". For example, multiindex_object.Duplicates.time_remove_unused_levels that deteriorates in the previous run is now almost the same for this branch and the master

@jreback
Copy link
Contributor

jreback commented May 29, 2019

great will look soon

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small test comment, otherwise lgtm. ping on green.

[[np.nan, 's', pd.NaT, 128, None]])
tm.assert_index_equal(result, expected)

result = MultiIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace the actual OP tests as well (e.g. the .dropna()); I don't think they are very different that what you have, but they also attempt to .dropna()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropna tests added, and green now.

@jreback jreback merged commit 73d8f96 into pandas-dev:master Jun 1, 2019
@jreback
Copy link
Contributor

jreback commented Jun 1, 2019

thanks @jiangyue12392 this was very nice! a non-trivial change and were very responsive! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex.dropna() does not always drop NANs
3 participants