Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR MultiIndex.is_lexsorted and MultiIndex.lexsort_depth #38701

Merged
merged 21 commits into from Jan 4, 2021

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 26, 2020

The issue is that is_lexsorted only looks at whether the codes are sorted - the order of the codes doesn't necessarily reflect the sorting of the index values, e.g. if the index is created from a .groupby(sort=False) operation (see the linked issue for an example).

So, instead of is_lexsorted, users should use is_monotonic_increasing. There isn't an equivalent function to point them to for lexsort_depth, right?

@MarcoGorelli
Copy link
Member Author

lexsort_depth also probably needs to be made private, although there's already a _lexsort_depth...will think about how to rename things

@MarcoGorelli MarcoGorelli changed the title DEPR MultiIndex.is_lexsorted DEPR MultiIndex.is_lexsorted and MultiIndex.lexsort_depth Dec 26, 2020
return self._get_lexsort_depth()

@cache_readonly
def _get_lexsort_depth(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's already a _lexsort_depth function, so am replacing this with _get_lexsort_depth

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm that is extra confusing. pls don't do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just move the .sortorder stuff to _lexsort_depth

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review - the difficulty I'm running into is that in pandas/core/indexes/multi.py there's

        if self.sortorder is not None:
            if self.sortorder > self._lexsort_depth():
                raise ValueError(
                    "Value for sortorder must be inferior or equal to actual "
                    f"lexsort_depth: sortorder {self.sortorder} "
                    f"with lexsort_depth {self._lexsort_depth()}"
                )

so I can't just move self.sortorder into self._lexsort_depth else this will never raise, as self._lexsort_depth() would always just return self.sortorder in this snippet.

For now I've factored out the part which is used in the above check into

    def _codes_lexsort_depth(self) -> int:
        int64_codes = [ensure_int64(level_codes) for level_codes in self.codes]
        for k in range(self.nlevels, 0, -1):
            if libalgos.is_lexsorted(int64_codes[:k]):
                return k
        return 0

but I'd imagine this will be considered equally confusing. Any suggestions?

Comment on lines 1824 to 1835
>>> pd.MultiIndex.from_arrays([['a', 'b', 'c'], ['d', 'e', 'f']]).is_lexsorted()
>>> pd.MultiIndex.from_arrays([['a', 'b'], ['d', 'e']])._is_lexsorted()
True
>>> pd.MultiIndex.from_arrays([['a', 'b', 'c'], ['d', 'f', 'e']]).is_lexsorted()
>>> pd.MultiIndex.from_arrays([['a', 'b'], ['d', 'f']])._is_lexsorted()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making these a bit shorter so they fit on a single line

@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 26, 2020 13:39
@MarcoGorelli MarcoGorelli added Deprecate Functionality to remove in pandas MultiIndex labels Dec 26, 2020
@moink
Copy link
Member

moink commented Dec 26, 2020

Some of the CI failures caused by #38703

@MarcoGorelli
Copy link
Member Author

Some of the CI failures caused by #38703

I haven't checked yet, but thanks for looking into it!

return self._get_lexsort_depth()

@cache_readonly
def _get_lexsort_depth(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm that is extra confusing. pls don't do this

return self._get_lexsort_depth()

@cache_readonly
def _get_lexsort_depth(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just move the .sortorder stuff to _lexsort_depth

@@ -35,15 +35,15 @@ def test_sort_index_and_reconstruction_doc_example(self):
),
)
result = df.sort_index()
assert result.index.is_lexsorted()
assert result.index._is_lexsorted()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the cases where these are the same as is_monotonic (vast majority), ok with removing the _is_lexsorted assert


@cache_readonly
def lexsort_depth(self):
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metion in the whatsnew as well

@MarcoGorelli MarcoGorelli marked this pull request as draft January 1, 2021 15:43
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 1, 2021 17:22
Comment on lines 932 to 936
value
a aa 2
bb 1
b aa 4
bb 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something went wrong with pasting the indents here, will fix it in the next commit

return self.sortorder
return self._codes_lexsort_depth()

def _codes_lexsort_depth(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing, why do you think you need yet another method here? we want to reduce the api

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I totally agree - see this comment for why I thought it was necessary: #38701 (comment) :

the difficulty I'm running into is that in pandas/core/indexes/multi.py there's

        if self.sortorder is not None:
            if self.sortorder > self._lexsort_depth():
                raise ValueError(
                    "Value for sortorder must be inferior or equal to actual "
                    f"lexsort_depth: sortorder {self.sortorder} "
                    f"with lexsort_depth {self._lexsort_depth()}"
                )

so I can't just move self.sortorder into self._lexsort_depth else this will never raise, as self._lexsort_depth() would always just return self.sortorder in this snippet.

Would it be less confusing if this was a module-level function? There must be some way of calling only this part of _lexsort_depth because it's used in _verify_integrity from pandas/core/indexes/multi.py :

if self.sortorder is not None:
if self.sortorder > self._lexsort_depth():
raise ValueError(
"Value for sortorder must be inferior or equal to actual "
f"lexsort_depth: sortorder {self.sortorder} "
f"with lexsort_depth {self._lexsort_depth()}"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it a module-level function, is this clearer?

return self.sortorder

return self._lexsort_depth()
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the cache on this one (as its already on _lexsort_depth) and this is now user facing

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, ping on green.

@@ -176,6 +176,15 @@ def new_meth(self_or_cls, *args, **kwargs):
return new_meth


def _lexsort_depth(codes: List[np.ndarray], nlevels: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to almost the end of the file, e.g. where the module level functions are

@jreback jreback merged commit 586b490 into pandas-dev:master Jan 4, 2021
@jreback
Copy link
Contributor

jreback commented Jan 4, 2021

thanks @MarcoGorelli


return self._lexsort_depth()
warnings.warn(
"MultiIndex.is_lexsorted is deprecated as a public function, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli should the message here refer to lexsort_depth instead of is_lexsorted?

this warning isn't being caught in a bunch of tests; not sure why that isn't caught by the npdev build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, looks like a typo, thanks for catching - we probably want to always use match= when catching warning to help prevent this, i'll fix this now

this warning isn't being caught in a bunch of tests; not sure why that isn't caught by the npdev build

not sure what you mean here sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: groupby with sort=False creates buggy MultiIndex
4 participants