New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

Merged
merged 4 commits into from Aug 11, 2013

Conversation

Projects
None yet
6 participants
@jtratner
Contributor

jtratner commented Jun 26, 2013

This PR covers:

Fixes: #4202, #3714, #3742 (there are might be some others, but I've blanked on them...)

Bug fixes:

  • MultiIndex preserves names as much as possible and it's now harder to overwrite index metadata by making changes down the line.
  • set_values no longer messes up names.

External API Changes:

  • Names, levels and labels are now validated each time and 'mostly' immutable.
  • names, levels and labels produce containers that are immutable (using new containers FrozenList and FrozenNDArray)
  • MultiIndex now shallow copies levels and labels before storing them.
  • Adds astype method to MultiIndex to resolve issue with set_values in NDFrame
  • Direct setting of levels and labels is "deprecated" with a setter that raises a DeprecationWarning (but still functions)
  • New set_names, set_labels, and set_levels methods allow setting of these attributes and take an inplace=True keyword argument to mutate in place.
  • Index has a rename method that works similarly to the set_* methods.
  • Improved exceptions on Index methods to be more descriptive / more specific (e.g., replacing Exception with ValueError, etc.)
  • Index.copy() now accepts keyword arguments (name=,names=, levels=, labels=,) which return a new copy with those attributes set. It also accepts deep, which is there for compatibility with other copy() methods, but doesn't actually change what copy does (though, for MultiIndex, it makes the copy operation slower)

Internal changes:

  • MultiIndex now uses _set_levels, _get_levels, _set_labels, _get_labels internally to handle labels and levels (and uses that directly in __array_finalize__ and __setstate__, etc.)
  • MultiIndex.copy(deep=True) will deepcopy levels, labels, and names.
  • Index objects handle names with _set_names and _get_names.
  • Index now inherits from FrozenNDArray which (mostly) blocks mutable methods (except for view() and reshape())
  • Index now actually copies ndarrays when copy=True is passed to constructor and dtype=None
@cpcloud

View changes

pandas/core/index.py Outdated
values = list(values)
if len(values) != self.nlevels:
raise ValueError(('Length of names (%d) must be same as level '
'(%d)') % (len(values),self.nlevels))

This comment has been minimized.

@cpcloud

cpcloud Jun 26, 2013

Member

complete bikeshedding, but no need for double parens here, as long as there's one set of parens python knows what 2 do :)

This comment has been minimized.

@jtratner

jtratner Jun 26, 2013

Contributor

heh, I just moved it straight from __new__, certainly worth it to change. :)

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 26, 2013

as an aside, this leads to weird things like

idx.names = list("abcdef")
idx.droplevel("d") # Gives an index out of range-esque error
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 26, 2013

yep i used to get that and i hacked around by recreating frames/series and some other trickery that will hopefully never see the light of day :)

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 26, 2013

@cpcloud okay, I think I'm understanding more of the problem here: when you slice an index, the levels remain the same...e.g.:

chunklet = idx[-3:]
assert chunklet.levels[0] is idx.levels[0] # True

So, when you assign names, it mutates the underlying levels of both. This seems to follow the convention in __new__ to assign names to the levels directly...but if you do this after the fact, you end up mutating earlier copies.

Moreover, if you pass levels to the MultiIndex constructor, it doesn't copy them, e.g.:

new_idx = MultiIndex(idx.levels, idx.labels)
assert new_idx.levels[0] is idx.levels[0] # True

so what ought to be happening here? Should names be assigned to underlying levels or just left alone?

@cpcloud

View changes

pandas/tests/test_index.py Outdated
index.names = ["a", "b"]
ind_names = list(index.names)
level_names = [level.name for level in index.levels]
self.assertListEqual(ind_names, level_names)

This comment has been minimized.

@cpcloud

cpcloud Jun 27, 2013

Member

fyi you can't use this because it was introduced in py27

This comment has been minimized.

@jtratner

jtratner Jun 27, 2013

Contributor

thanks for the heads up ... didn't realize (and definitely not worth
porting). Second change to make.

On Wed, Jun 26, 2013 at 10:01 PM, Phillip Cloud notifications@github.comwrote:

In pandas/tests/test_index.py:

  •    # initializing with bad names (should always be equivalent)
    
  •    major_axis, minor_axis = self.index.levels
    
  •    major_labels, minor_labels = self.index.labels
    
  •    assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
    
  •                       labels=[major_labels, minor_labels],
    
  •                       names=['first'])
    
  •    assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
    
  •                       labels=[major_labels, minor_labels],
    
  •                       names=['first', 'second', 'third'])
    
  •    # names are assigned
    
  •    index.names = ["a", "b"]
    
  •    ind_names = list(index.names)
    
  •    level_names = [level.name for level in index.levels]
    
  •    self.assertListEqual(ind_names, level_names)
    

fyi you can't use this because it was introduced in py27http://docs.python.org/2/library/unittest.html#unittest.TestCase.assertListEqual


Reply to this email directly or view it on GitHubhttps://github.com//pull/4039/files#r4907657
.

This comment has been minimized.

@cpcloud

cpcloud Jun 27, 2013

Member

In python >= 2.7 assertEqual will dispatch to e.g., assertListEqual if two lists are passed, so that's nice.

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 27, 2013

@cpcloud made those two changes and rebased.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 27, 2013

@wesm any reason why names should be settable to a sequence larger than nlevels after the fact?

@wesm

This comment has been minimized.

Member

wesm commented Jun 27, 2013

nope. i just never added any validation and left names as a simple attribute.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 27, 2013

+1 for a validator here...@jtratner what breaks?

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 27, 2013

this is basically same issue as #3742

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 27, 2013

indeed...close this one, that one?

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 27, 2013

def should be some kind of validator on setting of names attrib in index; weird things can happen if e.g. levels are changed

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 27, 2013

can close that one (maybe move example to here though) as another test case

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 27, 2013

Example from #3742 cc @thriveth

I have raised the issue in this question on Stack Overflow, but I'm not sure it ever made it to the Pandas issue tracker.

I have a MultiIndex'ed DataFrame which I want to expand by using set_value(), but doing this destroys the names attribute of the index. This does not happen when setting the value of an already existing entry in the DataFrame. An easily reproducible example is to create the dataframe by:

lev1 = ['hans', 'hans', 'hans', 'grethe', 'grethe', 'grethe']
lev2 = ['1', '2', '3'] * 2
idx = pd.MultiIndex.from_arrays(
    [lev1, lev2], 
    names=['Name', 'Number'])
df = pd.DataFrame(
    np.random.randn(6, 4),
    columns=['one', 'two', 'three', 'four'],
    index=idx)
df = df.sortlevel()
df 

This shows a neat and nice object, just as I expected, with proper naming of the index columns. If I now run:

df.set_value(('grethe', '3'), 'one', 99.34)

the result is also as expected. But if I run:

df.set_value(('grethe', '4'), 'one', 99.34)

The column names of the index are gone, and the names attribute has been set to [None, None].

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 27, 2013

also #3714 same issue too, except assigning to levels, needs validation as well

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 27, 2013

This is what I was trying to get across earlier :) If you pass levels through the MultiIndex constructor, they have their names set to the names keyword argument or to None (and are not copied).

        if names is None:
            # !!!This is why names get reset to None
            subarr.names = [None] * subarr.nlevels
        else:
            if len(names) != subarr.nlevels:
                raise AssertionError(('Length of names (%d) must be same as level '
                                      '(%d)') % (len(names),subarr.nlevels))

            subarr.names = list(names)

        # THIS IS WHERE NAMES GET OVERWRITTEN WITHOUT BEING COPIED
        # set the name
        for i, name in enumerate(subarr.names):
            subarr.levels[i].name = name

An easy solution would be for the MultiIndex to copy the levels it receives first and then rename them. Then, any time you set the names attribute, it would set the name on every level and the names attribute on the copied levels, e.g. force _ensure_index to make a copy if it's already an index, so that at this point you'd be all good:

levels = [_ensure_index(lev) for lev in levels]

Because later on it just assigns it to the object:

subarr.levels = levels

I think levels should be a cached_readonly property, so that you don't end up creating indices twice (once in the levels = part and another on the setattr levels). If not,you could turn levels into a property that makes a copy before setting on the object.

If this all makes sense to you, I can write it up into a PR soon.

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 27, 2013

@jreback @cpcloud Are Index and MultiIndex supposed to be immutable? If so, then neither should be able to set names directly (like MultiIndex does in its __new__ method). I think it makes more sense to allow them to mutate on names...much easier internally.

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 27, 2013

well they are immutable on the values
but not names / levels (though they really should be)
because state sharing is a problem then

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

@cpcloud I learned something today: Python cheats and compares lists first by object equality, then checks individual items...I'm wondering if this error is lurking other places in the code:

>>> arr1, arr2 = np.array(range(10)), np.array(range(10))
>>> assert [arr1, arr2] == [arr1, arr2] # succeeds
>>> assert [arr1, arr2] == [arr1, arr2.copy()]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> assert_almost_equal([arr1, arr2], [arr1.copy(), arr2])
True

I found this in test_copy, where a test case was not actually testing what it claimed to be testing :P .

    def test_copy(self):
        i_copy = self.index.copy()

        # Equal...but not the same object
        self.assert_(i_copy.levels == self.index.levels)
        self.assert_(i_copy.levels is not self.index.levels)
@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

Actually that whole test is not great, because it's actually (was) testing that two different lists were created.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 28, 2013

@jtratner that is good to know.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 28, 2013

i always assumed that sequence equality was done recursively. docs seem to imply that that is the case..strange

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

@cpcloud yep, that's exactly right. (I'm sure that's implementation dependent, but it is important to know if using numpy arrays.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 28, 2013

maybe some sort of lightweight Levels class is in order? overloading __eq__ to compare levels or something more general for use with labels and levels?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 28, 2013

Maybe you could use Categorical?

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

@cpcloud maybe. Right now I just changed levels, names and labels to return tuples and used assert_almost_equals. Very simple.

On that note - is it okay to make that change? Much easier to prevent erroneous assignment (like index.levels[i] = some_new_index) by producing an immutable object than to try to use something else. Have to change a ton of test cases that assumed names produces a list, but hopefully that's not a big deal...

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 28, 2013

Categorical might be jumping the gun a bit. I would rather have it immutable, but that prevents sharing. a perf comparison might be useful here

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

@cpcloud well, I'm using shallow copies/views, which I think means that only metadata is copied. This is necessary anyways, because you want to be able to set names on the underlying levels without worrying about messing up other indexes.

@jtratner

This comment has been minimized.

Contributor

jtratner commented Jun 28, 2013

(I just pushed what I have so far - it's failing because a ton of tests assume that index names, levels and labels will be lists...)

jtratner added some commits Jun 29, 2013

ENH: Index inherits from FrozenNDArray + add FrozenList
* `FrozenNDArray` - thin wrapper around ndarray that disallows setting methods
  (will be used for levels on `MultiIndex`)
* `FrozenList` - thin wrapper around list that disallows setting methods
  (needed because of type checks elsewhere)

Index inherits from FrozenNDArray now and also actually copies for deepcopy.
Assumption is that underlying array is still immutable-ish
ENH: Make core/index exceptions more descriptive
* `assert_copy`: You can use `assert_copy` to check that two iterables
  produce copies that are not the same object. (uses `assert_almost_equal`
  under the hood).
* Fix assert_almost_equal to handle non-ndarrays (previously failed
  after iterable check)
* Fix test_mixed_panel to reflect true name behavior
@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 10, 2013

@jreback tell me if you want docs on set_names, set_levels, or the new copy constructor.

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 10, 2013

I think an example for v0.13 is good
and maybe add in indexing.rst? not sure we actually have a section on index names so put where u think

@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 10, 2013

okay, there's an example in v0.13.0.txt and changed indexing.rst slightly to add the index names section (moved around the Index objects part so it could address MultiIndex too)

@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 10, 2013

@jreback this is all working now + has the docs, etc.

@jreback

View changes

doc/source/release.rst Outdated
@@ -82,6 +88,20 @@ pandas 0.13
- removed the ``warn`` argument from ``open``. Instead a ``PossibleDataLossError`` exception will
be raised if you try to use ``mode='w'`` with an OPEN file handle (:issue:`4367`)
- allow a passed locations array or mask as a ``where`` condition (:issue:`4467`)
- ``Index`` and ``MultiIndex`` changes (:issue:`4039`):

This comment has been minimized.

@jreback

jreback Aug 11, 2013

Contributor

doesn't this need to be indented? (eg outer level should be same as existing and inner level indented?)

This comment has been minimized.

@cpcloud

cpcloud Aug 11, 2013

Member

yep needs to be indented

This comment has been minimized.

@jtratner

jtratner Aug 11, 2013

Contributor

That indentation level is because the previous entry is a sub-bullet of the HDFStore changes. (this matches up with the outer indentation level, which is 2 spaces.).

This comment has been minimized.

@cpcloud

cpcloud Aug 11, 2013

Member

then can you change the ones below it cuz this one sticks out

This comment has been minimized.

@jtratner

jtratner Aug 11, 2013

Contributor

What follows are sub-bullets of "Index and MultiIndex changes", just like how the HDFStore changes are grouped above it - I can change it if you want, I was matching the look of other elements that have multiple changes.

This comment has been minimized.

@cpcloud

cpcloud Aug 11, 2013

Member

oh you're right! sorry about that. only thing is that to prevent sphinx from complaining you should put a newline between an outdented bullet point and the previously indented one

This comment has been minimized.

@jtratner

jtratner Aug 11, 2013

Contributor

@cpcloud added the space.

@jreback

View changes

doc/source/v0.13.0.txt Outdated
..code-block ::
# instead of setting levels directly

This comment has been minimized.

@jreback

jreback Aug 11, 2013

Contributor

I would make this more explicit, I'll saying before 0.13 you did this, but that is deprecated, so now use set_*

This comment has been minimized.

@jtratner

jtratner Aug 11, 2013

Contributor

I fixed this up - see what you think.

jtratner added some commits Jun 28, 2013

BUG/ENH: Make names, levels and labels properties.
* `names` is now a property *and* is set up as an immutable tuple.
* `levels` are always (shallow) copied now and it is deprecated to set directly
* `labels` are set up as a property now, moving all the processing of
  labels out of `__new__` + shallow-copied.
* `levels` and `labels` are immutable.
* Add names tests, motivating example from #3742, reflect tuple-ish
  output from names, and level names check to reindex test.
* Add set_levels, set_labels, set_names and rename to index
* Deprecate setting labels and levels directly

Similar to other set_* methods...allows mutation if necessary but
otherwise returns same object.

Labels are now converted to `FrozenNDArray` and wrapped in a
`FrozenList`. Should mostly resolve #3714 because you have to work to
actually make assignments to an `Index`.

BUG: Give MultiIndex its own astype method

Fixes issue with set_value forgetting names.
ENH: Additional keyword arguments for Index.copy()
* Index derivatives can set `name` or `names` as well as
  `dtype` on copy. MultiIndex can set `levels`, `labels`, and `names`.
* Also, `__deepcopy__` just calls `copy(deep=True)`
* Now, BlockManager.copy() takes an additional argument `copy_axes` which
  copies axes as well. Defaults to False.
* `Series.copy()` takes an optional deep argument, which causes it to
  copy its index.
* `DataFrame.copy()` passes `copy_axes=True` when deepcopying.
* Add copy kwarg to MultiIndex `__new__`
@jreback

This comment has been minimized.

Contributor

jreback commented Aug 11, 2013

ok bombs away

jreback added a commit that referenced this pull request Aug 11, 2013

Merge pull request #4039 from jtratner/fix-multiindex-naming
ENH/BUG: Fix names, levels and labels handling in MultiIndex

@jreback jreback merged commit 429e9f3 into pandas-dev:master Aug 11, 2013

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2013

@jtratner

just rebased #3482 on top of this in master, no problem

noticed that you had a separate copy for Series (which copies index and such properly). Is this not also needed in core/internals/copy? specificy the axes are ALWAYS shallow copied here...?

@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 12, 2013

@jreback actually, I was flip-flopping on this for a while. I tried copying axes in a few places, and I kept getting issues with the check that self.ref_items is self.items failing. It's fine for series, because it passes it through the constructor again, but DataFrame passes its data to the new object, so it doesn't reassign ref_items. I didn't want to mess with it too much because I wasn't sure where to change it. Maybe I needed to change it in block instead?

@jtratner jtratner deleted the jtratner:fix-multiindex-naming branch Aug 12, 2013

@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 12, 2013

@jreback After you brought this up - I'm thinking that you could copy index first, then pass it as a parameter to the copy on blocks to overwrite items and ref_items. Maybe that would work? What's the difference between ref_items and items?

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 12, 2013

@jtratner items are references to ref_items that are identical or a take (and not an indexing operation).

So the only thing you can do is to copy it BEFORE you start and then use that one (and yes, could be done in copy), you could have to pass the new ref_items, then change the Block.copy to use the new items, e.g. items = ref_items.take(self.ref_locs); very much like a renaming operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment