Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix copy s.t. it always copies index/columns. #4830

Merged
merged 3 commits into from Sep 24, 2013

Conversation

@jtratner
Copy link
Contributor

commented Sep 13, 2013

Fixes #4202 (and maybe some others too).

Only copies index/columns with deep=True on BlockManager. Plus some
tests...yay!

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2013

@hayd copying will work now :)

@jreback

View changes

pandas/core/internals.py Outdated
if deep:
new_axes = [ax.copy(deep=deep) for ax in self.axes]
else:
new_axes = list(self.axes)
return self.apply('copy', axes=new_axes, deep=deep, do_integrity_check=False)

This comment has been minimized.

Copy link
@jreback

jreback Sep 13, 2013

Contributor

just pass ref_items=new_axes[0] if deep else None (and it will get passed to the block), the you dont' need the mod in apply

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 13, 2013

Author Contributor

duh - good catch. 😄

@hayd

View changes

pandas/core/internals.py Outdated
else:
new_axes = list(self.axes)
return self.apply('copy', axes=new_axes, deep=deep,
ref_items=new_axes[0] do_integrity_check=False)

This comment has been minimized.

Copy link
@hayd

hayd Sep 13, 2013

Contributor

comma missing?

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 13, 2013

Author Contributor

whoops. that's why gotta test locally too...

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2013

this works for everything but SparseSeries and SparsePanel at this point...still trying to figure them out.

@jtratner

View changes

pandas/core/groupby.py Outdated
@@ -1988,7 +1988,8 @@ def transform(self, func, *args, **kwargs):

# broadcasting
if isinstance(res, Series):
if res.index is obj.index:
# TODO: Possibly could just check for length here...not clear
if res.index.identical(obj.index):

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 15, 2013

Author Contributor

@jreback, @cpcloud or others - can't do this check any more here, because of the change in how copying works (and because more ops end up returning views rather than the same object), I think identical is okay, because it seems relatively fast. I didn't see much of a change on perf tests, but they are relatively unreliable on my computer. What do you think? I wasn't sure whether what mattered was whether it was the same object or if it was just the length...

This comment has been minimized.

Copy link
@cpcloud

cpcloud Sep 15, 2013

Member

Hm is is much faster than identical, even for tiny indices:

In [9]: i = Index([1, 2, 3])

In [10]: timeit i.identical(i)
1000000 loops, best of 3: 1.43 µs per loop

In [11]: timeit i is i
10000000 loops, best of 3: 60.2 ns per loop

Is there any way to try the is check and if that fails then use identical.

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 15, 2013

Author Contributor

@cpcloud with these changes, in the entire test suite, the is check never succeeds. I was thinking of checking is on the base first, since it's a view. Would that work? Here's what I get timing wise for this:

In [3]: %timeit i is i
%timeit 10000000 loops, best of 3: 95.6 ns per loop

In [4]: %timeit i.base is i.base
1000000 loops, best of 3: 369 ns per loop

In [5]: %timeit i.identical(i)
100000 loops, best of 3: 2.1 µs per loop

Works with MultiIndex too:

In [6]: mi = MultiIndex.from_tuples(zip(range(10), range(10)))

In [7]: mi.base
Out[7]: array([], dtype=object)
In [8]: mi2 = MultiIndex.from_tuples(zip(range(10), range(10))
   ...: )

In [9]: mi2
Out[9]:
MultiIndex
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]

In [10]: mi2.base
Out[10]: array([], dtype=object)

In [11]: mi.base is mi2.base
Out[11]: False

In [12]: %timeit mi is mi
%timeit mi.base i10000000 loops, best of 3: 95.8 ns per loop

In [13]: %timeit mi.base is mi.base
1000000 loops, best of 3: 351 ns per loop

In [14]: %timeit mi.identical(mi)
100000 loops, best of 3: 12.6 µs per loop

In [15]: %timeit mi.base is mi2.base
1000000 loops, best of 3: 355 ns per loop

Length check is not MI friendly:

In [18]: %timeit len(mi) == len(mi2)
100000 loops, best of 3: 4.96 µs per loop

btw - maybe we should, at some point, optimize the length check on multiindex...

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 15, 2013

Author Contributor

nope. base doesn't work in every case either :-(

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2013

re identical

u might try something like

if a is b or a.base is b.base: ( maybe do a share memory check too)
return True

return a.identizal(b)

so the actual identity case is a lot faster

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2013

Problem is that bases gets changed with view. I.e.:

In [2]: i = Index(range(10)
   ...: )

In [3]: i.base
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: i.view().base
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [5]: i.view().view().base is i
Out[5]: True

so slight tweak on base.

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2013

@jreback @cpcloud here's what I'm thinking for index, adding an is_ method:

    def is_(self, other):
        """returns True if and only if the indices actually are the same underlying array.
         Generally survives through views. Doesn't care about metadata."""
        if self is other:
            return True
        s_base, o_base = self.base, other.base
        return self is o_base or other is s_base or s_base is o_base

(base calls aren't cached, which is why it needs to be setup this way)

@cpcloud

View changes

pandas/sparse/panel.py Outdated
d = self._construct_axes_dict()
if deep:
new_data = dict((k, v.copy()) for k, v in compat.iteritems(self._frames))
d = dict((k, v.copy()) for k, v in compat.iteritems(d))

This comment has been minimized.

Copy link
@cpcloud

cpcloud Sep 15, 2013

Member

should these be v.copy(deep=True)?

This comment has been minimized.

Copy link
@jreback

jreback Sep 15, 2013

Contributor

yes they should

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 15, 2013

Author Contributor

I'll change.

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2013

blah, this doesn't quite work either :-/ multiple levels of view screw up .base.

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2013

It's going to take some thinking to decide on how best to efficiently handle flexible metadata for Index. I was thinking of setting an _id attribute that gets copied on views but changed when slicing, copying or setting attributes. then the is_ check would just be comparing the equality of two ids (probably hashes of id + type name).

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 21, 2013

Got this working (and faster) with the is_ method from #4909. @cpcloud you want to take a look at either this or #4909 to see whether you think this is fast enough for groupby?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 21, 2013

i'll vbench it and report back :)

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 21, 2013

if it's not good enough, we can shave off about 600-800ns per iteration by skipping the function call and using self.index._id is other.index._id

@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 22, 2013

Is this the right behavior?

ind = Index(range(10))
df = DataFrame(range(10), ind)
assert df.index is ind
assert (df + 1).index is ind
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2013

@jtratner yep

@@ -116,7 +116,7 @@ def __init__(self, data, index=None, sparse_index=None, kind='block',

if is_sparse_array:
if isinstance(data, SparseSeries) and index is None:
index = data.index
index = data.index.view()

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 22, 2013

Author Contributor

so then, should all of these places actually make views on the index? Less sure about sparse...

This comment has been minimized.

Copy link
@jreback

jreback Sep 22, 2013

Contributor

in this case what's the different? whether I assign an index or its view?

This comment has been minimized.

Copy link
@jtratner

jtratner Sep 22, 2013

Author Contributor

just whether you want to be creating a view every time. Minimal difference (if at all).

jtratner and others added 3 commits Sep 12, 2013
BUG: Fix copy s.t. it always copies index/columns.
Only copies index/columns with `deep=True` on `BlockManager`. Plus some
tests...yay! Change copy to make views of indices.

Requires changing groupby to check whether indices are identical, rather
than comparing with `is`. Plus add tests for name checks through
everything. Also, fixes tests to use `is_()` rather than `is`

CLN: Change groupby to use 'is_' instead
@jtratner

This comment has been minimized.

Copy link
Contributor Author

commented Sep 24, 2013

@jreback @cpcloud does this look okay to merge?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2013

ok

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 24, 2013

yep

jtratner added a commit that referenced this pull request Sep 24, 2013
Merge pull request #4830 from jtratner/copy-index-and-columns
BUG: Fix copy s.t. it always copies index/columns.

@jtratner jtratner merged commit d7d9a6c into pandas-dev:master Sep 24, 2013

@jtratner jtratner deleted the jtratner:copy-index-and-columns branch Sep 24, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.