BUG: fix NDFrame.as_blocks() for sparse containers #6748

immerrr · 2014-03-31T08:33:07Z

SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype:

In [1]: pd.__version__
Out[1]: '0.13.1-527-g73506cb'

In [2]: pd.SparseDataFrame({'a': [1,2,3, np.nan, np.nan], 'b': [1,2,3, np.nan, np.nan]})
Out[2]: 
    a   b
0   1   1
1   2   2
2   3   3
3 NaN NaN
4 NaN NaN

[5 rows x 2 columns]

In [3]: _2.blocks
Out[3]: 
{'float64':     b
0   1
1   2
2   3
3 NaN
4 NaN

[5 rows x 1 columns]}

when the last output should be:

In [3]: _2.blocks
Out[3]: 
{'float64':     a   b
0   1   1
1   2   2
2   3   3
3 NaN NaN
4 NaN NaN

[5 rows x 2 columns]}

This also drops the columns kwarg of as_blocks since it doubles reindex functionality.

SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype. Also, drop the 'columns' kwarg, there's reindex for that.

jreback · 2014-03-31T10:59:53Z

pandas/core/generic.py

-                BlockManager([b], [b.items, self.index])).__finalize__(self)
-        return bd
+            bd.setdefault(str(b.dtype), []).append(b)
+


had meant to move the guts of this to internals anyhow; e.g. maybe get_as_blocks can return the dict of str(dtype) to block manager (already combined), so in generic needs to just iterate and _constructor....finalize...

if it works better to refactor this later (in your internal changes checklist), ok 2

lmk

This will most likely get overwritten during refactoring.

And it did occur to me that this is very much like "group blocks by ftype" operation that happens during merging/concatenation. And the implementation does look rather trivial:

def mapreduce_blocks(mgr, keyfunc, reducer): return dict((key, reducer(val_iter)) for key, val_iter in itertools.groupby(mgr.blocks, keyfunc=keyfunc)) def group_blocks_by_ftype(mgr): return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.ftype), reducer=list) def combine_blocks_by_dtype(mgr): return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.dtype), reducer=mgr.combine)

Such one-liner functions are indeed a natural extension of blockmanager external API, but I'm not sure if they deserve to be part of it.

ok....maybe just reference this issue then in your refactoring...

will merge then

I've decided to separate this into a separate comment.

Which brings me to another topic, minimalism in API design. I mean, I tend to share the opinion that an API is perfect not when you can't add anything else but rather when you can't take anything from it. It happens quite often that an interface gets overburdened with those small details, e.g. it just hurts my eyes to wade through numpy.ndarray attrs/methods when I forget the exact spelling of the thing I want to use and I know is there:

In [1]: np.arange(10) Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [2]: _1. _1.T _1.choose _1.data _1.flatten _1.nbytes _1.repeat _1.sort _1.tostring _1.all _1.clip _1.diagonal _1.getfield _1.ndim _1.reshape _1.squeeze _1.trace _1.any _1.compress _1.dot _1.imag _1.newbyteorder _1.resize _1.std _1.transpose _1.argmax _1.conj _1.dtype _1.item _1.nonzero _1.round _1.strides _1.var _1.argmin _1.conjugate _1.dump _1.itemset _1.prod _1.searchsorted _1.sum _1.view _1.argsort _1.copy _1.dumps _1.itemsize _1.ptp _1.setfield _1.swapaxes _1.astype _1.ctypes _1.fill _1.max _1.put _1.setflags _1.take _1.base _1.cumprod _1.flags _1.mean _1.ravel _1.shape _1.tofile _1.byteswap _1.cumsum _1.flat _1.min _1.real _1.size _1.tolist

And, unfortunately, pandas containers add more on top of that:

In [1]: pd.Series() Out[1]: Series([], dtype: float64) In [2]: _1. Display all 225 possibilities? (y or n)

The desire to have a bit of everything at your fingertip is tempting indeed, but 225 (two hundred!!) methods and properties is a bit too many for my liking. And it is a lot to wrap your head around when you're only starting.

that's a valid point, except that you have search in the docs, lots of docs, and you can always start the possibilities with a letter (to narrow down the search).

pandas provides a lot of functionaility (as does numpy). not sure that this is a problem per se.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

I like how overloading is handled in BLAS interfaces: you have a convention about prefixes/suffixes that correspond to certain variation of the core algorithm, so you as a programmer should make an educated guess about what the data is going to look like or just use the generic implementation.

As for the naming, I've gone through the reference a bit and found out a "light of hope": string functions that obviously form a cluster of functionality available for series containers are conveniently put under .str attribute, that's a very nice application of namespacing and that's a very viable approach to the API cluttering problem. Following the Zen, I'd suggest that Namespaces are one honking great idea -- let's do more of those!

Whenever I monkey patch new functionality I try to create namespaces. Since we control the ipython autocomplete, we could always move methods from the root to namespaces. The original methods would still work, they just wouldn't autocomplete in the root. I do this for subclasses of pandas objects. If I created a subclass like OHLC, I'd rather only see the custom methods I made.

BUG: fix NDFrame.as_blocks() for sparse containers

BUG: fix NDFrame.as_blocks() for sparse containers

1a62444

SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype. Also, drop the 'columns' kwarg, there's reindex for that.

jreback reviewed Mar 31, 2014
View reviewed changes

jreback added Bug labels Mar 31, 2014

jreback added this to the 0.14.0 milestone Mar 31, 2014

jreback added a commit that referenced this pull request Mar 31, 2014

Merge pull request #6748 from immerrr/fix-as-blocks-for-sparse

1c438e8

BUG: fix NDFrame.as_blocks() for sparse containers

jreback merged commit 1c438e8 into pandas-dev:master Mar 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix NDFrame.as_blocks() for sparse containers #6748

BUG: fix NDFrame.as_blocks() for sparse containers #6748

immerrr commented Mar 31, 2014

jreback Mar 31, 2014

immerrr Mar 31, 2014

jreback Mar 31, 2014

immerrr Mar 31, 2014

jreback Mar 31, 2014

immerrr Mar 31, 2014

dalejung Mar 31, 2014

BUG: fix NDFrame.as_blocks() for sparse containers #6748

BUG: fix NDFrame.as_blocks() for sparse containers #6748

Conversation

immerrr commented Mar 31, 2014

jreback Mar 31, 2014

Choose a reason for hiding this comment

immerrr Mar 31, 2014

Choose a reason for hiding this comment

jreback Mar 31, 2014

Choose a reason for hiding this comment

immerrr Mar 31, 2014

Choose a reason for hiding this comment

jreback Mar 31, 2014

Choose a reason for hiding this comment

immerrr Mar 31, 2014

Choose a reason for hiding this comment

dalejung Mar 31, 2014

Choose a reason for hiding this comment