Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix NDFrame.as_blocks() for sparse containers #6748

Merged
merged 1 commit into from
Mar 31, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Mar 31, 2014

SparseBlocks don't consolidate, so previous implementation silently dropped all but the last blocks for given dtype:

In [1]: pd.__version__
Out[1]: '0.13.1-527-g73506cb'

In [2]: pd.SparseDataFrame({'a': [1,2,3, np.nan, np.nan], 'b': [1,2,3, np.nan, np.nan]})
Out[2]: 
    a   b
0   1   1
1   2   2
2   3   3
3 NaN NaN
4 NaN NaN

[5 rows x 2 columns]

In [3]: _2.blocks
Out[3]: 
{'float64':     b
0   1
1   2
2   3
3 NaN
4 NaN

[5 rows x 1 columns]}

when the last output should be:

In [3]: _2.blocks
Out[3]: 
{'float64':     a   b
0   1   1
1   2   2
2   3   3
3 NaN NaN
4 NaN NaN

[5 rows x 2 columns]}

This also drops the columns kwarg of as_blocks since it doubles reindex functionality.

SparseBlocks don't consolidate, so previous implementation silently
dropped all but the last blocks for given dtype.

Also, drop the 'columns' kwarg, there's reindex for that.
BlockManager([b], [b.items, self.index])).__finalize__(self)
return bd
bd.setdefault(str(b.dtype), []).append(b)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had meant to move the guts of this to internals anyhow; e.g. maybe get_as_blocks can return the dict of str(dtype) to block manager (already combined), so in generic needs to just iterate and _constructor....finalize...

if it works better to refactor this later (in your internal changes checklist), ok 2

lmk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely get overwritten during refactoring.

And it did occur to me that this is very much like "group blocks by ftype" operation that happens during merging/concatenation. And the implementation does look rather trivial:

def mapreduce_blocks(mgr, keyfunc, reducer):
    return dict((key, reducer(val_iter))
                for key, val_iter in itertools.groupby(mgr.blocks, keyfunc=keyfunc))

def group_blocks_by_ftype(mgr):
    return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.ftype),
                            reducer=list)

def combine_blocks_by_dtype(mgr):
    return mapreduce_blocks(mgr, keyfunc=lambda b: str(b.dtype),
                            reducer=mgr.combine)

Such one-liner functions are indeed a natural extension of blockmanager external API, but I'm not sure if they deserve to be part of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok....maybe just reference this issue then in your refactoring...

will merge then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to separate this into a separate comment.

Which brings me to another topic, minimalism in API design. I mean, I tend to share the opinion that an API is perfect not when you can't add anything else but rather when you can't take anything from it. It happens quite often that an interface gets overburdened with those small details, e.g. it just hurts my eyes to wade through numpy.ndarray attrs/methods when I forget the exact spelling of the thing I want to use and I know is there:

In [1]: np.arange(10)
Out[1]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]: _1.
_1.T             _1.choose        _1.data          _1.flatten       _1.nbytes        _1.repeat        _1.sort          _1.tostring
_1.all           _1.clip          _1.diagonal      _1.getfield      _1.ndim          _1.reshape       _1.squeeze       _1.trace
_1.any           _1.compress      _1.dot           _1.imag          _1.newbyteorder  _1.resize        _1.std           _1.transpose
_1.argmax        _1.conj          _1.dtype         _1.item          _1.nonzero       _1.round         _1.strides       _1.var
_1.argmin        _1.conjugate     _1.dump          _1.itemset       _1.prod          _1.searchsorted  _1.sum           _1.view
_1.argsort       _1.copy          _1.dumps         _1.itemsize      _1.ptp           _1.setfield      _1.swapaxes      
_1.astype        _1.ctypes        _1.fill          _1.max           _1.put           _1.setflags      _1.take          
_1.base          _1.cumprod       _1.flags         _1.mean          _1.ravel         _1.shape         _1.tofile        
_1.byteswap      _1.cumsum        _1.flat          _1.min           _1.real          _1.size          _1.tolist        

And, unfortunately, pandas containers add more on top of that:

In [1]: pd.Series()
Out[1]: Series([], dtype: float64)

In [2]: _1.
Display all 225 possibilities? (y or n)

The desire to have a bit of everything at your fingertip is tempting indeed, but 225 (two hundred!!) methods and properties is a bit too many for my liking. And it is a lot to wrap your head around when you're only starting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a valid point, except that you have search in the docs, lots of docs, and you can always start the possibilities with a letter (to narrow down the search).

pandas provides a lot of functionaility (as does numpy). not sure that this is a problem per se.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is cryptic functions that do too much with too much overloading. which IMHO is a bigger/worse problem.

I like how overloading is handled in BLAS interfaces: you have a convention about prefixes/suffixes that correspond to certain variation of the core algorithm, so you as a programmer should make an educated guess about what the data is going to look like or just use the generic implementation.

As for the naming, I've gone through the reference a bit and found out a "light of hope": string functions that obviously form a cluster of functionality available for series containers are conveniently put under .str attribute, that's a very nice application of namespacing and that's a very viable approach to the API cluttering problem. Following the Zen, I'd suggest that Namespaces are one honking great idea -- let's do more of those!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I monkey patch new functionality I try to create namespaces. Since we control the ipython autocomplete, we could always move methods from the root to namespaces. The original methods would still work, they just wouldn't autocomplete in the root. I do this for subclasses of pandas objects. If I created a subclass like OHLC, I'd rather only see the custom methods I made.

@jreback jreback added this to the 0.14.0 milestone Mar 31, 2014
jreback added a commit that referenced this pull request Mar 31, 2014
BUG: fix NDFrame.as_blocks() for sparse containers
@jreback jreback merged commit 1c438e8 into pandas-dev:master Mar 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants