Support dtypes other than float in sparse data structures #667

Closed
wesm opened this Issue Jan 23, 2012 · 5 comments

Comments

Projects
None yet
3 participants
Owner

wesm commented Jan 23, 2012

No description provided.

Contributor

jreback commented Jul 24, 2013

partially in #3482

jreback added the Testing label Feb 15, 2014

@jreback jreback modified the milestone: 0.15.0, 0.14.0 Feb 15, 2014

@jreback jreback modified the milestone: 0.16.0, 0.17.0 Jan 26, 2015

Would it be easier to work towards supporting floats other than float64 first? The overall enhancement of supporting other kinds of dtypes seems a major effort that should probably be tackled in smaller steps.

I'm particularly interested in reducing memory usage of dummy variables (i.e. bool) and small-valued counts (e.g. uint8). When sparsifying dataframes that contain such dtypes, being able to convert to float16 rather than float64 would already help a lot.

I've posted a StackOverflow question (and answer) regarding my attempts to achieve this.

kawochen referenced this issue Apr 10, 2016

Open

BUG: Sparse master issue #10627

11 of 18 tasks complete
Contributor

jreback commented Apr 10, 2016

@aolieman sparse already supports quite a few dtypes, and more in 0.18.1, see changes here

@jreback thanks for mentioning the 0.18.1 fixes. They should solve some of the issues I encountered, but dtype coercion still occurs with frames. Even if a SparseDataFrame is directly constructed from a SparseSeries that has the desired dtype, the resulting frame always uses float64.

Attempts to construct a one-column frame in 0.18.0 (same result for multiple columns):

In []: dense_series = pd.Series([False]*5 + [True]*3 + [False]*5, dtype='bool', name='b')

In []: dense_df = pd.DataFrame(dense_series)

In []: sparse_df = dense_df.to_sparse(fill_value=False)

In []: sparse_df['b'].dtype
Out[]: dtype('float64')

In []: sparse_series = dense_series.to_sparse(fill_value=False)

In []: sparse_series.dtype
Out[]: dtype('bool')

In []: sparse_df = pd.SparseDataFrame(sparse_series)

In []: sparse_df['b'].dtype
Out[]: dtype('float64')

In []: sparse_df = pd.SparseDataFrame(sparse_series, dtype='bool')

In []: sparse_df['b'].dtype
Out[]: dtype('bool')

In []: sparse_df.info()
------------------------
[traceback omitted]
AttributeError: ("'SingleBlockManager' object has no attribute 'view'", 'occurred at index b')

In []: sparse_df['b'].values
Out[]:
SingleBlockManager
Items: RangeIndex(start=0, stop=13, step=1)
BoolBlock: 13 dtype: bool

My apologies if this is solved in 0.18.1 (which I'm not able to test right now) or if I'm doing it wrong.

Contributor

jreback commented Apr 11, 2016

well it's an open issue - welcome to have test and such

@jreback jreback added a commit that referenced this issue Aug 3, 2016

@sinhrks @jreback sinhrks + jreback ENH: add sparse op for other dtypes
closes #13848
xref #667
45d54d0

@jreback jreback modified the milestone: 0.19.0, Next Major Release Aug 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment