Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH: Sparse int64 and bool dtype support enhancement #13849
Conversation
sinhrks
added Enhancement Dtypes Sparse
labels
Jul 30, 2016
sinhrks
added this to the
0.19.0
milestone
Jul 30, 2016
codecov-io
commented
Jul 30, 2016
•
Current coverage is 85.27% (diff: 98.63%)@@ master #13849 diff @@
==========================================
Files 139 139
Lines 50511 50523 +12
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43071 43083 +12
Misses 7440 7440
Partials 0 0
|
This was referenced Aug 1, 2016
jreback
added a commit
that referenced
this pull request
Aug 4, 2016
|
|
sinhrks + jreback |
2beab41
|
|
@sinhrks getting tons of warnings compiling on windows....all the same
|
This was referenced Aug 8, 2016
jreback
added a commit
that referenced
this pull request
Aug 9, 2016
|
|
sinhrks + jreback |
ae26ec7
|
sinhrks
changed the title from
(WIP)ENH: Sparse now supports int64 and bool dtype to ENH: Sparse now supports int64 and bool dtype
Aug 16, 2016
|
rebased and added the doc, now ready for review. |
jreback
commented on an outdated diff
Aug 18, 2016
jreback
commented on an outdated diff
Aug 18, 2016
| @@ -777,6 +779,43 @@ Sparse Changes | ||
| These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling. | ||
| +- Sparse data structure now supports ``int64`` and ``bool`` ``dtype`` (:issue:`13849`) | ||
| + | ||
| +Previously, sparse data have ``float64`` dtype by default, even if all inputs are ``int`` or ``bool``. You had to specify ``dtype`` explicitly to create sparse data with ``int64`` dtype. Also, you must specify ``fill_value`` to actually sparcify the data, becuase ``fill_value`` 's default is ``np.nan`` which doesn't appear in ``int64`` data. |
|
|
|
lgtm. just some doc corrections. ping on green. |
jreback
commented on an outdated diff
Aug 18, 2016
| @@ -13,6 +13,8 @@ Highlights include: | ||
| - ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>` | ||
| - pandas development api, see :ref:`here <whatsnew_0190.dev_api>` | ||
| - ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>` | ||
| +- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>` |
|
|
jreback
commented on the diff
Aug 18, 2016
| else: | ||
| # array-like | ||
| if sparse_index is None: | ||
| - values, sparse_index = make_sparse(data, kind=kind, | ||
| - fill_value=fill_value) | ||
| + if dtype is not None: |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 18, 2016
| @@ -255,6 +245,16 @@ def _simple_new(cls, data, sp_index, fill_value): | ||
| result._fill_value = fill_value | ||
| return result | ||
| + @classmethod | ||
| + def _get_default_fill_value(cls, arr_or_dtype): | ||
| + if is_bool_dtype(arr_or_dtype): | ||
| + # if we have a bool type, make sure that we have a bool fill_value | ||
| + return False | ||
| + elif is_integer_dtype(arr_or_dtype): | ||
| + return 0 | ||
| + else: |
jreback
Contributor
|
|
@jreback Thx for review. One point is whether we should prohibit dtypes other than "relatively-well" supported ones (currently There are few issues which uses sparse data with CC: @sstanovnik |
sinhrks
changed the title from
ENH: Sparse now supports int64 and bool dtype to ENH: Sparse int64 and bool dtype support enhancement
Aug 20, 2016
|
yeah |
|
Let me add my 2¢, since you went as far as CC-ing me. I can't make a very informed opinion, since I don't know enough about pandas' internals, and I obviously have an interest (biolab/orange3#1347) for supporting arbitrary types. My thoughts are that you should be able to throw the same kind of data in a dense or a sparse DataFrame so that they are equivalent. An example off the top of my head is a SparseDataFrame with a recommendation dataset with rows as movies and columns as users, and additional metadata (string) columns about each movie. I don't know if this is possible, but judging from my time with BlockManager, you could maybe use dense string columns mixed in-between an otherwise sparse structure, if supporting sparse string storage is too hard. As I said, I may be completely off-target here, just some thoughts :) |
|
Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it. |
|
@sinhrks Does this also close pydata#13110? |
|
I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR). #13110 should be closed. Added whatsnew. |
jreback
commented on an outdated diff
Aug 25, 2016
| @@ -17,6 +17,7 @@ Highlights include: | ||
| - ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>` | ||
| - pandas development api, see :ref:`here <whatsnew_0190.dev_api>` | ||
| - ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>` | ||
| +- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>` |
|
|
jreback
commented on the diff
Aug 25, 2016
jreback
commented on an outdated diff
Aug 25, 2016
| @@ -790,6 +791,50 @@ Sparse Changes | ||
| These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling. | ||
| + | ||
| +``int64`` and ``bool`` support enhancements | ||
| +""""""""""""""""""""""""""""""""""""""""""" | ||
| + | ||
| +Sparse data structure now gained enhanced support of ``int64`` and ``bool`` ``dtype`` (:issue:`667`, :issue:`13849`) |
|
|
jreback
commented on an outdated diff
Aug 25, 2016
| @@ -790,6 +791,50 @@ Sparse Changes | ||
| These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling. | ||
| + | ||
| +``int64`` and ``bool`` support enhancements | ||
| +""""""""""""""""""""""""""""""""""""""""""" | ||
| + | ||
| +Sparse data structure now gained enhanced support of ``int64`` and ``bool`` ``dtype`` (:issue:`667`, :issue:`13849`) | ||
| + | ||
| +Previously, sparse data have ``float64`` dtype by default, even if all inputs are ``int`` or ``bool``. You had to specify ``dtype`` explicitly to create sparse data with ``int64`` dtype. Also, you must specify ``fill_value`` to actually sparcify the data, becuase ``fill_value`` 's default is ``np.nan`` which doesn't appear in ``int64`` data. |
jreback
Contributor
|
jreback
commented on the diff
Aug 25, 2016
| res = s.fillna(-1) | ||
| exp = SparseArray([0, 0, 0, 0], fill_value=0) | ||
| tm.assert_sp_array_equal(res, exp) | ||
| + # fill_value can be nan if there is no missing hole. | ||
| + # only fill_value will be changed | ||
| + s = SparseArray([0, 0, 0, 0], fill_value=np.nan) |
jreback
Contributor
|
|
Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question. Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem) |
|
OK, so probably my question should be categorized in the naive category :-) |
jorisvandenbossche
commented on the diff
Aug 27, 2016
| @@ -132,6 +132,61 @@ keeps an arrays of all of the locations where the data are not equal to the | ||
| fill value. The ``block`` format tracks only the locations and sizes of blocks | ||
| of data. | ||
| +.. _sparse.dtype: | ||
| + | ||
| +Sparse Dtypes | ||
| +------------- | ||
| + | ||
| +Sparse data should have the same dtype as its dense representation. Currently, | ||
| +``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original | ||
| +dtype, ``fill_value`` default changes: |
jorisvandenbossche
Owner
|
jorisvandenbossche
commented on an outdated diff
Aug 27, 2016
|
joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default) so this is not a conceptual change at all just a change to keep dtype consistency |
|
@jreback I was looking at the |
|
@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-)) |
|
@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value. |
jorisvandenbossche
merged commit b6d3a81
into pandas-dev:master
Aug 31, 2016
|
@sinhrks Thanks a lot! |
|
@sinhrks appveyor started failing (some int dtype issues):
|
|
@jorisvandenbossche thx for pointing out, will fix. |
sinhrks commentedJul 30, 2016
•
edited
git diff upstream/master | flake8 --diffCurrently, sparse doesn't support
int64andbooldtypes actually. Whenintorboolvalues are passed, it is coerced tofloat64ifdtypekw is not explicitly specified.on current master
after this PR
The created data should have the
dtypeof passed values (as the same as normalSeries).Also,
fill_valueis automatically specified according to the following rules (becausenp.nancannot appear inintorbooldtype):Basic rule: sparse
dtypemust not be changed when it is converted to dense.sparse_indexis specified and data has a hole (missing values):fill_valueis np.nandtypeisfloat64orobject(which can store bothdataandfill_value)sparse_indexis None (all values are provided viadata, no missing values)fill_valueis not explicitly passed, following default will be used depending on its dtype.float:np.nanint:0bool:False