ENH: Sparse int64 and bool dtype support enhancement #13849

Merged
merged 1 commit into from Aug 31, 2016

Conversation

Projects
None yet
5 participants
Member

sinhrks commented Jul 30, 2016 edited

  • closes #667, closes #8292, closes #13001, closes #8276, closes #13110
    • tests added / passed
    • passes git diff upstream/master | flake8 --diff
    • whatsnew entry

Currently, sparse doesn't support int64 and bool dtypes actually. When int or bool values are passed, it is coerced to float64 if dtypekw is not explicitly specified.

on current master

pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)

pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)

after this PR

The created data should have the dtype of passed values (as the same as normal Series).

pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)

pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)

Also, fill_value is automatically specified according to the following rules (because np.nan cannot appear in int or bool dtype):

Basic rule: sparse dtype must not be changed when it is converted to dense.

  • If sparse_index is specified and data has a hole (missing values):
    • fill_value is np.nan
    • dtype is float64 or object (which can store both data and fill_value)
  • If sparse_index is None (all values are provided via data, no missing values)
    • if fill_value is not explicitly passed, following default will be used depending on its dtype.
      • float: np.nan
      • int: 0
      • bool: False

sinhrks added this to the 0.19.0 milestone Jul 30, 2016

codecov-io commented Jul 30, 2016 edited

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@             master     #13849   diff @@
==========================================
  Files           139        139          
  Lines         50511      50523    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43071      43083    +12   
  Misses         7440       7440          
  Partials          0          0          

Powered by Codecov. Last update 10bf721...341585a

@jreback jreback added a commit that referenced this pull request Aug 4, 2016

@sinhrks @jreback sinhrks + jreback ENH: sparse astype now supports int64 and bool
split from #13849

Author: sinhrks <sinhrks@gmail.com>

Closes #13900 from sinhrks/sparse_astype and squashes the following commits:

1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool
2beab41
Contributor

jreback commented Aug 7, 2016

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data

@jreback jreback added a commit that referenced this pull request Aug 9, 2016

@sinhrks @jreback sinhrks + jreback BLD: Fix sparse warnings
closes #13942
xref #13849
ae26ec7

sinhrks changed the title from (WIP)ENH: Sparse now supports int64 and bool dtype to ENH: Sparse now supports int64 and bool dtype Aug 16, 2016

Member

sinhrks commented Aug 16, 2016

rebased and added the doc, now ready for review.

@jreback jreback commented on an outdated diff Aug 18, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -777,6 +779,43 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
+- Sparse data structure now supports ``int64`` and ``bool`` ``dtype`` (:issue:`13849`)
@jreback

jreback Aug 18, 2016

Contributor

make this a separe sub-section (as first one is too long?)

@jreback jreback commented on an outdated diff Aug 18, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -777,6 +779,43 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
+- Sparse data structure now supports ``int64`` and ``bool`` ``dtype`` (:issue:`13849`)
+
+Previously, sparse data have ``float64`` dtype by default, even if all inputs are ``int`` or ``bool``. You had to specify ``dtype`` explicitly to create sparse data with ``int64`` dtype. Also, you must specify ``fill_value`` to actually sparcify the data, becuase ``fill_value`` 's default is ``np.nan`` which doesn't appear in ``int64`` data.
@jreback

jreback Aug 18, 2016

Contributor

could add a ref to #667 here as well

Contributor

jreback commented Aug 18, 2016

lgtm. just some doc corrections. ping on green.

@jreback jreback commented on an outdated diff Aug 18, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -13,6 +13,8 @@ Highlights include:
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>`
+- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>`
@jreback

jreback Aug 18, 2016

Contributor

say dtype='category' for parsing Categorical data.

@jreback jreback commented on the diff Aug 18, 2016

pandas/sparse/array.py
else:
# array-like
if sparse_index is None:
- values, sparse_index = make_sparse(data, kind=kind,
- fill_value=fill_value)
+ if dtype is not None:
@jreback

jreback Aug 18, 2016

Contributor

I think explicty put the supported dtypes in here to have a nice error message, for now to avoid unwanted conversions (e.g. this won't raise on M8, and add a test for non-supported dtypes.

@jreback

jreback Aug 18, 2016

Contributor

prob need to have this handled for .astype as well. you already have something like this getting the fill value, maybe can use that to validate. (e.g. use the if test, then raise if its not there).

@jreback

jreback Aug 18, 2016

Contributor

actually maybe should put this in sparse.array or have a sparse.common module that you can import to do centrailzed sparse things .

@jreback jreback commented on an outdated diff Aug 18, 2016

pandas/sparse/array.py
@@ -255,6 +245,16 @@ def _simple_new(cls, data, sp_index, fill_value):
result._fill_value = fill_value
return result
+ @classmethod
+ def _get_default_fill_value(cls, arr_or_dtype):
+ if is_bool_dtype(arr_or_dtype):
+ # if we have a bool type, make sure that we have a bool fill_value
+ return False
+ elif is_integer_dtype(arr_or_dtype):
+ return 0
+ else:
@jreback

jreback Aug 18, 2016

Contributor
elif is_floating_dtype(array_or_dtype):
    return np.nan

raise ValueError("unsupported dtype {...} for Sparse")
@jreback

jreback Aug 25, 2016

Contributor

add a clause for object, raise otherwise

Member

sinhrks commented Aug 20, 2016

@jreback Thx for review. One point is whether we should prohibit dtypes other than "relatively-well" supported ones (currently float64, after the PR int64 and bool also). Maybe my title was misleading, thus change the title.

There are few issues which uses sparse data with object dtype like #11633, #13917. So I feel we should keep the current behavior for other dtype rather than raise (may better to show a warning).

CC: @sstanovnik

sinhrks changed the title from ENH: Sparse now supports int64 and bool dtype to ENH: Sparse int64 and bool dtype support enhancement Aug 20, 2016

Contributor

jreback commented Aug 20, 2016

yeah object dtypes are only partially supported ATM. I think can leave that ok, though possibly we could warn, @jorisvandenbossche ?

Contributor

sstanovnik commented Aug 20, 2016

Let me add my 2¢, since you went as far as CC-ing me. I can't make a very informed opinion, since I don't know enough about pandas' internals, and I obviously have an interest (biolab/orange3#1347) for supporting arbitrary types.

My thoughts are that you should be able to throw the same kind of data in a dense or a sparse DataFrame so that they are equivalent. An example off the top of my head is a SparseDataFrame with a recommendation dataset with rows as movies and columns as users, and additional metadata (string) columns about each movie. I don't know if this is possible, but judging from my time with BlockManager, you could maybe use dense string columns mixed in-between an otherwise sparse structure, if supporting sparse string storage is too hard.

As I said, I may be completely off-target here, just some thoughts :)

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

@sinhrks Does this also close pydata#13110?

Member

sinhrks commented Aug 21, 2016

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

@jreback jreback commented on an outdated diff Aug 25, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -17,6 +17,7 @@ Highlights include:
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
- ``PeriodIndex`` now has its own ``period`` dtype. see ref:`here <whatsnew_0190.api.perioddtype>`
+- Sparse now supports other ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
@jreback

jreback Aug 25, 2016

Contributor

would leave out other

@jreback jreback commented on the diff Aug 25, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -790,6 +791,50 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
@jreback

jreback Aug 25, 2016

Contributor

need a sub-section ref

@jreback jreback commented on an outdated diff Aug 25, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -790,6 +791,50 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
+
+``int64`` and ``bool`` support enhancements
+"""""""""""""""""""""""""""""""""""""""""""
+
+Sparse data structure now gained enhanced support of ``int64`` and ``bool`` ``dtype`` (:issue:`667`, :issue:`13849`)
@jreback

jreback Aug 25, 2016

Contributor

structure -> structures

@jreback jreback commented on an outdated diff Aug 25, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -790,6 +791,50 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
+
+``int64`` and ``bool`` support enhancements
+"""""""""""""""""""""""""""""""""""""""""""
+
+Sparse data structure now gained enhanced support of ``int64`` and ``bool`` ``dtype`` (:issue:`667`, :issue:`13849`)
+
+Previously, sparse data have ``float64`` dtype by default, even if all inputs are ``int`` or ``bool``. You had to specify ``dtype`` explicitly to create sparse data with ``int64`` dtype. Also, you must specify ``fill_value`` to actually sparcify the data, becuase ``fill_value`` 's default is ``np.nan`` which doesn't appear in ``int64`` data.
@jreback

jreback Aug 25, 2016

Contributor

have -> were
even if all inputs are -> even if all inputs were int or bool dtype.

Unclear what the Also, you must specify sentence means....can you reword?

@jreback jreback commented on the diff Aug 25, 2016

pandas/sparse/tests/test_array.py
res = s.fillna(-1)
exp = SparseArray([0, 0, 0, 0], fill_value=0)
tm.assert_sp_array_equal(res, exp)
+ # fill_value can be nan if there is no missing hole.
+ # only fill_value will be changed
+ s = SparseArray([0, 0, 0, 0], fill_value=np.nan)
@jreback

jreback Aug 25, 2016

Contributor

I am not sure about this. I think if a float fill_value should force float (though if the data infers as integer, maybe raise/warn)? having the determination as no gaps is too data specific. I know this forces an integer array to then have a specified fill value (as it can't then be np.nan), but I think that's ok.

@jreback

jreback Aug 27, 2016

Contributor

my remaining question was this one

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016 edited

Owner

I agree that it seems logical that the fill_value matches the dtype. So in case of specifically specifying the fill_value, I would take that into account for the actual dtype inference.

Given that nan is no longer the default fill_value, I don't think it is a problem that specifying fill_value=np.nan changes your integer data into a float sparse array.

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Owner

@sinhrks The docstring of SparseArray still says that the default fill_value is NaN, which is no longer true I think (it changed to None, to depend on the data type I suppose)

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.

@jorisvandenbossche jorisvandenbossche commented on the diff Aug 27, 2016

doc/source/sparse.rst
@@ -132,6 +132,61 @@ keeps an arrays of all of the locations where the data are not equal to the
fill value. The ``block`` format tracks only the locations and sizes of blocks
of data.
+.. _sparse.dtype:
+
+Sparse Dtypes
+-------------
+
+Sparse data should have the same dtype as its dense representation. Currently,
+``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
+dtype, ``fill_value`` default changes:
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Owner

Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

@jorisvandenbossche jorisvandenbossche commented on an outdated diff Aug 27, 2016

pandas/core/generic.py
Return a boolean same-sized object indicating if the values are null.
See Also
--------
notnull : boolean inverse of isnull
"""
+
+ @Appender(_shared_docs['isnull'] % _shared_doc_kwargs)
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Owner

I don't think the % _shared_doc_kwargs are used in this case?

Contributor

jreback commented Aug 27, 2016

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

Owner

jorisvandenbossche commented Aug 29, 2016 edited

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

@sinhrks sinhrks ENH: Sparse dtypes
341585a

@jorisvandenbossche jorisvandenbossche merged commit b6d3a81 into pandas-dev:master Aug 31, 2016

3 checks passed

codecov/patch 98.63% of diff hit (target 50.00%)
Details
codecov/project 85.27% (target 82.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@sinhrks Thanks a lot!

Owner

jorisvandenbossche commented Aug 31, 2016 edited

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32
Member

sinhrks commented Sep 1, 2016

@jorisvandenbossche thx for pointing out, will fix.

sinhrks deleted the sinhrks:sparse_dtype3 branch Sep 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment