New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Sparse int64 and bool dtype support enhancement #13849

Merged
merged 1 commit into from Aug 31, 2016

Conversation

Projects
None yet
5 participants
@sinhrks
Member

sinhrks commented Jul 30, 2016

  • closes #667, closes #8292, closes #13001, closes #8276, closes #13110
    • tests added / passed
    • passes git diff upstream/master | flake8 --diff
    • whatsnew entry

Currently, sparse doesn't support int64 and bool dtypes actually. When int or bool values are passed, it is coerced to float64 if dtypekw is not explicitly specified.

on current master

pd.SparseArray([1, 2, 0, 0 ])
# [1.0, 2.0, 0.0, 0.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2, 3], dtype=int32)

pd.SparseArray([True, False, True])
# [1.0, 0.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([0, 1, 2], dtype=int32)

after this PR

The created data should have the dtype of passed values (as the same as normal Series).

pd.SparseArray([1, 2, 0, 0 ])
# [1, 2, 0, 0]
# Fill: 0
# IntIndex
# Indices: array([0, 1], dtype=int32)

pd.SparseArray([True, False, True])
# [True, False, True]
# Fill: False
# IntIndex
# Indices: array([0, 2], dtype=int32)

Also, fill_value is automatically specified according to the following rules (because np.nan cannot appear in int or bool dtype):

Basic rule: sparse dtype must not be changed when it is converted to dense.

  • If sparse_index is specified and data has a hole (missing values):
    • fill_value is np.nan
    • dtype is float64 or object (which can store both data and fill_value)
  • If sparse_index is None (all values are provided via data, no missing values)
    • if fill_value is not explicitly passed, following default will be used depending on its dtype.
      • float: np.nan
      • int: 0
      • bool: False

@sinhrks sinhrks added this to the 0.19.0 milestone Jul 30, 2016

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Jul 30, 2016

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@             master     #13849   diff @@
==========================================
  Files           139        139          
  Lines         50511      50523    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43071      43083    +12   
  Misses         7440       7440          
  Partials          0          0          

Powered by Codecov. Last update 10bf721...341585a

codecov-io commented Jul 30, 2016

Current coverage is 85.27% (diff: 98.63%)

Merging #13849 into master will increase coverage by <.01%

@@             master     #13849   diff @@
==========================================
  Files           139        139          
  Lines         50511      50523    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43071      43083    +12   
  Misses         7440       7440          
  Partials          0          0          

Powered by Codecov. Last update 10bf721...341585a

jreback added a commit that referenced this pull request Aug 4, 2016

ENH: sparse astype now supports int64 and bool
split from #13849

Author: sinhrks <sinhrks@gmail.com>

Closes #13900 from sinhrks/sparse_astype and squashes the following commits:

1c669ad [sinhrks] ENH: sparse astype now supports int64 and bool
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 7, 2016

Contributor

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
Contributor

jreback commented Aug 7, 2016

@sinhrks getting tons of warnings compiling on windows....all the same

pandas\src\sparse.c(63861) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(63870) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66180) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(66189) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data
pandas\src\sparse.c(68499) : warning C4244: '=' : conversion from 'Py_ssize_t' to '__pyx_t_5numpy_int32_t', possible loss of data

jreback added a commit that referenced this pull request Aug 9, 2016

@sinhrks sinhrks changed the title from (WIP)ENH: Sparse now supports int64 and bool dtype to ENH: Sparse now supports int64 and bool dtype Aug 16, 2016

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Aug 16, 2016

Member

rebased and added the doc, now ready for review.

Member

sinhrks commented Aug 16, 2016

rebased and added the doc, now ready for review.

@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 18, 2016

Contributor

lgtm. just some doc corrections. ping on green.

Contributor

jreback commented Aug 18, 2016

lgtm. just some doc corrections. ping on green.

@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
else:
# array-like
if sparse_index is None:
values, sparse_index = make_sparse(data, kind=kind,
fill_value=fill_value)
if dtype is not None:

This comment has been minimized.

@jreback

jreback Aug 18, 2016

Contributor

I think explicty put the supported dtypes in here to have a nice error message, for now to avoid unwanted conversions (e.g. this won't raise on M8, and add a test for non-supported dtypes.

@jreback

jreback Aug 18, 2016

Contributor

I think explicty put the supported dtypes in here to have a nice error message, for now to avoid unwanted conversions (e.g. this won't raise on M8, and add a test for non-supported dtypes.

This comment has been minimized.

@jreback

jreback Aug 18, 2016

Contributor

prob need to have this handled for .astype as well. you already have something like this getting the fill value, maybe can use that to validate. (e.g. use the if test, then raise if its not there).

@jreback

jreback Aug 18, 2016

Contributor

prob need to have this handled for .astype as well. you already have something like this getting the fill value, maybe can use that to validate. (e.g. use the if test, then raise if its not there).

This comment has been minimized.

@jreback

jreback Aug 18, 2016

Contributor

actually maybe should put this in sparse.array or have a sparse.common module that you can import to do centrailzed sparse things .

@jreback

jreback Aug 18, 2016

Contributor

actually maybe should put this in sparse.array or have a sparse.common module that you can import to do centrailzed sparse things .

@jreback

View changes

Show outdated Hide outdated pandas/sparse/array.py
@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Aug 20, 2016

Member

@jreback Thx for review. One point is whether we should prohibit dtypes other than "relatively-well" supported ones (currently float64, after the PR int64 and bool also). Maybe my title was misleading, thus change the title.

There are few issues which uses sparse data with object dtype like #11633, #13917. So I feel we should keep the current behavior for other dtype rather than raise (may better to show a warning).

CC: @sstanovnik

Member

sinhrks commented Aug 20, 2016

@jreback Thx for review. One point is whether we should prohibit dtypes other than "relatively-well" supported ones (currently float64, after the PR int64 and bool also). Maybe my title was misleading, thus change the title.

There are few issues which uses sparse data with object dtype like #11633, #13917. So I feel we should keep the current behavior for other dtype rather than raise (may better to show a warning).

CC: @sstanovnik

@sinhrks sinhrks changed the title from ENH: Sparse now supports int64 and bool dtype to ENH: Sparse int64 and bool dtype support enhancement Aug 20, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 20, 2016

Contributor

yeah object dtypes are only partially supported ATM. I think can leave that ok, though possibly we could warn, @jorisvandenbossche ?

Contributor

jreback commented Aug 20, 2016

yeah object dtypes are only partially supported ATM. I think can leave that ok, though possibly we could warn, @jorisvandenbossche ?

@sstanovnik

This comment has been minimized.

Show comment
Hide comment
@sstanovnik

sstanovnik Aug 20, 2016

Contributor

Let me add my 2¢, since you went as far as CC-ing me. I can't make a very informed opinion, since I don't know enough about pandas' internals, and I obviously have an interest (biolab/orange3#1347) for supporting arbitrary types.

My thoughts are that you should be able to throw the same kind of data in a dense or a sparse DataFrame so that they are equivalent. An example off the top of my head is a SparseDataFrame with a recommendation dataset with rows as movies and columns as users, and additional metadata (string) columns about each movie. I don't know if this is possible, but judging from my time with BlockManager, you could maybe use dense string columns mixed in-between an otherwise sparse structure, if supporting sparse string storage is too hard.

As I said, I may be completely off-target here, just some thoughts :)

Contributor

sstanovnik commented Aug 20, 2016

Let me add my 2¢, since you went as far as CC-ing me. I can't make a very informed opinion, since I don't know enough about pandas' internals, and I obviously have an interest (biolab/orange3#1347) for supporting arbitrary types.

My thoughts are that you should be able to throw the same kind of data in a dense or a sparse DataFrame so that they are equivalent. An example off the top of my head is a SparseDataFrame with a recommendation dataset with rows as movies and columns as users, and additional metadata (string) columns about each movie. I don't know if this is possible, but judging from my time with BlockManager, you could maybe use dense string columns mixed in-between an otherwise sparse structure, if supporting sparse string storage is too hard.

As I said, I may be completely off-target here, just some thoughts :)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 21, 2016

Member

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

Member

jorisvandenbossche commented Aug 21, 2016

Sorry, not familiar with sparse. But: using object dtype, does it work enough to use it for certain cases? If yes, I would not remove it.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 21, 2016

Member

@sinhrks Does this also close #13110?

Member

jorisvandenbossche commented Aug 21, 2016

@sinhrks Does this also close #13110?

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Aug 21, 2016

Member

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

Member

sinhrks commented Aug 21, 2016

I think object dtype can be used in some cases, but not fully sure as it is not tested well. Not remove ATM and add more tests to clarify (on another PR).

#13110 should be closed. Added whatsnew.

@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
@@ -790,6 +791,50 @@ Sparse Changes
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.

This comment has been minimized.

@jreback

jreback Aug 25, 2016

Contributor

need a sub-section ref

@jreback

jreback Aug 25, 2016

Contributor

need a sub-section ref

@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
@jreback

View changes

Show outdated Hide outdated doc/source/whatsnew/v0.19.0.txt
res = s.fillna(-1)
exp = SparseArray([0, 0, 0, 0], fill_value=0)
tm.assert_sp_array_equal(res, exp)
# fill_value can be nan if there is no missing hole.
# only fill_value will be changed
s = SparseArray([0, 0, 0, 0], fill_value=np.nan)

This comment has been minimized.

@jreback

jreback Aug 25, 2016

Contributor

I am not sure about this. I think if a float fill_value should force float (though if the data infers as integer, maybe raise/warn)? having the determination as no gaps is too data specific. I know this forces an integer array to then have a specified fill value (as it can't then be np.nan), but I think that's ok.

@jreback

jreback Aug 25, 2016

Contributor

I am not sure about this. I think if a float fill_value should force float (though if the data infers as integer, maybe raise/warn)? having the determination as no gaps is too data specific. I know this forces an integer array to then have a specified fill value (as it can't then be np.nan), but I think that's ok.

This comment has been minimized.

@jreback

jreback Aug 27, 2016

Contributor

my remaining question was this one

@jreback

jreback Aug 27, 2016

Contributor

my remaining question was this one

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

I agree that it seems logical that the fill_value matches the dtype. So in case of specifically specifying the fill_value, I would take that into account for the actual dtype inference.

Given that nan is no longer the default fill_value, I don't think it is a problem that specifying fill_value=np.nan changes your integer data into a float sparse array.

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

I agree that it seems logical that the fill_value matches the dtype. So in case of specifically specifying the fill_value, I would take that into account for the actual dtype inference.

Given that nan is no longer the default fill_value, I don't think it is a problem that specifying fill_value=np.nan changes your integer data into a float sparse array.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

@sinhrks The docstring of SparseArray still says that the default fill_value is NaN, which is no longer true I think (it changed to None, to depend on the data type I suppose)

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

@sinhrks The docstring of SparseArray still says that the default fill_value is NaN, which is no longer true I think (it changed to None, to depend on the data type I suppose)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

Member

jorisvandenbossche commented Aug 27, 2016

Disclaimer: I never used sparse or am familiar with the implementation (so my excuses if it is a stupid or naive question), but I quickly looked at the PR and have the following question.

Previously, for integer and boolean serieses, the 0 or False values were regarded as actual values, not an indication of 'not a value' in the sparse series. Isn't this a big change? (I don't know how much you could use it before this PR to be a problem)
Next to that, having eg False for boolean arrays as the default fill_value also seems a bit strange to me. I would expect that somebody who wants a boolean sparse array, would want to be able to have both True and False values as actual values? (eg something like [True, -, -, False, -, -, True])?
Of course this is currently because boolean serieses cannot have anything else as True or False.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.

Member

jorisvandenbossche commented Aug 27, 2016

OK, so probably my question should be categorized in the naive category :-)
I see that this is the same as what scipy.sparse does, so seems like a sensible default then.

Sparse data should have the same dtype as its dense representation. Currently,
``float64``, ``int64`` and ``bool`` dtypes are supported. Depending on the original
dtype, ``fill_value`` default changes:

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

Can you add a note here somewhere that for int and bool this was only added from 0.19 ?

@jorisvandenbossche

View changes

Show outdated Hide outdated pandas/core/generic.py
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 27, 2016

Contributor

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

Contributor

jreback commented Aug 27, 2016

joris your example already works you can have any values u want as actual values (both True and False); the fill value is for the missing value indicator when I need to densify (it's the default)

so this is not a conceptual change at all just a change to keep dtype consistency

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

Member

jorisvandenbossche commented Aug 27, 2016

@jreback I was looking at the to_sparse examples. So the fill_value is also used to convert from dense to sparse. So the output what you see there (eg in case of pd.Series([1, 0, 0]).to_sparse()) has changed (previously that was a block length of 3, now of 1). But no problem, I understand that the actual behaviour you want has not changed.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 27, 2016

Member

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

Member

jorisvandenbossche commented Aug 27, 2016

@jreback This PR for the rest OK to merge for you, Jeff? (it's closing a lot of issues for 0.19.0 :-))

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 29, 2016

Member

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

Member

jorisvandenbossche commented Aug 29, 2016

@sinhrks Can you update the docstrings for SparseDataFrame, SparseSeries and SparseArray? They all still mention the fact that only floats are supported or that nan is the default fill value.

@jorisvandenbossche jorisvandenbossche merged commit b6d3a81 into pandas-dev:master Aug 31, 2016

3 checks passed

codecov/patch 98.63% of diff hit (target 50.00%)
Details
codecov/project 85.27% (target 82.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche
Member

jorisvandenbossche commented Aug 31, 2016

@sinhrks Thanks a lot!

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Aug 31, 2016

Member

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32
Member

jorisvandenbossche commented Aug 31, 2016

@sinhrks appveyor started failing (some int dtype issues):

======================================================================
FAIL: test_append_zero (pandas.sparse.tests.test_list.TestSparseList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\sparse\tests\test_list.py", line 64, in test_append_zero
    tm.assert_sp_array_equal(sparr, SparseArray(arr, fill_value=0))
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1392, in assert_sp_array_equal
    assert_numpy_array_equal(left.sp_values, right.sp_values)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1083, in assert_numpy_array_equal
    assert_attr_equal('dtype', left, right, obj=obj)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: numpy array are different
Attribute "dtype" are different
[left]:  int64
[right]: int32
@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Sep 1, 2016

Member

@jorisvandenbossche thx for pointing out, will fix.

Member

sinhrks commented Sep 1, 2016

@jorisvandenbossche thx for pointing out, will fix.

@sinhrks sinhrks deleted the sinhrks:sparse_dtype3 branch Sep 1, 2016

@kawochen kawochen referenced this pull request Dec 5, 2016

Open

BUG: Sparse master issue #10627

11 of 18 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment