ENH: add sparse op for int64 dtypes #13848

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants
Member

sinhrks commented Jul 30, 2016 edited

  • related to #667
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

As a first step for #667, numeric op can now preserve int64 dtype. On current master, dtype is reset to float64 after op.

# current master
a = pd.SparseArray([1, 2], dtype=np.int64)
a.dtype
# dtype('int64')

(a + a).dtype
# dtype('float64')

NOTE: int64 SparseSeries.__floordiv__ test is skipped because dense Series also has inconsistency in nan/inf handling (#13843). Currently it outputs the same result as float64.

sinhrks added this to the 0.19.0 milestone Jul 30, 2016

sinhrks referenced this pull request Jul 30, 2016

Merged

ENH: Sparse int64 and bool dtype support enhancement #13849

4 of 4 tasks complete

sinhrks added the Dtypes label Jul 30, 2016

codecov-io commented Jul 30, 2016 edited

Current coverage is 85.28% (diff: 98.00%)

Merging #13848 into master will increase coverage by <.01%

@@             master     #13848   diff @@
==========================================
  Files           139        139          
  Lines         50020      50046    +26   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42657      42682    +25   
- Misses         7363       7364     +1   
  Partials          0          0          

Powered by Codecov. Last update 97de42a...f101e66

Contributor

TomAugspurger commented Jul 30, 2016

Haven't had a chance to look through the code yet, but what are the rules around alignment and potentially recasting the dtype?

import numpy as np
import pandas as pd

s1 = pd.SparseSeries(np.arange(4), dtype=np.int64, fill_value=0)
s2 = pd.SparseSeries(np.arange(4), index=range(1, 5), dtype=np.int64, fill_value=0)

s1 + s1  # OK
s1 + s2  # error
Traceback (most recent call last):
  File "script.py", line 8, in <module>
    s1 + s2  # error
  File "/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/pandas/sparse/series.py", line 56, in wrapper
    return _sparse_series_op(self, other, op, name)
  File "/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/pandas/sparse/series.py", line 81, in _sparse_series_op
    series=True)
  File "/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/pandas/sparse/array.py", line 119, in _sparse_array_op
    sparse_op = getattr(splib, opname)
AttributeError: module 'pandas._sparse' has no attribute 'sparse_add_float64'
Member

sinhrks commented Jul 30, 2016

@TomAugspurger The latter case looks work on my branch, the error seems to show that sparse.pyx is not re-compiled properly.

I'm adding more tests related to alignment:)

Contributor

TomAugspurger commented Jul 30, 2016

My bad, just got to that section of the code. Recompiled and it does indeed work 👍

@jreback jreback commented on an outdated diff Aug 1, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -301,6 +301,29 @@ For ``MultiIndex``, values are dropped if any level is missing by default. Speci
``Index.astype()`` now accepts an optional boolean argument ``copy``, which allows optional copying if the requirements on dtype are satisfied (:issue:`13209`)
+.. _whatsnew_0190.sparse:
+
+Sparse changes
+~~~~~~~~~~~~~~
+
+These changes conform sparse data to support more dtypes, and for work to make a smoother experience with data handling.
@jreback

jreback Aug 1, 2016

Contributor

These changes allow pandas to handle sparse data with more dtypes.

@jreback jreback commented on an outdated diff Aug 1, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -301,6 +301,29 @@ For ``MultiIndex``, values are dropped if any level is missing by default. Speci
``Index.astype()`` now accepts an optional boolean argument ``copy``, which allows optional copying if the requirements on dtype are satisfied (:issue:`13209`)
+.. _whatsnew_0190.sparse:
+
+Sparse changes
+~~~~~~~~~~~~~~
+
+These changes conform sparse data to support more dtypes, and for work to make a smoother experience with data handling.
+
+- Sparse data structure now can preserve ``dtype`` after arithmetic op (:issue:`13848`)
+
@jreback

jreback Aug 1, 2016

Contributor

after arithmetic ops

@jreback jreback commented on the diff Aug 1, 2016

pandas/sparse/array.py
@@ -420,7 +459,12 @@ def astype(self, dtype=None):
dtype = np.dtype(dtype)
if dtype is not None and dtype not in (np.float_, float):
raise TypeError('Can only support floating point data for now')
- return self.copy()
+
+ if self.dtype == dtype:
+ return self.copy()
+ else:
+ return self._simple_new(self.sp_values.astype(dtype),
+ self.sp_index, float(self.fill_value))
@jreback

jreback Aug 1, 2016

Contributor

maybe we should coerce fill_value in the constructor based on the type of the values?

@sinhrks

sinhrks Aug 1, 2016

Member

Yeah, it should be covered in #13849 (though i'm thinkinn to split small PRs though...).

Because currently astype only supports to coerce float, it is a temp workaround not to break others.

Contributor

jreback commented Aug 2, 2016

rebase in light of changes #13787

@sinhrks sinhrks ENH: add sparse op for other dtypes
f101e66
Contributor

jreback commented Aug 3, 2016

thanks!

nice cleanup

jreback closed this in 45d54d0 Aug 3, 2016

sinhrks deleted the sinhrks:sparse_op2 branch Aug 3, 2016

Contributor

jreback commented Aug 4, 2016

FYI: 8ec7406

as we no longer depend on generated; was causing recompilation of algos.pyx every time :<

Contributor

jreback commented Aug 4, 2016

small dtype adj needed on windows

(Pdb) c
E........................................................................................................................................
..............................................................................S.........................S................................
...........................................
======================================================================
ERROR: test_int_array_comparison (pandas.sparse.tests.test_arithmetics.TestSparseArrayArithmetics)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\sparse\tests\test_arithmetics.py", line 292, in test_int_array_comparison
    self._check_comparison_ops(a, b, values, rvalues)
  File "C:\Users\conda\Documents\pandas3.5\pandas\sparse\tests\test_arithmetics.py", line 93, in _check_comparison_ops
    self._check_bool_result(a == b_dense)
  File "C:\Users\conda\Documents\pandas3.5\pandas\sparse\array.py", line 54, in wrapper
    return _sparse_array_op(self, other, op, name)
  File "C:\Users\conda\Documents\pandas3.5\pandas\sparse\array.py", line 98, in _sparse_array_op
    dtype = _maybe_match_dtype(left, right)
  File "C:\Users\conda\Documents\pandas3.5\pandas\sparse\array.py", line 75, in _maybe_match_dtype
    raise NotImplementedError('dtypes must be identical')
NotImplementedError: dtypes must be identical

----------------------------------------------------------------------
Ran 331 tests in 49.517s

FAILED (SKIP=2, errors=1)
(pandas3.5) C:\Users\conda\Documents\pandas3.5>nosetests pandas\sparse --pdb
........> c:\users\conda\documents\pandas3.5\pandas\sparse\array.py(75)_maybe_match_dtype()
-> raise NotImplementedError('dtypes must be identical')
(Pdb) u
> c:\users\conda\documents\pandas3.5\pandas\sparse\array.py(98)_sparse_array_op()
-> dtype = _maybe_match_dtype(left, right)
(Pdb) u
> c:\users\conda\documents\pandas3.5\pandas\sparse\array.py(54)wrapper()
-> return _sparse_array_op(self, other, op, name)
(Pdb) d
> c:\users\conda\documents\pandas3.5\pandas\sparse\array.py(98)_sparse_array_op()
-> dtype = _maybe_match_dtype(left, right)
(Pdb) p left
[0, 1, 2, 0, 0, 0, 1, 2, 1, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(Pdb) p right
[2, 0, 2, 3, 0, 0, 1, 5, 2, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

(Pdb) p left.dtype
dtype('int64')
(Pdb) p right.dtype
dtype('int32')
(Pdb) u
Member

sinhrks commented Aug 4, 2016

Thx, will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment