BUG: Slicing subclasses of SparseDataFrames. #13787

Closed
wants to merge 12 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

sstanovnik commented Jul 25, 2016 edited

  • 1 test added & passed
  • passes git diff upstream/master | flake8 --diff

This changes SparseDataFrame to use proper subclassing functionality so slicing of subclasses of SparseDataFrame works. Example of a failure that this PR fixes:

import pandas as pd

class DenseSubclassDF(pd.DataFrame):
    _constructor = property(lambda self: DenseSubclassDF)
    _constructor_sliced = property(lambda self: DenseSubclassS)

class DenseSubclassS(pd.Series):
    _constructor = property(lambda self: DenseSubclassS)
    _constructor_expanddim = property(lambda self: DenseSubclassDF)

class SparseSubclassDF(pd.SparseDataFrame):
    _constructor = property(lambda self: SparseSubclassDF)
    _constructor_sliced = property(lambda self: SparseSubclassS)

class SparseSubclassS(pd.SparseSeries):
    _constructor = property(lambda self: SparseSubclassS)
    _constructor_expanddim = property(lambda self: SparseSubclassDF)

ddf = DenseSubclassDF([[1,2,3], [4,5,6], [7,8,9]])
sdf = SparseSubclassDF([[1,2,3], [4,5,6], [7,8,9]])

print(type(ddf.iloc[0]))  # <class '__main__.DenseSubclassS'>
print(type(ddf.iloc[:2]))  # <class '__main__.DenseSubclassDF'>
print(type(ddf[:2]))  # <class '__main__.DenseSubclassDF'>

# sparse doesn't preserve types
print(type(sdf.iloc[0]))  # <class '__main__.SparseSubclassS'>
print(type(sdf.iloc[:2]))  # <class 'pandas.sparse.frame.SparseDataFrame'>
print(type(sdf[:2]))  # <class 'pandas.sparse.frame.SparseDataFrame'>

sstanovnik changed the title from [FIX] Slicing subclasses of SparseDataFrames. to BUG: Slicing subclasses of SparseDataFrames. Jul 25, 2016

@jreback jreback and 1 other commented on an outdated diff Jul 25, 2016

pandas/sparse/frame.py
@@ -373,7 +373,9 @@ def _slice(self, slobj, axis=0, kind=None):
new_index = self.index
new_columns = self.columns[slobj]
- return self.reindex(index=new_index, columns=new_columns)
+ return self._constructor(data=self.reindex(index=new_index,
@jreback

jreback Jul 25, 2016

Contributor

try not to use \, instead:

return self._constructor(
          data=self.reindex(index=new_index, 
                                        columns=new_columns)).__finalize__(self)
@sinhrks

sinhrks Jul 25, 2016

Member

can't we fix .reindex to return subclass?

@jreback jreback and 1 other commented on an outdated diff Jul 25, 2016

pandas/tests/frame/test_subclass.py
@@ -210,3 +210,17 @@ def test_subclass_align_combinations(self):
tm.assert_series_equal(res1, exp2)
tm.assertIsInstance(res2, tm.SubclassedDataFrame)
tm.assert_frame_equal(res2, exp1)
+
+ def test_subclass_sparse_slice(self):
+ ssdf = tm.SubclassedSparseDataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
@jreback

jreback Jul 25, 2016

Contributor

add the issue number here

@sstanovnik

sstanovnik Jul 25, 2016

Contributor

I haven't created an issue for this. Should I?

@jreback

jreback Jul 25, 2016

Contributor

no, but have a look thru existing ones to see if it covers anything, it might

@jreback

jreback Jul 25, 2016

Contributor

just search, maybe sparse (there are not too many open ones)

@jreback jreback and 2 others commented on an outdated diff Jul 25, 2016

pandas/tests/frame/test_subclass.py
@@ -210,3 +210,17 @@ def test_subclass_align_combinations(self):
tm.assert_series_equal(res1, exp2)
tm.assertIsInstance(res2, tm.SubclassedDataFrame)
tm.assert_frame_equal(res2, exp1)
+
+ def test_subclass_sparse_slice(self):
+ ssdf = tm.SubclassedSparseDataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
+ ssdf.testattr = "testattr"
+
+ tm.assertIsInstance(ssdf.loc[:2], tm.SubclassedSparseDataFrame)
@jreback

jreback Jul 25, 2016

Contributor

use tm.assert_sp_frame_equal here (and just use a constructed SubClassedSparseDataFrame as the expected

@sstanovnik

sstanovnik Jul 25, 2016

Contributor

As far as I can tell, tm.assert_sp_frame_equal only asserts both are instances of SparseDataFrame and not that they are the same type. What I meant to fix and test is only preserving the types when slicing. Should I add (not replace with) an additional check with tm.assert_sp_frame_equal anyway?

@jreback

jreback Jul 25, 2016

Contributor

add a check_dtype option in testing function which will assert that not only they are instances of pd.SparseDataFrame , but also that they are the same (e.g. signature could be
check_dtype=True|'same'

@sinhrks

sinhrks Jul 25, 2016

Member

assert_frame_equal and assert_series_equal have check_frame/series_type for the same meaning. Can use same options.

Contributor

jreback commented Jul 25, 2016

is there a specific issue for this?

@sinhrks pls have a look

codecov-io commented Jul 25, 2016 edited

Current coverage is 85.25% (diff: 100%)

Merging #13787 into master will increase coverage by <.01%

@@             master     #13787   diff @@
==========================================
  Files           140        140          
  Lines         50455      50471    +16   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43014      43031    +17   
+ Misses         7441       7440     -1   
  Partials          0          0          

Powered by Codecov. Last update 59f2557...a7fb24f

Contributor

sstanovnik commented Jul 26, 2016

Thank you for your comments, I made changes according to them:

  • .reindex didn't need to be adapted, the problem was _reindex_columns which explicitly instantiated a SparseDataFrame. I changed all such calls to use _constructor.
  • The thing to check in testing.py is that I added obj as a kwarg to panel checks to conform to what the series and frame ones look like. Previously, obj was referenced in them but the one that was used was a leftover from the for loop around line 2600.
  • There is no issue that handles this problem, the closest one is the sparse master issue #10627. Reference that one or create a new one?

@sinhrks sinhrks commented on an outdated diff Jul 26, 2016

pandas/util/testing.py
obj='SparseDataFrame'):
- """
- exact: Series SparseIndex objects must be exactly the same, otherwise just
- compare dense representations
+ """Check that the left and right SparseDataFrame are equal.
+
+ Parameters
+ ----------
+ left : DataFrame
@sinhrks

sinhrks Jul 26, 2016

Member

nice docstring:) SparseDataFrame.

@sinhrks sinhrks commented on an outdated diff Jul 26, 2016

pandas/util/testing.py
@@ -1453,10 +1492,29 @@ def assert_sp_frame_equal(left, right, exact_indices=True,
assert (col in left)
-def assert_sp_panel_equal(left, right, exact_indices=True):
+def assert_sp_panel_equal(left, right, exact_indices=True,
@sinhrks

sinhrks Jul 26, 2016

Member

SparsePanel has been removed, needs rebase (#13778).

kawochen referenced this pull request Jul 26, 2016

Open

BUG: Sparse master issue #10627

11 of 18 tasks complete
Member

sinhrks commented Jul 26, 2016 edited

@sstanovnik thx for update! i linked the PR from #10627, so new issue is not needed unless there is something which is not fixid in this PR.

Contributor

sstanovnik commented Jul 26, 2016

Should I also squash the commits?

Member

sinhrks commented Jul 26, 2016

not mandatory on your side. will be done during merge.

Contributor

sstanovnik commented Jul 26, 2016

Typo fixed, rebased.

Contributor

sstanovnik commented Jul 27, 2016

Two old pickling tests failed: see this build. I didn't see a nice way out, so I made an exception for the deprecated SparseTimeSeries. This may not be what you want and needs review.

Member

sinhrks commented Jul 27, 2016

@sstanovnik yeah SparseTimeSeries is deprecated, you can switch validation option depending on the version, like:

@jreback jreback commented on the diff Jul 27, 2016

pandas/tests/frame/test_subclass.py
@@ -210,3 +210,25 @@ def test_subclass_align_combinations(self):
tm.assert_series_equal(res1, exp2)
tm.assertIsInstance(res2, tm.SubclassedDataFrame)
tm.assert_frame_equal(res2, exp1)
+
+ def test_subclass_sparse_slice(self):
@jreback

jreback Jul 27, 2016

Contributor

we have lots of changes, but want to see if we can have a test for each, can you audit these changes and add tests if needed.

Contributor

sstanovnik commented Jul 28, 2016

I added additional tests to check other subclassing changes I made.

Contributor

jreback commented Jul 29, 2016

lgtm; though I think the read_pickle assertion can be done in a clearer way (use sep function). @sinhrks ?

@sinhrks sinhrks commented on an outdated diff Jul 29, 2016

pandas/io/tests/test_pickle.py
@@ -44,8 +44,15 @@ def compare_element(self, result, expected, typ, version=None):
return
if typ.startswith('sp_'):
+ # SparseTimeSeries deprecated in 0.17.0
+ if (typ == "sp_series" and version and
@sinhrks

sinhrks Jul 29, 2016

Member

test_pickle can use object-specific method if it is defined. Defining compare_sp_series_ts (like other compare_xxx method) should allow test_pickle to use the method for sparse time series. Current impl skips type check of all sparse series.

NOTE: method name is derived from pickle key:

@sinhrks sinhrks commented on an outdated diff Jul 29, 2016

pandas/io/tests/test_pickle.py
@@ -86,6 +86,13 @@ def compare(self, vf, version):
comparator(result, expected, typ, version)
return data
+ def compare_sp_series_ts(self, res, exp, typ, version):
+ # SparseTimeSeries deprecated in 0.17.0
+ if version and LooseVersion(version) < "0.17.0":
@sinhrks

sinhrks Jul 29, 2016

Member

Thx for the update. I think ... <= "0.12.0" is OK based on the error below. After that, SparseTimeSeries has been integrated to SparseSeries.

Contributor

jreback commented Jul 29, 2016

@sstanovnik looks good. @sinhrks anything further?

jreback added this to the 0.19.0 milestone Jul 29, 2016

Member

sinhrks commented Jul 29, 2016

None. thanks for all the effort, @sstanovnik !

Contributor

jreback commented Aug 1, 2016

@sstanovnik lgtm. pls add a whatsnew note (put in API changes). ping when green.

sstanovnik added some commits Jul 25, 2016

@sstanovnik sstanovnik BUG: Fix slicing subclasses of SparseDataFrames.
Use proper subclassing behaviour so subclasses work properly: this fixes
an issue where a multi-element slice of a subclass of SparseDataFrame
returned the SparseDataFrame type instead of the subclass type.
ddc2874
@sstanovnik sstanovnik Formatting changes. 7c10bb5
@sstanovnik sstanovnik Add type checks to sparse object comparisons. cdb7c8c
@sstanovnik sstanovnik Modify the new test to compare complete objects. 140e4ad
@sstanovnik sstanovnik Use _constructor when instantiating SparseDataFrames.
This should fix subclassing issues.
1d4e9da
@sstanovnik sstanovnik Lint corrections. 4e31303
@sstanovnik sstanovnik Further SparseSeries subclass fixes. 0387303
@sstanovnik sstanovnik Proper test parameter override, travis soothing. d73ec68
@sstanovnik sstanovnik Additional sparse subclass tests. 436a06d
@sstanovnik sstanovnik Properer test assertion override. 0b74034
@sstanovnik sstanovnik Changed the override from 0.17.0 to 0.12.0. 8d2ff79
@sstanovnik sstanovnik Add a whatsnew note.
a7fb24f
Contributor

sstanovnik commented Aug 1, 2016

ping

jreback closed this in a7f7e1d Aug 2, 2016

Contributor

jreback commented Aug 2, 2016

thanks @sstanovnik

jreback referenced this pull request Aug 2, 2016

Closed

ENH: add sparse op for int64 dtypes #13848

4 of 4 tasks complete

sstanovnik deleted the sstanovnik:fix-sparse-slice branch Aug 4, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment