isin method buggy on SparseSeries #1412

grsr · 2012-06-06T10:05:35Z

The isin method doesn't seem to work correctly for SparseSeries objects, but does work on an equivalent dense object, example code below:

In [203]: pandas.__version__
Out[203]: '0.7.3'

In [204]: sparse_df = DataFrame({'flag': [1., 0., 1.]}).to_sparse(fill_value=0.)

In [205]: sparse_df[sparse_df.flag == 1.]
Out[205]: 
   flag
0     1
2     1

In [206]: sparse_df[sparse_df.flag.isin([1.])]
Out[206]: 
   flag
0     1
1     0

In [207]: dense_df = DataFrame({'flag': [1., 0., 1.]})

In [208]: dense_df[dense_df.flag == 1.]
Out[208]: 
   flag
0     1
2     1

In [209]: dense_df[dense_df.flag.isin([1.])]
Out[209]: 
   flag
0     1
2     1

grsr · 2012-06-07T08:55:22Z

I've been investigating this a bit further, and it looks like this bug isn't related to using a fill_value other than NaN (unlike the previous issue I found with SparseSeries), as I get a similar result using NaN as the fill_value:

In [7]: sparse_df = DataFrame({'flag': [np.nan, 0, 1]}).to_sparse()

In [8]: sparse_df.flag.isin([1.])
Out[8]: 
0    False
1     True
2    False
Name: flag

I think what's going on is that the logic checking if each value from the array is in the query set is only checking through the reduced array of non-sparse data and ignoring the sparse 'slots'. I am having trouble digging deeper because a lot of the logic appears to be implemented in cython or up in numpy somewhere (and this stuff is all rather new to me), so I can't trace through it with pdb. I wonder if isin (and possibly other similar methods) needs to be overridden in the SparseSeries class so that it can use the sparse index? Any pointers on where to look next would be very welcome. Using sparse data structures is giving me significant savings on memory and disk usage and is making some analyses possible that I couldn't do before, so it would be really good to get them working on a par with their dense counterparts.

grsr · 2012-06-07T11:17:20Z

Here is a straw man implementation of isin for the SparseSeries class which seems to work for my test data at least. The method returns a (dense) boolean Series just like the Series.isin method, but as per the inline comment, I wonder if it should be returning a boolean SparseSeries instead? Any comments welcome. I could submit a proper pull request for this, but I don't feel like I understand the code well enough yet, so I wonder if the developers could include it (or something along these lines) if this seems like a reasonable approach.

def isin(self, values):
    """
    Return boolean Series showing whether each element in the 
    SparseSeries is exactly contained in the passed sequence of 
    values

    Parameters
    ----------
    values : sequence

    Returns
    -------
    isin : Series (boolean dtype)
    """

    value_set = set(values)

    # call the superclass to get the result for the 
    # non-sparse values
    non_sparse_result = super(SparseArray, self).isin(value_set)

    # construct our result array, with default values
    # set according to whether our fill_value is in 
    # values
    result = np.empty(len(self), dtype=np.bool_)
    result.fill(self.fill_value in value_set)

    # and fill in the rest with the results from the
    # non-sparse entries
    int_index = self.sp_index.to_int_index()
    result.put(int_index.indices, non_sparse_result)

    # XXX: should we be returning a boolean SparseSeries 
    # with the fill_value set to 
    # (self.fill_value in value_set)?

    return Series(result, self.index, name=self.name)

changhiskhan · 2012-06-08T03:49:06Z

In [1]: sparse_df = DataFrame({'flag': [1., 0., 1.]}).to_sparse(fill_value=0.)

In [2]: sparse_df[sparse_df.flag.isin([1.])]
Out[2]:
flag
0 1
2 1

changhiskhan · 2012-06-08T03:50:27Z

Thanks for the bug report!

* commit 'v0.8.0b1-137-gf733f10': (119 commits) ENH: make any/all conform to sum/mean interface. Fixed bug in copy keyword in BlockManager.get_numeric_data pandas-dev#1416 TST: mixed case ENH: DataFrame any all pandas-dev#1416 BLD: ujson fix compiler warnings BUG: ujson memory alignment fixes BUG: ujson don't compare pointers outside their allocated blocks ujson check for malloc success Add dependency on pytz ENH: check the periods argument is a number pandas-dev#1438 BUG: fix DatetimeIndex.groupby bug close pandas-dev#1430 ENH: set_xlim using str for irregular index pandas-dev#1427 BUG: isin method buggy on SparseSeries pandas-dev#1412 BUG: pass on limit argument in upsampling, close pandas-dev#1424 ENH: implement comparison methods on Factor, close pandas-dev#1405 BUG: fix numpy 1.6 erroneous type-casting bug causing NumPy 1.7 issues, close pandas-dev#1396 BUG: fix Series .ix bug and test failure described in pandas-dev#1396 DOC: Indicate that to_csv and from_csv methods can handle string file path and file-like objects for io. BUG: fix a number of negative ordinal bugs, check for out-of-range quarters BUG: preserve name in PeriodIndex.to_timestamp BUG: fix asfreq bug on PeriodIndex ...

changhiskhan pushed a commit that referenced this issue Jun 8, 2012

BUG: isin method buggy on SparseSeries #1412

ed7a3d2

changhiskhan closed this as completed Jun 8, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

isin method buggy on SparseSeries #1412

isin method buggy on SparseSeries #1412

grsr commented Jun 6, 2012

grsr commented Jun 7, 2012

grsr commented Jun 7, 2012

changhiskhan commented Jun 8, 2012

changhiskhan commented Jun 8, 2012

isin method buggy on SparseSeries #1412

isin method buggy on SparseSeries #1412

Comments

grsr commented Jun 6, 2012

grsr commented Jun 7, 2012

grsr commented Jun 7, 2012

changhiskhan commented Jun 8, 2012

changhiskhan commented Jun 8, 2012