Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isin method buggy on SparseSeries #1412

Closed
grsr opened this issue Jun 6, 2012 · 4 comments
Closed

isin method buggy on SparseSeries #1412

grsr opened this issue Jun 6, 2012 · 4 comments
Labels
Milestone

Comments

@grsr
Copy link

grsr commented Jun 6, 2012

The isin method doesn't seem to work correctly for SparseSeries objects, but does work on an equivalent dense object, example code below:

In [203]: pandas.__version__
Out[203]: '0.7.3'

In [204]: sparse_df = DataFrame({'flag': [1., 0., 1.]}).to_sparse(fill_value=0.)

In [205]: sparse_df[sparse_df.flag == 1.]
Out[205]: 
   flag
0     1
2     1

In [206]: sparse_df[sparse_df.flag.isin([1.])]
Out[206]: 
   flag
0     1
1     0

In [207]: dense_df = DataFrame({'flag': [1., 0., 1.]})

In [208]: dense_df[dense_df.flag == 1.]
Out[208]: 
   flag
0     1
2     1

In [209]: dense_df[dense_df.flag.isin([1.])]
Out[209]: 
   flag
0     1
2     1
@grsr
Copy link
Author

grsr commented Jun 7, 2012

I've been investigating this a bit further, and it looks like this bug isn't related to using a fill_value other than NaN (unlike the previous issue I found with SparseSeries), as I get a similar result using NaN as the fill_value:

In [7]: sparse_df = DataFrame({'flag': [np.nan, 0, 1]}).to_sparse()

In [8]: sparse_df.flag.isin([1.])
Out[8]: 
0    False
1     True
2    False
Name: flag

I think what's going on is that the logic checking if each value from the array is in the query set is only checking through the reduced array of non-sparse data and ignoring the sparse 'slots'. I am having trouble digging deeper because a lot of the logic appears to be implemented in cython or up in numpy somewhere (and this stuff is all rather new to me), so I can't trace through it with pdb. I wonder if isin (and possibly other similar methods) needs to be overridden in the SparseSeries class so that it can use the sparse index? Any pointers on where to look next would be very welcome. Using sparse data structures is giving me significant savings on memory and disk usage and is making some analyses possible that I couldn't do before, so it would be really good to get them working on a par with their dense counterparts.

@grsr
Copy link
Author

grsr commented Jun 7, 2012

Here is a straw man implementation of isin for the SparseSeries class which seems to work for my test data at least. The method returns a (dense) boolean Series just like the Series.isin method, but as per the inline comment, I wonder if it should be returning a boolean SparseSeries instead? Any comments welcome. I could submit a proper pull request for this, but I don't feel like I understand the code well enough yet, so I wonder if the developers could include it (or something along these lines) if this seems like a reasonable approach.

def isin(self, values):
    """
    Return boolean Series showing whether each element in the 
    SparseSeries is exactly contained in the passed sequence of 
    values

    Parameters
    ----------
    values : sequence

    Returns
    -------
    isin : Series (boolean dtype)
    """

    value_set = set(values)

    # call the superclass to get the result for the 
    # non-sparse values
    non_sparse_result = super(SparseArray, self).isin(value_set)

    # construct our result array, with default values
    # set according to whether our fill_value is in 
    # values
    result = np.empty(len(self), dtype=np.bool_)
    result.fill(self.fill_value in value_set)

    # and fill in the rest with the results from the
    # non-sparse entries
    int_index = self.sp_index.to_int_index()
    result.put(int_index.indices, non_sparse_result)

    # XXX: should we be returning a boolean SparseSeries 
    # with the fill_value set to 
    # (self.fill_value in value_set)?

    return Series(result, self.index, name=self.name)

@changhiskhan
Copy link
Contributor

In [1]: sparse_df = DataFrame({'flag': [1., 0., 1.]}).to_sparse(fill_value=0.)

In [2]: sparse_df[sparse_df.flag.isin([1.])]
Out[2]:
flag
0 1
2 1

@changhiskhan
Copy link
Contributor

Thanks for the bug report!

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jun 21, 2012
* commit 'v0.8.0b1-137-gf733f10': (119 commits)
  ENH: make any/all conform to sum/mean interface. Fixed bug in copy keyword in BlockManager.get_numeric_data pandas-dev#1416
  TST: mixed case
  ENH: DataFrame any all pandas-dev#1416
  BLD: ujson fix compiler warnings
  BUG: ujson memory alignment fixes
  BUG: ujson don't compare pointers outside their allocated blocks
  ujson check for malloc success
  Add dependency on pytz
  ENH: check the periods argument is a number pandas-dev#1438
  BUG: fix DatetimeIndex.groupby bug close pandas-dev#1430
  ENH: set_xlim using str for irregular index pandas-dev#1427
  BUG: isin method buggy on SparseSeries pandas-dev#1412
  BUG: pass on limit argument in upsampling, close pandas-dev#1424
  ENH: implement comparison methods on Factor, close pandas-dev#1405
  BUG: fix numpy 1.6 erroneous type-casting bug causing NumPy 1.7 issues, close pandas-dev#1396
  BUG: fix Series .ix bug and test failure described in pandas-dev#1396
  DOC: Indicate that to_csv and from_csv methods can handle string file path and file-like objects for io.
  BUG: fix a number of negative ordinal bugs, check for out-of-range quarters
  BUG: preserve name in PeriodIndex.to_timestamp
  BUG: fix asfreq bug on PeriodIndex
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants