-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isin method buggy on SparseSeries #1412
Comments
I've been investigating this a bit further, and it looks like this bug isn't related to using a fill_value other than NaN (unlike the previous issue I found with SparseSeries), as I get a similar result using NaN as the fill_value: In [7]: sparse_df = DataFrame({'flag': [np.nan, 0, 1]}).to_sparse()
In [8]: sparse_df.flag.isin([1.])
Out[8]:
0 False
1 True
2 False
Name: flag I think what's going on is that the logic checking if each value from the array is in the query set is only checking through the reduced array of non-sparse data and ignoring the sparse 'slots'. I am having trouble digging deeper because a lot of the logic appears to be implemented in cython or up in numpy somewhere (and this stuff is all rather new to me), so I can't trace through it with pdb. I wonder if isin (and possibly other similar methods) needs to be overridden in the SparseSeries class so that it can use the sparse index? Any pointers on where to look next would be very welcome. Using sparse data structures is giving me significant savings on memory and disk usage and is making some analyses possible that I couldn't do before, so it would be really good to get them working on a par with their dense counterparts. |
Here is a straw man implementation of isin for the SparseSeries class which seems to work for my test data at least. The method returns a (dense) boolean Series just like the Series.isin method, but as per the inline comment, I wonder if it should be returning a boolean SparseSeries instead? Any comments welcome. I could submit a proper pull request for this, but I don't feel like I understand the code well enough yet, so I wonder if the developers could include it (or something along these lines) if this seems like a reasonable approach. def isin(self, values):
"""
Return boolean Series showing whether each element in the
SparseSeries is exactly contained in the passed sequence of
values
Parameters
----------
values : sequence
Returns
-------
isin : Series (boolean dtype)
"""
value_set = set(values)
# call the superclass to get the result for the
# non-sparse values
non_sparse_result = super(SparseArray, self).isin(value_set)
# construct our result array, with default values
# set according to whether our fill_value is in
# values
result = np.empty(len(self), dtype=np.bool_)
result.fill(self.fill_value in value_set)
# and fill in the rest with the results from the
# non-sparse entries
int_index = self.sp_index.to_int_index()
result.put(int_index.indices, non_sparse_result)
# XXX: should we be returning a boolean SparseSeries
# with the fill_value set to
# (self.fill_value in value_set)?
return Series(result, self.index, name=self.name) |
In [1]: sparse_df = DataFrame({'flag': [1., 0., 1.]}).to_sparse(fill_value=0.) In [2]: sparse_df[sparse_df.flag.isin([1.])] |
Thanks for the bug report! |
* commit 'v0.8.0b1-137-gf733f10': (119 commits) ENH: make any/all conform to sum/mean interface. Fixed bug in copy keyword in BlockManager.get_numeric_data pandas-dev#1416 TST: mixed case ENH: DataFrame any all pandas-dev#1416 BLD: ujson fix compiler warnings BUG: ujson memory alignment fixes BUG: ujson don't compare pointers outside their allocated blocks ujson check for malloc success Add dependency on pytz ENH: check the periods argument is a number pandas-dev#1438 BUG: fix DatetimeIndex.groupby bug close pandas-dev#1430 ENH: set_xlim using str for irregular index pandas-dev#1427 BUG: isin method buggy on SparseSeries pandas-dev#1412 BUG: pass on limit argument in upsampling, close pandas-dev#1424 ENH: implement comparison methods on Factor, close pandas-dev#1405 BUG: fix numpy 1.6 erroneous type-casting bug causing NumPy 1.7 issues, close pandas-dev#1396 BUG: fix Series .ix bug and test failure described in pandas-dev#1396 DOC: Indicate that to_csv and from_csv methods can handle string file path and file-like objects for io. BUG: fix a number of negative ordinal bugs, check for out-of-range quarters BUG: preserve name in PeriodIndex.to_timestamp BUG: fix asfreq bug on PeriodIndex ...
The isin method doesn't seem to work correctly for SparseSeries objects, but does work on an equivalent dense object, example code below:
The text was updated successfully, but these errors were encountered: