Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseArray is an ExtensionArray #22325

Merged
merged 236 commits into from Oct 13, 2018

Conversation

@TomAugspurger
Copy link
Contributor

commented Aug 13, 2018

Closes #21978
Closes #19506
Closes #22835

High-level summary: SparseArray is an ExtensionArray. It's no longer an ndarray subclass. The actual data model hasn't changed at all, it's still an array and a sparse_index. Only now the sparse values are self.sparse_values, rather than self.

This isn't really close to being ready yet. I'm going to go through and self-review a bunch of things right now, will call out for others' opinions in specific places.

API discussions:

  • Why is the default kind inconsistent between SparseSeries and SparseArray?
    • Possibly get rid of block in a future PR
  • Should .astype(np_dtype) be sparse or dense?
    • sparse
  • What should the inferred type of an empty SparseArray be? SparseArray([]).dtype? NumPy defaults to float (Sparse[float64]), pandas typically uses object (Sparse[object])
    • sparse
  • Policy for warning when converting to dense (e.g. https://github.com/pandas-dev/pandas/pull/22325/files#diff-71caf9627e9687e837e4b1f86ecc6271R390). In the past, pandas would coerce to an ndarray all over the place. Now we at least have a chance of knowing when we're doing a bad thing, and warning about it.
    • warn when it's implicit
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

NumPy will do that for us, but it still takes a different path (on older numpy).

I think it's roughly equivalent to

In [2]: a = np.array([0, 1, 2])

In [3]: a[[True, False, True]]
/Users/taugspurger/miniconda3/envs/circle-27-compat/bin/ipython:1: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
  #!/Users/taugspurger/miniconda3/envs/circle-27-compat/bin/python
Out[3]: array([1, 0, 1])
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

The root problem is ndarray.__getitem__(Sparse[bool]). For NumPy 1.9.3, numpy doesn't treat the Sparse[bool] as a boolean indexer.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

Yes, I understand that, but we can still manually convert the sparse boolean to a numpy boolean in places where we use it for indexing a numpy array? (although it would not be needed for newer numpy)

@jreback
Copy link
Contributor

left a comment

some comments. you may want to address some here (as you have to rebase anyhow). or a followup ok.

* Passing a scalar for ``indices`` is no longer allowed.
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

i agree the fill type should match the dtype but since missing value support is allowed here it is prob ok.

@@ -566,6 +597,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Added :meth:`pandas.api.types.register_extension_dtype` to register an extension type with pandas (:issue:`22664`)
- Series backed by an ``ExtensionArray`` now work with :func:`util.hash_pandas_object` (:issue:`23066`)
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

really? we allow this. I agree this would be ok, but is reasonably tested?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 12, 2018

Author Contributor

Define reasonable :)

I'm reasonably sure there are places in pandas where we assume we have an ndarray, but may get an ExtensionArray instead.

The common case of .isna().any() is well tested though, I think.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 12, 2018

Member

In any case, if you create the masks in other ways than .isna(), eg with a comparison like df['sparse_column'] == 1, you get exactly the same issue. Which means we basically have to support indexing with boolean sparse masks anyway I think.

Sparse
^^^^^^

- Updating a boolean, datetime, or timedelta column to be Sparse now works (:issue:`22367`)

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

really, we have support for this? again i agree this is a nice feature, but we are decreasing support generally for sparse, so not anxious to advertise this

This should return a 1-D array the same length as 'self'.
* ``na_values._is_boolean`` should be True
* `na_values` should implement :func:`ExtensionArray._reduce`

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

we should probably have an Indexing EA mixin that implementes these as NotImplemented (so once can subclass)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 12, 2018

Member

Such an indexing mixing might be useful, but how is this related to the line above?

Show resolved Hide resolved pandas/core/internals/blocks.py
else:
return g, None
try:
g = np.find_common_type(upcast_classes, [])

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

ok,, yeah would be nice to simplify this

Show resolved Hide resolved pandas/core/reshape/reshape.py Outdated
Show resolved Hide resolved pandas/tests/extension/arrow/bool.py Outdated
@@ -67,7 +67,11 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls.from_scalars(scalars)

def __getitem__(self, item):

This comment has been minimized.

Copy link
@jreback

jreback Oct 12, 2018

Contributor

see my comment above, we should create an Indexing EA mixin and use the interface here

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

Yes, I understand that, but we can still manually convert the sparse boolean to a numpy boolean in places where we use it for indexing a numpy array?

Oh, I see what you're suggesting. I only see two of these FutureWarnings from numpy...

Looking at quantile, right now it'd be fine to convert the ExtensionArray mask to an ndarray because we know that values is an ndarray. However, I could imagine a future where we (or numpy, if an array-like implements __array_function__) dispatches np.percentile. Converting the mask to an ndarray would be counter-productive. But we're a long way from that future, so let's just do the conversion for now, with a note that it may have to change.

The other case is in SparseArray.take, where we create the ndarray, so that's safe.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 12, 2018

The other case is in SparseArray.take, where we create the ndarray, so that's safe.

Did you do a change for this? (I only see something in the diff for percentile)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

Did you do a change for this? (I only see something in the diff for percentile)

Sorry forgot to post what I found. The LOC is

return self.take(np.arange(len(key), dtype=np.int32)[key])

We can't hit that with an SparseArray (we go to dense earlier). I see now that we're hitting it with something like

In [4]: arr[[True, False, True]]
pandas/core/sparse/array.py:662: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
  return self.take(np.arange(len(key), dtype=np.int32)[key])
Out[4]:
[2, 1, 2]
Fill: 0
IntIndex
Indices: array([0, 1, 2], dtype=int32)

again this is only on older numpys. So we can do an asarray on key there.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

FYI, this boolean indexing behavior affects through numpy 1.11.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

All green, if anyone wants to take a look at the dtype raising commit before pushing the green button.

@jorisvandenbossche
Copy link
Member

left a comment

Last changes look good, just a minor doc comment

@@ -173,9 +173,10 @@ def construct_from_string(cls, string):
'Sparse[int, 1]' SparseDtype[np.int64, 0]

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 13, 2018

Member

I think this one now needs to be updated as the above would raise an error. But make it 'Sparse[int, 0] to show that default fill value is OK?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 13, 2018

Good catch.

Here we go.

@TomAugspurger TomAugspurger merged commit 56d8e78 into pandas-dev:master Oct 13, 2018

6 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181013.17 succeeded
Details

@TomAugspurger TomAugspurger deleted the TomAugspurger:ea-sparse-2 branch Oct 13, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 15, 2018

Whoohoo!

@TomAugspurger did you open an issue with your follow-up items list?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 15, 2018

And there was already one for deprecating SparseDataFrame. I'll add SparseSeries to that discussion.

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

[API/REF]: SparseArray is an ExtensionArray (pandas-dev#22325)
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes pandas-dev#22367

* Fixed Series[sparse].to_sparse

Closes pandas-dev#22389

Closes pandas-dev#21978
Closes pandas-dev#19506
Closes pandas-dev#22835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.