Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: added array #23581

Merged
merged 57 commits into from
Dec 28, 2018
Merged

API: added array #23581

merged 57 commits into from
Dec 28, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Nov 8, 2018

Adds

  • a new top-level pd.array method for creating arrays
  • all our extension dtypes to the top-level API
  • adds pd.arrays for all our EAs

TODO

  • Add the actual array classes somewhere to the public API (pandas.arrays?)
  • API docs for the rest of the arrays and dtypes.

Closes #22860

supersedes #23532.

@TomAugspurger TomAugspurger added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 8, 2018
@pep8speaks
Copy link

pep8speaks commented Nov 8, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 22:13 Hours UTC

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 9, 2018

I'd like to restructure the documentation a bit. I want to collect similar things together, but we have two kinds of similarity:

First, we could group by "topic" (the type of data).

  • Categorical
    • dtype
    • Categorical
    • CategoricalIndex?
  • Period
    • dtype
    • scalar
    • array
    • PeriodIndex(?) or leave that under indexes probably
  • Integer
    • ...
  • Interval
    • ...
  • Sparse
    • ...

and then eventually DatetimeArray & TimedeltaArray.

Alternatively, we could group by "kind" first. So we'd have

  • Dtypes
    • CategoricalDtype
    • PeriodDtype
    • ...
  • Scalars
    • Period
    • ...
  • Arrays
    • Categorical
    • PeriodArray
    • ...

Do people have a preference between "by topic" and "by kind" (cc. @jorisvandenbossche @jreback @jbrockmendel @datapythonista)?

@jorisvandenbossche
Copy link
Member

I'd like to restructure the documentation a bit.

I suppose you are only talking about the api.rst page?

@TomAugspurger
Copy link
Contributor Author

Yes, sorry.

@jorisvandenbossche
Copy link
Member

Not a strong opinion, but for adding new docs I would go by kind.
But that is in the assumption that the above is only for listing the arrays / dtypes, and that there are still separate sections about each topic anyway? (eg with all Interval-specific or datetime-like methods/attributes)

@TomAugspurger
Copy link
Contributor Author

The nice thing about "by topic" is that we can give a high-level summary of what, e.g. "Period" is for. If we have things grouped "by kind" (dtypes, scalars, arrays), then we'd need to repeat that description, or just have it once.

@jorisvandenbossche
Copy link
Member

The nice thing about "by topic" is that we can give a high-level summary of what, e.g. "Period" is for

But in those places you would also put all the custom attributes and methods?

I actually think we can be duplicated somewhat. Even if there are topical sections, I think it is still nice to have a short table of all dtypes and one for all array types.

@TomAugspurger
Copy link
Contributor Author

Did a bit of inference in fe06de4. There are a few TODOs.

I think we don't handle intervals yet. #23553

pandas/core/arrays/array_.py Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
@TomAugspurger TomAugspurger changed the title [WIP]API: added array API: added array Nov 10, 2018
@codecov
Copy link

codecov bot commented Nov 10, 2018

Codecov Report

Merging #23581 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23581      +/-   ##
==========================================
+ Coverage   92.31%   92.32%   +0.01%     
==========================================
  Files         165      166       +1     
  Lines       52194    52240      +46     
==========================================
+ Hits        48182    48231      +49     
+ Misses       4012     4009       -3
Flag Coverage Δ
#multiple 90.74% <100%> (+0.01%) ⬆️
#single 43.07% <28.94%> (+0.09%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/period.py 98.42% <ø> (ø) ⬆️
pandas/core/arrays/interval.py 93.04% <ø> (ø) ⬆️
pandas/core/dtypes/dtypes.py 95.33% <100%> (ø) ⬆️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/api.py 100% <100%> (ø) ⬆️
pandas/core/arrays/array_.py 100% <100%> (ø)
pandas/core/arrays/base.py 98.23% <0%> (+0.03%) ⬆️
pandas/core/arrays/sparse.py 92.17% <0%> (+0.06%) ⬆️
pandas/core/arrays/numpy_.py 93.51% <0%> (+0.46%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1af4f5...1b9e251. Read the comment docs.

@TomAugspurger TomAugspurger mentioned this pull request Nov 10, 2018
@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018
doc/source/api.rst Outdated Show resolved Hide resolved
doc/source/api.rst Outdated Show resolved Hide resolved
doc/source/api.rst Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
pandas/arrays/__init__.py Show resolved Hide resolved
pandas/core/arrays/array_.py Show resolved Hide resolved
pandas/core/arrays/array_.py Show resolved Hide resolved
pandas/tests/arrays/test_array.py Show resolved Hide resolved
pd.IntervalArray.from_tuples([(1, 2), (3, 4)])),
([0, 1], 'Sparse[int64]', pd.SparseArray([0, 1], dtype='int64')),
([1, None], 'Int16', integer_array([1, None], dtype='Int16')),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also send in pd.Series / pd.Index of these types (or is this tested below)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you test pass thru to ndarray? (e.g. maybe use any_numpy_dtype fixture with the dtype specified)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested below. Added a couple here as well though.

Not sure how any_numpy_dtype would be used here. We would need data to go with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger right you would need to construct a dummy array and test that its passing in thru

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure what this would look like, or what the test is for. I'm not especially worried about specific numpy types not getting through, and we do test that path.



def test_registered():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an extension test? (or maybe in test_dtypes below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how to write a base test for that...

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

looks good @TomAugspurger mostly some docs & clarification comments, generally looks good though.

@TomAugspurger
Copy link
Contributor Author

From #23581 (comment)

#23581 (comment)

No, we don't because I hadn't really considered what to do about that. Do people have thoughts on how to handle 2-D (or n-d) input inside pd.array? In general, it shouldn't be up to pd.array what is and isn't valid input. That's up to the individual array constructors. But is 2-D special since NumPy handles it but EAs don't?

@TomAugspurger
Copy link
Contributor Author

Added a doc note about just doing 1-d arrays.

Holding off on NumPy / 2D changes for now, in case we decide we like Series.array always being an EA (will have a PR in an hour or so).

@TomAugspurger
Copy link
Contributor Author

#24227 for those following along with the "always return an ExtensionArray" discussion. Let's keep that discussion over there.

But, specific to the pd.array function If we go with #24227, would we return a NumPyBackedExtensionArray, or raise, for cases like

pd.array([1, 2, 3])

?

@jreback
Copy link
Contributor

jreback commented Dec 11, 2018

But, specific to the pd.array function If we go with #24227, would we return a NumPyBackedExtensionArray, or raise, for cases like

pd.array([1, 2, 3])

I guess you have to raise on this until we have numpy backed EA (which I don't think we should try to put into 0.24.0).

Though I am ok with just returning a numpy array for now.

@TomAugspurger TomAugspurger mentioned this pull request Dec 11, 2018
3 tasks
@TomAugspurger
Copy link
Contributor Author

932e119 has the changes for PandasArray. API-wise, this means that pd.array always returns an ExtensionArray.

This implies that pd.array raises for non-1d input (scalars or 2+ dimensions).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments


A new top-level method :func:`array` has been added for creating 1-dimensional arrays (:issue:`22860`).
This can be used to create any :ref:`extension array <extending.extension-types>`, including
extension arrays registered by :ref:`3rd party libraries <ecosystem.extensions>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you now have a ref in basics where this should point?

pandas/core/arrays/array_.py Outdated Show resolved Hide resolved
pandas/core/arrays/array_.py Outdated Show resolved Hide resolved
* doc ref
* use extract_array
* use PandasArray._from_sequence
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

@TomAugspurger I am ok with merging this on green. Can follow up from the prior @jorisvandenbossche comment, which was about how we are handling string dtypes I think? can you create an issue for that discussion

@TomAugspurger
Copy link
Contributor Author

Thanks for the review.

I think that https://github.com/pandas-dev/pandas/pull/23581/files#diff-69ac57923b848af43df327c311b79db4R90 handles @jorisvandenbossche's comments regarding string aliases for dtypes. In a world where we have a StringArray backed by apache arrow

pd.array(['a', 'b'], dtype=None/str)

would return a StringArray. But

pd.array(['a', 'b'], dtype=np.dtype(str))

would continue to return a PandasArray backed by an ndarray with dtype np.dtype("<U").

We should maybe emphasize more that if the underlying memory layout really matters to you, then you shouldn't be using pd.array. But I think this is a relatively rare case, and don't want to bog down users with low-level details...

@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

#23581 (comment) sounds ok to me.

@jreback jreback mentioned this pull request Dec 28, 2018
12 tasks
@TomAugspurger
Copy link
Contributor Author

All green now.

@jreback jreback merged commit 77f4b0f into pandas-dev:master Dec 28, 2018
@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

awesome as always @TomAugspurger

@TomAugspurger TomAugspurger deleted the pd.array branch December 29, 2018 02:40
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

# this returns None for not-found dtypes.
if isinstance(dtype, compat.string_types):
dtype = registry.find(dtype) or dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger do you remember if there was any particular reason for using this pattern instead of dtype = pandas_dtype(dtype)?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall. I wonder if this predates pandas_dtype handling extension dtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants