Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Restructure the dtype struct to be new dtype friendly #25943

Merged
merged 9 commits into from
Mar 7, 2024

Conversation

seberg
Copy link
Member

@seberg seberg commented Mar 5, 2024

Note this PR is an actual ABI breaking change: No downstream project should be expected to work without recompiling (which means that our doc builds will just crash also).

This PR does a few changes, I would like to keep docs in a follow-up, but happy either way. Right now marking as draft because:

  • I am seeing test failures in SciPy, I am not sure yet where they originate (some general issue, something in SciPy or something here.). Its odd, I need to figure this out.
  • I may follow up with more changes (want to discuss briefly below), but this is a good start also to start review.
  • The Cython additions should get simple test and maybe need a closer look.

The actual changes are the following:

  • Flags are not a uint64
  • elsize and alignment are now intp (I don't think it matters for alignment but seemed fine).
  • c_metadata, subarray, fields, and names is now fully gone. (metadata is still there).

Plus cleanup and moving things between headers as they are now version dependent.


@ngoldbaum three things I would like your opinion on:

  1. Right now I kept metadata for the simple reason that it seemed at least somewhat useful and used and because of the preparation that was easier. I am happy to remove it.
  2. I could add one (maybe two) void *reserved slot that we define to be NULL but can give a meaning in the future. I don't care too much about it, we have flags and subclasses, but it is possible...
  3. I kept the elsize name, just to keep the diff smaller. But we could rename both or just the accessor to ITEMSIZE.

None of them seem like a huge deal, compared to getting rid of the other fields and bumping the size.

@charris
Copy link
Member

charris commented Mar 5, 2024

I assume this is a blocker for the 2.0.0 release?

@seberg seberg added this to the 2.0.0 release milestone Mar 5, 2024
@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 5, 2024

Right now I kept metadata for the simple reason that it seemed at least somewhat useful and used and because of the preparation that was easier. I am happy to remove it.

I did another code search and found one usage in newish code, here to stuff some extra metadata into the descriptor (it looks like to work around a numpy limitation).

Seems like a nice thing to have and no strong reason to remove it, let's keep it.

I could add one (maybe two) void *reserved slot that we define to be NULL but can give a meaning in the future. I don't care too much about it, we have flags and subclasses, but it is possible...

One could imagine always providing a per-descriptor allocator like stringdtype does, for example, as a way to deal with thread safety and add nogil support.

So let's do it. My only thought is that only having one extra slot might be limiting, could we define it in such a way that it can be extensible (e.g. using something like the PyTypeSlot struct to allow an arbitrary number of extra members defined at runtime)? That's just me spitballing, it might be a bad idea.

I kept the elsize name, just to keep the diff smaller. But we could rename both or just the accessor to ITEMSIZE.

Sure, let's leave it, and in the future we can refactor numpy to use an accessor macro with a better name. There's a ton of downstream code using elsize too so this would cause some significant downstream code churn.

@tylerjereddy
Copy link
Contributor

I am seeing test failures in SciPy, I am not sure yet where they originate

Let me know if you want help. Even if it is just to brute force dissect out the part(s) of the codes related.

@seberg
Copy link
Member Author

seberg commented Mar 5, 2024

There's a ton of downstream code using elsize too so this would cause some significant downstream code churn.

Well, we have to force those to churn and use PyDataType_ELSIZE(), although many can use PyArray_ITEMSIZE() instead (this is what I did in SciPy).
I don't think this is too bad, 2.5k of finds on github isn't actually that much :) and I quickly get to dead projects ;).

@seberg
Copy link
Member Author

seberg commented Mar 5, 2024

OK, the SciPy issues came also down to pybind11 and required an unfortunate little hack pending pybind11 fixup. Luckily, most places that use pybind11, use the template type version py::array_t to create new arrays and that isn't affected.

I have updated the docs. We should maybe discuss this briefly. (EDIT: It is unfortunate that the docs don't build, I think we can wait for SciPy to work, but if more fails, may need to see how to get the changes in)

I will look into pybind11 update tomorrow, but I doubt it should be hard. (Unless field sizes seem wrong, further fixups like adding a second void * would not actually break ABI).

@seberg
Copy link
Member Author

seberg commented Mar 5, 2024

Two notes:

  1. I checked, and I could add one more reserved field easily (without adding cruft for Cython), more and 32bit platforms sizes shouldn't match.
  2. Before merging, I will split the docs into their own PR and merge that. That way the correct docs are there without a few days of breathing room for at least scipy to have new wheel builds.

This modifies the main dtype/descriptor struct to:
* Use intp for the elsize (and alignment)
* A uint64 for flag space
* To actually remove c_metadata and fields related to structured
  dtypes.

It thus *breaks ABI* and the unfortunate souls who require access
to `->elsize` or similar fields will have to vendor `npy_2_compat.h`
or do something similar.
(This should not be super many though, most can use PyArray_ITEMSIZE).

The changes also require moving a few other places to be run-time and
fixing allocation of structs.

This commit is not complete: New warnings and errors still need to be
fixed.
(for once shipping f2py makes it easier)
Note that the (rare) aligned struct case will not allow unpickling
NumPy 2.x files with NumPy 1.x.  This could be added if necessary.

Unpickling 1.x files in 2.x is unproblematic
2 should work (unless I miscounted badly).
@seberg
Copy link
Member Author

seberg commented Mar 6, 2024

I think this should be ready, but of course someone should have a look over.

@lithomas1 once this is merged, pandas will need some small updates to the json c reader (IIRC), shouldn't be more than the similar ones that arrow needs (@jorisvandenbossche is aware). If the arrow code-paths are hit, this might crash pandas, although building the wheel would still be fine.

Right now, I don't think it is looking too bad, but of course I don't know... if it ends up being bad we might have to keep elsize as an integer to break fewer code.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, and lays the groundwork for dropping the old struct in a few years when all user-defined dtypes move to NumPy2. I still need a bit of convincing around the reserved space, but not a big deal since the whole structure is meant to be opaque.

Edit: the spare fields are needed for subclassing. Makes sense.

numpy/__init__.cython-30.pxd Outdated Show resolved Hide resolved
numpy/__init__.cython-30.pxd Outdated Show resolved Hide resolved
numpy/__init__.pxd Outdated Show resolved Hide resolved
numpy/_core/include/numpy/ndarraytypes.h Show resolved Hide resolved
numpy/_core/include/numpy/ndarraytypes.h Show resolved Hide resolved
Co-authored-by: Matti Picus <matti.picus@gmail.com>
@mattip
Copy link
Member

mattip commented Mar 7, 2024

Tests are failing but not because of this PR, probably due to #21760. I will comment there. The error is (note the extra .)

    def test_polynomial_str(self):
        res = str(poly.Polynomial([0, 1]))
        tgt = '0.0 + 1.0 x'
>       assert_equal(res, tgt)
E       AssertionError: 
E       Items are not equal:
E        ACTUAL: '0.0 + 1.0·x'
E        DESIRED: '0.0 + 1.0 x'

@mattip
Copy link
Member

mattip commented Mar 7, 2024

I will merge this to keep the momentum going. Let's see what breaks downstream...

@mattip mattip merged commit dbbf235 into numpy:main Mar 7, 2024
57 of 63 checks passed
@mattip
Copy link
Member

mattip commented Mar 7, 2024

Thanks @seberg

@lithomas1
Copy link
Collaborator

lithomas1 commented Mar 7, 2024

Thanks for the heads up.

Can someone from numpy hit the build button on the nightlies now that this is merged?

@seberg
Copy link
Member Author

seberg commented Mar 7, 2024

Yesterday it sounded a bit like you wanted to do that @rgommers, so you can make sure to do it on SciPy quickly after as well?

@charris
Copy link
Member

charris commented Mar 7, 2024

I have already triggered the builds, at least those not on cirrus.

@jakevdp
Copy link
Contributor

jakevdp commented Mar 8, 2024

Hi - I found that this PR broke the openxla build, due to this line: https://github.com/openxla/xla/blob/24eaeeab8e7465cef0fc655cd9fd8f6060485a27/xla/python/nb_numpy.h#L55

What would be the best way to access elsize in a way that is compatible with both Numpy 1.X and 2.X?

@jorisvandenbossche
Copy link
Contributor

@jakevdp see #25946 for the relevant doc changes

The direct equivalent is PyDataType_ELSIZE, but since I think you have an array (and not just a descr), you can use PyArray_ITEMSIZE directly (for PyDataType_ELSIZE you need to add a small shim to make it compile on 1.x)

@seberg
Copy link
Member Author

seberg commented Mar 8, 2024

Right, the caveat is, it needs array_import now, but you probably already have that.

@seberg seberg deleted the descr-abi-break branch March 8, 2024 19:48
@jakevdp
Copy link
Contributor

jakevdp commented Mar 8, 2024

Thanks. Unfortunately, what the code has access to there is just a descr, not the array, so PyArray_ITEMSIZE is not applicable. Is there any way to get descr->elsize in a cross-version-compatible way given descr is a PyArray_Descr? I can't find any answer to that in the docs you linked to.

@jakevdp
Copy link
Contributor

jakevdp commented Mar 8, 2024

Maybe I need to use compiler directives to conditionally define PyDataType_ELSIZE when building against older numpy versions – would that be the recommended approach here?

@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 8, 2024

Maybe I need to use compiler directives to conditionally define PyDataType_ELSIZE

Yes unfortunately this access pattern needs to be dealt with using e.g. an npy_2_compat.h header. See here and below
https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/npy_2_compat.h#L142

@seberg
Copy link
Member Author

seberg commented Mar 8, 2024

would that be the recommended approach here?

Yes, just #if NPY_ABI_VERSION < 0x02000000 and define the macro. You could copy npy_2_compat.h and include it explicitly. I like to have that, but wouldn't use it myself unless you would need it in a few places.

@jakevdp
Copy link
Contributor

jakevdp commented Mar 8, 2024

Ah, I see now that the PyDataType_ELSIZE solution is mentioned at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#the-pyarray-descr-struct-has-been-changed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants