Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Update scalar representations as per NEP 51 #22449

Merged
merged 49 commits into from
Jul 26, 2023

Conversation

seberg
Copy link
Member

@seberg seberg commented Oct 18, 2022

This is a WIP for implementing NEP 51/gh-22261. There are a few things that go beyond just changing the scalar representation:

  1. It makes get_formatter() a more central/public API, to also move fmt=... to it.

    • This introduces the distinction that if the dtype is known, we can e.g. represent a longdouble as '3.1' (with quotes) rather than np.longdouble('3.1').
    • fmt is mainly used for fmt="r" (previous point), I think fmt="s" should be supported, but always be identical to str(arr[0]).
    • There could be an argument for the fmt path to be a ufunc (it should not inspect array values, unlike the default printing).
    • fmt should be used more cleanly for the __format__ work. This should be fine, I think it only means pushing the "option" adjustments lower.
  2. arr.tofile() used repr(arr.item(...)) to print each element. This is weird/problematic:

    • For longdouble, I want that repr to be np.longdouble('3.1').
    • For strings, this means it uses 'string' or b'string'
  3. Record and MA printing has to be adjusted to use the new functionality. This affects record scalars and the printing of the masked array fill_value.

    After some back and forth, I think the default should just be to go back to str() (this was changed silently in BUG 4381 Longdouble from string without precision loss #6199). str(arr.item()) is not ideal for float32 for example.
    This could use some thought (or should change to use the above get_formatter(fmt="s"), but I hope this can be decoupled from NEP 51.


I plan to update the NEP 51 draft to include these points. Happy to discuss here of course (will make pushing NEP 51 quicker). But for now the draft-PR exists primarily to help with the NEP 51 discussion.

TODOS::

  • Refguide check is failing (and may lead to fixups being necessary?)
  • Discuss and accept NEP 51
  • Decide whether this should be hidden behind a flag or if it should interact with the legacy print-mode (i.e. to allow inclusion in the next release, but only switch with NEP 50 – which would likely be a NumPy 2.0 as well)
  • New tests for some paths are needed. Mainly arr.tofile() is not well tested (which is why the change here should not fail tests).

@seberg seberg force-pushed the scalar-repr-change branch 2 times, most recently from 342b701 to ff552fc Compare October 18, 2022 11:59
@tylerjereddy
Copy link
Contributor

@seberg so far so good for latest main of SciPy alongside this feature branch in my hands (Ubuntu Linux; i9-7900X):

  • python dev.py test -j 10
  • 37669 passed, 2222 skipped, 139 xfailed, 9 xpassed in 83.59s (0:01:23)

@seberg
Copy link
Member Author

seberg commented Oct 21, 2022

So, I realized that the empty format string "" already has more or less the meaning of str(), and in that sense I was thinking of using the repr function itself as a singleton for "scalar repr" (rather than "r").

I don't really have an opinion, could even say that str should be used for !s formatting effectively. Or could go back, because why would anyone use {obj:r} for something that is not a repr...

EDIT: Reconsidering this, maybe "" is different, since it would be simpler if that was the same as the default formatting, which is not string. So using str and repr seems actually pretty nice?

The thing to look into now (besides test fixes), is that array float formatting doesn't match scalar float formatting, so I actually need to pass through the formatting information better (should not be hard, but needs a bit organizing/thought.)

@seberg seberg marked this pull request as ready for review October 21, 2022 13:14
@seberg seberg changed the title WIP: Update scalar representations as per NEP 51 ENH: Update scalar representations as per NEP 51 Oct 21, 2022
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Oct 21, 2022
@seberg seberg marked this pull request as draft October 21, 2022 13:44
@seberg
Copy link
Member Author

seberg commented Oct 24, 2022

This is starting to settle, although I am considering changing np.float64(nan) to add the quotes always (np.float64('nan') similar to Python).

@rossbar do you know why the refguide check does not seem to notice changes from 0.0 to np.float64(0.0)? In a sense, I like that, making that a followup is probably far more manageable.

@seberg seberg marked this pull request as ready for review October 24, 2022 12:56
@seberg seberg force-pushed the scalar-repr-change branch 3 times, most recently from ce4166f to 9a64758 Compare October 24, 2022 15:31
@h-vetinari
Copy link
Contributor

I saw #22261 only now, so I won't complain if this is too late, but I'd really like to get away from long, longlong, etc. I'd much prefer to just use sized integers throughout. It's a much saner model, and now that C99 is available on all platforms, we could do it (modulo the fact that LAPACK and perhaps some other venerable libs still have those types in their API 😑)

@seberg
Copy link
Member Author

seberg commented Nov 1, 2022

I am proposing to print the bit-sized version, even if imprecise sometimes. Yes, it would be nice to change it, but it seems like a big ABI change on the C-side?

@h-vetinari
Copy link
Contributor

I am proposing to print the bit-sized version, even if imprecise sometimes.

Looked more closely, looks fine.

Yes, it would be nice to change it, but it seems like a big ABI change on the C-side?

I'm afraid so, but I still think it'd be worth it eventually (numpy 2.0 material?).

Copy link
Contributor

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a mismatch between the implementation and the NEP w.r.t. whether printing is affected or not.

Based on the implementation, I think this PR has pretty high chance of disrupting services where something gets calculated with numpy, and then passed on in some serialized form to another consumer, where numpy (or even python) might not necessarily be installed.

Is there any fallback we could provide so that people can adapt such interfaces (e.g. .value or .raw or ...)?

doc/source/reference/arrays.classes.rst Show resolved Hide resolved
@ev-br
Copy link
Contributor

ev-br commented Nov 5, 2022

the refguide check does not seem to notice changes from 0.0 to np.float64(0.0)

It won't, as long as np.allclose(eval('np.float64(0.0)'), 0.0)

@charris
Copy link
Member

charris commented Feb 19, 2023

Needs rebase.

@pllim
Copy link
Contributor

pllim commented Jul 3, 2023

@mhvk , how do you think astropy can/should deal with this in our doctest downstream? I can only think of disabling doctest for numpy-dev and then when numpy 2.0 is released, we update the docs and disable doctest for numpy 1.x. However, this comes with the risk that we won't catch any incompatibility if a feature is only tested in the doc examples. Hope you can advise. Thanks! 🙏

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - I think this all makes sense now. The only slightly larger remaining comment is about repr_format and str_format -- see there.

It may be good to have a follow-up issue about remaining items. It seems to be current array2string is actually fine, but the question seems to be whether there should be some kind of option where all scalars, including void, would round-trip. If we want to support sub-arrays for void, then I think this would imply that having a round-trip version for array would be a simple extension. But it seems to me it would also be fine not to do that - the main goal, of ensuring scalars are recognizable, has been reached here. Anyway, for a separate issue, let's get this one in first!

numpy/core/arrayprint.py Show resolved Hide resolved
@@ -536,7 +544,7 @@ def _array2string(a, options, separator=' ', prefix=""):
summary_insert = ""

# find the right formatting function for the array
format_function = _get_format_function(data, **options)
format_function = _get_format_function(data=data, **options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well avoid unnecessary changes...

@@ -1400,13 +1408,23 @@ def __call__(self, x):
return "({})".format(", ".join(str_fields))


def _void_scalar_repr(x):
def _void_scalar_to_string(x, is_repr=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the name change is fine!

normalized_name = np.dtype(f"{dtype.kind}{dtype.itemsize}").type.__name__
assert representation == f"np.{normalized_name}({repr_string})"

np.set_printoptions(legacy="1.25")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine in itself, but context manager safer:

with np.printoptions(legacy="1.25"):
...

assert repr(scalar) == representation

np.set_printoptions(legacy="1.25")
assert repr(scalar) == legacy_repr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again would use context manager.

@@ -316,7 +316,7 @@ class TestCommaDecimalPointLocale(CommaDecimalPointLocale):

def test_repr_roundtrip_foreign(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here too.

numpy/core/tests/test_records.py Show resolved Hide resolved
numpy/ma/tests/test_core.py Show resolved Hide resolved
numpy/ma/core.py Show resolved Hide resolved
@@ -232,6 +232,14 @@ found `here <https://github.com/numpy/numpy/pull/22449>`_
Implementation
==============

.. note::
This part has *not* been implemented with the initial changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well give a PR number, for easier reference: gh-22449.

@seberg
Copy link
Member Author

seberg commented Jul 18, 2023

To be honest, I still think we will eventually go back to what I had, although hopefully it's OK to dump the special special fmt states (and improved) (and just have r and s, which void can use, whatever subarray makes of that, I don't care if it checks the printoptions and doesn't round-trip).

Structured void repr should stay integrated with array repr, I think. So any changes there need to be integrated into array2string internals (which are a mess anyway, and in need of reasonable hooks for both user dtypes and format string passing).

@seberg seberg removed 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes 25 - WIP labels Jul 24, 2023
@seberg
Copy link
Member Author

seberg commented Jul 25, 2023

Can we push this over the finish line? Yes, my doctest changeset isn't quite there yet, but not sure it will be helpful to wait, I will expect that the worst affected projects would use the legacy= option anyway initially?

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - sorry, lots of other stuff came up. But yes, this is good to go as far as I'm concerned!

@charris charris merged commit de398b0 into numpy:main Jul 26, 2023
54 of 57 checks passed
@charris
Copy link
Member

charris commented Jul 26, 2023

Thanks Sebastian.

charris added a commit to charris/numpy that referenced this pull request Jul 26, 2023
In numpygh-22449 `descr` was renamed `dtype`, which broke the build as
numpygh-24211 decrefed the old variable to fix a reference leak. This
bug slipped in because numpygh-22449 was not retested after the merge.
@seberg seberg deleted the scalar-repr-change branch July 26, 2023 06:44
@larsoner
Copy link
Contributor

larsoner commented Jul 26, 2023

I can see a lot of projects hitting this (EDIT: like us)... I'm trying this workaround in mne/conftest.py in pytest_configure after reading the release notes from here:

    try:
        np.set_printoptions(legacy="1.25")
    except Exception:
        pass  # probably missing kwarg

I'll let people know in 1h if it works!

@larsoner
Copy link
Contributor

So far I had to change it to legacy=125, using a string broke stuff later (not at set_printoptions time) for released versions of NumPy. I'll update in another hour...

@larsoner
Copy link
Contributor

Looks like we're headed to green with legacy="1.25" (set only when the NumPy version is > 1.25), so the suggested workaround is all good (when deployed correctly) 👍

@seberg
Copy link
Member Author

seberg commented Jul 27, 2023

I will note that just setting the legacy mode for the tests will unfortunately hide issues if you use repr(scalar) for serialization/writing to a file (or a dependency does). So it is a good hot-fix for doctests or tests which check repr explicitly, but might cause real issues to be missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants