Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: NumericIndex should not support float16 dtype #49536

Merged

Conversation

topper-123
Copy link
Contributor

The whatsnew does not need updating, because NumericIndex is purely internal.

Extracted from #49494

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from 06333da to 455b2a7 Compare November 5, 2022 07:45
@mroeschke mroeschke added the Index Related to the Index class or subclasses label Nov 8, 2022
@mroeschke
Copy link
Member

Will Index(..., dtype=np.float16) upcast to np.float32 in the future then too?

Makes sense from an indexing utility why float16 should be disallowed, but just as a container it could be?

In [7]: pd.Index([1], dtype=np.float16)
Out[7]: Float64Index([1.0], dtype='float64')  # Would this be Index(..., dtype=float32) in the future?

In [8]: pd.Series([1], dtype=np.float16)
Out[8]:
0    1.0
dtype: float16

cc @jbrockmendel if you have any opinions on consistency with the above example

@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Nov 8, 2022
@jbrockmendel
Copy link
Member

Will Index(..., dtype=np.float16) upcast to np.float32 in the future then too?

I would much rather this raise than give you something unexpected.

More generally, it'd be nice if Index supported everything Series did

@topper-123
Copy link
Contributor Author

My reasoning for upcasting was that pandas has a history of being very permissive with inputs and doing a lot to take in everything. This is not a very strong opinion of mine, so I'm also ok with disallowing float16.

float16 is not available in khash, which was needed to make an index in the same style as the other numeric indexes, but yeah I would also preferred to not special case float16.

@mroeschke
Copy link
Member

@topper-123 would it be possible upcast to float32 when needing to dispatching to khash but keep the float16 when storing values (this may be a naive question with my gaps in indexing code).

Agreed with Brock that generally we should be more explicit with disallowing float16 in Index (and I guess Series) or supporting float16 in Index & Series (maybe upcasting internally to float32 in Index when needed?)

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from 455b2a7 to f087051 Compare November 12, 2022 06:57
@topper-123
Copy link
Contributor Author

Are we sure there won't be lossiness if we use float32 in the backend for float16 indexes?

I'm thinking that e.g. unions of float16 and int8 indexes then need to convert the int8->float32 float16->float32 and then convert the result back down to float16. Could that give wrong results in some circumstances? I'm also not a super fan of the special casing needed in the code and tests to make that happen, because float16 is rarely used.

Can we instead use @jbrockmendel suggestion to disallow float16 indexes? If someone later wants to have float16 index they casn open a PR at that time?

@mroeschke
Copy link
Member

unions of float16 and int8 indexes then need to convert

I'm assuming that find_common_dtype (or the associated dtype resolution path) would just upcast to the appropriate type, but agreed that I may be underestimating edge cases where just-in-time promotions may cause issues.

I'd be okay raising a NotImplementedError for now to disallow float16

@jbrockmendel
Copy link
Member

disallowing float16 would be OK. my preference would be to implement Float16Engine in index.pyx to cast to float32 before passing things to the hashtables.

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from 87f392a to fd0e959 Compare November 15, 2022 00:08
@topper-123
Copy link
Contributor Author

I've updated to raise NotImplementedError in float16 cases.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on the changed test, looks like float16 was being tested and supported in some operations involving the index? IMO that signals that we should probably try to support if possible

@jbrockmendel
Copy link
Member

cdef class Float16Engine:
    def __init__(self, values):
        values = values.astype(np.float32)
        super().__init__(values)

    def get_loc(self, key):
        if isinstance(key, np.float16):
            key = key.astype(np.float32)
        return super().get_loc(key)

    def get_indexer(self, other):
        other = other.astype(np.float32)
        return super().get_indexer(other)

@topper-123
Copy link
Contributor Author

Yeah could be. I got another way using 16-bit floats in 32-bit IndexEngines in #49560 file pandas/core/indexes/numeric.py, (lines 103 and 116), so that's not a blocker ATM.

We could discuss which is better, though I do like the simplicity of mine :-).

@mroeschke
Copy link
Member

Yeah I think

    def _get_engine_target(self) -> ArrayLike:
        vals = self._values
        # pandas has no Float16Engine, so we use Float32Engine instead
        if vals.dtype == "float16":
            vals = vals.astype("float32")
        return vals

in #49560 looks reasonable (given the test passing)

@topper-123
Copy link
Contributor Author

I've looked into making Float16Engine as suggested by @jbrockmendel in order to support float16 indexes. It may not be practically possible (without a lot of work). The underlying issue is that cnumpy doesn't have a float16_t. Without a float16_t type, we can't give float16 arrays to the relevant c-level functions and will have make workarounds, which will probably be ugly and of marginal value (no one uses float16 arrays anyway).

For further discussion see this and this stackoverflow discussions. From the discussions it looks like that numpy internally converts float16 to float32 and then after the operations converts back to float16. I guess we could do that also, but that may be a project unto itself and outside of scope my current work (i.e. collecting numeric indexes in the base Index).

I'm thinking we should choose between:
1: make instantiating Index with a float16 dtype raise a NotImplementedError.
2: make instantiating Index with a float16 dtype convert to Index with a float32 dtype (maybe raise a warning also?).

It seems from the discussion above that the majority opinion is 1. Do you still hold that opinion given that numpy does not have a float16_t type?

@jbrockmendel
Copy link
Member

i guess go forward with this and ill take a stab at implementing Float16Engine

@jreback
Copy link
Contributor

jreback commented Dec 7, 2022

umm we don't support float16 for nearly anything.

would be -1 on supporting
just raise

@topper-123
Copy link
Contributor Author

topper-123 commented Dec 8, 2022

Something to consider is that calling any numpy ufunc on int8 arrays returns a float16 array e.g.

>>> import numpy as np
>>> arr = np.arange(3, dtype=npint8)
>>> np.exp(arr)
array([1.   , 2.719, 7.39 ], dtype=float16)

If we raise on float16 indexes, calling those ufunc functions on int8 indexes would also raise unless we guard against that in Index__array_ufunc__, e.g.

>>> import pandas as pd
>>> idx = pd.core.api.NumericIndex(arr)
>>> idx
NumericIndex([0, 1, 2], dtype='int8')
>>> np.exp(idx)  # what should this return?

The choices are (if we do not have a float16 index):

  1. raise NotImplentedError
  2. return NumericIndex(np.exp(arr), dtype='float32') by changing changing float 16 arrays to float32 arrays in Index.__array_ufunc__
  3. return NumericIndex(np.exp(arr), dtype='float32') by letting NumericIndex(arr, dtype='float16') return a float32 index (not raising on float16 instantiation)?

Raising on a NumericIndex(..., dtype=float16) input means either choice 1 or 2. Option 1 would be the most stringent but would fail in some cases where users might expect it to not fail, while option 2 is a bit like option 3 but without allowing directly instantiating with dtype=float16 become float32 indexes.

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch 2 times, most recently from 30dcd1d to aa8b577 Compare December 10, 2022 18:20
@topper-123
Copy link
Contributor Author

Any comment, especially on the issue of ufuncs on int8 indexes?

@jbrockmendel
Copy link
Member

Any comment, especially on the issue of ufuncs on int8 indexes?

We could cast float16 ufunc results to float32 before wrapping in an Index. I think we do something similar with FloatingArray.

I still think best-case is to support the same dtypes in Index that we do with Series. I've got a branch going that implements Float16Engine, have some test failures to work out.

@topper-123
Copy link
Contributor Author

We could cast float16 ufunc results to float32 before wrapping in an Index. I think we do something similar with FloatingArray.

That is what I've done in the newest version (option 2 above). If we could pull this in (and #50195) then we'd be ready to pull in #49560 also and I could proceed with removing Int64Index etc?

@topper-123 topper-123 added this to the 2.0 milestone Dec 13, 2022
@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from aa8b577 to f6edfac Compare December 15, 2022 17:26
@topper-123
Copy link
Contributor Author

I just rebased, just in case, and the failure looks unrelated.

Could we merge this and then the work that @jbrockmendel does in #50218 could be rebased to include this PR? This one not being merged is blocking #49560, which is an important step in the work of making the base Index take numeric dtypes.

if dtype == np.float16:
# float16 not supported (no indexing engine)
dtype = np.dtype(np.float32)
if dtype == "float16":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disallowing the string but not the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll make a new commit to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the PR.

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from f6edfac to 03c5968 Compare December 20, 2022 22:30
@topper-123
Copy link
Contributor Author

The failures look unrelated to this PR.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be okay with disallowing float16 for now, but I think this needs a whatsnew because Index(..., dtype="float16") now raises an exception when prior this worked?

@topper-123
Copy link
Contributor Author

My intention was to write the whatsnew after getting the the code PRs committed, because it's all related. Would that be ok (I'll include the float16 issue in).

@mroeschke
Copy link
Member

Sure

@topper-123
Copy link
Contributor Author

👍 Can we pull this PR in (after the CI gets working again)?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @jbrockmendel if you have any other comments

@topper-123 topper-123 force-pushed the convert_float16_index_to_float32 branch from 03c5968 to 785df81 Compare December 23, 2022 00:16
@topper-123
Copy link
Contributor Author

I’ve rebased so it would pass the CI. No other changes were made.

@topper-123
Copy link
Contributor Author

I'd like to merge this, is that ok?

@mroeschke mroeschke merged commit 38b4e96 into pandas-dev:main Dec 27, 2022
@mroeschke
Copy link
Member

Thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: NumericIndex should not support float16 dtype
4 participants