Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

NelDav · 2024-01-12T12:07:01Z

When sorting a pandas dataframe by multiple columns and one of the columns contains values of type AffineScalarFunc, the sort failes due to missing __hash__ method.

TypeError: unhashable type: 'AffineScalarFunc'

The Variable type implements a __hash__ method. Therfore it is possible to do it with this type:

>>> a = pd.DataFrame([
... [ufloat(2,0.021), 8],
... [ufloat(3,0.002), 7],
... [ufloat(1,0.001), 9]])
>>> a
                 0  1
0    2.000+/-0.021  8
1  3.0000+/-0.0020  7
2  1.0000+/-0.0010  9

>>> a.sort_values(by=[0,1])
                 0  1
2  1.0000+/-0.0010  9
0    2.000+/-0.021  8
1  3.0000+/-0.0020  7

As soon as we start calculating, the dataframe no longer contains values of type Variable but of type AffineScalarFunc.
Because of that, sorting multiple columns does no longer work:

>>> a[0] = a[0] * ufloat(1, 0.01)
>>> a
               0  1
0  2.000+/-0.029  8
1  3.000+/-0.030  7
2  1.000+/-0.010  9

>>> a.sort_values(by=[0,1])
.
.
.
TypeError: unhashable type: 'AffineScalarFunc'

To enable this functionality, 'AffineScalarFunc' must be hashable.

I think this would be possible by implementing something like this:

def __hash__(self):
        ids = [id(d) for d in self.derivatives.keys()]
        return hash((self._nominal_value, self._linear_part, tuple(ids)))

I think the derivative ids must be part of the hash to make the has dependent from the derivatives.
Additionally, I think that the nominal and linear part must also be part of the hash to ensure different hashes in case the uncertainty is multiplied with a regular float:

k = ufloat(3, 0.0021)
u = k * 2
hash(k) != hash(u) #this should be the case right?

The text was updated successfully, but these errors were encountered:

wshanks · 2024-04-08T18:04:05Z

I thought #184 might help with this but after playing around with the ExtensionArray API I came to the conclusion that it can not. I opened pandas-dev/pandas#58182 regarding that. I feel like if the ExtensionArray subclass knows how to sort its items it should not be necessary for the items to be hashable but the current implementation relies on hashability for tracking which items are equivalent (and sorting by multiple columns only makes sense if some items are equivalent; otherwise you could just sort by a single column). There would need to be an API that sorted with equivalence, like argsort but assigning the same index to items that are equivalent instead of choosing an ordering of them (so the output is not one-to-one reordering of range(len(array))).

NelDav mentioned this issue Jan 22, 2024

Add hash function for AffineScalarFunc class #189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

NelDav commented Jan 12, 2024 •

edited

Loading

wshanks commented Apr 8, 2024

Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

Comments

NelDav commented Jan 12, 2024 • edited Loading

wshanks commented Apr 8, 2024

NelDav commented Jan 12, 2024 •

edited

Loading