Fix issue where quantify doesn't work on DataFrames containing string columns #217

Nick-Hemenway · 2024-04-02T18:18:20Z

I'm hoping someone can help me a bit. I've never contributed to an open source project and am not sure what the protocol is or how to run tests so I appreciate your patience. I'd be happy to modify my request as needed.

The simple change I made enables the quantify method on a DataFrame to work even if the DataFrame contains string columns. Currently this works, but there's a unintended side effect. I've included a minimal example below to show what I mean.

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
print(df1.power.pint.m)

The current code would output:

After the update I made:

As can be seen, the current code converts the underlying data to an object data type when it should remain float64. This occurs because the current code takes the entire DataFrame and converts it to a 2D array before indexing (at which point the entire array is converted to object datatypes. My modification indexes into individual columns so that the data is never cast to a different underlying type.

Closes # (insert issue number)
Executed pre-commit run --all-files with no errors
The change is fully covered by automated unit tests
Documented in docs/ as appropriate
Added an entry to the CHANGES file

andrewgsavage · 2024-04-02T21:36:42Z

Nice fix!
ignore the 3.9 master failure

Add a test to pint_pandas/testsuite/test_issues.py. Make a class/function like

pint-pandas/pint_pandas/testsuite/test_issues.py

Line 55 in f9d3a66

class TestIssue21(BaseExtensionTests):

You can use this PR# as the issue #.

the code in your post is basically the test already, just add an assert to check the two dfs are same

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

tm.assert_equal(df1, df)

Nick-Hemenway · 2024-04-02T22:50:56Z

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

andrewgsavage · 2024-04-02T22:58:31Z

Looks good, you'll need to run pre-commit run --all-files to get the linter to pass

andrewgsavage · 2024-04-02T23:03:48Z

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

Out of interest, do you find the lack of visibility of the float64/object dtype making the PintArrays look the same but actually be different an issue?

Nick-Hemenway · 2024-04-03T13:02:50Z

Yes, I think that might be the underlying issue here (for the testing). The two DataFrames should not pass the "isequal" test but the underlying pint arrays show that they are equal even though they really aren't as the data they contain is of a different type.

Example:

df = pd.DataFrame({'a':[1,2,3], 'b':['apple', 'pear', 'kiwi']})

p1 = PintArray(df.to_numpy()[:,0], 'm') #inner numpy array contains data of type 'object'
p2 = PintArray(np.array([1,2,3], dtype=np.float64), 'm')


p1.equals(p2) #returns True when it should return False

andrewgsavage · 2024-04-03T13:15:15Z

Yea I think you're right, that's how pandas EA behaves:

>>> p1.data
<NumpyExtensionArray>
[1, 2, 3]
Length: 3, dtype: object
>>> p2.data
<NumpyExtensionArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64

>>> p1.data==p2.data
<NumpyExtensionArray>
[True, True, True]
Length: 3, dtype: bool

>>> p1.data.equals(p2.data)
False

andrewgsavage · 2024-04-03T13:16:13Z

thanks for fixing that

thaeber · 2024-04-23T11:02:50Z

I just stumbled upon the same problem, but for data frames that also contain numeric columns without units. The current fix only seems to fix the unwanted type conversion to dtype='object' for columns with units, while the same happens with numeric columns without units.

In the following example, the "something_else" column has numerical values, but no associated units:

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0, 2.0, 3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0, 5.0, 6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi']),
        'something_else': pd.Series([7.0, 8.0, 9.0]),
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

print(df.dtypes)

print(df1.dtypes)

1st print statement:

2nd print statement:

This could probably be fixed by changing the "else" branch at code from else df.values[:, i] to else df.iloc[:, i], similar to the "then" branch in the current fix.

andrewgsavage · 2024-04-23T22:22:54Z

a PR to fix that is welcomed.

update values to iloc to avoid casting array to obj type

7d0037a

add test for PR217

7e48e5e

andrewgsavage mentioned this pull request Apr 2, 2024

Eqaulity checks when magnitudes have different dtypes #218

Open

run pre-commit tests to pass lint test

838a716

andrewgsavage merged commit 6646cfa into hgrecco:master Apr 3, 2024
26 of 29 checks passed

thaeber mentioned this pull request Apr 24, 2024

Fix issue where quantify changes the type of arrays to obj if the DataFrame containes string columns #225

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Nick-Hemenway commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

Nick-Hemenway commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

Nick-Hemenway commented Apr 3, 2024 •

edited

Loading

andrewgsavage commented Apr 3, 2024

andrewgsavage commented Apr 3, 2024

thaeber commented Apr 23, 2024

andrewgsavage commented Apr 23, 2024

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Conversation

Nick-Hemenway commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

Nick-Hemenway commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

andrewgsavage commented Apr 2, 2024

Nick-Hemenway commented Apr 3, 2024 • edited Loading

andrewgsavage commented Apr 3, 2024

andrewgsavage commented Apr 3, 2024

thaeber commented Apr 23, 2024

andrewgsavage commented Apr 23, 2024

Nick-Hemenway commented Apr 3, 2024 •

edited

Loading