Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Merged
merged 3 commits into from
Apr 3, 2024

Conversation

Nick-Hemenway
Copy link
Contributor

I'm hoping someone can help me a bit. I've never contributed to an open source project and am not sure what the protocol is or how to run tests so I appreciate your patience. I'd be happy to modify my request as needed.

The simple change I made enables the quantify method on a DataFrame to work even if the DataFrame contains string columns. Currently this works, but there's a unintended side effect. I've included a minimal example below to show what I mean.

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
print(df1.power.pint.m)

The current code would output:
2024-04-02_13-07-44

After the update I made:
2024-04-02_13-12-15

As can be seen, the current code converts the underlying data to an object data type when it should remain float64. This occurs because the current code takes the entire DataFrame and converts it to a 2D array before indexing (at which point the entire array is converted to object datatypes. My modification indexes into individual columns so that the data is never cast to a different underlying type.

  • Closes # (insert issue number)
  • Executed pre-commit run --all-files with no errors
  • The change is fully covered by automated unit tests
  • Documented in docs/ as appropriate
  • Added an entry to the CHANGES file

@andrewgsavage
Copy link
Collaborator

Nice fix!
ignore the 3.9 master failure

Add a test to pint_pandas/testsuite/test_issues.py. Make a class/function like

class TestIssue21(BaseExtensionTests):

You can use this PR# as the issue #.

the code in your post is basically the test already, just add an assert to check the two dfs are same

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

tm.assert_equal(df1, df)

@Nick-Hemenway
Copy link
Contributor Author

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

@andrewgsavage
Copy link
Collaborator

Looks good, you'll need to run pre-commit run --all-files to get the linter to pass

@andrewgsavage
Copy link
Collaborator

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

Out of interest, do you find the lack of visibility of the float64/object dtype making the PintArrays look the same but actually be different an issue?

@Nick-Hemenway
Copy link
Contributor Author

Nick-Hemenway commented Apr 3, 2024

Yes, I think that might be the underlying issue here (for the testing). The two DataFrames should not pass the "isequal" test but the underlying pint arrays show that they are equal even though they really aren't as the data they contain is of a different type.

Example:

df = pd.DataFrame({'a':[1,2,3], 'b':['apple', 'pear', 'kiwi']})

p1 = PintArray(df.to_numpy()[:,0], 'm') #inner numpy array contains data of type 'object'
p2 = PintArray(np.array([1,2,3], dtype=np.float64), 'm')


p1.equals(p2) #returns True when it should return False

@andrewgsavage
Copy link
Collaborator

Yea I think you're right, that's how pandas EA behaves:

>>> p1.data
<NumpyExtensionArray>
[1, 2, 3]
Length: 3, dtype: object
>>> p2.data
<NumpyExtensionArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64

>>> p1.data==p2.data
<NumpyExtensionArray>
[True, True, True]
Length: 3, dtype: bool

>>> p1.data.equals(p2.data)
False

@andrewgsavage andrewgsavage merged commit 6646cfa into hgrecco:master Apr 3, 2024
26 of 29 checks passed
@andrewgsavage
Copy link
Collaborator

thanks for fixing that

@thaeber
Copy link
Contributor

thaeber commented Apr 23, 2024

I just stumbled upon the same problem, but for data frames that also contain numeric columns without units. The current fix only seems to fix the unwanted type conversion to dtype='object' for columns with units, while the same happens with numeric columns without units.

In the following example, the "something_else" column has numerical values, but no associated units:

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0, 2.0, 3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0, 5.0, 6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi']),
        'something_else': pd.Series([7.0, 8.0, 9.0]),
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

print(df.dtypes)

print(df1.dtypes)

1st print statement:
grafik

2nd print statement:
grafik

This could probably be fixed by changing the "else" branch at code from else df.values[:, i] to else df.iloc[:, i], similar to the "then" branch in the current fix.

@andrewgsavage
Copy link
Collaborator

a PR to fix that is welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants