Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#4577: Set attribute of Modin dataframe to updated value #4588

Merged
merged 2 commits into from Jun 22, 2022

Conversation

pyrito
Copy link
Collaborator

@pyrito pyrito commented Jun 20, 2022

Signed-off-by: Karthik Velayutham vkarthik@ponder.io

What do these changes do?

This PR addresses issue #4577 which ended up being quite a simple fix but involved a fair amount of soul-searching (thank you to @RehanSD for helping me with this on your day off). Previously, update to DataFrame columns via attribute manipulation would handle things correctly internally, but wouldn't set the attribute properly for the Modin DataFrame. As a result, if a column were to be set to a list, it would return a list when we tried to access the column by attribute (when it really should have been a pandas Series). The test included has a good example of this behavior.

@pyrito pyrito requested a review from a team as a code owner June 20, 2022 20:46
Comment on lines 2502 to 2503
# the type of value can change
value = self.__getitem__(key)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this is to just return instead of setting the value. By doing so, __getattr__ will actually get called since no attribute is registered and it will look up our "ghost" attribute for us. This is kind of hacky though, so I thought it would be at least more readable to approach things this was.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvashishtha tagging you here so you're aware of why I went about this way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like current approach more, it's explicit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the interest of avoiding carrying extra state in the dataframe, it might be better to just return. The extra __getitem__ isn't free, and if the column names change, I wonder if the behavior would still be correct.

import modin.pandas as pd

df = pd.DataFrame([[1]], columns=['col0'])
df.col0 = [3]
df.columns = ["col1"]
print(df.col0)  # this will still work, but shouldn't
print(df.col1)  # this will also work, and it should

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0 in place in the example above using iloc, I think we won't change df.col0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for solving my issue

Copy link
Collaborator

@naren-ponder naren-ponder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito Can you add a release note to this PR?

@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #4588 (c360515) into master (c360515) will not change coverage.
The diff coverage is n/a.

❗ Current head c360515 differs from pull request most recent head 1323b30. Consider uploading reports for the commit 1323b30 to get more accurate results

@@           Coverage Diff           @@
##           master    #4588   +/-   ##
=======================================
  Coverage   86.59%   86.59%           
=======================================
  Files         228      228           
  Lines       18420    18420           
=======================================
  Hits        15950    15950           
  Misses       2470     2470           

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@vnlitvinov
Copy link
Collaborator

@pyrito I've fixed PR message to explicitly list it resolves the issue you mention in a free form.
GitHub is not smart enough to read your message entirely, but, when some keywords are used, it would understand that this PR is linked to that issue, and it would auto-close the issue when the PR is merged.

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great and simple, thanks @pyrito!

Comment on lines 2502 to 2503
# the type of value can change
value = self.__getitem__(key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like current approach more, it's explicit

Comment on lines 260 to 291
# Use case from issue #4577
pandas_df = pandas.DataFrame([[1]], columns=["col0"])
modin_df = pd.DataFrame([[1]], columns=["col0"])

pandas_df.col0 = [3]
modin_df.col0 = [3]

pandas_df.col0.ffill()
modin_df.col0.ffill()
df_equals(modin_df, pandas_df)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest turning it into a separate test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, a new test would be good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I kept it as part of this because I felt like it is testing similar code paths, but I can do that as well.

Comment on lines 2502 to 2503
# the type of value can change
value = self.__getitem__(key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0 in place in the example above using iloc, I think we won't change df.col0.

modin/pandas/test/dataframe/test_iter.py Outdated Show resolved Hide resolved
@pyrito pyrito force-pushed the fix/FIX-4577 branch 3 times, most recently from 8cdc31d to a99c505 Compare June 21, 2022 15:11
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left some style comments. This is a subtle method, so we should document the code and tests well for Modin developers :)

modin/pandas/test/dataframe/test_iter.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/test/dataframe/test_iter.py Outdated Show resolved Hide resolved
@pyrito pyrito force-pushed the fix/FIX-4577 branch 4 times, most recently from aae480c to 69e6e33 Compare June 21, 2022 16:15
@@ -258,6 +259,24 @@ def test___setattr__():
df_equals(modin_df, pandas_df)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# While `new_col` is not a column of the dataframe,
# it should be accessible with __getattr__.
assert modin_df.new_col.equals(pandas_df.new_col)

This is not strictly part of the bug, but we should check that we're setting non-column attributes correctly after your fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think equals would work here since these are list types, so I just used the == operator which should work fine for these lists I think.

Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one minor style suggestion.

modin_df.col0 = pd.Series([5])
modin_df.loc[0, "col0"] = 4

assert modin_df.col0.equals(modin_df["col0"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert modin_df.col0.equals(modin_df["col0"])
df_equals(modin_df, pandas_df)
assert modin_df.col0.equals(pandas_df.col0)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this and a bit more

mvashishtha
mvashishtha previously approved these changes Jun 21, 2022
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for responding to all the comments.

Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but suggested one edit for the tests!

pandas_df = pandas.DataFrame([[1]], columns=["col0"])
modin_df = pd.DataFrame([[1]], columns=["col0"])

# Replacing a column with a list should mutate the column in place.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test here ensuring that we can do something like ffill on df.col0 after setting it to a list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just check that pandas_df.col0.equals(modin_df.col0). That will check that the types are series. That's what we care about checking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that before, but got rid of it due to @mvashishtha 's suggestion. You can see it here: #4588 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

RehanSD
RehanSD previously approved these changes Jun 21, 2022
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@RehanSD
Copy link
Collaborator

RehanSD commented Jun 21, 2022

Looks like CI is failing because of #4589. Once #4590 is merged, can you rebase off of master? CI should be green then and I'll merge!

Karthik Velayutham added 2 commits June 21, 2022 17:25
Copy link
Collaborator

@RehanSD RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@RehanSD RehanSD merged commit 8efd8f7 into modin-project:master Jun 22, 2022
RehanSD pushed a commit that referenced this pull request Jun 24, 2022
Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.Series.ffill() raise the error: AttributeError: 'numpy.ndarray' object has no attribute 'ffill'
7 participants