New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4577: Set attribute of Modin dataframe to updated value #4588
Conversation
modin/pandas/dataframe.py
Outdated
# the type of value can change | ||
value = self.__getitem__(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to this is to just return instead of setting the value. By doing so, __getattr__
will actually get called since no attribute is registered and it will look up our "ghost" attribute for us. This is kind of hacky though, so I thought it would be at least more readable to approach things this was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvashishtha tagging you here so you're aware of why I went about this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like current approach more, it's explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the interest of avoiding carrying extra state in the dataframe, it might be better to just return
. The extra __getitem__
isn't free, and if the column names change, I wonder if the behavior would still be correct.
import modin.pandas as pd
df = pd.DataFrame([[1]], columns=['col0'])
df.col0 = [3]
df.columns = ["col1"]
print(df.col0) # this will still work, but shouldn't
print(df.col1) # this will also work, and it should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0
in place in the example above using iloc
, I think we won't change df.col0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for solving my issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyrito Can you add a release note to this PR?
Codecov Report
@@ Coverage Diff @@
## master #4588 +/- ##
=======================================
Coverage 86.59% 86.59%
=======================================
Files 228 228
Lines 18420 18420
=======================================
Hits 15950 15950
Misses 2470 2470 📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
@pyrito I've fixed PR message to explicitly list it resolves the issue you mention in a free form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great and simple, thanks @pyrito!
modin/pandas/dataframe.py
Outdated
# the type of value can change | ||
value = self.__getitem__(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like current approach more, it's explicit
# Use case from issue #4577 | ||
pandas_df = pandas.DataFrame([[1]], columns=["col0"]) | ||
modin_df = pd.DataFrame([[1]], columns=["col0"]) | ||
|
||
pandas_df.col0 = [3] | ||
modin_df.col0 = [3] | ||
|
||
pandas_df.col0.ffill() | ||
modin_df.col0.ffill() | ||
df_equals(modin_df, pandas_df) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest turning it into a separate test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, a new test would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I kept it as part of this because I felt like it is testing similar code paths, but I can do that as well.
modin/pandas/dataframe.py
Outdated
# the type of value can change | ||
value = self.__getitem__(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0
in place in the example above using iloc
, I think we won't change df.col0
.
8cdc31d
to
a99c505
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left some style comments. This is a subtle method, so we should document the code and tests well for Modin developers :)
aae480c
to
69e6e33
Compare
@@ -258,6 +259,24 @@ def test___setattr__(): | |||
df_equals(modin_df, pandas_df) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# While `new_col` is not a column of the dataframe, | |
# it should be accessible with __getattr__. | |
assert modin_df.new_col.equals(pandas_df.new_col) |
This is not strictly part of the bug, but we should check that we're setting non-column attributes correctly after your fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think equals
would work here since these are list types, so I just used the ==
operator which should work fine for these lists I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one minor style suggestion.
modin_df.col0 = pd.Series([5]) | ||
modin_df.loc[0, "col0"] = 4 | ||
|
||
assert modin_df.col0.equals(modin_df["col0"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert modin_df.col0.equals(modin_df["col0"]) | |
df_equals(modin_df, pandas_df) | |
assert modin_df.col0.equals(pandas_df.col0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this and a bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for responding to all the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but suggested one edit for the tests!
pandas_df = pandas.DataFrame([[1]], columns=["col0"]) | ||
modin_df = pd.DataFrame([[1]], columns=["col0"]) | ||
|
||
# Replacing a column with a list should mutate the column in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a test here ensuring that we can do something like ffill
on df.col0
after setting it to a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just check that pandas_df.col0.equals(modin_df.col0)
. That will check that the types are series
. That's what we care about checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that before, but got rid of it due to @mvashishtha 's suggestion. You can see it here: #4588 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…alue Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Karthik Velayutham vkarthik@ponder.io
What do these changes do?
This PR addresses issue #4577 which ended up being quite a simple fix but involved a fair amount of soul-searching (thank you to @RehanSD for helping me with this on your day off). Previously, update to DataFrame columns via attribute manipulation would handle things correctly internally, but wouldn't set the attribute properly for the Modin DataFrame. As a result, if a column were to be set to a list, it would return a list when we tried to access the column by attribute (when it really should have been a pandas Series). The test included has a good example of this behavior.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date