FIX-#4577: Set attribute of Modin dataframe to updated value #4588

pyrito · 2022-06-20T20:46:52Z

Signed-off-by: Karthik Velayutham vkarthik@ponder.io

What do these changes do?

This PR addresses issue #4577 which ended up being quite a simple fix but involved a fair amount of soul-searching (thank you to @RehanSD for helping me with this on your day off). Previously, update to DataFrame columns via attribute manipulation would handle things correctly internally, but wouldn't set the attribute properly for the Modin DataFrame. As a result, if a column were to be set to a list, it would return a list when we tried to access the column by attribute (when it really should have been a pandas Series). The test included has a good example of this behavior.

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves pd.Series.ffill() raise the error: AttributeError: 'numpy.ndarray' object has no attribute 'ffill' #4577
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

pyrito · 2022-06-20T20:47:55Z

modin/pandas/dataframe.py

+            # the type of value can change
+            value = self.__getitem__(key)


The alternative to this is to just return instead of setting the value. By doing so, __getattr__ will actually get called since no attribute is registered and it will look up our "ghost" attribute for us. This is kind of hacky though, so I thought it would be at least more readable to approach things this was.

@mvashishtha tagging you here so you're aware of why I went about this way

I like current approach more, it's explicit

I think in the interest of avoiding carrying extra state in the dataframe, it might be better to just return. The extra __getitem__ isn't free, and if the column names change, I wonder if the behavior would still be correct.

import modin.pandas as pd df = pd.DataFrame([[1]], columns=['col0']) df.col0 = [3] df.columns = ["col1"] print(df.col0) # this will still work, but shouldn't print(df.col1) # this will also work, and it should

+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0 in place in the example above using iloc, I think we won't change df.col0.

Thank for solving my issue

naren-ponder

@pyrito Can you add a release note to this PR?

codecov · 2022-06-20T21:07:29Z

Codecov Report

Merging #4588 (c360515) into master (c360515) will not change coverage.
The diff coverage is n/a.

❗ Current head c360515 differs from pull request most recent head 1323b30. Consider uploading reports for the commit 1323b30 to get more accurate results

@@           Coverage Diff           @@
##           master    #4588   +/-   ##
=======================================
  Coverage   86.59%   86.59%           
=======================================
  Files         228      228           
  Lines       18420    18420           
=======================================
  Hits        15950    15950           
  Misses       2470     2470

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

vnlitvinov · 2022-06-21T07:57:22Z

@pyrito I've fixed PR message to explicitly list it resolves the issue you mention in a free form.
GitHub is not smart enough to read your message entirely, but, when some keywords are used, it would understand that this PR is linked to that issue, and it would auto-close the issue when the PR is merged.

vnlitvinov

Overall looks great and simple, thanks @pyrito!

vnlitvinov · 2022-06-21T07:59:03Z

modin/pandas/dataframe.py

+            # the type of value can change
+            value = self.__getitem__(key)


I like current approach more, it's explicit

vnlitvinov · 2022-06-21T07:59:49Z

modin/pandas/test/dataframe/test_iter.py

+    # Use case from issue #4577
+    pandas_df = pandas.DataFrame([[1]], columns=["col0"])
+    modin_df = pd.DataFrame([[1]], columns=["col0"])
+
+    pandas_df.col0 = [3]
+    modin_df.col0 = [3]
+
+    pandas_df.col0.ffill()
+    modin_df.col0.ffill()
+    df_equals(modin_df, pandas_df)
+


I would suggest turning it into a separate test

+1, a new test would be good

Sounds good, I kept it as part of this because I felt like it is testing similar code paths, but I can do that as well.

mvashishtha · 2022-06-21T13:11:16Z

modin/pandas/dataframe.py

+            # the type of value can change
+            value = self.__getitem__(key)


+1 to @devin-petersohn . I think this fix also still suffers from the bug I pointed out here. If we mutate col0 in place in the example above using iloc, I think we won't change df.col0.

modin/pandas/test/dataframe/test_iter.py

modin/pandas/dataframe.py

mvashishtha

This looks great! I left some style comments. This is a subtle method, so we should document the code and tests well for Modin developers :)

modin/pandas/test/dataframe/test_iter.py

modin/pandas/dataframe.py

modin/pandas/test/dataframe/test_iter.py

mvashishtha · 2022-06-21T16:52:45Z

modin/pandas/test/dataframe/test_iter.py

@@ -258,6 +259,24 @@ def test___setattr__():
    df_equals(modin_df, pandas_df)



Suggested change

# While `new_col` is not a column of the dataframe,

# it should be accessible with __getattr__.

assert modin_df.new_col.equals(pandas_df.new_col)

This is not strictly part of the bug, but we should check that we're setting non-column attributes correctly after your fix.

I don't think equals would work here since these are list types, so I just used the == operator which should work fine for these lists I think.

mvashishtha

I have one minor style suggestion.

mvashishtha · 2022-06-21T19:09:34Z

modin/pandas/test/dataframe/test_iter.py

+    modin_df.col0 = pd.Series([5])
+    modin_df.loc[0, "col0"] = 4
+
+    assert modin_df.col0.equals(modin_df["col0"])


Suggested change

assert modin_df.col0.equals(modin_df["col0"])

df_equals(modin_df, pandas_df)

assert modin_df.col0.equals(pandas_df.col0)

Did this and a bit more

mvashishtha

LGTM! Thanks for responding to all the comments.

RehanSD

LGTM but suggested one edit for the tests!

RehanSD · 2022-06-21T20:36:27Z

modin/pandas/test/dataframe/test_iter.py

+    pandas_df = pandas.DataFrame([[1]], columns=["col0"])
+    modin_df = pd.DataFrame([[1]], columns=["col0"])
+
+    # Replacing a column with a list should mutate the column in place.


I think we should add a test here ensuring that we can do something like ffill on df.col0 after setting it to a list

Let's just check that pandas_df.col0.equals(modin_df.col0). That will check that the types are series. That's what we care about checking

I had that before, but got rid of it due to @mvashishtha 's suggestion. You can see it here: #4588 (comment)

Makes sense!

RehanSD

LGTM!

RehanSD · 2022-06-21T21:08:38Z

Looks like CI is failing because of #4589. Once #4590 is merged, can you rebase off of master? CI should be green then and I'll merge!

…alue Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>

RehanSD

LGTM!

Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>

pyrito requested a review from a team as a code owner June 20, 2022 20:46

pyrito commented Jun 20, 2022

View reviewed changes

pyrito requested review from mvashishtha and RehanSD June 20, 2022 20:48

naren-ponder reviewed Jun 20, 2022

View reviewed changes

pyrito force-pushed the fix/FIX-4577 branch from dcbe5b8 to c6cd36d Compare June 20, 2022 21:07

vnlitvinov reviewed Jun 21, 2022

View reviewed changes

mvashishtha reviewed Jun 21, 2022

View reviewed changes

pyrito force-pushed the fix/FIX-4577 branch 3 times, most recently from 8cdc31d to a99c505 Compare June 21, 2022 15:11

pyrito requested review from mvashishtha and devin-petersohn June 21, 2022 15:15

devin-petersohn reviewed Jun 21, 2022

View reviewed changes

modin/pandas/dataframe.py Outdated Show resolved Hide resolved

pyrito force-pushed the fix/FIX-4577 branch from a99c505 to 94b332b Compare June 21, 2022 15:43

devin-petersohn reviewed Jun 21, 2022

View reviewed changes

modin/pandas/dataframe.py Outdated Show resolved Hide resolved

mvashishtha reviewed Jun 21, 2022

View reviewed changes

pyrito force-pushed the fix/FIX-4577 branch 4 times, most recently from aae480c to 69e6e33 Compare June 21, 2022 16:15

mvashishtha reviewed Jun 21, 2022

View reviewed changes

pyrito force-pushed the fix/FIX-4577 branch from 69e6e33 to 74bd22d Compare June 21, 2022 17:16

mvashishtha reviewed Jun 21, 2022

View reviewed changes

pyrito requested a review from devin-petersohn June 21, 2022 19:25

mvashishtha previously approved these changes Jun 21, 2022

View reviewed changes

RehanSD requested changes Jun 21, 2022

View reviewed changes

RehanSD previously approved these changes Jun 21, 2022

View reviewed changes

Karthik Velayutham added 2 commits June 21, 2022 17:25

FIX-modin-project#4577: Set attribute of Modin dataframe to updated v…

9ba47d4

…alue Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>

Added a few more checks in test

1323b30

pyrito dismissed stale reviews from RehanSD and mvashishtha via 1323b30 June 21, 2022 22:25

pyrito force-pushed the fix/FIX-4577 branch from 230cbbe to 1323b30 Compare June 21, 2022 22:25

RehanSD approved these changes Jun 21, 2022

View reviewed changes

RehanSD merged commit 8efd8f7 into modin-project:master Jun 22, 2022

RehanSD pushed a commit that referenced this pull request Jun 24, 2022

FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)

87efb34

Signed-off-by: Karthik Velayutham <vkarthik@ponder.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#4577: Set attribute of Modin dataframe to updated value #4588

FIX-#4577: Set attribute of Modin dataframe to updated value #4588

pyrito commented Jun 20, 2022 •

edited by vnlitvinov

pyrito Jun 20, 2022

pyrito Jun 20, 2022

vnlitvinov Jun 21, 2022

devin-petersohn Jun 21, 2022

mvashishtha Jun 21, 2022

VasilijKolomiets Jun 22, 2022

naren-ponder left a comment

codecov bot commented Jun 20, 2022 •

edited

vnlitvinov commented Jun 21, 2022

vnlitvinov left a comment

vnlitvinov Jun 21, 2022

vnlitvinov Jun 21, 2022

devin-petersohn Jun 21, 2022

pyrito Jun 21, 2022

mvashishtha Jun 21, 2022

mvashishtha left a comment

mvashishtha Jun 21, 2022

pyrito Jun 21, 2022

mvashishtha left a comment

mvashishtha Jun 21, 2022

pyrito Jun 21, 2022

mvashishtha left a comment

RehanSD left a comment

RehanSD Jun 21, 2022

mvashishtha Jun 21, 2022

pyrito Jun 21, 2022

RehanSD Jun 21, 2022

RehanSD left a comment

RehanSD commented Jun 21, 2022

RehanSD left a comment

		@@ -258,6 +259,24 @@ def test___setattr__():
		df_equals(modin_df, pandas_df)

+        # While `new_col` is not a column of the dataframe,
+        # it should be accessible with __getattr__.
+        assert modin_df.new_col.equals(pandas_df.new_col)

	assert modin_df.col0.equals(modin_df["col0"])
	df_equals(modin_df, pandas_df)
	assert modin_df.col0.equals(pandas_df.col0)

FIX-#4577: Set attribute of Modin dataframe to updated value #4588

FIX-#4577: Set attribute of Modin dataframe to updated value #4588

Conversation

pyrito commented Jun 20, 2022 • edited by vnlitvinov

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naren-ponder left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 20, 2022 • edited

Codecov Report

vnlitvinov commented Jun 21, 2022

vnlitvinov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvashishtha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvashishtha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvashishtha left a comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

RehanSD commented Jun 21, 2022

RehanSD left a comment

Choose a reason for hiding this comment

pyrito commented Jun 20, 2022 •

edited by vnlitvinov

codecov bot commented Jun 20, 2022 •

edited