-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Add temporary struct col in pyspark backend to ensure that UDFs are e… #2657
BUG: Add temporary struct col in pyspark backend to ensure that UDFs are e… #2657
Conversation
6d0c322
to
30b05c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for this? only on spark?
|
@jreback Added tests. The change is for spark only. And new tests are for both pandas and spark. |
| output_type=dt.Struct(['col1', 'col2'], [dt.double, dt.double]), | ||
| ) | ||
| def add_one_struct_exact_once(v): | ||
| print(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha my bad. Removed.
| print(v) | ||
| key = v.iloc[0] | ||
| path = Path(f"{tempdir}/{key}") | ||
| assert not path.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you writing things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically to create a side effect. If this function gets run for the second time it will hit the side effect and fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool (may want to document this in the future)
|
|
||
| import pyspark | ||
|
|
||
| if LooseVersion(pyspark.__version__) < LooseVersion("3.1.1"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm not sure i love this here
can you create a pytest decorator and use it here, similar to this: https://github.com/pandas-dev/pandas/blob/master/pandas/util/_test_decorators.py#L200 (you can use skipif, but have to do the import too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha let me try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added marker min_spark_version
| print(v) | ||
| key = v.iloc[0] | ||
| path = Path(f"{tempdir}/{key}") | ||
| assert not path.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool (may want to document this in the future)
|
thanks @icexelloss |
…xecuted once
What change is proposed
Spark sometimes execute the UDF multiple times if
This change ensures that struct column is always assigned to the Spark DataFrame before destructing to ensure exact once execution.
How is this tested
Add test for exact once execution verification under
test_vectorized_udf.py