-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
181 migrate pandas usage to copy on write and remove copy arg from transforms #197
181 migrate pandas usage to copy on write and remove copy arg from transforms #197
Conversation
… now redundant test in base tests
There are some merge conflicts with I had a look through the repo - where is |
Consider hooking into #191? |
…d-remove-copy-arg-from-transforms
…py-arg-from-transforms' of https://github.com/lvgig/tubular into 181-migrate-pandas-usage-to-copy-on-write-and-remove-copy-arg-from-transforms merge remote and local changes fixing conflicts
I've pulled in the latest changes and fixed the merge conflicts with main. I've also included the profiling scripts. The above screenshots are from me pulling down the code and running profiling as it was easier to compare to the current main branch. But I've included it here in order to see what is being tested. I don't know if we want to include these files going forward or not |
…ate-pandas-usage-to-copy-on-write-and-remove-copy-arg-from-transforms
…r init before deprecation
Very nice - any puts us in a good place to tackle #185 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Discussion in thread above.
Set
pd.options.mode.copy_on_write = True
. This will become default in pandas 3.0.0 so it is recommended to turn it on now to get the benefits & ensure code is compliant with the upcoming changes (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html). CoW stops making copies of the entire df unecessarily, instead just copying columns that change. It also prevents the original df changing when a view of the object is altered (see link for more info) & should limit the incorrect setwithcopy warnings. However it will still alter the originaldf
in the following snippet, as it is not creating a view of the originaldf
df_temp = df
. Therefore, to prevent this I have enforced views to make use of CoW with the following:df = df[df.columns]
As the copy argument is no longer needed, I have added a warning to the copy argument of the base transformer, informing that it will soon be deprecated and updated the tests to drop this argument. I have not updated the example nbs as the notebooks were already generating errors and needed updating with the latest changes - I will open up a separate issue for this.
CoW was only added as optional in pandas 1.5.0 so I have also updated the requirements for tubular enforcing this new lower limit.
Profiling of updated changes (using same approach as in pyarrow tubular benchmarking repo)
Current results:
Proposed results:
There is a very marginal increase in the overall time to transform the df, however making these changes will allow us to implement faster options going forward e.g. polars