Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

181 migrate pandas usage to copy on write and remove copy arg from transforms #197

Conversation

lsumption
Copy link
Contributor

Set pd.options.mode.copy_on_write = True. This will become default in pandas 3.0.0 so it is recommended to turn it on now to get the benefits & ensure code is compliant with the upcoming changes (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html). CoW stops making copies of the entire df unecessarily, instead just copying columns that change. It also prevents the original df changing when a view of the object is altered (see link for more info) & should limit the incorrect setwithcopy warnings. However it will still alter the original df in the following snippet, as it is not creating a view of the original df df_temp = df. Therefore, to prevent this I have enforced views to make use of CoW with the following: df = df[df.columns]

As the copy argument is no longer needed, I have added a warning to the copy argument of the base transformer, informing that it will soon be deprecated and updated the tests to drop this argument. I have not updated the example nbs as the notebooks were already generating errors and needed updating with the latest changes - I will open up a separate issue for this.

CoW was only added as optional in pandas 1.5.0 so I have also updated the requirements for tubular enforcing this new lower limit.

Profiling of updated changes (using same approach as in pyarrow tubular benchmarking repo)
Current results:
image
Proposed results:
image
There is a very marginal increase in the overall time to transform the df, however making these changes will allow us to implement faster options going forward e.g. polars

@adamsardar
Copy link
Contributor

There are some merge conflicts with main now - and importantly CI with ruff rather than black/bandit. ruff format . or ruff . --fix should sort most things out.

I had a look through the repo - where is profile_tubular.py? I think it interesting that the profiling script went up - my naive play with a couple of transformers and setting the copy flag to false cut compute time in roughly half and decreased memory allocations by 70% or so. It might be worth kicking more thorough profiling to #182?

@adamsardar
Copy link
Contributor

I have not updated the example nbs as the notebooks were already generating errors and needed updating with the latest changes - I will open up a separate issue for this.

Consider hooking into #191?

tubular/base.py Outdated Show resolved Hide resolved
@lsumption
Copy link
Contributor Author

I've pulled in the latest changes and fixed the merge conflicts with main. I've also included the profiling scripts. The above screenshots are from me pulling down the code and running profiling as it was easier to compare to the current main branch. But I've included it here in order to see what is being tested. I don't know if we want to include these files going forward or not

tubular/base.py Outdated Show resolved Hide resolved
@adamsardar
Copy link
Contributor

Very nice - any puts us in a good place to tackle #185

Copy link
Contributor

@adamsardar adamsardar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Discussion in thread above.

@lsumption lsumption merged commit 0f0f744 into main Mar 25, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate pandas usage to Copy-on-Write and remove copy arg from transforms
2 participants