181 migrate pandas usage to copy on write and remove copy arg from transforms #197

lsumption · 2024-03-12T09:06:20Z

Set pd.options.mode.copy_on_write = True. This will become default in pandas 3.0.0 so it is recommended to turn it on now to get the benefits & ensure code is compliant with the upcoming changes (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html). CoW stops making copies of the entire df unecessarily, instead just copying columns that change. It also prevents the original df changing when a view of the object is altered (see link for more info) & should limit the incorrect setwithcopy warnings. However it will still alter the original df in the following snippet, as it is not creating a view of the original df df_temp = df. Therefore, to prevent this I have enforced views to make use of CoW with the following: df = df[df.columns]

As the copy argument is no longer needed, I have added a warning to the copy argument of the base transformer, informing that it will soon be deprecated and updated the tests to drop this argument. I have not updated the example nbs as the notebooks were already generating errors and needed updating with the latest changes - I will open up a separate issue for this.

CoW was only added as optional in pandas 1.5.0 so I have also updated the requirements for tubular enforcing this new lower limit.

Profiling of updated changes (using same approach as in pyarrow tubular benchmarking repo)
Current results:

Proposed results:

There is a very marginal increase in the overall time to transform the df, however making these changes will allow us to implement faster options going forward e.g. polars

… now redundant test in base tests

adamsardar · 2024-03-12T11:14:58Z

There are some merge conflicts with main now - and importantly CI with ruff rather than black/bandit. ruff format . or ruff . --fix should sort most things out.

I had a look through the repo - where is profile_tubular.py? I think it interesting that the profiling script went up - my naive play with a couple of transformers and setting the copy flag to false cut compute time in roughly half and decreased memory allocations by 70% or so. It might be worth kicking more thorough profiling to #182?

adamsardar · 2024-03-12T11:16:13Z

I have not updated the example nbs as the notebooks were already generating errors and needed updating with the latest changes - I will open up a separate issue for this.

Consider hooking into #191?

tubular/base.py

…d-remove-copy-arg-from-transforms

…py-arg-from-transforms' of https://github.com/lvgig/tubular into 181-migrate-pandas-usage-to-copy-on-write-and-remove-copy-arg-from-transforms merge remote and local changes fixing conflicts

lsumption · 2024-03-12T13:41:54Z

I've pulled in the latest changes and fixed the merge conflicts with main. I've also included the profiling scripts. The above screenshots are from me pulling down the code and running profiling as it was easier to compare to the current main branch. But I've included it here in order to see what is being tested. I don't know if we want to include these files going forward or not

tubular/base.py

tubular/dates.py

tests/comparison/test_EqualityChecker.py

…ate-pandas-usage-to-copy-on-write-and-remove-copy-arg-from-transforms

…r init before deprecation

adamsardar · 2024-03-22T16:23:58Z

Very nice - any puts us in a good place to tackle #185

adamsardar

Approved. Discussion in thread above.

lsumption added 7 commits March 4, 2024 15:13

drop .copy() from dates & imputations

54a62c8

move copy_on_write to base

8a997ff

drop copy line, still need to drop args & update tests

2d36773

set up copy argument in base transformer for future deprecation. Drop…

9a1aba5

… now redundant test in base tests

drop copy from tests & super init calls

c5efe74

tidy up commented code

d4ade8d

update min panads version for CoW

cf39025

lsumption linked an issue Mar 12, 2024 that may be closed by this pull request

Migrate pandas usage to Copy-on-Write and remove copy arg from transforms #181

Closed

adamsardar reviewed Mar 12, 2024

View reviewed changes

tubular/base.py Outdated Show resolved Hide resolved

lsumption added 6 commits March 12, 2024 12:08

add profiling

fcc1f8a

Merge branch 'main' into 181-migrate-pandas-usage-to-copy-on-write-an…

6736d35

…d-remove-copy-arg-from-transforms

ruff profiling

bc2ff11

Merge branch '181-migrate-pandas-usage-to-copy-on-write-and-remove-co…

eadce38

…py-arg-from-transforms' of https://github.com/lvgig/tubular into 181-migrate-pandas-usage-to-copy-on-write-and-remove-copy-arg-from-transforms merge remote and local changes fixing conflicts

ruff again on all files

42e6191

check copy none rather than int. Drop unnecessary comment

67401d7