Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Way to enforce "fortran" layout #15796

Closed
cosama opened this issue Apr 19, 2024 · 5 comments
Closed

Way to enforce "fortran" layout #15796

cosama opened this issue Apr 19, 2024 · 5 comments
Labels
A-interop-numpy Area: interoperability with NumPy enhancement New feature or an improvement of an existing feature

Comments

@cosama
Copy link

cosama commented Apr 19, 2024

Description

It would be great if there would be a way to convert a DataFrame and maybe even a LazyFrame, so that all columns have fortran memory outline and df.to_numpy(allow_copy=False) will always pass.

If you need to call to_numpy many times, the copy costs can add up and it would be better to convert the arrays upfront, once.

I tried to use df.rechunk() but:

import polars as pl
import numpy as np

data = np.random.default_rng().random((100, 1), dtype=float)
pl.DataFrame(dict(a=data)).rechunk().to_numpy(allow_copy=False)

fails with

RuntimeError: copy not allowed: only numeric data without nulls in Fortran-like order can be converted without copy

That might also be a bug, not sure.

@cosama cosama added the enhancement New feature or an improvement of an existing feature label Apr 19, 2024
@stinodego
Copy link
Member

stinodego commented Apr 21, 2024

Currently, rechunk only makes sure that each individual Series is contiguous in memory.

It would make sense to have a parameter on rechunk that ensures the entire dataframe is contiguous. Thanks for the suggestion!

I think you can hack around this by converting to numpy and then back. The resulting DataFrame will be contiguous in memory.

@stinodego stinodego added accepted Ready for implementation A-interop-numpy Area: interoperability with NumPy labels Apr 21, 2024
@ritchie46
Copy link
Member

ritchie46 commented Apr 21, 2024

I don't think we should want that @stinodego. That requires unsafe allocations and will be super hard to enfore throughout the engine.

And besides all much more costly than simply paying the copy at the end.

Either way there needs to be made a copy. It doesn't matter if we do it internally or when moving out to numpy. I will close this as it will have no benefit.

@stinodego stinodego removed the accepted Ready for implementation label Apr 21, 2024
@stinodego
Copy link
Member

Right, I was thinking that if you do many to_numpy calls, it would be cheaper to first convert the DataFrame to Fortran layout once, and it will save the copy in subsequent calls.

However, any operations you do on the DataFrame inbetween those calls will not guarantee that the Fortran layout is preserved. So better to just not give any guarantees about it.

@cosama
Copy link
Author

cosama commented Apr 21, 2024

Thank you so much for the quick answer. I did assume that the layout is by always fortran internally (except maybe for some weird numpy arrays), but yeah I can see how this could create some complications at other places. Makes sense to not implement it then.

By the way, why does to_numpy need to make a copy if the layout is not fortran? If you support other layouts throughout the engine, couldn't you just move it back to numpy in that format?

@ritchie46
Copy link
Member

Right, I was thinking that if you do many to_numpy calls, it would be cheaper to first convert the DataFrame to Fortran layout once, and it will save the copy in subsequent calls.

In such a case, I think people should cache their numpy array. I don't think our methods should be focussed with caching.

If you support other layouts throughout the engine, couldn't you just move it back to numpy in that format?

If we could, we would. A numpy array is backed by a single contiguous allocation. Polars DataFrames are backed by multiple buffers.

Don't worry about that copy too much. They happen implicitly all the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-numpy Area: interoperability with NumPy enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests

3 participants