-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for custom datasources as plugins #74
Comments
For my thesis I am currently looking at how I can hook an existing backend query service into Polars to use the Lazy DataFrame API. This however would need to be passed from the Rust side to the Python side as the use-case is aimed at Data Scientists / ML Engineers working in Python. From what I gathered it unfortunately seems to be impossible to do so right now, so I want to +1 this issue as this would in general open up a lot of possibilities for the Polars eco system! |
Can anyone suggest how to work around this limitation? That is, how can I "extend polars" to support scanning my custom file formats? I looked at https://github.com/universalmind303/polars-mongo which seems clean and straight-forward, but suffers from the same limitation as in #67. |
You might be able to scan your custom file formats using fsspec. Here's an example: https://csvbase.com/blog/7. |
@hantusk this looks very cool and potentially a good workaround. Thanks for sharing! |
This is something I want to get into to. But it need to be more than a trait as we want to get over FFI. On the rust side there is already |
Hi @ritchie46, I've been using the newly released IO plugins and it works well, thank you. I have a question regarding
Is it before or after the predicate is applied? In this context, what's the meaning of "materialize"? Thanks again for implementing this! |
Wow, you are quick. I am still working on the example. :D |
Here is the working example; https://github.com/pola-rs/pyo3-polars/tree/main/example/io_plugin |
@ritchie46 thank you. I understand from this that def _read_my_format_impl(path: str, ...) -> pl.DataFrame: ...
def scan_my_format(paths, ...) -> pl.LazyFrame:
def _read_my_format(with_columns, predicate, n_rows, batch_size):
for path in paths:
df = _read_my_format_impl(path, columns=with_columns, n_rows=n_rows)
if predicate is not None:
df = df.filter(predicate)
yield df
if n_rows is not None:
n_rows -= df.height # <-- is this legit?
if n_rows <= 0:
break
return register_io_source(callable=_read_my_format, schema=...) |
Maybe. You are not allowed to return more than |
The current system makes it pretty easy to add new transformations (expr)'s as plugins, but there is currently no good way for users to provide custom datasources.
Ideally, custom datasources should be as easy as implementing a trait or macro. There is already the
AnonymousScan
trait that mostly works for this use case, but doesn't work via pyo3-polars due to (de)serialization issues (see #67). Maybe we can have an FFI equivalent instead of the in memoryAnonymousScan
?If we loosely base it off of datafusion's TableProvider it may look something like this
Related issues
#67
The text was updated successfully, but these errors were encountered: