Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC of PDEP-9 (I/O plugins) #53005

Closed
wants to merge 9 commits into from

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista commented Apr 29, 2023

Proof of concept of what's being discussed in #51799.

An example of a plugin is implemented in https://github.com/datapythonista/lance-reader

Both together allow this code:

import pandas

pandas.load_io_plugins()

df = pandas.read_lance('sample.lance')
df.to_lance('sample_copy.lance')

if hasattr(io_plugin, "read"):
globals()[f"read_{dataframe_io_entry_point.name}"] = io_plugin.read
if hasattr(io_plugin, "write"):
setattr(DataFrame, f"to_{dataframe_io_entry_point.name}", io_plugin.write)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that dataframe_io_entry_point.name isn't overwriting something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. Probably some other checks could be useful too. So far my goal was more to show what PDEP-9 could imply in terms of code, as I don't understand all the opposition for what in my opinion is a small change with huge benefits. So I guess a MVP implementation can help undertand understand what are the implications to the PDEP. But fully agree we should raise if two installed packages use the same entrypoint name, iirc it's mentioned in the PDEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

to be clear, my only opposition to pdep9 was in renaming the hugely established pd.read_csv (which is also the most visited page in the docs, according to the analytics you sent me)

adding an optional plugin system like this which allows third-party authors to develop readers/writers sounds like a net positive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying your position @MarcoGorelli.

I think that was great feedback. While it'd be nice to have a better I/O API IMHO, probably not worth the change, and in any case, that can be discussed separately, as it's independent and adds noise to the discussion about plugins.

I was concerned of adding to much stuff to the pandas already huge namespaces, but in a second thought, if we eventually move connectors like SAS or BigQuery to third-party projects, most users will probably end up having less connectors than now, not more.

@jbrockmendel
Copy link
Member

A couple other thoughts:

  1. Do users have a way to tell which things are 3rd party? Many people don't use virtualenvs etc or keep track of what they have installed or why.

  2. When users upgrade a pandas version, can/should we warn or something on existing connectors and tell them to upgrade those too? Or maybe have the connector define a max_(tested|supported)_pandas_version?

@mroeschke
Copy link
Member

A couple other thoughts:

1. Do users have a way to tell which things are 3rd party?  Many people don't use virtualenvs etc or keep track of what they have installed or why.

2. When users upgrade a pandas version, can/should we warn or something on existing connectors and tell them to upgrade those too?  Or maybe have the connector define a `max_(tested|supported)_pandas_version`?

I guess we don't really do either of these for the matplotlib backends, nor do I think pandas necessarily needs to enforce these.

@mroeschke mroeschke added IO Data IO issues that don't fit into a more specific label PDEP pandas enhancement proposal labels May 1, 2023
@datapythonista
Copy link
Member Author

I think users knowing where things come from is a good point. Ideally we'd like to make things be as little as magic as possible even for users without a deep understanding of Python. Users can iterate entrypoints in the same way as done in this PR to see what connectors are available in their environtment. Afaik entrypoints can provide the name of the module, but not the name of the pip/conda package that added them. I'd personally start simple, but depending on the feedback could be reasonable to have something like pandas.list_connectors() or whatever to make users life easier.

For the versioning, I find it out of scope. Compatibility of packages should be at the package level (pip/conda dependencies). And if we want to get into the business of recommending users to have newer versions it'd probably make more sense to start by recommending updates to pandas, or to numpy, if users aren't using the latest versions, and not here, don't you think?

@mroeschke I'll remove the PDEP label, iirc our website will believe this is a PDEP, and publish it in the list of proposals under discussion (or fail because the title doesn't start by PDEP-x.

@datapythonista datapythonista removed the PDEP pandas enhancement proposal label May 1, 2023
@datapythonista
Copy link
Member Author

@pandas-dev/pandas-core I updated this POC to show in practice what PDEP-9 proposes regarding Arrow, which based on the feedback received I don't think it was clear.

The current approach is that pandas will expect connectors to implement pandas_reader and pandas_writer which communicate with pandas with a pandas DataFrame.

In case a connector doesn't implement functions that exchange a pandas dataframe, then pandas will fallback to see if the connector can exchange data with a PyArrow object. If PyArrow is installed (it will be, since it'll be a dependency of the connector), then pandas will read the PyArrow table from the connector, cast it to a dataframe, and return it for the pandas.read_whatever(...) call. This design opens the door to support nanoarrow, a Rust Arrow implementation and others in the future, if that is useful, with minimal changes required in the pandas side.

This PyArrow fallback options adds value in at least two different ways:

  • It makes the API between connectors and pandas simpler, as an Arrow structure (PyArrow Table) is more concise than a pandas dataframe, which could have custom extension arrays, different backend types...
  • It makes this I/O interface available to projects that don't want or can't depend on pandas. Possibly increasing its adoption, and having more projects/devs testing and helping maintain connectors

@jbrockmendel @simonjayhawkins @jreback does this seem reasonable to you? I think you were the main people who expressed concerns about using Arrow as an interface.

Finally, in this version I added a function pandas.load_io_plugins() that need to be called to load the I/O connectors. The alternative was to load them at import pandas which doesn't make it possible to disable it via options, which would be executed when connectors are already loaded. I think making this opt-in initially is a good idea, and if in a future version we want to make plugins load by default, the deprecation cycle should be trivial, warning when users call the function so they remove the call.

# the original parameters and not `*args` and `**kwargs`
@functools.wraps(original_reader)
def reader_wrapper(*args, **kwargs):
result = original_reader(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth for a third-party reader/writer to have a way of declaring whether we should "pre-process" a path argument? get_handle currently does a lot of heavy lifting to unify compression, text/binary, str/file-object, and local/remote files across most IO functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point @twoertwein, thanks for bringing this up. I remember there is a library that does all that in a similar way of what pandas implements. I couldn't find it now, but probably the simplest option is to let connectors use that library and keep this out of the connector API. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion: for end-users, it would be nice to know that any read/to functions accepts compression/file-handles/pathlib/str but it would also be nice to not make this PDEP more complicated :)

@functools.wraps(original_reader)
def reader_wrapper(*args, **kwargs):
result = original_reader(*args, **kwargs)
if exchange_format == "pyarrow":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be be more generic to convert result if it implements the interchange protocol instead? Such that

  1. If original_reader uses/returns Arrow objects, original_reader can also customize how the Arrow objects are converted to pd.DataFrame via *args, **kwargs
  2. Maybe a little more future proof to Arrow as a format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a very good point @mroeschke. Both PyArrow Table and pandas DataFrame support it, so using the interchange protocol in the pandas side of this connectors API seems to solve the problem in a much simpler way.

I don't think there shouldn't be performance implications on using it, but maybe worth to check if data is not being copied with the current implementations of the interchange protocol.

I updated this PR and the lance-reader repo to use what you suggest, please let me know if this is what you had in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes perfect! Thanks

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative was to load them at import pandas which doesn't make it possible to disable it via options, which would be executed when connectors are already loaded. I think making this opt-in initially is a good idea, and if in a future version we want to make plugins load by default...

I'm fine with the current implementation, but I do feel strongly against making this the default with the current behavior of raising when there is a conflict. To me, that is an entire new level of dependency hell where say pip installs just fine but I can't use package X because I have package Y and Z installed.

More generally, I think requiring users to modify their environment to use this feature should be avoided at all costs. Would it be possible to have an option specifying which ones to skip?

@datapythonista
Copy link
Member Author

I'm fine with the current implementation, but I do feel strongly against making this the default with the current behavior of raising when there is a conflict. To me, that is an entire new level of dependency hell where say pip installs just fine but I can't use package X because I have package Y and Z installed.

More generally, I think requiring users to modify their environment to use this feature should be avoided at all costs. Would it be possible to have an option specifying which ones to skip?

I agree, but I don't see an easy solution to this. The current version doesn't load plugins by default, so if I understand you correctly, you're talking in a future version (e.g. 2.2 or later) if we remove pandas.load_io_plugins() and we call it automatically at import pandas.

I don't think pandas should know anything about third-party connectors (let me know if you disagree with that). If that's the case, we don't have any way in the pandas side to prevent two connectors using the same name. So, as I see things the optionsif two installed packages request to use the name pandas.read_X are:

  1. We raise an exception. This hopefully will cause frustration of the users using the connector that duplicated the name, and the authors will probably change it.
  2. We load one of them with an arbitrary criteria (the first or last to be returned by the Python entrypoints API)
  3. We rename them to something like pandas.read_X_1 and pandas.read_X_2

While none of the options are perfect, 1 is the only reasonable option to me (regardless of whether we load plugins at import or explicitly, that it's probably a separate discussion). What do you think @rhshadrach?

@rhshadrach
Copy link
Member

so if I understand you correctly, you're talking in a future version (e.g. 2.2 or later) if we remove pandas.load_io_plugins() and we call it automatically at import pandas.

Correct. I'm +1 as long we either require calling pandas.load_io_plugins() or finding a good way to handle conflicting packages. With requiring pandas.load_io_pluging(), can we also have an option to determine what one loads in case of a conflict?

  1. We raise an exception. This hopefully will cause frustration of the users using the connector that duplicated the name, and the authors will probably change it.

As expressed in #53005 (review), I do not find this approach viable. Relying on inducing a bad user experience for our users to encourage change in other packages does not seem like a good strategy to me.

  1. We load one of them with an arbitrary criteria (the first or last to be returned by the Python entrypoints API)

While not great, I think this would be okay if users had a way (e.g. a pandas option?) to change which one is loaded. Is this possible?

We rename them to something like pandas.read_X_1 and pandas.read_X_2

This could mean installing a package breaks functioning code; -1 here. Even if you leave one named pandas.read_X, I think this is still not viable. We do have somewhat similar behavior here: e.g. calling DataFrame.to_excel without specifying the engine argument also depends on what packages are installed (e.g. if you only have openpyxl vs only have xlxswriter installed). But at least I can specify engine="openpyxl" which will then guarantee the desired engine (or raise if not installed). I don't think that is possible here.

@datapythonista
Copy link
Member Author

While not great, I think this would be okay if users had a way (e.g. a pandas option?) to change which one is loaded. Is this possible?

Sounds good in theory, but I fail to see how users could specify which option to use, either when loading the plugins, or as an option later. If we only have the name (e.g. two implementations for pandas.read_deltalake) how can I reference one of them individually? We could add an engine value to this API, but I think that creates more problems than solutions:

  • If we can't expect packages to do the right thing and not duplicate names, why we should expect engines to be unique? We are in the same problem (with a more complex solution)
  • The current approach assumes all engines of a single format to have the same signature, which I think it's far from optimal. For example, I'd like to have a CSV reader with a schema parameter, but this is not possible in pandas since read_csv(engine='*', ...) has the same signature for all CSV readers.

The only thing that comes to my mind would be to use PyPI names to enforce uniqueness. Not sure if the entrypoints API allows me to see the package name, or what to do with conda-forge, so probably a bad idea. But that's the only way I can think of.

@topper-123
Copy link
Contributor

topper-123 commented May 27, 2023

A fourth option would be to warn when loading conflicting plugins, and only raise when calling the conflicting name (the name will will then not be a real one but a shim that raises an exception).

There will no doubt be situations when want to access some plugin, but not the conflicting one, so I can see @rhshadrach point that it will be annoying to get an error in those situations, but as warning first and the raise if you actually access it could be a way forward?

@datapythonista
Copy link
Member Author

Good point. A warning sounds good too. I think it's simpler instead of an exception if the function/method is accessed, to simply not create it. What do you think?

@topper-123
Copy link
Contributor

I think that's also possible. The warning text should be clear about what the problem is.

One thing though I like about raising errors is that it gives a nice entry point into the debugger and let's you go back to where the error occurred. I'm sure that is also possible with warnings, though I could google it, my point is just that the work flow in debugging errors is more natural than debugging warnings. Or that's just me who knows.

I could accept both just a warning or a warning plus exception if the name is called.

@datapythonista
Copy link
Member Author

@rhshadrach @topper-123 thanks a lot for the feedback, it was very useful. I updated the PR, and now it only warns if connectors conflict and are being ignored (not registered). I think this is surely better, and I could get the name of the pip/conda package, so the error message should be quite informative on what needs to be done.

I didn't try it yet, as it's not trivial, but it'd be great to know your feedback on this approach before moving forward.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the latest commit. I haven't read the whole PR, but will read it now.

else:
setattr(
pd,
f"read_{format_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"read_{format_name}" -> func_name.

_warn_conflict(
f"pandas.{func_name}", format_name, loaded_plugins, io_plugin
)
delattr(pd, func_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk we'll remove anything unintentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should happen. I'll think if we can detect the conflicts before registering anything. I think it's tricky the way it's implemented now, but I think it's easier if we use separate entrypoints for readers and writers, which can be a good idea.

If the general concept seems fine, happy to improve the implementation.

from importlib.metadata import entry_points
import importlib_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an external package? Then make it an (optional) dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a standard library I think, I'll confirm, in case it was installed in the environment by some other package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the stdlib one is importlib.metadata, and importlib_metadata` is the backport - does it work with the stdlib one? seems it's new in py3.8 https://docs.python.org/3/library/importlib.metadata.html

loaded_plugins = {}

for dataframe_io_entry_point in entry_points().get("dataframe.io", []):
format_name = dataframe_io_entry_point.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in read_<name>... It's not use for anything else. The only constrain is that the name Dask, Vaex, Polars... Will receive if they ever use this connector API will be the same. Personally I think that's good, but not sure if for any case the same connector would want to use different names in different libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any limitation about that

@datapythonista
Copy link
Member Author

Closing PDEP-9 as rejected, so not moving forward with this. See #51799.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants