Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform file IO API and consolidated codebase #15008

Open
25 tasks
dhimmel opened this issue Dec 29, 2016 · 9 comments
Open
25 tasks

Uniform file IO API and consolidated codebase #15008

dhimmel opened this issue Dec 29, 2016 · 9 comments
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Dec 29, 2016

There are at least three things that many of the IO methods must deal with: reading from URL, reading/writing to a compressed format, and different text encodings. It would be great if all io functions where these factors were relevant could use the same code (consolidated codebase) and expose the same options (uniform API).

In #14576, we consolidated the codebase but more consolidation is possible. In io.common.py, there are three functions that must be sequentially called to get a file-like object: get_filepath_or_buffer, _infer_compression, and _get_handle. This should be consolidated into a single function, which can then delegate to sub functions.

Currently, pandas supports the following io methods. First for reading:

  • read_csv
  • read_excel
  • read_hdf
  • read_feather
  • read_sql
  • read_json
  • read_msgpack (experimental)
  • read_html
  • read_gbq (experimental)
  • read_stata
  • read_sas
  • read_clipboard
  • read_pickle

And then for writing:

  • to_csv
  • to_excel
  • to_hdf
  • to_feather
  • to_sql
  • to_json
  • to_msgpack (experimental)
  • to_html
  • to_gbq (experimental)
  • to_stata
  • to_clipboard
  • to_pickle

Some of these should definitely use the consilidated/uniform API, such as read_csv, read_html, read_pickle, read_excel.

Some functions perhaps should be kept separate, such as read_feather or read_clipboard.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 29, 2016

Here are my thoughts on the API.

  • Read methods should support the following compression methods: None, 'infer', 'gzip', 'bz2', 'xz', 'zip'. Xref ENH: add gzip/bz2 compression to read_pickle() (and perhaps other read_*() methods) #11666

  • Write methods should support the following compression methods: None, 'infer', 'gzip', 'bz2', 'xz' (no zip since it's perhaps bad practice).

  • We may want to support both long and short compression names. Currently, you specify gzip not gz, but bz2 not bzip2.

  • Read methods should support reading from a path, buffer, or URL.

  • Write methods should support writing to a path or buffer.

  • Textual payloads should support the encoding argument

  • Iterator interface should be consistent (support chunksize)

Regarding the consolidated codebase:

  • I'd favor greater separation of the code for 2 and 3. This way when pandas becomes 3-only, the entire 2 sections can be deleted.

@jorisvandenbossche jorisvandenbossche added Clean IO Data IO issues that don't fit into a more specific label labels Dec 29, 2016
@jorisvandenbossche
Copy link
Member

That sounds great!
If you would like to work towards this, that would be very welcome.

Regarding the py2/py3 separation, I think we should just do what is most practical here (having a certain separation makes the code more clear, too much separation can make it more complex again. In any case, having a few but scattered if PY2 statements are also rather easy to delete). But if all related code is contained in io/common.py, it should not be too difficult to find a good balance in that one file here.

One more consolidation that would be possible for read_csv is between the python and c engine. I think the c engine still has its own logic for handling compression, while I do not think this is needed to be in the cython/c code (I don't think this is the performance sensitive part?)

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 29, 2016

If you would like to work towards this, that would be very welcome.

Let's wait for #13317 and any other IO PRs that I don't know about to be merged. I'm hesitant to commit since I know it will cut into my other obligations. But if no one else is interested in implementing, I'll consider.

I think we should just do what is most practical here

Totally agree. There are still a few things I need to understand before I can make that call. One issue is mode in _get_handle, which currently is poorly documented. Presumably this could include t for text or b for bytes, which will have some interactions with Py 2 or 3.

I think the c engine still has its own logic for handling compression, while I do not think this is needed to be in the cython/c code

Agree the c engine implementation should be consolidated, unless there is a major performance issue. But the duplicated functionality with _get_handle appears not to be c optimized (I'm not sure as I don't know cython).

@jreback jreback added the Master Tracker High level tracker for similar issues label Dec 29, 2016
@jreback
Copy link
Contributor

jreback commented Dec 29, 2016

@dhimmel can you annotate the above (or maybe make it a table)

add an x/check if supports pathlib like things / compression / url

@goldenbull
Copy link
Contributor

agree! 👍
I'm now working on #13317 and found _get_handle a bit complex to understand.
_get_handle needs to deal with varies situations:

  • py2 or py3
  • binary (pickle, msgpack) or text (csv)
  • if text, what's the encoding
  • compression
  • memory map
  • open for read or write

It seems to be better to spilt _get_handle into two or more functions to make each single function simpler

@jreback jreback added this to the High Level Issue Tracking milestone Sep 24, 2017
@TomAugspurger TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018
@TomAugspurger TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 8, 2018
@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

@gfyoung can you evaluate this issue, e.g. close, tick boxes, etc.

@gfyoung gfyoung modified the milestones: 0.24.0, Contributions Welcome Jul 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2018

@jreback : This looks to be a much more substantial refactoring at the moment. The checkboxes were more of an enumeration of methods instead of actual tasks AFAICT.

@VelizarVESSELINOV
Copy link

Request for API consistency between to_sql and to_gbq:
.to_sql(index=True...)
vs
.to_gbq(no option index is ignored all the time)

Desired solution:

  1. To have the same option in both functions
  2. To have the same default value

Do you prefer having a separate ticket?

@dhimmel
Copy link
Contributor Author

dhimmel commented Sep 27, 2022

Do you prefer having a separate ticket?

Yes, the index parameter is outside the scope of this issue, which is focused on specifying the input data location and the corresponding compression.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Enhancement IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code
Projects
No open projects
Development

No branches or pull requests

9 participants