Uniform file IO API and consolidated codebase #15008

dhimmel · 2016-12-29T14:26:48Z

dhimmel · 2016-12-29T14:36:13Z

Here are my thoughts on the API.

Read methods should support the following compression methods: None, 'infer', 'gzip', 'bz2', 'xz', 'zip'. Xref ENH: add gzip/bz2 compression to read_pickle() (and perhaps other read_*() methods) #11666
Write methods should support the following compression methods: None, 'infer', 'gzip', 'bz2', 'xz' (no zip since it's perhaps bad practice).
We may want to support both long and short compression names. Currently, you specify gzip not gz, but bz2 not bzip2.
Read methods should support reading from a path, buffer, or URL.
Write methods should support writing to a path or buffer.
Textual payloads should support the encoding argument
Iterator interface should be consistent (support chunksize)

Regarding the consolidated codebase:

I'd favor greater separation of the code for 2 and 3. This way when pandas becomes 3-only, the entire 2 sections can be deleted.

jorisvandenbossche · 2016-12-29T15:32:36Z

That sounds great!
If you would like to work towards this, that would be very welcome.

Regarding the py2/py3 separation, I think we should just do what is most practical here (having a certain separation makes the code more clear, too much separation can make it more complex again. In any case, having a few but scattered if PY2 statements are also rather easy to delete). But if all related code is contained in io/common.py, it should not be too difficult to find a good balance in that one file here.

One more consolidation that would be possible for read_csv is between the python and c engine. I think the c engine still has its own logic for handling compression, while I do not think this is needed to be in the cython/c code (I don't think this is the performance sensitive part?)

dhimmel · 2016-12-29T15:46:39Z

If you would like to work towards this, that would be very welcome.

Let's wait for #13317 and any other IO PRs that I don't know about to be merged. I'm hesitant to commit since I know it will cut into my other obligations. But if no one else is interested in implementing, I'll consider.

I think we should just do what is most practical here

Totally agree. There are still a few things I need to understand before I can make that call. One issue is mode in _get_handle, which currently is poorly documented. Presumably this could include t for text or b for bytes, which will have some interactions with Py 2 or 3.

I think the c engine still has its own logic for handling compression, while I do not think this is needed to be in the cython/c code

Agree the c engine implementation should be consolidated, unless there is a major performance issue. But the duplicated functionality with _get_handle appears not to be c optimized (I'm not sure as I don't know cython).

jreback · 2016-12-29T17:38:59Z

@dhimmel can you annotate the above (or maybe make it a table)

add an x/check if supports pathlib like things / compression / url

goldenbull · 2016-12-30T06:13:22Z

agree! 👍
I'm now working on #13317 and found _get_handle a bit complex to understand.
_get_handle needs to deal with varies situations:

py2 or py3
binary (pickle, msgpack) or text (csv)
if text, what's the encoding
compression
memory map
open for read or write

It seems to be better to spilt _get_handle into two or more functions to make each single function simpler

xref gh-15008 xref gh-17262

jreback · 2018-07-08T13:05:38Z

@gfyoung can you evaluate this issue, e.g. close, tick boxes, etc.

gfyoung · 2018-07-08T20:36:34Z

@jreback : This looks to be a much more substantial refactoring at the moment. The checkboxes were more of an enumeration of methods instead of actual tasks AFAICT.

xref pandas-devgh-15008 xref pandas-devgh-17262

VelizarVESSELINOV · 2022-09-27T00:51:07Z

Request for API consistency between to_sql and to_gbq:
.to_sql(index=True...)
vs
.to_gbq(no option index is ignored all the time)

Desired solution:

To have the same option in both functions
To have the same default value

Do you prefer having a separate ticket?

dhimmel · 2022-09-27T01:57:37Z

Do you prefer having a separate ticket?

Yes, the index parameter is outside the scope of this issue, which is focused on specifying the input data location and the corresponding compression.

dhimmel mentioned this issue Dec 29, 2016

add compression support for 'read_pickle' and 'to_pickle' #13317

Closed

jorisvandenbossche added Clean IO Data IO issues that don't fit into a more specific label labels Dec 29, 2016

jreback added the Master Tracker High level tracker for similar issues label Dec 29, 2016

This was referenced Aug 15, 2017

Infer compression from non-string paths #17206

Merged

TST: Compression Inference Tests for read_* #17262

Closed

jreback added this to the High Level Issue Tracking milestone Sep 24, 2017

Dobatymo mentioned this issue Oct 18, 2017

ENH: Add 'infer' option to compression in _get_handle() #17900

Merged

4 tasks

TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018

TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018

jreback added this to the 0.24.0 milestone Jul 8, 2018

jreback pushed a commit that referenced this issue Jul 8, 2018

ENH: support 'infer' compression in _get_handle() (#17900)

6008d75

xref gh-15008 xref gh-17262

gfyoung modified the milestones: 0.24.0, Contributions Welcome Jul 8, 2018

This was referenced Jul 21, 2018

Defaulting to_csv to infer compression #22004

Closed

Default to_* methods to compression='infer' #22011

Merged

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

ENH: support 'infer' compression in _get_handle() (pandas-dev#17900)

7703867

xref pandas-devgh-15008 xref pandas-devgh-17262

gwaybio mentioned this issue Apr 4, 2019

Upgrade Pandas to 0.24 greenelab/pancancer#100

Open

simonjayhawkins mentioned this issue Sep 11, 2019

File options for IO methods #28377

Open

datapythonista mentioned this issue Sep 12, 2019

DEPR: Move rarely used I/O connectors to third party modules #28409

Closed

simonjayhawkins mentioned this issue Sep 29, 2019

to_html()没有encoding参数？造成输出中文显示乱码 #28663

Closed

mohitanand001 mentioned this issue Oct 4, 2019

to_string() does not have an "encoding" parameter. #28766

Closed

jbrockmendel added this to Consolidate in IO Method Robustness Dec 20, 2019

janpipek mentioned this issue Dec 26, 2019

Add option encoding to to_html? #30483

Closed

mroeschke added API Design Refactor Internal refactoring of code and removed Clean labels May 2, 2020

This was referenced May 2, 2020

CLN: unify unicode file handle processing #13401

Closed

API: formalize the pandas IO API #15862

Closed

CLN: consolidate Iterator interface #9496

Closed

API: Unify compression-kwarg for IO-methods #21640

Closed

jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 20, 2020

mroeschke added Enhancement and removed API Design labels May 2, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniform file IO API and consolidated codebase #15008

Uniform file IO API and consolidated codebase #15008

dhimmel commented Dec 29, 2016

dhimmel commented Dec 29, 2016 •

edited by mroeschke

Loading

jorisvandenbossche commented Dec 29, 2016

dhimmel commented Dec 29, 2016

jreback commented Dec 29, 2016

goldenbull commented Dec 30, 2016

jreback commented Jul 8, 2018

gfyoung commented Jul 8, 2018

VelizarVESSELINOV commented Sep 27, 2022

dhimmel commented Sep 27, 2022

Uniform file IO API and consolidated codebase #15008

Uniform file IO API and consolidated codebase #15008

Comments

dhimmel commented Dec 29, 2016

dhimmel commented Dec 29, 2016 • edited by mroeschke Loading

jorisvandenbossche commented Dec 29, 2016

dhimmel commented Dec 29, 2016

jreback commented Dec 29, 2016

goldenbull commented Dec 30, 2016

jreback commented Jul 8, 2018

gfyoung commented Jul 8, 2018

VelizarVESSELINOV commented Sep 27, 2022

dhimmel commented Sep 27, 2022

dhimmel commented Dec 29, 2016 •

edited by mroeschke

Loading