Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

custom formatters for to_csv #4668

Open
cpcloud opened this Issue Aug 25, 2013 · 17 comments

Comments

Projects
None yet
8 participants
Member

cpcloud commented Aug 25, 2013

SO question

something like

df.to_csv(format='%10.4f', sep=' ')
Contributor

nehalecky commented Aug 25, 2013

Most legacy Fortran 77 based simulators (of which many still are actively used in different scientific communities) such as TOUGH2, (which I've had the pleasure of working with extensively), often have data input file subroutines that have hardcoded fields widths. This makes life interesting, and often as much time is spent in pre processing as in post processing. Scientists I worked with had folders of rigid bash scripts and even Fortran routines to attempt to manage these 'input decks' creation—oh the scientist-hours lost. 😢.

Input decks can contain information such as mesh geometry and physical properties, system initial conditions (i.e., thermodynamic state of each element and transport states between them), and simulator operation parameters. While it's often the case that pandas is used to analyze measured or resultant data, I certainly could envision using it to manipulate input files (indeed, I wrote an entire perl library to do this, before discovering python and pandas). With functionality to manage such input decks, it is fundamental to have fixed width output.

Clearly there are much better ways to interact with fortran libraries (f2py / numpy) but I can tell you that (some) scientists are simply interested in getting a simulation up and running. If pandas already has them hooked for data analysis use, there could be a large benefit from such functionality. If had I access to such a tool, my graduate student life would have been a who lot more social 😉.

Contributor

jreback commented Aug 26, 2013

@nehalecky so you want either to_csv to have a fixed width mode

what kind of an API would you see here?

we have been toying with the idea of passing a style parameter to these output routines which could be a class (pandas would provide a base class) that could be overridden for really custom behavior

but easy to see a FIxedWidthWriter

or maybe overkill an just need something straightforward?

Contributor

hayd commented Aug 26, 2013

Perhaps should be float_format to match with options.display.... actually atm that requires a formatter (e.g. '{:10.4f}'.format), maybe should also accept strings like '%10.4f'...

Contributor

patricktokeeffe commented Oct 30, 2013

I think per-column functionality should be added to this list, similar to how read_csv's dtype and na_values accept per-column parameters as a dict.

That would allow users, for example, to apply a different float format to the timestamp than the data columns. Or change the time formatting to military format. (Date formatting was touched in PR #4313 but not time IIUC)

Contributor

jreback commented Oct 30, 2013

this is really just waiting on a nice API that either does what you are suggesting / templates or both

and of course someone to work on this....

it would not be hard to extend float_format/date_format to accept a dict of columns to format

e.g.

date_format={'A' : '%Y%m%d', 'B' : '%y'}

Contributor

nehalecky commented Feb 15, 2014

A per column template, as suggested by @jreback I think would be grand. For large/complex column arrangements, you could use a series beforehand to prescribe slices across certain columns and generate a dict. :)

Contributor

cancan101 commented Feb 16, 2014

Currently the date_format argument is a little unclear as to what it does when the value being formatted is a "date" (datetime w/o a time) as opposed to a "datetime" (datetime w/ a time). At present, it treats these alike and uses the same formatter for each. This is different from how a DatetimeIndex is formatted to CSV. In that case, the formatting code detects if all of the values in the index do not contain times in which cases it only formats the date component. See:

df = pd.DataFrame({'a':[datetime.datetime(2013,1,1)]}, index=pd.to_datetime([datetime.datetime(2013,1,1)]))
io = StringIO()
df.to_csv(io)

In [12]: print io.getvalue()
,a
2013-01-01,2013-01-01 00:00:00

I would suggest having some way how to format datetime w/o time different from datetime with times.

Contributor

hayd commented Mar 1, 2014

@cancan101 could have a flag to drop the minutes / seconds if 00:00:00 (not sure on good name). Could do with an example of date_format in doc, think it would make it clearer (or use default.

Should these be in options.io ?

Contributor

cancan101 commented Mar 3, 2014

@hayd I assume you mean drop the hours, minutes, and seconds (ie the time component of the datetime)?

I think the option and its name depend on how it will work: should it be an "intelligent" format that only prints the time component if needed (i.e. if any of the datetime values have a time != midnight, see #5701) or should it work as a truncate where datetimes are truncated to just dates.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014

Member

cpcloud commented Mar 13, 2014

Did we ever settle on an API here?

Contributor

jreback commented Mar 13, 2014

I think need to create a Format object

Format(col or cols, format=None, default=None)

so this easiky handles date_fornat and float_format (for back compat)

and handles ability to customize as well

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

I dont want to be a pain but This master issue looks like it has been holding the string spacing issue hostage for almost 3 years and I think is mostly unrelated to more complex issues like float and datetime formatting. just ran into it in version 17.1 Going to have to format entire table manualy.

Contributor

jreback commented Mar 25, 2016

@drafter250 well we have 1600 issues - which one shall be first? best way to get something in would be to put in a pull request

Suggestion to have float32_formatter different than the float64_formatter. Today force float32 and float64 types to have same formatter is far from optimal.

drafter250 commented May 23, 2016

@jreback re: extra spaces between columns in DataFrame.to_string. not long after my posting I dug into the pandas/core/format.py module where to_string is located under the DataFrameFormater class. I found that there are a few calls to a method "self.adj.adjoin" where the first argument should be an integer for the number of spaces between columns and the second are the columns themselves. most of the calls to this method are hardcoded and the col_space argument in the "to_string" method seems to actually go un-used.

so I added these lines to DataFrameFormatter.init()

    #set col_space to zero if custom formaters provided and
    #no col_space provided.
    self.col_space = col_space
    if formatters is not None and col_space is None:
        self.col_space = 0
    elif col_space is None:
        self.col_space = 1

then replaced the hard-coded values to self.adj.adjoin with self.col_space and it seemed to work both with formatters or specifying a number of spaces to the col_space argument

I just got a development environment setup for pandas per the user instructions and noticed that the format module is missing from pandas/core and is now in a subpackage called formatters. This would be my first pull request and I want to do things right and write my tests first.

Q1 Where would I find any tests related to the to_string functionality as I don't see a test folder under the new formatters sub-package?

Q2. Could the extra space issue #4158 be separated out from this bigger issue so it can be referenced from the pull request. (you can wait till I actually submit the request)?

Thanks!

Contributor

jreback commented May 23, 2016

rebase on master and you will see the pandas/formats where things moved in 0.18.0.
test are in tests/formats/

ideally you DO separate out issues to as narrow as possible

I got hung up trying to find right spot to add/modify tests and think I
remember running into some corner cases that i couldn't quite figure out.
After looking at the formatting code It seems very intertwined with other
bits and i can see why they wanted to revamp the API.

On Mon, Nov 14, 2016 at 1:46 PM, acosby notifications@github.com wrote:

Was there any progress on this? Or a way to hack to_space to use commas
as sep?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4668 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALNFB6uawZRNKHXCHWDCJAYuM8zefqvIks5q-Kx-gaJpZM4A7uJA
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment