to_string() not easily reversible for multi-index DataFrames #25570

randolf-scholz · 2019-03-06T14:52:19Z

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np
from itertools import product
df1 = pandas.DataFrame(product(['a', 'b'], range(3)), columns=['idx1', 'idx2'])
df2 = pandas.DataFrame(np.random.rand(6, 2), columns=['col1', 'col2'])
df  = pandas.concat([df1, df2], axis=1)
df.set_index(['idx1','idx2'], inplace=True)
with open('test.dat', 'w') as file:
    file.write(df.to_string())   
df_read = pandas.read_csv('test.dat')
print(df.values)
print(df_read.values)
df.equals(df_read)  # -> False

Problem description

I wanted to save a pandas DataFrame in human readable form, that is, as a text file with nice vertical alignment. The to_string function achieves precisely this, whereas to_csv does not.

I have saved data like this before, and it works just fine when one does not save the index. In this case it can be loaded via pandas.read_csv(file, sep=r'\s+'). I tried using the index_cols and header parameters but nothing seems to work properly.

I also made a StackExchange thread.

Expected Output

True

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-03-06T15:07:44Z

What precise changes to the output (or reader?) are you proposing?

FWIW, I don't think that faithful round-tripping is a high priority for to_string. If that's your goal, there are plenty of better options.

randolf-scholz · 2019-03-06T15:14:17Z

@TomAugspurger

My main goal is to save a multi-index DataFrame in human readable form and be able to load it again.

TomAugspurger · 2019-03-06T15:19:19Z

That's a difficult task :) What exact changes are you proposing?

…

On Wed, Mar 6, 2019 at 9:14 AM randolf-scholz ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> My main goal is to save a multi-index DataFrame in human readable form and be able to load it again. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25570 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIiP0zs5kG1ZDfHfFroX9ovYhoREeks5vT9tTgaJpZM4bhD6_> .

randolf-scholz · 2019-03-06T15:33:22Z

That's a difficult task :) What exact changes are you proposing?
…
On Wed, Mar 6, 2019 at 9:14 AM randolf-scholz @.***> wrote: @TomAugspurger https://github.com/TomAugspurger My main goal is to save a multi-index DataFrame in human readable form and be able to load it again. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25570 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIiP0zs5kG1ZDfHfFroX9ovYhoREeks5vT9tTgaJpZM4bhD6_ .

To be honest I think that a faithful reverse to to_string is probably the best option. And the task should not be as difficult because all the information needed to reconstruct the DataFrame is contained in the string representation. In fact to me it seems all that one needs to do is

Compute the column widths by analyzing the header
Extract columns corresponding to the data & the index seperately
Refill missing entries in the index

Which is essentially what is proposed in this answer: https://stackoverflow.com/a/55024872/9318372

WillAyd · 2019-03-06T16:11:38Z

Would #10415 fit your needs instead?

chris-b1 · 2019-03-06T19:19:11Z

Some previous discussion buried in this thread here - idea of a read_repr
#8323 (comment)

randolf-scholz · 2019-03-07T18:26:52Z

Thanks @WillAyd and @chris-b1 for the suggestions. I am using the following script now to read the files:

df = pandas.read_fwf(f, header=[0,1])
cols = [x for x,_ in df.columns if 'Unnamed' not in x]
idxs = [y for _,y in df.columns if 'Unnamed' not in y]
df.columns = idxs + cols
df[idxs] = df[idxs].ffill()
df.set_index(idxs, inplace=True)

I do believe strongly though that the ability to read and write tables in human readable form should be a core-functionality of a module like pandas.

WillAyd · 2019-03-08T06:40:43Z

Makes sense. You could also specify index_col=[0, 1] and swap things around thereafter (might be easier).

I would side with @TomAugspurger assessment of the priority here as I (perhaps mistakenly) couldn't see this being useful outside of very small DataFrames when compared to the slew of other methods that exist.

With that said this is open source so let's see what others think. You are of course always welcome to submit a PR if you see an easy and scalable way to make it work

jbrockmendel added the MultiIndex label Jul 23, 2019

mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string labels Jun 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_string() not easily reversible for multi-index DataFrames #25570

to_string() not easily reversible for multi-index DataFrames #25570

randolf-scholz commented Mar 6, 2019

INSTALLED VERSIONS

TomAugspurger commented Mar 6, 2019

randolf-scholz commented Mar 6, 2019

TomAugspurger commented Mar 6, 2019 via email

randolf-scholz commented Mar 6, 2019

WillAyd commented Mar 6, 2019

chris-b1 commented Mar 6, 2019

randolf-scholz commented Mar 7, 2019

WillAyd commented Mar 8, 2019

to_string() not easily reversible for multi-index DataFrames #25570

to_string() not easily reversible for multi-index DataFrames #25570

Comments

randolf-scholz commented Mar 6, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 6, 2019

randolf-scholz commented Mar 6, 2019

TomAugspurger commented Mar 6, 2019 via email

randolf-scholz commented Mar 6, 2019

WillAyd commented Mar 6, 2019

chris-b1 commented Mar 6, 2019

randolf-scholz commented Mar 7, 2019

WillAyd commented Mar 8, 2019

Output of `pd.show_versions()`