Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_string() not easily reversible for multi-index DataFrames #25570

Open
randolf-scholz opened this issue Mar 6, 2019 · 8 comments
Open

to_string() not easily reversible for multi-index DataFrames #25570

randolf-scholz opened this issue Mar 6, 2019 · 8 comments
Labels
Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string

Comments

@randolf-scholz
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np
from itertools import product
df1 = pandas.DataFrame(product(['a', 'b'], range(3)), columns=['idx1', 'idx2'])
df2 = pandas.DataFrame(np.random.rand(6, 2), columns=['col1', 'col2'])
df  = pandas.concat([df1, df2], axis=1)
df.set_index(['idx1','idx2'], inplace=True)
with open('test.dat', 'w') as file:
    file.write(df.to_string())   
df_read = pandas.read_csv('test.dat')
print(df.values)
print(df_read.values)
df.equals(df_read)  # -> False

Problem description

I wanted to save a pandas DataFrame in human readable form, that is, as a text file with nice vertical alignment. The to_string function achieves precisely this, whereas to_csv does not.

I have saved data like this before, and it works just fine when one does not save the index. In this case it can be loaded via pandas.read_csv(file, sep=r'\s+'). I tried using the index_cols and header parameters but nothing seems to work properly.

I also made a StackExchange thread.

Expected Output

True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

What precise changes to the output (or reader?) are you proposing?

FWIW, I don't think that faithful round-tripping is a high priority for to_string. If that's your goal, there are plenty of better options.

@randolf-scholz
Copy link
Contributor Author

@TomAugspurger

My main goal is to save a multi-index DataFrame in human readable form and be able to load it again.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 6, 2019 via email

@randolf-scholz
Copy link
Contributor Author

That's a difficult task :) What exact changes are you proposing?

On Wed, Mar 6, 2019 at 9:14 AM randolf-scholz @.***> wrote: @TomAugspurger https://github.com/TomAugspurger My main goal is to save a multi-index DataFrame in human readable form and be able to load it again. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25570 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIiP0zs5kG1ZDfHfFroX9ovYhoREeks5vT9tTgaJpZM4bhD6_ .

To be honest I think that a faithful reverse to to_string is probably the best option. And the task should not be as difficult because all the information needed to reconstruct the DataFrame is contained in the string representation. In fact to me it seems all that one needs to do is

  1. Compute the column widths by analyzing the header
  2. Extract columns corresponding to the data & the index seperately
  3. Refill missing entries in the index

Which is essentially what is proposed in this answer: https://stackoverflow.com/a/55024872/9318372

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

Would #10415 fit your needs instead?

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 6, 2019

Some previous discussion buried in this thread here - idea of a read_repr
#8323 (comment)

@randolf-scholz
Copy link
Contributor Author

Thanks @WillAyd and @chris-b1 for the suggestions. I am using the following script now to read the files:

df = pandas.read_fwf(f, header=[0,1])
cols = [x for x,_ in df.columns if 'Unnamed' not in x]
idxs = [y for _,y in df.columns if 'Unnamed' not in y]
df.columns = idxs + cols
df[idxs] = df[idxs].ffill()
df.set_index(idxs, inplace=True)

I do believe strongly though that the ability to read and write tables in human readable form should be a core-functionality of a module like pandas.

@WillAyd
Copy link
Member

WillAyd commented Mar 8, 2019

Makes sense. You could also specify index_col=[0, 1] and swap things around thereafter (might be easier).

I would side with @TomAugspurger assessment of the priority here as I (perhaps mistakenly) couldn't see this being useful outside of very small DataFrames when compared to the slew of other methods that exist.

With that said this is open source so let's see what others think. You are of course always welcome to submit a PR if you see an easy and scalable way to make it work

@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string labels Jun 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

6 participants