Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel loses spaces on ods #32207

Closed
joernhees opened this issue Feb 23, 2020 · 8 comments · Fixed by #33233
Closed

read_excel loses spaces on ods #32207

joernhees opened this issue Feb 23, 2020 · 8 comments · Fixed by #33233
Labels
IO Excel read_excel, to_excel
Milestone

Comments

@joernhees
Copy link

Code Sample

Create a new spreadsheet with 1 column "testcol" in LibreOffice / OpenOffice & Excel, save as test.ods/test.xlsx:

testcol
this is great
4    spaces
1 trailing 
 1 leading
2  spaces  multiple  times

For simplicity here as zip: (1 ods, 1 xlsx): spreadsheets.zip

import pandas as pd
for i in pd.read_excel('test.ods', engine='odf')['testcol']:
    print(repr(i))

# output:
'this is great'
'4 spaces'
'1 trailing '
'1 leading'
'2 spaces multiple times'

for i in pd.read_excel('test.xlsx')['testcol']:
    print(repr(i))

# output:
'this is great'
'4    spaces'
'1 trailing '
' 1 leading'
'2  spaces  multiple  times'

Problem description

When reading .ods files (OpenOffice or LibreOffice) multiple spaces are collapsed into one, leading ones are lost, trailing ones preserved.

Expected Output

see excel output above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

Not sure why that says odfpy: None, pip list tells me it's 1.4.1

@joernhees joernhees changed the title read_excel looses spaces on ods read_excel loses spaces on ods Feb 23, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 24, 2020

Makes sense. The odf reader is located in pandas.io.excel._odfreader.py - interested in debugging and pushing a PR?

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Feb 24, 2020
@WillAyd WillAyd added this to the Contributions Welcome milestone Feb 24, 2020
@joernhees
Copy link
Author

I actually already tried in eea/odfpy#95, but man did i get lost... i can't even really tell if it's a bug in odfpy or if pandas uses it with(out) some magic option (in that case it would be a pandas bug).

Maybe someone more into odfpy like the original contributers @detrout or @simonjayhawkins could have a brief glimpse at it? Might be really simple for someone into it, but the odfpy docs aren't really "great", so not really beginner-friendly :-/

@detrout
Copy link
Contributor

detrout commented Feb 27, 2020

Huh.

It's losing spaces from the middle of a text string? That seems odd.

The most useful docs for odfpy were actually the OpenDocument specifications

I think I might have some time next week to try looking at this.

@detrout
Copy link
Contributor

detrout commented Feb 27, 2020

What an odd way to represent spaces?

<text:p>4 <text:s text:c="3"/>spaces</text:p>
<text:p>text:s/1 leading</text:p>

I had no idea that they'd have a commend to represent a string of spaces so I didn't attempt to parse it.

@detrout
Copy link
Contributor

detrout commented Feb 27, 2020

Here's the relevant specification on how spaces should be handled. https://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html#__RefHeading__1415200_253892949

I suspect tab and new-line characters probably aren't supported either.

@SurajH1
Copy link
Contributor

SurajH1 commented Mar 28, 2020

@WillAyd should I take this one?

@SurajH1
Copy link
Contributor

SurajH1 commented Apr 1, 2020

@detrout has this been resolved? I would like to look into it if it hasn't been

@detrout detrout mentioned this issue Apr 2, 2020
5 tasks
@detrout
Copy link
Contributor

detrout commented Apr 2, 2020

I made a stab at fixing it, though my test environment had issues so I'm not sure if it introduced any regressions. in #33233

detrout added a commit to detrout/pandas that referenced this issue Apr 3, 2020
The test files were generated by the bug submitter @joernhees
detrout added a commit to detrout/pandas that referenced this issue Apr 3, 2020
The test files were generated by the bug submitter @joernhees
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants