Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wish: Input / output for Open Document Spreadsheet (ODS) #2311

Closed
lbeltrame opened this issue Nov 21, 2012 · 27 comments · Fixed by #25427
Closed

Wish: Input / output for Open Document Spreadsheet (ODS) #2311

lbeltrame opened this issue Nov 21, 2012 · 27 comments · Fixed by #25427
Labels
Enhancement good first issue IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@lbeltrame
Copy link
Contributor

It would be nice to have, along with XLS and XLSX, also support for ODS (used notably by OpenOffice.org and Libreoffice, but not only that).

The main problem I saw is that there's a plethora of libraries to handle that: however not sure which is the "best" one (with a licensing that would fit with pandas).

@paulproteus
Copy link

It seems to me the best option for this would be:

This would be a rather substantial bit of work.

A good start for the test cases would be to look at pandas/io/tests/test_excel.py and write test cases that mirror what is in that file.

@H0R5E
Copy link
Contributor

H0R5E commented Sep 23, 2014

+1 for this feature.

Working on an open-source data wrangling project and I would like to be able to interact with spreadsheet software that is not Excel.

@jtratner
Copy link
Contributor

@H0R5E for now, you could just convert the resulting excel file to ODS (it's pretty much a simple set of numbers, nothing more): http://stackoverflow.com/questions/15257032/python-convert-excel-file-xls-or-xlsx-to-from-ods or you could use json as an intermediary and use tablib to export to ODS: https://pypi.python.org/pypi/tablib

@robertmuil
Copy link

+1 for this!

@davidovitch
Copy link
Contributor

I want to tune in with some additional notes:

  • odfpy is now living on Github, and seems to get some attention now and then.
  • I bumped into ezodf, which, contrary to simple-odspy, seems to be still under development. It does not depend on odfpy. The docs are here, and some examples here.

Based on some initial looks, ezodf has a straight forward interface. For example, loading an ods spreadsheet into a DataFrame could go as follows:

import pandas as pd
import ezodf
doc = ezodf.opendoc('somefile.ods')

print("Spreadsheet contains %d sheets.\n" % len(doc.sheets))
for sheet in doc.sheets:
    if sheet.name == 'data':
        data = sheet
    print("Sheet name: '%s'" % sheet.name)
    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
    print("-"*40)

# a random cell
pos = (row, col)
data.get_cell(pos).display_form

df_dict = {}
for i, col in enumerate(data.columns()):
    # col is a list that contains all the cells (rows of that column)
    # assume the column name is on the first row
    colname = col[0].display_form
    df_dict[colname] = []
    print(colname)
    for j, row in enumerate(col):
        # ignore the header
        if j == 0:
            continue
        else:
            df_dict[col_name].append(row.display_form)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)

I have no idea how odfpy and ezodf compare performance wise, or how their different memory footprints are for very large spreadsheets. I haven't tried to use odfpy because that seems more complicated to get going with. I also have not checked how robust the writing process of ezodf is. I'll try to see if I can get something similar going as for pandas/io/tests/test_excel.py.

I will also mention @sorenroug from odfpy and @T0ha from ezodf just in case someone from upstream is interested in this discussion.

@davidovitch
Copy link
Contributor

After a short exploration, it seems doable to create an ods prototype reader based on the structure of pandas/io/excel.py using the ezodf module. I've forked pandas here and will implement an ods prototype reader in the io-ods branch.

@jtratner
Copy link
Contributor

@davidovitch - one thing to keep in mind if you choose to handle both writing and reading: you can register an ODS writer in the excel module (just implementing the required methods as listed in the code) and then you can just add a simple subclass of the excel writer tests and get a ton of test cases for free. Just a thought.

@davidovitch
Copy link
Contributor

I would like to give an update on the current status. My fork can currently read ODF ODS files, but writing is not implemented yet. The corresponding pull request is still pending review, and it has been a lot more work than I originally anticipated.

Just a few days ago I bumped into yet another library to read/write spreadsheet: pyexcel. What is different about this library is that it aims at creating a single API for all the different read/write libraries, and it builds on all the existing libraries out there (ezodf, xlrd, etc). This is actually kind of similar to what pandas currently has, and to what I am trying to extend in PR #9070. So I am wondering, would it make sense to use a single API library (with optional dependencies to all the relevant readers/writers for the different spreadsheet types) instead of developing something similar, but tailored to pandas use case? I am speculating that a lot of code in io/excel.py could be removed when relying on pyexcel, but it adds yet another dependency. At this stage I am not sure how all different edge cases on data types and other magic happening inside the spreadsheet interpretation differs between the current pandas implementation and pyexcel.

@davidovitch
Copy link
Contributor

The ods reader is not ready for the upcoming 0.17 release. PR #9070 is closed (see the PR for a technical discussion), and a new improved PR will be made by someone at some point in the future. I have a working version in the ezodf_reader branch of my pandas fork in case someone wants to have a look at it. Suggestions and improvements are welcome :-)

@H0R5E
Copy link
Contributor

H0R5E commented Sep 16, 2015

@davidovitch thanks for your efforts! Its a shame that that reading and writing to ODF is not better supported in python, otherwise I'm sure this would be less of an ordeal.

@PBrockmann
Copy link

Working on how to structure data for paleoclimatology community, I would like to promote ODS as the standard format. Then of course, pandas with this possible reading and writing at ODS format would become the natural way to start nice analysis.

+2 and more for this feature that seems so closed to be released.

@naught101
Copy link

You should almost certainly be using netcdf for paleoclimate data. It's more or less industry standard. Checkout the python package 'xray' for a nice way to interface with it.

On 3 November 2015 12:30:20 am AEDT, Patrick Brockmann notifications@github.com wrote:

Working on how to structure data for paleoclimatology community, I
would like to promote ODS as the standard format. Then of course,
pandas with this possible reading and writing at ODS format would
become the natural way for IO and start nice analysis.

+2 and more for this feature that seems so closed to be released.


Reply to this email directly or view it on GitHub:
#2311 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@PBrockmann
Copy link

@naught101: I know quite well netCDF format and some conventions like CF (Climate and Forecast) widely used in the Earth System Modelling community but it seems that the PaleoClimate one has also good argument to keep working with others standard. An article that is under discussion describes this motivation and promote a JSON-LD format. Read https://www.authorea.com/users/17200/articles/19163/_show_article and please feel free to contribute it.
In my understanding, the data themselves are not problematic, the difficulty comes more from the metadata that are in many aspect hierachical. Netcdf attributs are in this case not very well designed.
On the other hand, describe those metadata with a simple spreadsheet with a dotted notation in an open source format like ODS (I will do a proposition in this way) could be interesting because it will not radically change the behaviour of PaleoClimate scientists that widelly work with spreadsheets.
All this, to say that if pandas could allow IO from ODS, it could be a good motivation to start their analysis with pandas.

@shoyer
Copy link
Member

shoyer commented Nov 3, 2015

@PBrockmann If your metadata is hierarchical, maybe it would be appropriate to use the hierarchical features of netCDF4/HDF5? Just a thought. In any case, I think we agree regardless that ODS support would be valuable for pandas.

@hnykda
Copy link

hnykda commented Nov 28, 2015

Is there any progress on this? Thanks

@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

@hnykda there is a PR #9070 that is not far away, but needs some work if you'd like to pick it up.

@hnykda
Copy link

hnykda commented Nov 30, 2015

Will take a look at it. Thank you for pointing it out.

@jameserrico
Copy link

On a similar note has anyone considered Google Sheets? There are some limits to the size of a sheet (http://gappstips.com/google-sheets/google-spreadsheet-limitations/) that probably make it only usable for some cases but there are quite a lot of datasets that would fit within this restrictions and the ease of collaboration with Sheets is nice.

@detrout
Copy link
Contributor

detrout commented Jun 8, 2017

@jameserrico There's a library https://github.com/embr/gcat that can read from google drive. I have a fork that I updated to work with python3 https://github.com/detrout/gcat but upstream hasn't merged the pull-request

It knows how to return pandas data frames. I.e. I have some code like this:

    book = gcat.get_file(book_name, fmt='pandas_excel')
    experiments = book.parse('Experiments', header=0)`

@detrout
Copy link
Contributor

detrout commented Jun 16, 2017

I have a standalone reader styled like pandas.io.excel with tests.
https://github.com/detrout/pandasodf

It uses odfpy as that seems like its still maintained. Any comments, or should I try reformatting into a pull request creating pandas/io/odf.py?

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

@naught101
Copy link

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

Makes sense. Perhaps there could be a read_spreadsheet that calls read_csv, read_excel, read_odf etc, depending on the extension?

@detrout
Copy link
Contributor

detrout commented Jun 16, 2017

@naught101 The general reader might have some difficulties since there are differences in what the various formats support. CSV/TSV only have a single table, while Excel & ODF can have multiple tables per file.

Also I'm currently contemplating passing the raw odfpy xml cell node for the converters callable, so those would be quite difficult to write if you didn't know the source file type.

@davidovitch
Copy link
Contributor

davidovitch commented Aug 7, 2017

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

@detrout FYI, maybe things have changed since, but at the time of PR #9070 there was a very clear decision by the pandas devs not to do that and to keep read_excel for both the Excel and Open Document Format families.

@TrigonaMinima
Copy link

@jreback is #9070 the right starting point on this or do you suggest starting with something else?

@jreback
Copy link
Contributor

jreback commented Feb 21, 2019

yes that issue is a good starting point

note that the internal code org has changed a lot since then

@jreback jreback modified the milestones: Someday, 0.25.0 Jul 3, 2019
@H0R5E
Copy link
Contributor

H0R5E commented Jul 3, 2019

Great to see the ODS reader implemented. Well done! I'm not sure this should have been closed though, as writing still isn't supported.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

i would open a new issue for write support (if it’s really needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement good first issue IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.