Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to save and load all Axis and Group objects of a session in/from HDF, CSV and EXCEL files #578

Closed
alixdamman opened this issue Feb 12, 2018 · 13 comments
Assignees
Milestone

Comments

@alixdamman
Copy link
Collaborator

somewhat related to #153

@alixdamman
Copy link
Collaborator Author

@gdementen Implementing #81 (LFrame) and #6 (multiversion labels) would greatly help to spread the use of LArray among our (potential) current users but in the meantime it is possible to mix Pandas and LArray. Unfortunately, Session.save and Session.load do not handle Pandas objects.

It is really easy to modify Session.save to be able to also save Pandas objects but it is more complicated to do the same for Session.load since we do not know the type of the loaded objects at loading time (we currently assume they are LArray objects).

I suggest to adapt the argument names of Session.load to accept dict with containing pairs 'name: type'. What do you think?

@gdementen
Copy link
Contributor

gdementen commented Feb 13, 2018

I do not really like this option (but this might be the best option anyway -- this needs more thoughts). At least I hope it would not be required. Can't we autodetect what we have? At least for .h5, we could store some extra metadata to make this possible. For Excel & .csv, we could return an array when \ is present and a Dataframe when \ is not present. That would be backward incompatible though and would break again when we support LFrame.

Also names are ordered while the dict would not be guaranteed to be so on python < 3.7

@alixdamman
Copy link
Collaborator Author

The whole point behind this is to be able to save Axis and Group objects. Maybe a special object to simulate #6 also.
For HDF files, we can indeed use metadata.
For CSV and Excel, we need a tag or something to tell what it is at reading time.

@alixdamman alixdamman changed the title Include Pandas objects in Session.save() and Session.load() Allow to save and load Axis and Group objects in/from external files (HDF, CSV, EXCEL) Feb 14, 2018
@alixdamman alixdamman modified the milestones: 0.28, 0.29 Feb 21, 2018
@alixdamman
Copy link
Collaborator Author

@gdementen
Why not using group_name@associated_axis_name, list, of, labels and @axis_name, list, of, labels CSV files?
Or using a more explicit notation: @Group, group_name, associated_axis_name, list, of, labels and @Axis, axis_name, list, of, labels?

I don't see any obvious way to guess if we are dealing with a group or an axis or a 1D array when reading data from a CSV file or Excel sheet.

@gdementen
Copy link
Contributor

The @ idea is interesting, but I think we are looking at it from the wrong angle. There are potentially several different features involved here.

First, if we are loading one axis (or one group) from a file, the user should know what s/he is loading and could use a specific function to load that. eg read_axis(), read_group(), or similar.

It is probably valuable to have a format to save/load those one at a time from a custom format, and in that case the @ solution you describe seems a good compromise between readabilty, simplicity and functionality.

FWIW, I prefer to see "realistic" examples to gauge syntaxes. Your above proposals would be:

@sex,M,F
@country,BE,FR,DE,NL,LU
Benelux@country,BE,NL,LU

@axis,sex,M,F
@axis,country,BE,FR,DE,NL,LU
@group,Benelux,country,BE,NL,LU

But, what I think users need more is

  1. to save and load back an entire session containing arrays and groups in whatever format we like. In this case, I don't think a format for a single Axis or Group would help. We need to define a format to encode multiple axes/groups in a single file. HDF5 seems the easiest to implement, but we could also define a custom format for Excel, something a special sheet containing all axes and a special sheet containing all groups would probably be the most convenient for users. I thought this issue was about this part.
  2. to make it easy to create axes and groups from arbitrary external sources. Most likely candidates are .csv files and ranges in Excel sheets (see ensure it is easy to create groups from an Excel or .csv file #155).

@alixdamman
Copy link
Collaborator Author

alixdamman commented Mar 22, 2018

HDF5 seems the easiest to implement, but we could also define a custom format for Excel, something a special sheet containing all axes and a special sheet containing all groups would probably be the most convenient for users

That was my first thought actually but I see users coming. I'm pretty sure some of them will ask to export arrays with associated groups in the same sheets and then being able to reload arrays and groups in one operation. Imagine an array called pop and associated groups like teenagers and pensioners. I'm quite sure some users will not like to have teenagers and pensioners groups and the pop array stored in separated CSV files or Excel sheet.

Another stuff I'm worried about: Do we force Groups and Axis to be stored vertically or horizontally?

@gdementen
Copy link
Contributor

No, no, no... Users cannot have it both ways. They might complain indeed, but this is IMO putting the bar too high to have a way to save/load them exactly like users want and have them save/load all together. These are two different features. The session thing should be seen as an internal format. If the internal format can be used directly by users, that's all the better, but this is not even required. However, we need to make it as easy as possible to define axes and groups from arbitrary .csv and Excel files (ie, #155/use any format the user like) but then this is a one-object-at-a-time process.

In a mid to long term future we might want to let users define their own custom format/template for saving or loading many axes at once, but this is a lot less useful than the other two features.

@gdementen
Copy link
Contributor

Another stuff I'm worried about: Do we force Groups and Axis to be stored vertically or horizontally?

For 1, whatever is most convenient to implement, so I guess horizontal.
For 2, we need to support both (for both save and load).

@alixdamman
Copy link
Collaborator Author

  1. So, all Axes in a separate Sheet/CSV file with a specific name and all Groups in an other separate Sheet/CSV file with a specific name? Or one file per Axis and Group (this is a bit too much IMO)?
  2. I would store them horizontally (see files from ensure it is easy to create groups from an Excel or .csv file #155). It is more common.

@gdementen
Copy link
Contributor

  1. specific sheet for all objects of a kind.
  2. we need both. Eurostat files in ensure it is easy to create groups from an Excel or .csv file #155 are dataframe-like, so it is yet another format to "support".

@alixdamman
Copy link
Collaborator Author

  1. You mean for axes and groups, not for arrays?
  2. OK but how to guess if objects Axis/Group) are stored vertically or horizontally?

@gdementen
Copy link
Contributor

gdementen commented Mar 22, 2018

  1. one special sheet (__groups__?) for all groups + one special sheet (__axes__?) for all axes + one sheet for each array like we have now
  2. we don't guess. User should specify what to load (eg. range of the excel sheet).

@alixdamman alixdamman changed the title Allow to save and load Axis and Group objects in/from external files (HDF, CSV, EXCEL) Allow to save and load all Axis and Group objects of a session in/from HDF, CSV and EXCEL files Mar 22, 2018
@alixdamman
Copy link
Collaborator Author

The name of the Sheet/CSV file/HDF group for axes and groups could be defined by two additional arguments with default values:
pathaxes=__axes__ and pathgroup=__groups__ (something like that).

alixdamman added a commit to alixdamman/larray that referenced this issue Apr 9, 2018
…project#578) :

- added to_hdf method to Axis and Group
- updated read_hdf (inout/hdf.py)
- updated doctests of Session.load and Session.save
- added context manager LHDFStore (utils/misc.py)

refactored package inout: created one module per file extension or external object type like in pandas/io/:

new modules:
  - common.py
  - pandas.py
  - csv.py
  - excel.py
  - hdf.py
  - sas.py
  - misc.py
  - pickle.py

renamed modules:
  - excel.py --> xw_excel.py

deleted modules:
  - array.py
alixdamman added a commit to alixdamman/larray that referenced this issue Apr 9, 2018
…project#578) :

- added to_hdf method to Axis and Group
- updated read_hdf (inout/hdf.py)
- updated documentation of Session's methods
- updated doctests of Session.load and Session.save
- added context manager LHDFStore (utils/misc.py)

refactored package inout: created one module per file extension or external object type like in pandas/io/:

new modules:
  - common.py
  - pandas.py
  - csv.py
  - excel.py
  - hdf.py
  - sas.py
  - misc.py
  - pickle.py

renamed modules:
  - excel.py --> xw_excel.py

deleted modules:
  - array.py
alixdamman added a commit that referenced this issue Apr 9, 2018
- added to_hdf method to Axis and Group
- updated read_hdf (inout/hdf.py)
- updated documentation of Session's methods
- updated doctests of Session.load and Session.save
- added context manager LHDFStore (utils/misc.py)

refactored package inout: created one module per file extension or external object type like in pandas/io/:

new modules:
  - common.py
  - pandas.py
  - csv.py
  - excel.py
  - hdf.py
  - sas.py
  - misc.py
  - pickle.py

renamed modules:
  - excel.py --> xw_excel.py

deleted modules:
  - array.py
alixdamman added a commit to alixdamman/larray that referenced this issue Apr 9, 2018
alixdamman added a commit to alixdamman/larray that referenced this issue Apr 10, 2018
…project#578)

updated FileHandler and its subclasses:
- renamed FileHandler.list as FileHandler.lists which returns 3 lists (axes, groups and arrays)
- updated FileHandler.read_items()
- updated FileHandler.dump_items()
- split _dump() into _dump_array(), _dump_axes() and _dump_groups()
- split _read_item() into _read_array(), _read_axes(), _read_groups()
alixdamman added a commit to alixdamman/larray that referenced this issue Apr 23, 2018
…objects of a session in/from HDF, CSV and EXCEL files
gdementen pushed a commit that referenced this issue Aug 31, 2018
- added to_hdf method to Axis and Group
- updated read_hdf (inout/hdf.py)
- updated documentation of Session's methods
- updated doctests of Session.load and Session.save
- added context manager LHDFStore (utils/misc.py)

refactored package inout: created one module per file extension or external object type like in pandas/io/:

new modules:
  - common.py
  - pandas.py
  - csv.py
  - excel.py
  - hdf.py
  - sas.py
  - misc.py
  - pickle.py

renamed modules:
  - excel.py --> xw_excel.py

deleted modules:
  - array.py
gdementen pushed a commit that referenced this issue Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants