Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing data to disk (Xarray) #8

Open
phockett opened this issue Sep 13, 2019 · 14 comments
Open

Writing data to disk (Xarray) #8

phockett opened this issue Sep 13, 2019 · 14 comments

Comments

@phockett
Copy link
Owner

Xarray netCDF functionality: Currently fails with complex data-type issues.

Use HDF5 instead...?
Can also just pickle for now, but not recommended for Xarrays.

@phockett phockett added the bug label Sep 13, 2019
@phockett phockett added this to To do in ePSproc development Sep 13, 2019
@phockett phockett moved this from To do to In progress in ePSproc development Sep 16, 2019
@phockett
Copy link
Owner Author

Ah, looks like engine (installed libraries) issue. This defaults to netCDF3, which doesn't support complex128 np default type.

Installing libraries should fix, or switching dtype to one supported before save.

http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html

@phockett
Copy link
Owner Author

phockett commented Sep 16, 2019

For scipy backend, splitting to re + im is one solution:
fooCPDS = xr.Dataset({'Re':fooCP.real, 'Im':fooCP.imag}) fooCPDS.to_netcdf('fooCPDS_write_test_160919-sp.nc') fooReadTest = xr.open_dataset('fooCPDS_write_test_160919-sp.nc')
# THIS WORKS OK with scipy implementation.
# Reconstruct complex variables
fooCPrecon = fooReadTest.Re + fooReadTest.Im*1j

@phockett
Copy link
Owner Author

Also issues with multi-level indexing here (netCDF3), so need to flatten/unstack before save, and reconstruct upon read. Now implemented in read/write wrapper functions for Xarray.

Haven't tested for netCDF4 yet.

ePSproc development automation moved this from In progress to Done Sep 20, 2019
@phockett
Copy link
Owner Author

phockett commented Oct 27, 2019

Now having similar issues (with current write function) on AntonJr runs... need to check installed libraries and address.

@phockett phockett reopened this Oct 27, 2019
ePSproc development automation moved this from Done to In progress Oct 27, 2019
@phockett
Copy link
Owner Author

phockett commented Nov 8, 2019

Read & write debugged (again)... now OK for default netCDF3 Xarray behaviour. Still untested for other HDF libraries.

@phockett phockett closed this as completed Nov 8, 2019
ePSproc development automation moved this from In progress to Done Nov 8, 2019
@phockett
Copy link
Owner Author

phockett commented Nov 9, 2019

With newer Xarray versions (0.13+) now get issues with netCDF3 and multidim attribs (previously just gave a warning).

@phockett phockett reopened this Nov 9, 2019
ePSproc development automation moved this from Done to In progress Nov 9, 2019
@phockett
Copy link
Owner Author

phockett commented Mar 28, 2021

Also need to wrap this for job and multijob class methods!

For a quick fix, can just push tabulated data to CSV using Pandas:

# Write to CSV file (multijob class)
dataType = 'matE'
for key in data.data:
    data.data[key][dataType].attrs['pd'].to_csv(f"{key}_matE.csv")

@phockett
Copy link
Owner Author

phockett commented May 5, 2022

Briefly revisited for PEMtk data IO...

Example from the Xarray docs:

# Writing complex valued data
In [8]: da = xr.DataArray([1.0 + 1.0j, 2.0 + 2.0j, 3.0 + 3.0j])

In [9]: da.to_netcdf("complex.nc", engine="h5netcdf", invalid_netcdf=True)

# Reading it back
In [10]: reopened = xr.open_dataarray("complex.nc", engine="h5netcdf")

In [11]: reopened
Out[11]: 
<xarray.DataArray (dim_0: 3)>
array([1.+1.j, 2.+2.j, 3.+3.j])
Dimensions without coordinates: dim_0

@phockett
Copy link
Owner Author

phockett commented Jun 3, 2022

Improved complex number handling in writeXarray() as of e3988d0

For additional work with other IO formats (Pandas > HDF5, Pickle and others), see PEMtk library, esp. phockett/PEMtk@47b3541 and phockett/PEMtk#6

@phockett
Copy link
Owner Author

phockett commented Jun 7, 2022

Added more general restack() functionality, see fa8522e.

Since implemented for ep.IO.readXarray().

Updated docs: https://epsproc.readthedocs.io/en/dev/dataStructures/ePSproc_dataStructures_demo_070622.html

@phockett
Copy link
Owner Author

For more general to/from dict and to/from HDF5 functionality, see pydata/xarray#4073

TODO: wrap some of this into base code and move to h5py as preferred IO method.

@phockett
Copy link
Owner Author

phockett commented Jun 29, 2022

Status 29/06/22:

  • IO for general Xarray data, including complex data and MultiIndex data, now OK in general.
  • Main wrappers are ep.IO.writeXarray() and ep.IO.readXarray(), additional backend stuff in ep.ioBackends and also some ep.util functions.
  • See notes in docs for details and examples

TODO:


Recent collected IO dev update

@phockett
Copy link
Owner Author

phockett commented Jul 22, 2022

NEW BUG: currently failing for HDF5 and AFBLM data:

YET ANOTHER ANNOYING ISSUE IN CHECKDIMS() routine. This needs to be fixed ASAP.

e.g. ep.util.misc.checkDims(AFBLMdict[key]['full'].copy().squeeze(), method='full')

Fails at nonDimStacked line 355. Specifically XR core .to_index() ValueError: IndexVariable objects must be 1-dimensional

Tested in XR 0.19, Pandas 1.2.4 only.


UPDATE:

  • Main issue is due to use of xr.squeeze() in prior routines. If drop=True is not set this leaves squeezed coords as "0-dimensional" which throws errors for .to_index() methods later.
  • Also have issues (but maybe XR version specific) for ND coords with N>1.

E.g.

dataTest = data.data[key]['matE'].copy()  # OK
dataTestSq = data.data[key]['matE'].copy().squeeze()  # Fails - zero-dim coords in squeezed dims
dataTestSqDrop = data.data[key]['matE'].copy().squeeze(drop=True)  # OK

Options/todo:

  • Fix drop=True option in matEleSelector (currently not passed to .squeeze())
  • Additional checks in checkDims for this case, not sure how best to do this?
    • Updated to add dims for 0-dimensional case in 6fd735f
    • This STILL FAILS for multidimensional non-dim coords, sigh.
    • Now fixed in 704e6b2 hopefully. Worked for AFBLM, both raw and squeezed, in tests.
  • Test further cases and recon.
    • For IO (testing with OCS calcs), ep.writeXarray(data.data[key]['AFBLM'], fileName='h5pyTest2.h5', engine='hdf5') OK, but testXRin = ep.readXarray(fileName='h5pyTest2.h5', engine='hdf5') reads dict OK but fails to restack - XR version issue? (v0.19 tested only). Fails with ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (('Labels',), [(0.0, 0.0, 0.0)], {}) to Variable. at xr.DataArray.from_dict(data). Might also be new bug from recent changes above? UPDATE: round-trip OK for 'matE', so still seems to be something data-type specific, but thought tuple use fixed/OK in IO previously?

@phockett
Copy link
Owner Author

Minor IO changes in b00a966 seem to have broken IO tests... or possibly revealed some underlying issues.

E.g. from https://github.com/phockett/ePSproc/actions/runs/4679018663/jobs/8288471741

=========================== short test summary info ============================
FAILED tests/test_IO.py::test_pickle_read - AttributeError: 'dict' object has no attribute 'equals'
FAILED tests/test_IO.py::test_hdf5_read - ValueError: not enough values to unpack (expected 2, got 0)
FAILED tests/test_base_class.py::test_base_scanFiles - AttributeError: 'dict' object has no attribute 'attrs'
ERROR tests/test_base_class.py::test_jobsSummary - AttributeError: 'dict' object has no attribute 'attrs'
ERROR tests/test_base_class.py::test_AFBLM_iso - AttributeError: 'dict' object has no attribute 'attrs'
ERROR tests/test_base_class.py::test_AFBLM_N2_ADMs - AttributeError: 'dict' object has no attribute 'attrs'
============ 3 failed, 4 passed, 2685 warnings, 3 errors in 28.13s =============
Error: Process completed with exit code 1.

TODO: local testing/checks to confirm what is going on. Also check older CI runs to confirm this is a new bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
ePSproc development
  
In progress
Development

No branches or pull requests

1 participant