Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve html representation of datasets #1100

Open
wants to merge 22 commits into
base: dev
Choose a base branch
from

Conversation

h-mayorquin
Copy link

@h-mayorquin h-mayorquin commented Apr 19, 2024

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)


data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)


# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)



from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

image

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)


path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)
    
from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile

image

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin h-mayorquin marked this pull request as ready for review April 23, 2024 15:14
Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 1.92308% with 51 lines in your changes are missing coverage. Please review.

Project coverage is 85.07%. Comparing base (126bdb1) to head (08292c6).

❗ Current head 08292c6 differs from pull request most recent head 9cbcf64. Consider uploading reports for the commit 9cbcf64 to get more accurate results

Files Patch % Lines
src/hdmf/container.py 1.92% 48 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1100      +/-   ##
==========================================
- Coverage   88.70%   85.07%   -3.63%     
==========================================
  Files          45       45              
  Lines        9745     9792      +47     
  Branches     2767     2777      +10     
==========================================
- Hits         8644     8331     -313     
- Misses        779     1117     +338     
- Partials      322      344      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rly rly requested a review from stephprince April 24, 2024 01:21
@rly rly added the category: enhancement improvements of code or code behavior label Apr 24, 2024
@rly rly added this to the 3.14.0 milestone Apr 24, 2024
Copy link
Contributor

@stephprince stephprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the PR.

Could you add tests for the data html representation with hdf5 and zarr? I think we mainly have string equivalence tests for this kind of thing.

I'm also wondering if it would be nice to have the hdf5 dataset info displayed in a similar table format as the zarr arrays to make it more consistent across backends. I think we should be able to replicate this using the hdf5 dataset info as an input to a method like this: https://github.com/zarr-developers/zarr-python/blob/9d046ea0d2878af7d15b3de3ec3036fe31661340/zarr/util.py#L402

src/hdmf/container.py Outdated Show resolved Hide resolved
src/hdmf/container.py Show resolved Hide resolved
src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin
Copy link
Author

OK, I added table formating for hdf5:

image

@h-mayorquin
Copy link
Author

h-mayorquin commented Apr 26, 2024

@stephprince
Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

@h-mayorquin
Copy link
Author

Related:

hdmf-dev/hdmf-zarr#186

@h-mayorquin
Copy link
Author

I added the handling division by zero, check this out what happens with external files (like Video):

image

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile

@stephprince
Copy link
Contributor

There are still some failing tests for different python versions, it looks like one of the reasons is because h5py added the .nbytes attribute in version 3.0 and we still have h5py==2.10 as a minimum version.

>       array_size_in_bytes = array.nbytes
E       AttributeError: 'Dataset' object has no attribute 'nbytes'

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

@oruebel
Copy link
Contributor

oruebel commented May 1, 2024

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

Checking if hasattr(data, "nbytes") to optionally seems reasonable to me. In this way you can also avoid custom behavior depending on library versions.

Comment on lines 793 to 810
if isinstance(array, h5py.Dataset):
hdf5_dataset = array
chunks = hdf5_dataset.chunks
compression = hdf5_dataset.compression
uncompressed_size = hdf5_dataset.nbytes
compression_opts = hdf5_dataset.compression_opts
compressed_size = hdf5_dataset.id.get_storage_size()
compression_ratio = uncompressed_size / compressed_size if compressed_size != 0 else "undefined"

head = "HDF5 Dataset"
hdf5_info_dict = {"chunks": chunks, "compression": compression, "compression_opts": compression_opts,
"compression_ratio": compression_ratio}
backend_info_dict = {**basic_array_info_dict, **hdf5_info_dict}

if hasattr(array, "store") and hasattr(array, "shape"): # Duck typing for zarr array
head = "Zarr Array"
zarr_info_dict = {k:v for k, v in array.info_items()}
backend_info_dict = zarr_info_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid having logic that is specific for a particular io backend in the Container. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.

The checks for h5py.Dataset and Zarr.array are really only relevant when a file was read from file. To help disentangle the dependencies, I'm wondering whether we could do the following:

  1. Add a static method generate_dataset_html to HDMFIO that would then need to implemented by HDF5IO and ZarrIO
  2. In the Container you could then do something like:
read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid having logic that is specific for a particular io backend in the Container. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.

I see, yes, it would be nice if we could decouple this. On the other hand, right now, if they do implement their own backend they will just lose the representation for datasets which is not critical.

The checks for h5py.Dataset and Zarr.array are really only relevant when a file was read from file.

Is it? Right now in the test we are passsing an hdf5 dataset as data without writing the nwbfile for testin the display. Is this not intended?

This proposal seems very good:

read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

I see two downsides:

  1. Missing extensive representation for in-memory files (it is nice to know what you will write!).
  2. Fragmenting the code base for html representations.

Is there any other backend in the works right now? If not, maybe we can do this simpler way and add the complexity once we are closer to need it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would worth giving this a try here. If it works for HDF5, we can then we can easily move the logic for Zarr to hdmf-zarr. I don't think it should be too hard to make this work right now, but these things tend to get hard to change later on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about handling in-memory objects as well?

Copy link
Contributor

@oruebel oruebel May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-memory-only objects (i.e., numpy arrays and lists) can be handled here in the Container class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can then follow your approach in another PR to add backend related information extracted through the DataIO objects.

Could you point me to the PR you are referring to. I'm not sure what role DataIO plays for this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that we can implement the following strategy in another PR to add backend specific information:

read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

Copy link
Contributor

@oruebel oruebel May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that looks good 👍 . Just to avoid confusion, read_io is an instance of HDMFIO (i.e., HDF5IO or ZarrIO) and not DataIO. To implement the logic we would then need to:

  1. Add HDMFIO.generate_dataset_html(dataset) which would implement just a minimalist representation
  2. Implement HDF5IO.generate_dataset_html(h5py.Dataset) to represent h5py.Dataset
  3. In a separate PR on hdmf_zarr implement ZarrIO.generate_dataset_html(Zarr.array)

To simplify this implementation and generate consistent representations, we could make a utility function that would take information about a dataset (e.g,. name, shape, data type, etc.) as input and generate the html representation such that the individual generate_data_html on the I/O backends would just collect the information from the dataset and use the utility function to generate the actual html.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that looks good 👍 . Just to avoid confusion, read_io is an instance of HDMFIO (i.e., HDF5IO or ZarrIO) and not DataIO. To implement the logic we would then need to.

Yes, I realized after that I was confusing these two objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks for your hard work on this PR and the fruitful discussion.

@h-mayorquin
Copy link
Author

@stephprince

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

It can be estimated from the dtype and the number of elements. I will do that when the attribute does not exists.

src/hdmf/container.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants