[Bug]: `export` fails to correctly save units after adding columns #179

alejoe91 · 2024-03-27T10:02:43Z

What happened?

When using the io.export after appending some unit columns, the resulting NWB-zarr file only has the newly added columns and it's missing the original ones (including spike_times). This makes the exported NWB invalid.

Steps to Reproduce

with NWBZarrIO(str(nwb_input_path), "r") as read_io:
    nwbfile = read_io.read()

    for col in some_df_with_data.columns:
        nwbfile.add_unit_column(
            name=col,
            description=col,
            data=some_df_with_data[col].values
        )

    with NWBZarrIO(str(nwb_output_path), "w") as export_io:
        export_io.export(src_io=read_io, nwbfile=nwbfile)

# loading the exported NWB zarr folder fails
io = NWBZarrIO(str(nwb_output_path), "r")
nwbfile = io.read()

Traceback

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:729, in ZarrIO.resolve_ref(self, zarr_ref)
    728 try:
--> 729     target_zarr_obj = target_zarr_obj[object_path]
    730 except Exception:

File /opt/conda/lib/python3.9/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
    499 else:
--> 500     raise KeyError(item)

KeyError: '/units/spike_times'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[71], line 2
      1 io = NWBZarrIO(str(nwb_output_path), "r")
----> 2 nwbfile = io.read()

File /opt/conda/lib/python3.9/site-packages/hdmf/utils.py:668, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    666 def func_call(*args, **kwargs):
    667     pargs = _check_args(args, kwargs)
--> 668     return func(args[0], **pargs)

File /opt/conda/lib/python3.9/site-packages/hdmf/backends/io.py:56, in HDMFIO.read(self, **kwargs)
     53 @docval(returns='the Container object that was read in', rtype=Container)
     54 def read(self, **kwargs):
     55     """Read a container from the IO source."""
---> 56     f_builder = self.read_builder()
     57     if all(len(v) == 0 for v in f_builder.values()):
     58         # TODO also check that the keys are appropriate. print a better error message
     59         raise UnsupportedOperation('Cannot build data. There are no values.')

File /opt/conda/lib/python3.9/site-packages/hdmf/utils.py:668, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    666 def func_call(*args, **kwargs):
    667     pargs = _check_args(args, kwargs)
--> 668     return func(args[0], **pargs)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1320, in ZarrIO.read_builder(self)
   1318 @docval(returns='a GroupBuilder representing the NWB Dataset', rtype='GroupBuilder')
   1319 def read_builder(self):
-> 1320     f_builder = self.__read_group(self.__file, ROOT_NAME)
   1321     return f_builder

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1385, in ZarrIO.__read_group(self, zarr_obj, name)
   1383 # read sub groups
   1384 for sub_name, sub_group in zarr_obj.groups():
-> 1385     sub_builder = self.__read_group(sub_group, sub_name)
   1386     ret.set_group(sub_builder)
   1388 # read sub datasets

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1394, in ZarrIO.__read_group(self, zarr_obj, name)
   1391     ret.set_dataset(sub_builder)
   1393 # read the links
-> 1394 self.__read_links(zarr_obj=zarr_obj, parent=ret)
   1396 self._written_builders.set_written(ret)  # record that the builder has been written
   1397 self.__set_built(zarr_obj, ret)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1418, in ZarrIO.__read_links(self, zarr_obj, parent)
   1416     builder = self.__read_group(target_zarr_obj, target_name)
   1417 else:
-> 1418     builder = self.__read_dataset(target_zarr_obj, target_name)
   1419 link_builder = LinkBuilder(builder=builder, name=link_name, source=self.source)
   1420 link_builder.location = os.path.join(parent.location, parent.name)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1439, in ZarrIO.__read_dataset(self, zarr_obj, name)
   1436 else:
   1437     raise ValueError("Dataset missing zarr_dtype: " + str(name) + "   " + str(zarr_obj))
-> 1439 kwargs = {"attributes": self.__read_attrs(zarr_obj),
   1440           "dtype": zarr_dtype,
   1441           "maxshape": zarr_obj.shape,
   1442           "chunks": not (zarr_obj.shape == zarr_obj.chunks),
   1443           "source": self.source}
   1444 dtype = kwargs['dtype']
   1446 # By default, use the zarr.core.Array as data for lazy data load

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1488, in ZarrIO.__read_attrs(self, zarr_obj)
   1486 if isinstance(v, dict) and 'zarr_dtype' in v:
   1487     if v['zarr_dtype'] == 'object':
-> 1488         target_name, target_zarr_obj = self.resolve_ref(v['value'])
   1489         if isinstance(target_zarr_obj, zarr.hierarchy.Group):
   1490             ret[k] = self.__read_group(target_zarr_obj, target_name)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:731, in ZarrIO.resolve_ref(self, zarr_ref)
    729         target_zarr_obj = target_zarr_obj[object_path]
    730     except Exception:
--> 731         raise ValueError("Found bad link to object %s in file %s" % (object_path, source_file))
    732 # Return the create path
    733 return target_name, target_zarr_obj

ValueError: Found bad link to object /units/spike_times in file /root/capsule/results/ecephys_713557_2024-03-21_13-32-52_sorted_2024-03-25_23-56-33/nwb/ecephys_713557_2024-03-21_13-32-52_experiment1_recording1.nwb

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

pynwb==2.6.0
hdmf-zarr==0.6.0

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

oruebel · 2024-03-27T18:17:54Z

For a quick workaround, I think setting write_args={'link_data': False} should force export to copy all the data, rather than trying to link data, i.e., do:

export_io.export(src_io=read_io, nwbfile=nwbfile,  write_args={'link_data': False})

oruebel · 2024-03-27T18:19:27Z

I'm surprised that creating links fails in this way. @mavaylon1 could you take a look at this?

mavaylon1 · 2024-04-09T15:41:06Z

Code to more easily reproduce:

from datetime import datetime
from uuid import uuid4

import numpy as np
from dateutil.tz import tzlocal

from pynwb import NWBHDF5IO, NWBFile
from pynwb.ecephys import LFP, ElectricalSeries

from hdmf_zarr.nwb import NWBZarrIO

from hdmf.common import VectorData, DynamicTable


nwbfile = NWBFile(
    session_description="my first synthetic recording",
    identifier=str(uuid4()),
    session_start_time=datetime.now(tzlocal()),
    experimenter=[
        "Baggins, Bilbo",
    ],
    lab="Bag End Laboratory",
    institution="University of Middle Earth at the Shire",
    experiment_description="I went on an adventure to reclaim vast treasures.",
    session_id="LONELYMTN001",
)

nwbfile.add_unit_column(name="quality", description="sorting quality")

firing_rate = 20
n_units = 10
res = 1000
duration = 20
for n_units_per_shank in range(n_units):
    spike_times = np.where(np.random.rand((res * duration)) < (firing_rate / res))[0] / res
    nwbfile.add_unit(spike_times=spike_times, quality="good")

# breakpoint()

with NWBHDF5IO("ecephys_tutorial.nwb", "w") as io:
    io.write(nwbfile)

# ########
# #Convert
# ########
filename = "ecephys_tutorial.nwb"
zarr_filename = "ecephys_tutorial.nwb.zarr"
with NWBHDF5IO(filename, 'r', load_namespaces=False) as read_io:  # Create HDF5 IO object for read
    with NWBZarrIO(zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
        export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

col1 = VectorData(
    name='col1',
    description='column #1',
    data=list(range(0,10)),
)
some_df_with_data = DynamicTable(name='foo', description='foo', columns=[col1])

nwb_output_path = "exported_ecephys_tutorial.nwb.zarr"
with NWBZarrIO(zarr_filename, "r") as read_io:
    nwbfile = read_io.read()
    nwbfile.add_unit_column(
        name='col',
        description='col',
        data=col1.data
    )

    with NWBZarrIO(str(nwb_output_path), "w") as export_io:
        export_io.export(src_io=read_io, nwbfile=nwbfile)

# loading the exported NWB zarr folder fails
io = NWBZarrIO(str(nwb_output_path), "r")
nwbfile = io.read()

Working on this.

mavaylon1 · 2024-04-09T19:23:18Z

Update: What I found is that during export, the columns in units that already existed are stored as links.

io = NWBZarrIO(str(nwb_output_path), "r")
io.file['units'].attrs['zarr_link'] # should see  them

The new column is stored as a zarr array.

I'll have a fix after talking with the team.

oruebel · 2024-04-09T22:14:40Z

If you want to make a full copy during export you could set 'write_args={'link_data': False}' to enforce that datasets are copied by default instead of linked to; see https://pynwb.readthedocs.io/en/stable/export.html#my-nwb-file-contains-links-to-datasets-in-other-hdf5-files-how-do-i-create-a-new-nwb-file-with-copies-of-the-datasets

That being said, I believe that creating the links should work. One thing to check may be if have an external link to a dataset that then points to another dataset and that the lookup in turn uses the wrong file for the lookup? Just a hunch.

mavaylon1 · 2024-04-11T23:02:32Z

Update: The link for the spike times is pointing to the wrong file. Next week is a NWB Developer Days, I will push a fix the week after.

mavaylon1 · 2024-05-14T20:34:07Z

The issue is as follows: @oruebel take a gander at what I found. The biggest detail for you to note is that when we export, we are never supposed to create links. We are supposed to preserve them.

Export is not supposed to create links to the prior file, rather it is just mean to have the option to preserve them. This means if File A has links to some File C, then when we export File A to File B, File B will also have links the File C.

Problem 1: HDMF-Zarr is missing the logic in HDMF within write_dataset that has conditionals for link in terms of export.

Problem 2: When links are creating (let's say when they are supposed to be created), they are using absolute paths. They are supposed to use relative paths. Both can break when you move things, but absolute paths will always break.

Problem 3: When we create a reference, the source path is shorthand with ".", to represent the file it is currently in. We need to add logic in resolve_ref to handle links.

What to do while this is being fixed:

Always use 'write_args={'link_data': False}'.

I will divide the problem into stages:
Stage 1 (PR 1:) Add updated export logic into write_dataset

Stage 2 (PR 2:) Add logic into resolve_ref to resolve references in links

Stage 3 (PR 3:) Edge case Test Suite

mavaylon1 · 2024-10-28T16:22:06Z

@alejoe91 Are you exporting within the same backend or across from hdf5<->zarr.

alejoe91 · 2024-10-28T16:53:28Z

@alejoe91 Are you exporting within the same backend or across from hdf5<->zarr.

Always same backend

oruebel added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Mar 27, 2024

oruebel added this to the Next Release milestone Mar 27, 2024

mavaylon1 self-assigned this Apr 9, 2024

rly added the topic: storage issues related to storage schema and formats label Apr 11, 2024

This was referenced Jul 29, 2024

[Bug]: Reading and exporting Zarr NWB fails to copy over table information #208

Open

[Bug]: [0.7.0, 0.8.0] Fails to open file with consolidated metadata from S3 #205

Open

mavaylon1 mentioned this issue Aug 15, 2024

Broken Links when Exporting #194

Open

6 tasks

rcpeene mentioned this issue Oct 17, 2024

[Bug]: Can't Export NWB NeurodataWithoutBorders/pynwb#1973

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `export` fails to correctly save units after adding columns #179

[Bug]: `export` fails to correctly save units after adding columns #179

alejoe91 commented Mar 27, 2024

oruebel commented Mar 27, 2024

oruebel commented Mar 27, 2024

mavaylon1 commented Apr 9, 2024

mavaylon1 commented Apr 9, 2024

oruebel commented Apr 9, 2024

mavaylon1 commented Apr 11, 2024

mavaylon1 commented May 14, 2024 •

edited

Loading

mavaylon1 commented Oct 28, 2024

alejoe91 commented Oct 28, 2024

[Bug]: export fails to correctly save units after adding columns #179

[Bug]: export fails to correctly save units after adding columns #179

Comments

alejoe91 commented Mar 27, 2024

What happened?

Steps to Reproduce

Traceback

Operating System

Python Executable

Python Version

Package Versions

Code of Conduct

oruebel commented Mar 27, 2024

oruebel commented Mar 27, 2024

mavaylon1 commented Apr 9, 2024

mavaylon1 commented Apr 9, 2024

oruebel commented Apr 9, 2024

mavaylon1 commented Apr 11, 2024

mavaylon1 commented May 14, 2024 • edited Loading

mavaylon1 commented Oct 28, 2024

alejoe91 commented Oct 28, 2024

[Bug]: `export` fails to correctly save units after adding columns #179

[Bug]: `export` fails to correctly save units after adding columns #179

mavaylon1 commented May 14, 2024 •

edited

Loading