Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: export fails to correctly save units after adding columns #179

Open
3 tasks done
alejoe91 opened this issue Mar 27, 2024 · 7 comments
Open
3 tasks done

[Bug]: export fails to correctly save units after adding columns #179

alejoe91 opened this issue Mar 27, 2024 · 7 comments
Assignees
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Milestone

Comments

@alejoe91
Copy link
Collaborator

What happened?

When using the io.export after appending some unit columns, the resulting NWB-zarr file only has the newly added columns and it's missing the original ones (including spike_times). This makes the exported NWB invalid.

Steps to Reproduce

with NWBZarrIO(str(nwb_input_path), "r") as read_io:
    nwbfile = read_io.read()

    for col in some_df_with_data.columns:
        nwbfile.add_unit_column(
            name=col,
            description=col,
            data=some_df_with_data[col].values
        )

    with NWBZarrIO(str(nwb_output_path), "w") as export_io:
        export_io.export(src_io=read_io, nwbfile=nwbfile)

# loading the exported NWB zarr folder fails
io = NWBZarrIO(str(nwb_output_path), "r")
nwbfile = io.read()

Traceback

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:729, in ZarrIO.resolve_ref(self, zarr_ref)
    728 try:
--> 729     target_zarr_obj = target_zarr_obj[object_path]
    730 except Exception:

File /opt/conda/lib/python3.9/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
    499 else:
--> 500     raise KeyError(item)

KeyError: '/units/spike_times'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[71], line 2
      1 io = NWBZarrIO(str(nwb_output_path), "r")
----> 2 nwbfile = io.read()

File /opt/conda/lib/python3.9/site-packages/hdmf/utils.py:668, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    666 def func_call(*args, **kwargs):
    667     pargs = _check_args(args, kwargs)
--> 668     return func(args[0], **pargs)

File /opt/conda/lib/python3.9/site-packages/hdmf/backends/io.py:56, in HDMFIO.read(self, **kwargs)
     53 @docval(returns='the Container object that was read in', rtype=Container)
     54 def read(self, **kwargs):
     55     """Read a container from the IO source."""
---> 56     f_builder = self.read_builder()
     57     if all(len(v) == 0 for v in f_builder.values()):
     58         # TODO also check that the keys are appropriate. print a better error message
     59         raise UnsupportedOperation('Cannot build data. There are no values.')

File /opt/conda/lib/python3.9/site-packages/hdmf/utils.py:668, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    666 def func_call(*args, **kwargs):
    667     pargs = _check_args(args, kwargs)
--> 668     return func(args[0], **pargs)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1320, in ZarrIO.read_builder(self)
   1318 @docval(returns='a GroupBuilder representing the NWB Dataset', rtype='GroupBuilder')
   1319 def read_builder(self):
-> 1320     f_builder = self.__read_group(self.__file, ROOT_NAME)
   1321     return f_builder

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1385, in ZarrIO.__read_group(self, zarr_obj, name)
   1383 # read sub groups
   1384 for sub_name, sub_group in zarr_obj.groups():
-> 1385     sub_builder = self.__read_group(sub_group, sub_name)
   1386     ret.set_group(sub_builder)
   1388 # read sub datasets

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1394, in ZarrIO.__read_group(self, zarr_obj, name)
   1391     ret.set_dataset(sub_builder)
   1393 # read the links
-> 1394 self.__read_links(zarr_obj=zarr_obj, parent=ret)
   1396 self._written_builders.set_written(ret)  # record that the builder has been written
   1397 self.__set_built(zarr_obj, ret)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1418, in ZarrIO.__read_links(self, zarr_obj, parent)
   1416     builder = self.__read_group(target_zarr_obj, target_name)
   1417 else:
-> 1418     builder = self.__read_dataset(target_zarr_obj, target_name)
   1419 link_builder = LinkBuilder(builder=builder, name=link_name, source=self.source)
   1420 link_builder.location = os.path.join(parent.location, parent.name)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1439, in ZarrIO.__read_dataset(self, zarr_obj, name)
   1436 else:
   1437     raise ValueError("Dataset missing zarr_dtype: " + str(name) + "   " + str(zarr_obj))
-> 1439 kwargs = {"attributes": self.__read_attrs(zarr_obj),
   1440           "dtype": zarr_dtype,
   1441           "maxshape": zarr_obj.shape,
   1442           "chunks": not (zarr_obj.shape == zarr_obj.chunks),
   1443           "source": self.source}
   1444 dtype = kwargs['dtype']
   1446 # By default, use the zarr.core.Array as data for lazy data load

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:1488, in ZarrIO.__read_attrs(self, zarr_obj)
   1486 if isinstance(v, dict) and 'zarr_dtype' in v:
   1487     if v['zarr_dtype'] == 'object':
-> 1488         target_name, target_zarr_obj = self.resolve_ref(v['value'])
   1489         if isinstance(target_zarr_obj, zarr.hierarchy.Group):
   1490             ret[k] = self.__read_group(target_zarr_obj, target_name)

File /opt/conda/lib/python3.9/site-packages/hdmf_zarr/backend.py:731, in ZarrIO.resolve_ref(self, zarr_ref)
    729         target_zarr_obj = target_zarr_obj[object_path]
    730     except Exception:
--> 731         raise ValueError("Found bad link to object %s in file %s" % (object_path, source_file))
    732 # Return the create path
    733 return target_name, target_zarr_obj

ValueError: Found bad link to object /units/spike_times in file /root/capsule/results/ecephys_713557_2024-03-21_13-32-52_sorted_2024-03-25_23-56-33/nwb/ecephys_713557_2024-03-21_13-32-52_experiment1_recording1.nwb

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

pynwb==2.6.0
hdmf-zarr==0.6.0

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Mar 27, 2024

For a quick workaround, I think setting write_args={'link_data': False} should force export to copy all the data, rather than trying to link data, i.e., do:

export_io.export(src_io=read_io, nwbfile=nwbfile,  write_args={'link_data': False})

@oruebel
Copy link
Contributor

oruebel commented Mar 27, 2024

I'm surprised that creating links fails in this way. @mavaylon1 could you take a look at this?

@oruebel oruebel added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Mar 27, 2024
@oruebel oruebel added this to the Next Release milestone Mar 27, 2024
@mavaylon1 mavaylon1 self-assigned this Apr 9, 2024
@mavaylon1
Copy link
Contributor

Code to more easily reproduce:

from datetime import datetime
from uuid import uuid4

import numpy as np
from dateutil.tz import tzlocal

from pynwb import NWBHDF5IO, NWBFile
from pynwb.ecephys import LFP, ElectricalSeries

from hdmf_zarr.nwb import NWBZarrIO

from hdmf.common import VectorData, DynamicTable


nwbfile = NWBFile(
    session_description="my first synthetic recording",
    identifier=str(uuid4()),
    session_start_time=datetime.now(tzlocal()),
    experimenter=[
        "Baggins, Bilbo",
    ],
    lab="Bag End Laboratory",
    institution="University of Middle Earth at the Shire",
    experiment_description="I went on an adventure to reclaim vast treasures.",
    session_id="LONELYMTN001",
)

nwbfile.add_unit_column(name="quality", description="sorting quality")

firing_rate = 20
n_units = 10
res = 1000
duration = 20
for n_units_per_shank in range(n_units):
    spike_times = np.where(np.random.rand((res * duration)) < (firing_rate / res))[0] / res
    nwbfile.add_unit(spike_times=spike_times, quality="good")

# breakpoint()

with NWBHDF5IO("ecephys_tutorial.nwb", "w") as io:
    io.write(nwbfile)

# ########
# #Convert
# ########
filename = "ecephys_tutorial.nwb"
zarr_filename = "ecephys_tutorial.nwb.zarr"
with NWBHDF5IO(filename, 'r', load_namespaces=False) as read_io:  # Create HDF5 IO object for read
    with NWBZarrIO(zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
        export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

col1 = VectorData(
    name='col1',
    description='column #1',
    data=list(range(0,10)),
)
some_df_with_data = DynamicTable(name='foo', description='foo', columns=[col1])

nwb_output_path = "exported_ecephys_tutorial.nwb.zarr"
with NWBZarrIO(zarr_filename, "r") as read_io:
    nwbfile = read_io.read()
    nwbfile.add_unit_column(
        name='col',
        description='col',
        data=col1.data
    )

    with NWBZarrIO(str(nwb_output_path), "w") as export_io:
        export_io.export(src_io=read_io, nwbfile=nwbfile)

# loading the exported NWB zarr folder fails
io = NWBZarrIO(str(nwb_output_path), "r")
nwbfile = io.read()

Working on this.

@mavaylon1
Copy link
Contributor

Update: What I found is that during export, the columns in units that already existed are stored as links.

io = NWBZarrIO(str(nwb_output_path), "r")
io.file['units'].attrs['zarr_link'] # should see  them

The new column is stored as a zarr array.

I'll have a fix after talking with the team.

@oruebel
Copy link
Contributor

oruebel commented Apr 9, 2024

If you want to make a full copy during export you could set 'write_args={'link_data': False}' to enforce that datasets are copied by default instead of linked to; see https://pynwb.readthedocs.io/en/stable/export.html#my-nwb-file-contains-links-to-datasets-in-other-hdf5-files-how-do-i-create-a-new-nwb-file-with-copies-of-the-datasets

That being said, I believe that creating the links should work. One thing to check may be if have an external link to a dataset that then points to another dataset and that the lookup in turn uses the wrong file for the lookup? Just a hunch.

@rly rly added the topic: storage issues related to storage schema and formats label Apr 11, 2024
@mavaylon1
Copy link
Contributor

Update: The link for the spike times is pointing to the wrong file. Next week is a NWB Developer Days, I will push a fix the week after.

@mavaylon1
Copy link
Contributor

mavaylon1 commented May 14, 2024

The issue is as follows: @oruebel take a gander at what I found. The biggest detail for you to note is that when we export, we are never supposed to create links. We are supposed to preserve them.

Export is not supposed to create links to the prior file, rather it is just mean to have the option to preserve them. This means if File A has links to some File C, then when we export File A to File B, File B will also have links the File C.

Problem 1: HDMF-Zarr is missing the logic in HDMF within write_dataset that has conditionals for link in terms of export.

Problem 2: When links are creating (let's say when they are supposed to be created), they are using absolute paths. They are supposed to use relative paths. Both can break when you move things, but absolute paths will always break.

Problem 3: When we create a reference, the source path is shorthand with ".", to represent the file it is currently in. We need to add logic in resolve_ref to handle links.

What to do while this is being fixed:

  1. Always use 'write_args={'link_data': False}'.

I will divide the problem into stages:
Stage 1 (PR 1:) Add updated export logic into write_dataset

Stage 2 (PR 2:) Add logic into resolve_ref to resolve references in links

Stage 3 (PR 3:) Edge case Test Suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users topic: storage issues related to storage schema and formats
Projects
None yet
Development

No branches or pull requests

4 participants