Adds test for reading data with unordered position labeles #124

ieivanov · 2023-07-27T00:06:50Z

@henrypinkard I am finding that ndtiff sorts the axis labels (now using SortedSets), but the data is returned in the acquisition order, causing the test here to fail.

I think this is something you alluded to in micro-manager/pycro-manager#575. This issue will also come up if acquiring, for example, channels in the ['RFP', 'GFP'] order - the axes will read ['GFP', 'RFP'], but the channel with index 0 will correspond to RFP.

I think it will be costly to sort the data, so my suggestion would be to also remove the sorting of the axes - i.e. use set rather than SortedSet. This will largely preserve the current behavior and will even speed up the code a little bit. I also think it's right not to sort the data - if a user acquired channels or z slices out of alphabetical / numeric order it would be on them to reorder the data rather than expect the data reader to do it.

Let me know what you think. @carlkesselman is welcome to chime in as well.

ieivanov · 2023-07-27T00:30:33Z

P. S. I mean data returned by dataset.as_array()

carlkesselman · 2023-07-27T01:29:17Z

Hi I don't think I have an opinion. My goal in using SortedSets was to preserve the existing behavior at significantly improved performance for large index sets. Carl

…

Sent from my iPhone On Jul 26, 2023, at 5:30 PM, Ivan Ivanov ***@***.***> wrote: P. S. I mean data returned by dataset.as_array() — Reply to this email directly, view it on GitHub<https://urldefense.us/v2/url?u=https-3A__github.com_micro-2Dmanager_NDTiffStorage_pull_124-23issuecomment-2D1652726916&d=DwMCaQ&c=qzHnJIRvjI6L-clJH8JwLQvf_Iq43fzikf6aoxZgMb8&r=sGCma2ufaUVT-N141kRIZQ&m=PhiipTRLltgp9IwUJibyuxv91S4vKNU29kACCv3tV24sbbc4zrFRshY9A7rWNl7U&s=-ORPZfDyYLMAAHG7ijKv_XMkdYP4maZy06TEPrWnGIc&e=>, or unsubscribe<https://urldefense.us/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AA3OGXWHNNBMSKNYU3KLLC3XSGZDLANCNFSM6AAAAAA2ZJRI5I&d=DwMCaQ&c=qzHnJIRvjI6L-clJH8JwLQvf_Iq43fzikf6aoxZgMb8&r=sGCma2ufaUVT-N141kRIZQ&m=PhiipTRLltgp9IwUJibyuxv91S4vKNU29kACCv3tV24sbbc4zrFRshY9A7rWNl7U&s=pvU1uNXkskzEGv2-jLxTKLomjRUaSoSLHH-yUA1Z9mc&e=>. You are receiving this because you were mentioned.Message ID: ***@***.***>

henrypinkard · 2023-07-28T15:04:47Z

This issue will also come up if acquiring, for example, channels in the ['RFP', 'GFP'] order - the axes will read ['GFP', 'RFP'], but the channel with index 0 will correspond to RFP.

Do you mean the index 0 when calling dataset.as_array()?

I think it will be costly to sort the data, so my suggestion would be to also remove the sorting of the axes - i.e. use set rather than SortedSet.

What situations are you thinking about when you say sorting will be costly?

I also think it's right not to sort the data - if a user acquired channels or z slices out of alphabetical / numeric order it would be on them to reorder the data rather than expect the data reader to do it.

I think this is reasonable for String valued axes, but probably not numeric. A concrete example of this is an explore acquisition. You take a z stack going from index 0 to index 9. Then you decide you want to see whats above the sample and image indices -8 to -1. When you call as_array(), you'd expect it to be spatially ordered, and having to do that yourself if kinda confusing

Thoughts?

carlkesselman · 2023-07-28T15:16:42Z

This is actually the interesting case… you have sequences of ascending values, and you end up sorting. Using the SortedSet minimizes the overhead in this case, but this will still be slightly more expensive than just doing the list append. However, based on my experience with the current implementation, the performance is tolerable, as you are just doing an insert into the sorted structure, not resorting every time. which was the prior behavior and really didn’t work. So in my mind, this is a question of semantics and not performance. You do have an underlying assumption in your example, which is that with numeric indexes, they are numbered in either spatial or temporal order. Perhaps the documentation should state this explicitly? Carl

…

---------------------------------------------------------- Dr. Carl Kesselman William H. Keck Professor of Engineering Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering Professor, Department of Population and Public Health Sciences, Keck School of Medicine Professor, Biomedical Sciences, Ostrow School of Dentistry University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: ***@***.*** Web: http://www.isi.edu/~carl On Jul 28, 2023 at 8:05 AM -0700, Henry Pinkard ***@***.***>, wrote: This issue will also come up if acquiring, for example, channels in the ['RFP', 'GFP'] order - the axes will read ['GFP', 'RFP'], but the channel with index 0 will correspond to RFP. Do you mean the index 0 when calling dataset.as_array()? I think it will be costly to sort the data, so my suggestion would be to also remove the sorting of the axes - i.e. use set rather than SortedSet. What situations are you thinking about when you say sorting will be costly? I also think it's right not to sort the data - if a user acquired channels or z slices out of alphabetical / numeric order it would be on them to reorder the data rather than expect the data reader to do it. I think this is reasonable for String valued axes, but probably not numeric. A concrete example of this is an explore acquisition. You take a z stack going from index 0 to index 9. Then you decide you want to see whats above the sample and image indices -8 to -1. When you call as_array(), you'd expect it to be spatially ordered, and having to do that yourself if kinda confusing — Reply to this email directly, view it on GitHub<https://urldefense.us/v2/url?u=https-3A__github.com_micro-2Dmanager_NDTiffStorage_pull_124-23issuecomment-2D1655849589&d=DwMCaQ&c=qzHnJIRvjI6L-clJH8JwLQvf_Iq43fzikf6aoxZgMb8&r=sGCma2ufaUVT-N141kRIZQ&m=ycOwCS40Ufyrb7e2r_54YQDx6UoeBA5_Sqr0YkuZCSYeqQJsCyfzgCyek9jsahn2&s=25CvORkTF27zBdSjvpHmvs255g8iHfh0nXBzeVBY9cU&e=>, or unsubscribe<https://urldefense.us/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AA3OGXXEBGVGMCWZQC4L6L3XSPIJTANCNFSM6AAAAAA2ZJRI5I&d=DwMCaQ&c=qzHnJIRvjI6L-clJH8JwLQvf_Iq43fzikf6aoxZgMb8&r=sGCma2ufaUVT-N141kRIZQ&m=ycOwCS40Ufyrb7e2r_54YQDx6UoeBA5_Sqr0YkuZCSYeqQJsCyfzgCyek9jsahn2&s=3vh4vkpC6i0FNZAtj-lF4X8WgnTNzgxKCybx06mk3mo&e=>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ieivanov · 2023-07-28T16:17:19Z

Do you mean the index 0 when calling dataset.as_array()?

Correct

I think this is reasonable for String valued axes, but probably not numeric. A concrete example of this is an explore acquisition. You take a z stack going from index 0 to index 9. Then you decide you want to see whats above the sample and image indices -8 to -1. When you call as_array(), you'd expect it to be spatially ordered, and having to do that yourself if kinda confusing

In 80623e7 I added a test for reading a dataset as you suggested. Here is the code I used to acquire the data:

import numpy as np
from pycromanager import Acquisition, Dataset

def fun(image, metadata):
    if not hasattr(fun, "idx"):
        fun.idx = 0

    image[0, 0] = np.uint16(fun.idx)

    fun.idx += 1

    return image, metadata

events1 = [{'axes': {'z': z_idx}, 'z': z_idx} for z_idx in range(10)]
events2 = [{'axes': {'z': z_idx}, 'z': z_idx} for z_idx in range(-10, 0)]

with Acquisition(directory=r'Q:\Ivan\testing', name='unordered_z', image_process_fn=fun, show_display=False) as acq:
    acq.acquire(events1)
    acq.acquire(events2)

Currently Dataset sorts the axes, but not the data (even when using integer-values axes) when calling dataset.as_array(), causing test_unordered_z_axis to fail. By sorting the data we would be introducing a new feature - even if that was the original intention all along. Further, as axis sorting, data sorting, and the dataset.axes property are all undocumented - something I've also planned to work on, the current behavior is all around unexpected, as Carl also points out.

In order to sort the array that dataset.as_array() we'll need to get a sorting key (something like sorting_key = np.argsort(axis['z'])) and then apply it to the data: dataset.as_array()[sorting_key, ...]. That, I think will add extra computational cost.

My suggestion is to remove the axis sorting, and keep the data fetching as it. We'll then pass the cost of array sorting to the user, if that's something they want to do. In the case of the example above you'd do dataset.as_array()[np.argsort(dataset.axes['z']), ...]

ieivanov · 2023-07-28T16:35:28Z

P.S. It looks like there is an even bigger bug here - in the unordered_z_1 dataset data.as_array()[10] returns all zeros - seems like the negative z axis index is causing problems.

henrypinkard · 2023-07-29T14:56:05Z

At least part of the problem is that this line

https://github.com/micro-manager/NDTiffStorage/blob/80623e77d67ea9e4be751cee6b53724752cbe91f/python/ndtiff/nd_tiff_current.py#L613

Should be changed to:

axes = {key: axes_to_stack[key][block_id[i]] for i, key in enumerate(axes_to_stack.keys())}

(BTW I tried to make this change myself, but couldn't. I could pull the changes locally as you suggested before, but not push. I think there is a box that you can check when making PRs that says something like "allow maintainers to make edits" that would make this easier)

How this works is that the axes use the SortedSets to determine the order, and then when as_array is called it makes the stacked array in this order (lazily, not pulling everything into RAM at once). However, because of this bug it wasn't actually using the order defined in dataset.axes.

I don't think there will be any appreciable performance penalty for creating the stacked arrays in different orderings, because the data isn't immediately read into memory. The only sorting-related cost should be sorting of the keys, but I don't think this is a huge concern because when loading the data from disk its a one time cost, and Carl's changes seem to have made this fast enough for in-progress acquisitions to be repeatedly run in real time. Maybe there are situations on enourmous datasets where this could become a problem, but I would guess this is atypical to how most people use this, so if we do find that it exists I'd advocate for making an option to turn off sorting as needed.

There may be differences in speed based on ordering for actually pulling the pixel data into memory, but this is hardware-dependent (e.g. if sequential images read are in nearby places on the physical disk) and I think beyond the scope of how this class should be set up.

Separately, there's the issue of whether String axes should be sorted or kept in acquisition order. I do think the dataset.axes ordering should match the dataset.as_array() ordering, since this was the original intent. So its a question of whether to change the sorting behavior for string-valued axes so that they remain in acquisition ordering or go to alphabetical automatically. This is a new enough feature that I think its fine to pick and document a convention now (and the past behavior was buggy and confusing anyway).

Great that you are working through all these details and making tests. I think this will be very beneficial to the libraries usability.

Co-Authored-By: Henry Pinkard <7969470+henrypinkard@users.noreply.github.com>

…ions

ieivanov · 2023-08-04T01:29:09Z

axes = {key: axes_to_stack[key][block_id[i]] for i, key in enumerate(axes_to_stack.keys())}

That did the trick!

On the issue of sorting vs not sorting - let's take the path of least resistance and continue sorting the axes and now the data. I agree with you that it's critical that the dataset.axes ordering should match the dataset.as_array() ordering and we've achieved that here.

I think this PR is ready to merge. I saw that you updated the documentation (thanks a lot!) I'll read through it and I'll double check that this behavior is fully explained.

ieivanov added 2 commits July 26, 2023 14:34

add test dataset

058b1a1

add test for reading data with unordered position labels

e3d963b

add test for reading unordered z axis

80623e7

ieivanov mentioned this pull request Jul 28, 2023

Save HCS position label in NDTiff dataset axes czbiohub-sf/shrimPy#70

Merged

ieivanov mentioned this pull request Jul 31, 2023

Add position labels to ndtiff metadata czbiohub-sf/iohub#166

Merged

ieivanov and others added 3 commits August 3, 2023 18:23

sort as_array() output data according to axes[key]

61d33e4

Co-Authored-By: Henry Pinkard <7969470+henrypinkard@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into tests/labeled_posit…

e4bcd61

…ions

bump version

eb80810

henrypinkard merged commit 41d212a into micro-manager:main Aug 4, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds test for reading data with unordered position labeles #124

Adds test for reading data with unordered position labeles #124

ieivanov commented Jul 27, 2023

ieivanov commented Jul 27, 2023

carlkesselman commented Jul 27, 2023 via email

henrypinkard commented Jul 28, 2023 •

edited

carlkesselman commented Jul 28, 2023 via email

ieivanov commented Jul 28, 2023

ieivanov commented Jul 28, 2023

henrypinkard commented Jul 29, 2023

ieivanov commented Aug 4, 2023

Adds test for reading data with unordered position labeles #124

Adds test for reading data with unordered position labeles #124

Conversation

ieivanov commented Jul 27, 2023

ieivanov commented Jul 27, 2023

carlkesselman commented Jul 27, 2023 via email

henrypinkard commented Jul 28, 2023 • edited

carlkesselman commented Jul 28, 2023 via email

ieivanov commented Jul 28, 2023

ieivanov commented Jul 28, 2023

henrypinkard commented Jul 29, 2023

ieivanov commented Aug 4, 2023

henrypinkard commented Jul 28, 2023 •

edited