Selectively reading sample components with `subset` #118

cczhu · 2018-01-29T23:37:18Z

Implemented the subset property in stream readers, which allows for selectively reading the components (threads, polarisation or channels) of a complete sample. Any indexing object accepted by a numpy.ndarray can be used, except for None. For multi-dimensional indexing, the enclosing structure must be a tuple (i.e. (dim_1_indexer, dim_2_indexer, ...)), so multi-dimensional arrays are not accepted. Advanced indexing that changes the dimensions of the sample shape is also not possible, except for passing single integers since these are transformed into slices within VLBIStreamBase._get_subset_and_sample_shape. Once set, subset is used as-is to slice frame data in all formats except VDIF, where threads and channels are sliced separately to retain the selective decoding of frameset frames. subset is a read-only property because it is currently infeasible to re-read a frame every time subset changes.

Some notes on design choices:

As noted above, if the user passes a single number i for one dimension, subset's setter turns it into slice(i, i+1). This is because a namedtuple with N elements cannot be set with fewer than N objects, and figuring out which dimensions to excise from it (similar to what we do for squeezed shapes in the sample_shape getter) requires essentially the same machinery as for converting integers to slices in subset. The latter, however, retains user control of squeezing.
VDIF reads the first frameset twice now, the first time only to obtain the number of threads. This is ugly, but I'm happy to keep it as is if it turns out not to slow down data read-in very much. Alternatively, can create a find_nthreads method in VDIFFileReader.
For consistency with other formats, I'm currently allowing VDIF frames to be returned in the thread ID order the user wishes. I don't think my implementation of this is self-consistent though, since I still use return cls(frames, header0) at the end of VDIFFrameSet.fromfile. Switching that just to return cls(frames) gives me a bunch of time offset errors in the test suite I don't fully understand.
I haven't decided if _unsliced_shape should be a tuple or named tuple. Since we use it in a few places (VDIF, GSB) to read new frames, we could initialize _unsliced_shape as a genuine sample shape at the same time as _sample_shape.

cczhu · 2018-01-30T17:46:08Z

I've just updated the PR to address point 3 from above, by making the stream reader use the first header in the sorted first frameset, rather than the very first header of the file. This prevents the sample VDIF file from being read properly because of its timestamp offset issue, so I fixed the file's timestamps and made a note of it in __init__. I'm willing to roll back these changes, but feel it makes the code more self-consistent, and prevents us from having a whole bunch of notes in the documentation that one of our sample files is wrong...

mhvk · 2018-02-01T18:34:56Z

On the changed file: I'm happy to have the repaired file available, but would like the original one to stay put (with a new name), and with a smaller set of tests that use it. Eventually, we want to have some "repair-on-the-fly" capabilities, in which, e.g., if the seconds in frames are not all the same, all-zero ones are ignored. It will then be good to have examples of problems encountered in the wild.

p.s. Could you separate this out into a different PR? I.e., just make SAMPLE_VDIF the repaired file, add a new SAMPLE_VDIF_... and just check ones that read the streams of both one gets consistent start_time, stop_time and data.

mhvk · 2018-03-01T18:30:47Z

baseband/mark4/base.py

-            fh_raw, header0=header, sample_shape=sample_shape,
-            bps=header.bps, complex_data=False, thread_ids=thread_ids,
+            fh_raw, header0=header, sample_rate=sample_rate,
+            unsliced_shape=tuple(self._frame.payload.sample_shape),


Don't like this recasting too much, unless it is really necessary.

mhvk · 2018-03-01T18:38:34Z

baseband/vdif/base.py

+        super(VDIFStreamReader, self)._get_subset_and_sample_shape(subset)
+        if self.subset is not None:
+            self._thread_ids = list(np.arange(self._unsliced_shape[0],
+                                              dtype='int')[self.subset[0]])


why the dtype? Should be int already, no?

mhvk · 2018-03-01T18:39:03Z

baseband/vdif/base.py

@@ -328,7 +344,7 @@ def _last_header(self):
        # Find first header with same thread_id going backward.
        found = False
        # Set maximum as twice number of frames in frameset.
-        maximum = 2 * self._sample_shape.nthread * self.header0.framesize
+        maximum = 2 * self._unsliced_shape[0] * self.header0.framesize


Use self._framesetsize

mhvk · 2018-03-01T18:44:43Z

baseband/vdif/tests/test_vdif.py

+            data = fh.read()
+            data = np.array([data, abs(data),
+                             -data, -abs(data)]).transpose(1, 2, 0)
+            fw = vdif.open(test_file, 'ws',


Use with vdif.open(...) as fw

mhvk · 2018-03-01T19:11:49Z

baseband/vlbi_base/base.py

        self.samples_per_frame = samples_per_frame
        self.sample_rate = sample_rate
        self.offset = 0
+        self._get_subset_and_sample_shape(subset)


One might use setters to do this:

self.squeeze = squeeze self.subset = subset # evalutate lazyproperty self.sample_shape

Without setters

self._squeeze = bool(squeeze) self._subset = subset # or maybe `self._get_subset(subset)` self._sample_shape = self._calculate_sample_shape()

I think that by looking at both subset and squeeze in calculate_sample_shape, it should be possible to keep integers in subsets for squeeze=True.

mhvk · 2018-03-01T19:13:49Z

baseband/vlbi_base/tests/test_vlbi_base.py

-        assert sbs._unsqueeze(data).shape[1:] == sample_shape_short
+        assert sbs._unsqueeze(data).shape[1:] == unsliced_shape_short
+
+    @pytest.mark.parametrize(('subset', 'sliced_shape',


You can use two @parametrize to get the "product" or different ones (or itertools.product)

mhvk · 2018-03-01T19:15:04Z

docs/baseband/tutorials/getting_started.rst

@@ -256,25 +256,66 @@ VDIF file's headers are of class::

 and so its attributes can be found `here <baseband.vdif.header.VDIFHeader3>`.

-Opening Specific Threads/Channels From Files
--------------------------------------------
+Opening Specific Components of the Data


Note somewhere that it is better to put slice than fancy indices in subset

And put somewhere that squeeze is immutable.

Maybe for the title of the section, state Reading Specific Components.. or Opening a File for Reading Only Specific Components

cczhu · 2018-03-05T16:34:50Z

Addressed the PR comments, and made some revisions to vlbi_base. Now:

_get_subset both processes subset and calculates subset_wrapints, which converts lone integeres to slices. The latter is then used in _get_sample_shape. When squeeze = False, subset is set to subset_wrapints to preserve dimensions of length unity. When squeeze = True, we use subset as is and squeeze any remaining unity dimensions. The resulting sample_shape corresponds to the dimensions of any payload data indexed with subset in either case.
_unsliced_shape is now a namedtuple (of the same type as sample_shape). Stream writers need it to create _data and use _unsqueeze() (we can make _data squeezed, but we'd still need to unsqueeze it when writing to payload). Moreover, it's needed for the Mark 5B reader because nchan is not in the header and sample_shape could be subset. This somewhat lessens the impact of @mhvk's recent changes (since we've just replaced calls to _sample_shape with calls to _unsliced_shape).
For VDIF, reverted from using np.arange(nthread) to directly obtaining frame numbers from the sorted frameset to set _thread_ids. Moved the machinery to VDIFStreamReader.__init__. This is just in case we encouter VDIF files where the frame number skips values (eg. [0, 1, 629, 630]) but everything else is fine. Dana's encountered files that skip frame number values, but also have additional problems like invalid frames, and the reader will still break in those cases, but this change should make it easier to hack. This "feature" is currently untested and undocumented since it's just an implementation choice.
That said, having a try/except block to handle self.subset[0] when setting _thread_ids is kind of ugly right now, which is the price we pay for allowing self.subset[0] to be an int when squeeze is True.

mhvk

Almost exclusively nitpicks now.

One more general question: we could insert singleton integers for dimensions that get squeezed in subset, thus avoiding the need for squeezing the payload. (This means subset would no longer be equal to the input for any value of squeeze, but since it is now edited for squeeze=False, perhaps that's OK after all...) One advantage of this is that the __repr__ will then correctly give the subset that would reproduce the input (currently, squeeze is not shown in the repr).

But perhaps this discussion is best left for another PR.

mhvk · 2018-03-05T16:55:02Z

baseband/gsb/tests/test_gsb.py

-                assert np.all(fh_n.read() == fh_r.read())
-                assert abs(fh_n.stop_time - fh_n.time) < 1.*u.ns
-                assert abs(fh_n.stop_time - fh_r.stop_time) < 1.*u.ns
+            fh_n = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate,


Use with gsb.open(...) as fh_n: to ensure the file gets closed (currently, it does not?)

the enclosing with... statement opens both the sample rawdump files for reading, and two binary files (sh and sp) for writing. However, it's silly to reset sh and sp every time, so rewrote this as you suggested.

Ah, yes, that makes sense.

mhvk · 2018-03-05T16:55:23Z

baseband/gsb/tests/test_gsb.py

+            sh.seek(0)
+            sp.seek(0)
+            fh_r.seek(0)
+            fh_wns = gsb.open(sh, 'ws', raw=sp, sample_rate=fh_r.sample_rate,


mhvk · 2018-03-05T16:55:33Z

baseband/gsb/tests/test_gsb.py

+            sh.seek(0)
+            sp.seek(0)
+            fh_r.seek(0)
+            fh_nns = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate,


mhvk · 2018-03-05T17:00:47Z

baseband/mark5b/base.py

-            fh_raw, header0=header, sample_shape=sample_shape, bps=bps,
-            complex_data=False, thread_ids=thread_ids,
+            fh_raw, header0=header, bps=bps, complex_data=False, subset=subset,
+            unsliced_shape=tuple(self._frame.payload.sample_shape),


Remove the tuple - no need.

mhvk · 2018-03-05T17:02:31Z

baseband/mark5b/base.py

            samples_per_frame=header.payloadsize * 8 // bps // nchan,
            sample_rate=sample_rate, squeeze=squeeze)

        self._data = np.zeros((self.samples_per_frame,
-                               self._sample_shape.nchan), np.float32)
+                               self._unsliced_shape.nchan), np.float32)


Since nchan is an input parameter, it is not entirely illogical to keep it as a private variable, in which case one could do self._nchan. But really not much of a benefit either -- fine to keep as is.

I spent probably an hour yesterday experimenting with which I liked better. I like this because it sets a precedent for formats which don't record the sample shape.

OK, let's keep as is then.

mhvk · 2018-03-05T18:11:01Z

docs/baseband/tutorials/getting_started.rst

+By default, ``fh.read()`` returns complete samples, i.e. with all
+available threads, polarizations or channels. If we were only interested in
+decoding specific components of the complete sample, we can select them by
+passing indexing objects the ``subset`` keyword in open.  For example, if we


objects in the subset

mhvk · 2018-03-05T18:12:16Z

docs/baseband/tutorials/getting_started.rst

+
+Since ``squeeze=False``, ``subset`` is converted from ``3`` to ``slice(3, 4,
+None)`` to retain dimensions of length unity.  This behaviour is turned off
+when ``squeeze=True`` (see below).  Like ``squeeze``, ``subset`` cannot be


Again, I feel that we do not have to mention that squeeze is immutable.

mhvk · 2018-03-05T18:13:46Z

docs/baseband/tutorials/getting_started.rst

+    SampleShape(nthread=2, nchan=1)
+    >>> fh.close()
+
+Here, we have also selected 1 and 3, and channel 0.  No enclosing `tuple` is


I don't understand what the text tries to say here. There is only one channel, so we haven't selected anything. I would just delete the sentence.

mhvk · 2018-03-05T18:15:54Z

docs/baseband/tutorials/getting_started.rst

+use broadcasting to select specific threads from more than one sample shape
+dimension; see the Numpy documentation on `integer array indexing.
+<https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer-array-indexing>`_
+


This is maybe the place to mention that, for the general case, things will be slightly faster with subset=slice(1, 4, 2) instead of subset=[1, 3] (unfortunately, this is not true for this specific example, though... so maybe just raise an issue to remind us).

mhvk · 2018-03-05T18:18:03Z

baseband/dada/base.py

+            if self.subset:
+                data_slice = (data_slice,) + self.subset
+            out[sample:sample + nsample] = (
+                self._frame[data_slice].squeeze() if self.squeeze else


You have two different ways of applying the subset -- here, you construct a slice and then use it; in some of the other formats, you slice data sequentially. The one here is arguably better, though perhaps less clear.

I made the switch to this syntax in the VLBI formats, but had to modify some of the lines for VDIF and Mark 4 (the latter because Mark 4 frame's __getitem__ forbids indexing).

Could you raise an issue about Mark4Frame not allowing indexing? That may just be an oversight.

mhvk · 2018-03-05T18:37:36Z

On my larger comment, see #127 - let's do that after merging this, since this is very close and very large.

cczhu · 2018-03-06T00:55:25Z

Addressed all issues. Before we merge I might want to squash the last two commits - I was hoping they'd be independent of one another but I fixed a fatal VDIF bug in the last one that was created in the second-last one.

mhvk · 2018-03-06T14:50:41Z

baseband/vdif/base.py

+        # Set _thread_ids.  If subsetting, decode first frameset again.
+        if self.subset:
+            # Squeeze in case subset[0] uses broadcasting.
+            subset_0 = (self.subset[0].squeeze()


I think this can be simplified by just doing

thread_ids = np.array(thread_ids)[subset[0]].squeeze()

I'll do it in a quick separate commit.

Nope, that squeezes np.array([1]) into np.array(1). You can solve that by checking if the array's ndim is larger than 0 before trying to turn it into a list, but that adds more checks to your line and makes things look confusing.

A subset argument can now be passed to stream readers in order to selectively read (and in the case of VDIF, selectively decode and read) components of the complete sample. thread_ids can no longer be passed to stream readers.

Also make squeeze (along with subset) immutable once set by the initializer. If squeeze=True, subset does not modify lone integers (this is now compatible with sample shape). Changed _sample_shape to hold the squeezed shape. _unsliced_shape now a namedtuple so it can be used by stream writers (and the M5B reader).

Also fixed how VDIF subsets.

mhvk · 2018-03-06T16:49:38Z

@cczhu - I made some final edits and cleanups, and also rebased (in hindsight, that was a mistake since now it is hard to see what I changed, at least on github). I think we can merge if tests pass.

mhvk

All OK now presumably

cczhu · 2018-03-06T17:10:57Z

Before merging, would you mind squashing the last two of my commits (and possibly yours as well)? The older commit has a bug that the newer one fixes.

…

On Tue, Mar 6, 2018 at 11:49 AM, Marten van Kerkwijk < ***@***.***> wrote: ***@***.**** approved this pull request. All OK now presumably — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#118 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbCC2MaUdpoXlTAcYgci5B2lQ5Grpg2ks5tbr4tgaJpZM4RxiGB> .

mhvk · 2018-03-06T17:13:40Z

@cczhu - I did squash those commits - there are now only 3, your initial 2 (with reworded titles, to be within 72 char), and a final one combining the rest. Since tests passed, I'll merge!

cczhu · 2018-03-06T17:15:38Z

There are titles for commit messages? Is it like with docstrings where the first line is 72 char long, but the rest can be in paragraph form?

…

On Tue, Mar 6, 2018 at 12:14 PM, Marten van Kerkwijk < ***@***.***> wrote: Merged #118 <#118>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#118 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbCC-i4h668Y8h6AI_D5ev0v4CLinYTks5tbsPfgaJpZM4RxiGB> .

mhvk · 2018-03-06T17:21:30Z

Indeed. When you commit, the text in the editor screen tells you what is expected.

cczhu force-pushed the subset branch 2 times, most recently from 4dced83 to e4ef976 Compare January 30, 2018 01:36

cczhu force-pushed the subset branch from 46c245a to e4ef976 Compare February 1, 2018 20:30

cczhu mentioned this pull request Feb 1, 2018

Repaired sample VDIF file #120

Merged

cczhu mentioned this pull request Feb 14, 2018

Glossary and Format Documentation #121

Merged

mhvk mentioned this pull request Feb 15, 2018

Use sample_shape more directly in read and write. #123

Merged

mhvk reviewed Mar 1, 2018

View reviewed changes

cczhu force-pushed the subset branch 5 times, most recently from b076266 to 16bff54 Compare March 5, 2018 16:32

cczhu force-pushed the subset branch from 16bff54 to 265f5a2 Compare March 5, 2018 16:46

mhvk reviewed Mar 5, 2018

View reviewed changes

mhvk mentioned this pull request Mar 5, 2018

Final bits on subset combined with squeeze #127

Closed

mhvk reviewed Mar 6, 2018

View reviewed changes

cczhu added 3 commits March 6, 2018 11:46

Allow stream readers to be "subset"

31aad10

A subset argument can now be passed to stream readers in order to selectively read (and in the case of VDIF, selectively decode and read) components of the complete sample. thread_ids can no longer be passed to stream readers.

Minor cleanups, improvements, and standardization.

78779d4

Also fixed how VDIF subsets.

mhvk force-pushed the subset branch from 0dfb9e4 to 78779d4 Compare March 6, 2018 16:48

mhvk approved these changes Mar 6, 2018

View reviewed changes

mhvk added this to the 1.0 milestone Mar 6, 2018

mhvk added enhancement refactoring labels Mar 6, 2018

cczhu mentioned this pull request Mar 6, 2018

Reminder: more efficient Mark4 decoder #108

Closed

mhvk mentioned this pull request Mar 6, 2018

Change VDIF frameset output array ordering? #129

Closed

mhvk merged commit 05dd5a9 into mhvk:master Mar 6, 2018

mhvk mentioned this pull request Mar 7, 2018

Generalize reading subsets #110

Closed

cczhu deleted the subset branch March 14, 2018 22:02

Selectively reading sample components with subset #118

Selectively reading sample components with subset #118

Conversation

cczhu commented Jan 29, 2018 • edited Loading

cczhu commented Jan 30, 2018 • edited Loading

mhvk commented Feb 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cczhu commented Mar 5, 2018

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhvk commented Mar 5, 2018

cczhu commented Mar 6, 2018

Choose a reason for hiding this comment

cczhu Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

mhvk commented Mar 6, 2018

mhvk left a comment

Choose a reason for hiding this comment

cczhu commented Mar 6, 2018 via email

mhvk commented Mar 6, 2018

cczhu commented Mar 6, 2018 via email

mhvk commented Mar 6, 2018

Selectively reading sample components with `subset` #118

Selectively reading sample components with `subset` #118

cczhu commented Jan 29, 2018 •

edited

Loading

cczhu commented Jan 30, 2018 •

edited

Loading

cczhu Mar 6, 2018 •

edited

Loading