Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectively reading sample components with subset #118

Merged
merged 3 commits into from
Mar 6, 2018
Merged

Conversation

cczhu
Copy link
Contributor

@cczhu cczhu commented Jan 29, 2018

Implemented the subset property in stream readers, which allows for selectively reading the components (threads, polarisation or channels) of a complete sample. Any indexing object accepted by a numpy.ndarray can be used, except for None. For multi-dimensional indexing, the enclosing structure must be a tuple (i.e. (dim_1_indexer, dim_2_indexer, ...)), so multi-dimensional arrays are not accepted. Advanced indexing that changes the dimensions of the sample shape is also not possible, except for passing single integers since these are transformed into slices within VLBIStreamBase._get_subset_and_sample_shape. Once set, subset is used as-is to slice frame data in all formats except VDIF, where threads and channels are sliced separately to retain the selective decoding of frameset frames. subset is a read-only property because it is currently infeasible to re-read a frame every time subset changes.

Some notes on design choices:

  • As noted above, if the user passes a single number i for one dimension, subset's setter turns it into slice(i, i+1). This is because a namedtuple with N elements cannot be set with fewer than N objects, and figuring out which dimensions to excise from it (similar to what we do for squeezed shapes in the sample_shape getter) requires essentially the same machinery as for converting integers to slices in subset. The latter, however, retains user control of squeezing.
  • VDIF reads the first frameset twice now, the first time only to obtain the number of threads. This is ugly, but I'm happy to keep it as is if it turns out not to slow down data read-in very much. Alternatively, can create a find_nthreads method in VDIFFileReader.
  • For consistency with other formats, I'm currently allowing VDIF frames to be returned in the thread ID order the user wishes. I don't think my implementation of this is self-consistent though, since I still use return cls(frames, header0) at the end of VDIFFrameSet.fromfile. Switching that just to return cls(frames) gives me a bunch of time offset errors in the test suite I don't fully understand.
  • I haven't decided if _unsliced_shape should be a tuple or named tuple. Since we use it in a few places (VDIF, GSB) to read new frames, we could initialize _unsliced_shape as a genuine sample shape at the same time as _sample_shape.

@cczhu cczhu force-pushed the subset branch 2 times, most recently from 4dced83 to e4ef976 Compare January 30, 2018 01:36
@cczhu
Copy link
Contributor Author

cczhu commented Jan 30, 2018

I've just updated the PR to address point 3 from above, by making the stream reader use the first header in the sorted first frameset, rather than the very first header of the file. This prevents the sample VDIF file from being read properly because of its timestamp offset issue, so I fixed the file's timestamps and made a note of it in __init__. I'm willing to roll back these changes, but feel it makes the code more self-consistent, and prevents us from having a whole bunch of notes in the documentation that one of our sample files is wrong...

@mhvk
Copy link
Owner

mhvk commented Feb 1, 2018

On the changed file: I'm happy to have the repaired file available, but would like the original one to stay put (with a new name), and with a smaller set of tests that use it. Eventually, we want to have some "repair-on-the-fly" capabilities, in which, e.g., if the seconds in frames are not all the same, all-zero ones are ignored. It will then be good to have examples of problems encountered in the wild.

p.s. Could you separate this out into a different PR? I.e., just make SAMPLE_VDIF the repaired file, add a new SAMPLE_VDIF_... and just check ones that read the streams of both one gets consistent start_time, stop_time and data.

fh_raw, header0=header, sample_shape=sample_shape,
bps=header.bps, complex_data=False, thread_ids=thread_ids,
fh_raw, header0=header, sample_rate=sample_rate,
unsliced_shape=tuple(self._frame.payload.sample_shape),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like this recasting too much, unless it is really necessary.

super(VDIFStreamReader, self)._get_subset_and_sample_shape(subset)
if self.subset is not None:
self._thread_ids = list(np.arange(self._unsliced_shape[0],
dtype='int')[self.subset[0]])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the dtype? Should be int already, no?

@@ -328,7 +344,7 @@ def _last_header(self):
# Find first header with same thread_id going backward.
found = False
# Set maximum as twice number of frames in frameset.
maximum = 2 * self._sample_shape.nthread * self.header0.framesize
maximum = 2 * self._unsliced_shape[0] * self.header0.framesize
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self._framesetsize

data = fh.read()
data = np.array([data, abs(data),
-data, -abs(data)]).transpose(1, 2, 0)
fw = vdif.open(test_file, 'ws',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use with vdif.open(...) as fw

self.samples_per_frame = samples_per_frame
self.sample_rate = sample_rate
self.offset = 0
self._get_subset_and_sample_shape(subset)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One might use setters to do this:

self.squeeze = squeeze
self.subset = subset
# evalutate lazyproperty
self.sample_shape

Without setters

self._squeeze = bool(squeeze)
self._subset = subset  # or maybe `self._get_subset(subset)`
self._sample_shape = self._calculate_sample_shape()

I think that by looking at both subset and squeeze in calculate_sample_shape, it should be possible to keep integers in subsets for squeeze=True.

assert sbs._unsqueeze(data).shape[1:] == sample_shape_short
assert sbs._unsqueeze(data).shape[1:] == unsliced_shape_short

@pytest.mark.parametrize(('subset', 'sliced_shape',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use two @parametrize to get the "product" or different ones (or itertools.product)

@@ -256,25 +256,66 @@ VDIF file's headers are of class::

and so its attributes can be found `here <baseband.vdif.header.VDIFHeader3>`.

Opening Specific Threads/Channels From Files
--------------------------------------------
Opening Specific Components of the Data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note somewhere that it is better to put slice than fancy indices in subset

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And put somewhere that squeeze is immutable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for the title of the section, state Reading Specific Components.. or Opening a File for Reading Only Specific Components

@cczhu cczhu force-pushed the subset branch 5 times, most recently from b076266 to 16bff54 Compare March 5, 2018 16:32
@cczhu
Copy link
Contributor Author

cczhu commented Mar 5, 2018

Addressed the PR comments, and made some revisions to vlbi_base. Now:

  • _get_subset both processes subset and calculates subset_wrapints, which converts lone integeres to slices. The latter is then used in _get_sample_shape. When squeeze = False, subset is set to subset_wrapints to preserve dimensions of length unity. When squeeze = True, we use subset as is and squeeze any remaining unity dimensions. The resulting sample_shape corresponds to the dimensions of any payload data indexed with subset in either case.
  • _unsliced_shape is now a namedtuple (of the same type as sample_shape). Stream writers need it to create _data and use _unsqueeze() (we can make _data squeezed, but we'd still need to unsqueeze it when writing to payload). Moreover, it's needed for the Mark 5B reader because nchan is not in the header and sample_shape could be subset. This somewhat lessens the impact of @mhvk's recent changes (since we've just replaced calls to _sample_shape with calls to _unsliced_shape).
  • For VDIF, reverted from using np.arange(nthread) to directly obtaining frame numbers from the sorted frameset to set _thread_ids. Moved the machinery to VDIFStreamReader.__init__. This is just in case we encouter VDIF files where the frame number skips values (eg. [0, 1, 629, 630]) but everything else is fine. Dana's encountered files that skip frame number values, but also have additional problems like invalid frames, and the reader will still break in those cases, but this change should make it easier to hack. This "feature" is currently untested and undocumented since it's just an implementation choice.
  • That said, having a try/except block to handle self.subset[0] when setting _thread_ids is kind of ugly right now, which is the price we pay for allowing self.subset[0] to be an int when squeeze is True.

Copy link
Owner

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost exclusively nitpicks now.

One more general question: we could insert singleton integers for dimensions that get squeezed in subset, thus avoiding the need for squeezing the payload. (This means subset would no longer be equal to the input for any value of squeeze, but since it is now edited for squeeze=False, perhaps that's OK after all...) One advantage of this is that the __repr__ will then correctly give the subset that would reproduce the input (currently, squeeze is not shown in the repr).

But perhaps this discussion is best left for another PR.

assert np.all(fh_n.read() == fh_r.read())
assert abs(fh_n.stop_time - fh_n.time) < 1.*u.ns
assert abs(fh_n.stop_time - fh_r.stop_time) < 1.*u.ns
fh_n = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use with gsb.open(...) as fh_n: to ensure the file gets closed (currently, it does not?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the enclosing with... statement opens both the sample rawdump files for reading, and two binary files (sh and sp) for writing. However, it's silly to reset sh and sp every time, so rewrote this as you suggested.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that makes sense.

sh.seek(0)
sp.seek(0)
fh_r.seek(0)
fh_wns = gsb.open(sh, 'ws', raw=sp, sample_rate=fh_r.sample_rate,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

sh.seek(0)
sp.seek(0)
fh_r.seek(0)
fh_nns = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again

fh_raw, header0=header, sample_shape=sample_shape, bps=bps,
complex_data=False, thread_ids=thread_ids,
fh_raw, header0=header, bps=bps, complex_data=False, subset=subset,
unsliced_shape=tuple(self._frame.payload.sample_shape),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the tuple - no need.

samples_per_frame=header.payloadsize * 8 // bps // nchan,
sample_rate=sample_rate, squeeze=squeeze)

self._data = np.zeros((self.samples_per_frame,
self._sample_shape.nchan), np.float32)
self._unsliced_shape.nchan), np.float32)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since nchan is an input parameter, it is not entirely illogical to keep it as a private variable, in which case one could do self._nchan. But really not much of a benefit either -- fine to keep as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent probably an hour yesterday experimenting with which I liked better. I like this because it sets a precedent for formats which don't record the sample shape.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's keep as is then.

By default, ``fh.read()`` returns complete samples, i.e. with all
available threads, polarizations or channels. If we were only interested in
decoding specific components of the complete sample, we can select them by
passing indexing objects the ``subset`` keyword in open. For example, if we
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects in the subset


Since ``squeeze=False``, ``subset`` is converted from ``3`` to ``slice(3, 4,
None)`` to retain dimensions of length unity. This behaviour is turned off
when ``squeeze=True`` (see below). Like ``squeeze``, ``subset`` cannot be
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I feel that we do not have to mention that squeeze is immutable.

SampleShape(nthread=2, nchan=1)
>>> fh.close()

Here, we have also selected 1 and 3, and channel 0. No enclosing `tuple` is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the text tries to say here. There is only one channel, so we haven't selected anything. I would just delete the sentence.

use broadcasting to select specific threads from more than one sample shape
dimension; see the Numpy documentation on `integer array indexing.
<https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer-array-indexing>`_

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe the place to mention that, for the general case, things will be slightly faster with subset=slice(1, 4, 2) instead of subset=[1, 3] (unfortunately, this is not true for this specific example, though... so maybe just raise an issue to remind us).

if self.subset:
data_slice = (data_slice,) + self.subset
out[sample:sample + nsample] = (
self._frame[data_slice].squeeze() if self.squeeze else
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two different ways of applying the subset -- here, you construct a slice and then use it; in some of the other formats, you slice data sequentially. The one here is arguably better, though perhaps less clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the switch to this syntax in the VLBI formats, but had to modify some of the lines for VDIF and Mark 4 (the latter because Mark 4 frame's __getitem__ forbids indexing).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you raise an issue about Mark4Frame not allowing indexing? That may just be an oversight.

@mhvk
Copy link
Owner

mhvk commented Mar 5, 2018

On my larger comment, see #127 - let's do that after merging this, since this is very close and very large.

@cczhu
Copy link
Contributor Author

cczhu commented Mar 6, 2018

Addressed all issues. Before we merge I might want to squash the last two commits - I was hoping they'd be independent of one another but I fixed a fatal VDIF bug in the last one that was created in the second-last one.

# Set _thread_ids. If subsetting, decode first frameset again.
if self.subset:
# Squeeze in case subset[0] uses broadcasting.
subset_0 = (self.subset[0].squeeze()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified by just doing

thread_ids = np.array(thread_ids)[subset[0]].squeeze()

I'll do it in a quick separate commit.

Copy link
Contributor Author

@cczhu cczhu Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that squeezes np.array([1]) into np.array(1). You can solve that by checking if the array's ndim is larger than 0 before trying to turn it into a list, but that adds more checks to your line and makes things look confusing.

cczhu added 3 commits March 6, 2018 11:46
A subset argument can now be passed to stream readers in order to
selectively read (and in the case of VDIF, selectively decode and
read) components of the complete sample.

thread_ids can no longer be passed to stream readers.
Also make squeeze (along with subset) immutable once set by the
initializer. If squeeze=True, subset does not modify lone integers
(this is now compatible with sample shape). Changed _sample_shape to
hold the squeezed shape. _unsliced_shape now a namedtuple so it can be
used by stream writers (and the M5B reader).
@mhvk
Copy link
Owner

mhvk commented Mar 6, 2018

@cczhu - I made some final edits and cleanups, and also rebased (in hindsight, that was a mistake since now it is hard to see what I changed, at least on github). I think we can merge if tests pass.

Copy link
Owner

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All OK now presumably

@mhvk mhvk added this to the 1.0 milestone Mar 6, 2018
@cczhu
Copy link
Contributor Author

cczhu commented Mar 6, 2018 via email

@mhvk
Copy link
Owner

mhvk commented Mar 6, 2018

@cczhu - I did squash those commits - there are now only 3, your initial 2 (with reworded titles, to be within 72 char), and a final one combining the rest. Since tests passed, I'll merge!

@mhvk mhvk merged commit 05dd5a9 into mhvk:master Mar 6, 2018
@cczhu
Copy link
Contributor Author

cczhu commented Mar 6, 2018 via email

@mhvk
Copy link
Owner

mhvk commented Mar 6, 2018

Indeed. When you commit, the text in the editor screen tells you what is expected.

@mhvk mhvk mentioned this pull request Mar 7, 2018
@cczhu cczhu deleted the subset branch March 14, 2018 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants