-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selectively reading sample components with subset
#118
Conversation
4dced83
to
e4ef976
Compare
I've just updated the PR to address point 3 from above, by making the stream reader use the first header in the sorted first frameset, rather than the very first header of the file. This prevents the sample VDIF file from being read properly because of its timestamp offset issue, so I fixed the file's timestamps and made a note of it in |
On the changed file: I'm happy to have the repaired file available, but would like the original one to stay put (with a new name), and with a smaller set of tests that use it. Eventually, we want to have some "repair-on-the-fly" capabilities, in which, e.g., if the seconds in frames are not all the same, all-zero ones are ignored. It will then be good to have examples of problems encountered in the wild. p.s. Could you separate this out into a different PR? I.e., just make SAMPLE_VDIF the repaired file, add a new SAMPLE_VDIF_... and just check ones that read the streams of both one gets consistent start_time, stop_time and data. |
baseband/mark4/base.py
Outdated
fh_raw, header0=header, sample_shape=sample_shape, | ||
bps=header.bps, complex_data=False, thread_ids=thread_ids, | ||
fh_raw, header0=header, sample_rate=sample_rate, | ||
unsliced_shape=tuple(self._frame.payload.sample_shape), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like this recasting too much, unless it is really necessary.
baseband/vdif/base.py
Outdated
super(VDIFStreamReader, self)._get_subset_and_sample_shape(subset) | ||
if self.subset is not None: | ||
self._thread_ids = list(np.arange(self._unsliced_shape[0], | ||
dtype='int')[self.subset[0]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the dtype
? Should be int
already, no?
baseband/vdif/base.py
Outdated
@@ -328,7 +344,7 @@ def _last_header(self): | |||
# Find first header with same thread_id going backward. | |||
found = False | |||
# Set maximum as twice number of frames in frameset. | |||
maximum = 2 * self._sample_shape.nthread * self.header0.framesize | |||
maximum = 2 * self._unsliced_shape[0] * self.header0.framesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self._framesetsize
baseband/vdif/tests/test_vdif.py
Outdated
data = fh.read() | ||
data = np.array([data, abs(data), | ||
-data, -abs(data)]).transpose(1, 2, 0) | ||
fw = vdif.open(test_file, 'ws', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use with vdif.open(...) as fw
baseband/vlbi_base/base.py
Outdated
self.samples_per_frame = samples_per_frame | ||
self.sample_rate = sample_rate | ||
self.offset = 0 | ||
self._get_subset_and_sample_shape(subset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One might use setters to do this:
self.squeeze = squeeze
self.subset = subset
# evalutate lazyproperty
self.sample_shape
Without setters
self._squeeze = bool(squeeze)
self._subset = subset # or maybe `self._get_subset(subset)`
self._sample_shape = self._calculate_sample_shape()
I think that by looking at both subset
and squeeze
in calculate_sample_shape
, it should be possible to keep integers in subsets for squeeze=True
.
assert sbs._unsqueeze(data).shape[1:] == sample_shape_short | ||
assert sbs._unsqueeze(data).shape[1:] == unsliced_shape_short | ||
|
||
@pytest.mark.parametrize(('subset', 'sliced_shape', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use two @parametrize to get the "product" or different ones (or itertools.product
)
@@ -256,25 +256,66 @@ VDIF file's headers are of class:: | |||
|
|||
and so its attributes can be found `here <baseband.vdif.header.VDIFHeader3>`. | |||
|
|||
Opening Specific Threads/Channels From Files | |||
-------------------------------------------- | |||
Opening Specific Components of the Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note somewhere that it is better to put slice
than fancy indices in subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And put somewhere that squeeze
is immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for the title of the section, state Reading Specific Components..
or Opening a File for Reading Only Specific Components
b076266
to
16bff54
Compare
Addressed the PR comments, and made some revisions to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost exclusively nitpicks now.
One more general question: we could insert singleton integers for dimensions that get squeezed in subset
, thus avoiding the need for squeezing the payload. (This means subset
would no longer be equal to the input for any value of squeeze
, but since it is now edited for squeeze=False
, perhaps that's OK after all...) One advantage of this is that the __repr__
will then correctly give the subset that would reproduce the input (currently, squeeze
is not shown in the repr
).
But perhaps this discussion is best left for another PR.
baseband/gsb/tests/test_gsb.py
Outdated
assert np.all(fh_n.read() == fh_r.read()) | ||
assert abs(fh_n.stop_time - fh_n.time) < 1.*u.ns | ||
assert abs(fh_n.stop_time - fh_r.stop_time) < 1.*u.ns | ||
fh_n = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use with gsb.open(...) as fh_n:
to ensure the file gets closed (currently, it does not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the enclosing with...
statement opens both the sample rawdump files for reading, and two binary files (sh
and sp
) for writing. However, it's silly to reset sh
and sp
every time, so rewrote this as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that makes sense.
baseband/gsb/tests/test_gsb.py
Outdated
sh.seek(0) | ||
sp.seek(0) | ||
fh_r.seek(0) | ||
fh_wns = gsb.open(sh, 'ws', raw=sp, sample_rate=fh_r.sample_rate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
baseband/gsb/tests/test_gsb.py
Outdated
sh.seek(0) | ||
sp.seek(0) | ||
fh_r.seek(0) | ||
fh_nns = gsb.open(sh, 'rs', raw=sp, sample_rate=fh_r.sample_rate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again
baseband/mark5b/base.py
Outdated
fh_raw, header0=header, sample_shape=sample_shape, bps=bps, | ||
complex_data=False, thread_ids=thread_ids, | ||
fh_raw, header0=header, bps=bps, complex_data=False, subset=subset, | ||
unsliced_shape=tuple(self._frame.payload.sample_shape), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the tuple
- no need.
samples_per_frame=header.payloadsize * 8 // bps // nchan, | ||
sample_rate=sample_rate, squeeze=squeeze) | ||
|
||
self._data = np.zeros((self.samples_per_frame, | ||
self._sample_shape.nchan), np.float32) | ||
self._unsliced_shape.nchan), np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since nchan
is an input parameter, it is not entirely illogical to keep it as a private variable, in which case one could do self._nchan
. But really not much of a benefit either -- fine to keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent probably an hour yesterday experimenting with which I liked better. I like this because it sets a precedent for formats which don't record the sample shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's keep as is then.
By default, ``fh.read()`` returns complete samples, i.e. with all | ||
available threads, polarizations or channels. If we were only interested in | ||
decoding specific components of the complete sample, we can select them by | ||
passing indexing objects the ``subset`` keyword in open. For example, if we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
objects in the subset
|
||
Since ``squeeze=False``, ``subset`` is converted from ``3`` to ``slice(3, 4, | ||
None)`` to retain dimensions of length unity. This behaviour is turned off | ||
when ``squeeze=True`` (see below). Like ``squeeze``, ``subset`` cannot be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I feel that we do not have to mention that squeeze
is immutable.
SampleShape(nthread=2, nchan=1) | ||
>>> fh.close() | ||
|
||
Here, we have also selected 1 and 3, and channel 0. No enclosing `tuple` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the text tries to say here. There is only one channel, so we haven't selected anything. I would just delete the sentence.
use broadcasting to select specific threads from more than one sample shape | ||
dimension; see the Numpy documentation on `integer array indexing. | ||
<https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#integer-array-indexing>`_ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe the place to mention that, for the general case, things will be slightly faster with subset=slice(1, 4, 2)
instead of subset=[1, 3]
(unfortunately, this is not true for this specific example, though... so maybe just raise an issue to remind us).
baseband/dada/base.py
Outdated
if self.subset: | ||
data_slice = (data_slice,) + self.subset | ||
out[sample:sample + nsample] = ( | ||
self._frame[data_slice].squeeze() if self.squeeze else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have two different ways of applying the subset
-- here, you construct a slice and then use it; in some of the other formats, you slice data
sequentially. The one here is arguably better, though perhaps less clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the switch to this syntax in the VLBI formats, but had to modify some of the lines for VDIF and Mark 4 (the latter because Mark 4 frame's __getitem__
forbids indexing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you raise an issue about Mark4Frame
not allowing indexing? That may just be an oversight.
On my larger comment, see #127 - let's do that after merging this, since this is very close and very large. |
Addressed all issues. Before we merge I might want to squash the last two commits - I was hoping they'd be independent of one another but I fixed a fatal VDIF bug in the last one that was created in the second-last one. |
baseband/vdif/base.py
Outdated
# Set _thread_ids. If subsetting, decode first frameset again. | ||
if self.subset: | ||
# Squeeze in case subset[0] uses broadcasting. | ||
subset_0 = (self.subset[0].squeeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified by just doing
thread_ids = np.array(thread_ids)[subset[0]].squeeze()
I'll do it in a quick separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that squeezes np.array([1])
into np.array(1)
. You can solve that by checking if the array's ndim
is larger than 0 before trying to turn it into a list, but that adds more checks to your line and makes things look confusing.
A subset argument can now be passed to stream readers in order to selectively read (and in the case of VDIF, selectively decode and read) components of the complete sample. thread_ids can no longer be passed to stream readers.
Also make squeeze (along with subset) immutable once set by the initializer. If squeeze=True, subset does not modify lone integers (this is now compatible with sample shape). Changed _sample_shape to hold the squeezed shape. _unsliced_shape now a namedtuple so it can be used by stream writers (and the M5B reader).
Also fixed how VDIF subsets.
@cczhu - I made some final edits and cleanups, and also rebased (in hindsight, that was a mistake since now it is hard to see what I changed, at least on github). I think we can merge if tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All OK now presumably
Before merging, would you mind squashing the last two of my commits (and
possibly yours as well)? The older commit has a bug that the newer one
fixes.
…On Tue, Mar 6, 2018 at 11:49 AM, Marten van Kerkwijk < ***@***.***> wrote:
***@***.**** approved this pull request.
All OK now presumably
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#118 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbCC2MaUdpoXlTAcYgci5B2lQ5Grpg2ks5tbr4tgaJpZM4RxiGB>
.
|
@cczhu - I did squash those commits - there are now only 3, your initial 2 (with reworded titles, to be within 72 char), and a final one combining the rest. Since tests passed, I'll merge! |
There are titles for commit messages? Is it like with docstrings where the
first line is 72 char long, but the rest can be in paragraph form?
…On Tue, Mar 6, 2018 at 12:14 PM, Marten van Kerkwijk < ***@***.***> wrote:
Merged #118 <#118>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#118 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAbCC-i4h668Y8h6AI_D5ev0v4CLinYTks5tbsPfgaJpZM4RxiGB>
.
|
Indeed. When you commit, the text in the editor screen tells you what is expected. |
Implemented the
subset
property in stream readers, which allows for selectively reading the components (threads, polarisation or channels) of a complete sample. Any indexing object accepted by anumpy.ndarray
can be used, except forNone
. For multi-dimensional indexing, the enclosing structure must be a tuple (i.e.(dim_1_indexer, dim_2_indexer, ...)
), so multi-dimensional arrays are not accepted. Advanced indexing that changes the dimensions of the sample shape is also not possible, except for passing single integers since these are transformed into slices withinVLBIStreamBase._get_subset_and_sample_shape
. Once set,subset
is used as-is to slice frame data in all formats except VDIF, where threads and channels are sliced separately to retain the selective decoding of frameset frames.subset
is a read-only property because it is currently infeasible to re-read a frame every timesubset
changes.Some notes on design choices:
i
for one dimension, subset's setter turns it intoslice(i, i+1)
. This is because anamedtuple
with N elements cannot be set with fewer than N objects, and figuring out which dimensions to excise from it (similar to what we do for squeezed shapes in thesample_shape
getter) requires essentially the same machinery as for converting integers to slices insubset
. The latter, however, retains user control of squeezing.find_nthreads
method inVDIFFileReader
.return cls(frames, header0)
at the end ofVDIFFrameSet.fromfile
. Switching that just toreturn cls(frames)
gives me a bunch of time offset errors in the test suite I don't fully understand._unsliced_shape
should be a tuple or named tuple. Since we use it in a few places (VDIF, GSB) to read new frames, we could initialize_unsliced_shape
as a genuine sample shape at the same time as_sample_shape
.