Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write wideband data with CPHD metadata and add a Python list-of-dicts interface for PVP parameters #351

Closed
wants to merge 37 commits into from

Conversation

elliot-klein
Copy link

@elliot-klein elliot-klein commented Sep 21, 2020

  • Add cphd::CPHDWriter::writeWideband() to write a NumPy array of complex64 wideband data along with CPHD Metadata and PVPBlock objects
  • Add methods to cphd::PVPBlock to convert to/from a list of Python dictionaries of PVP data (a list of channels, each with a dict of PVP parameter names mapping to NumPy arrays)
  • Make cphd::PVPBlock used std::map to store added PVP parameters instead of std::unordered_map. This was needed to support older versions of SWIG.
  • Add Python tests for these features and move some common test code into util.py. To install util.py along with the test folder, use waf install (not just waf install --target=test_.....py)

Issues

  • cphd::CPHDWriter::writeWideband() doesn't support compressed data (and there is currently nothing preventing someone from using it to write compressed data and failing)
  • cphd::CPHDWriter::writeWideband() calls cphd::CPHDWriter::write(), which was previously %ignored by SWIG, so the Python CPHDWriter now has both a write() and writeWideband() method, and it may be unclear which one to use.
  • Using PVPBlock list-of-dicts with custom PVP parameters and/or multiple components is not fully supported:
    • Using custom float or complex PVP parameters requires using astype('float32') and astype('complex64'), otherwise the values are slightly inconsistent when reading to/from a list-of-dicts
    • Different data types in a single parameter (e.g. X='F8';Y='I8';) isn't supported, parameters with multiple components can only have one type (this is only an issue for custom parameters, none of the default parameters currently require this).
    • Custom PVP parameters with multiple components aren't supported
    • Default PVP parameters that are not of size 1 or 3 are not supported (not currently an issue since there are none)
    • If using a string PVP parameter that is longer than the specified size, NumPy will silently truncate it (e.g. a parameter specified as 'S8' but given '123456789' will store '12345678')

…mplate CPHDWriter::write() for complex<float>
@dstarinshak dstarinshak self-requested a review September 22, 2020 11:59
* \param cols
* Expected number of cols in widebandArray
*/
void writeWideband(PVPBlock pvpBlock, PyObject* widebandArray, size_t rows, size_t cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you were trying to do here, but I don't think it will work in general. I think a cleaner approach would be to circumvent using CPHDWriter::write() altogether.

The write() routine expects to see a giant concatenated blob of wideband data for all the channels at once. It would be nice if all the per-channel wideband were the same size. Then you could have a single array that's shaped (num_channels, num_rows, num_cols). Getting the underlying array pointer would ensure everything's contiguous in memory.

However, each channel can have its own dimensions. In that case, it seems more natural to have the wideband data expressed as a collection (e.g., list, tuple, etc) of numpy arrays. You can't call the write() routine directly because there's no guarantee that the individual numpy arrays are contiguous in memory with each other. We could make them contiguous by doing a copy, but that can get very expensive very fast. Instead, we can modify this extended function to do the following:

void writeWideband(PVPBlock pvpBlock, PyObject* widebandArrays)
{
    const size_t numChannels = static_cast<size_t>(<Python_API_for_computing_length>(widebandArrays));
    ASSERT(numChannels == mMetadata.data.getNumChannels());
    writeMetadata(pvpBlock);
    if (mMetadata.data.getNumSupportArrays() != 0)
    {
        if (supportData == nullptr)
        {
            throw except::Exception(Ctxt("SupportData is not provided"));
        }
        writeSupportData(supportData);
    }
    writePVPData(pvpBlock);

    for (size_t ii = 0; ii < numChannels; ++ii)
    {
        const types::RowCol<size_t> dims(
            mMetadata.data.getNumVectors(ii),
            mMetadata.data.getNumSamples(ii)
        );
        const types::RowCol<size_t> arrayDims = numpyutils::getDimensionsRC(widebandArrays[ii]);
        ASSERT(dims == arraysDims);
        auto widebandPtr = numpyutils::getBuffer<std::complex<float>>(widebandArrays[ii]);
        writeCPHDData<std::complex<float>>(widebandPtr, numElements, ii);
    }
}

Although this duplicates a lot of CPHDWriter::write(), it leaves the Python level looking clean.

I would suggest also adding the following to simplify writing from Python. Hide the CPHDWriter object from the Python API:

%pythoncode
%{
    def write_cphd(
        pathname,
        metadata,
        pvp_block,
        wideband_arrays,
        schema_paths=None,
        num_threads=None,
        scratch_space=None,
    ):
        """Docstring"""
    if schema_paths is None:
        schema_paths = []  # Not sure if this is the best default or checking the user's env
    if num_threads is None:
        num_threads = 0
    if scratch_space is None:
        scratch_space = 4 * 1024 * 1024

    # Support writing a single channel with a single ndarray
    if isinstance(wideband_arrays, np.ndarray):
        wideband_arrays = [wideband_arrays,]

    writer = CPHDWriter(
        metadata,
        pathname,
        schema_paths,
        num_threads,
        scratch_space,
    )
    writer.writeWideband(pvp_block, wideband_arrays)
%}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dstarinshak! I updated writeWideband() and added the write_cphd() method.

How would you recommend handling support arrays? write_cphd() currently does not accept any as a parameter (but writeWideband() does). CPHDWriter has two writeSupportData() methods -- one that writes support arrays individually given their string ID and a data pointer (without including any padding from the metadata), and another that writes all the support arrays given a pointer to contiguous support array data. Two ideas:

  • write_cphd() could receive a sequence of NumPy support arrays, make a copy so that the data is contiguous, and pass that to writeWideband() (no changes needed to writeWideband())
  • write_cphd() could receive a Python dictionary mapping string support array IDs to NumPy support array data, and writeWideband() could individually write each array by ID (maybe modify that writeSupportData() version to include padding)

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both approaches for writing support arrays sounds good. The first one sounds like less work. It would only involve adding a support_arrays=None kwarg to write_cphd and parsing out what to do with it inside. As long as you document what form supportArrays should take in the docstring, I think we're good.

Copy link
Author

@elliot-klein elliot-klein Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you recommend reading the support arrays back in for test_cphd_write_wideband? The first method works well for writing, but I haven't tested it since it doesn't look like there are Python bindings yet for reading support arrays.

CPHDReader loads a SupportBlock, and we can get the support array metadata (string identifiers and dimensions) from Metadata.data.supportArrayMap.

SupportBlock has methods to read support arrays by string ID into a pre-allocated BufferView or a ScopedArray-- based on Wideband.read (which calls (1), (2), (3)), We can cast the __array_interface__ NumPy data pointer from long long to a sys::ubyte* to create a BufferView<sys::ubyte> for SupportBlock::read(). That compiles and runs, but the data in the NumPy array is not equal to what was written. Do those casts and methods sound correct?

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I lost track of this thread. I suggest simply leaving support arrays out of the bindings for the time being. Is there a clean way to recognize if a CPHD contains support data in the bindings? If so, we could log a warning to the user that "support array parsing it not implemented" and that we're only reading and returning a subset of their CPHD. I would rather not raise a NotImplementedError since the remainder of the CPHD should be valid for parsing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get the number of support arrays with metadata.data.getNumSupportArrays(), but for the moment there's not a great place to check that in the bindings, since we aren't extending CPHDReader or SupportBlock at all.

It's possible to get a SupportBlock object with CPHDReader::getSupportBlock(), but then there's no Python interface to get the actual support block data. To write a CPHD, you need to provide the support array data (as well as the SupportBlock from the metadata), so I don't think it would be possible for someone to load a CPHD file that has support arrays and then e.g. write it to another file without realizing that the support arrays are missing.

@dstarinshak
Copy link
Contributor

Consider running pylint on any (non-SWIG-generated) Python files to check style

Copy link
Author

@elliot-klein elliot-klein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I ran out of time to completely wrap this up. I think this is what remains:

  • Fix the 'U1' and 'I1' PVP params not being correctly converted from the list-of-dicts
  • Decide what to do about support array reading (might be just fine as-is for now). I pushed my WIP changes to the wip-support-arrays branch of my fork in case you want to use those as a starting point.
  • I flagged some duplicated code for generating the list of PVP params in from_list_of_dicts(), you might be able to come up with a cleaner way to do that
  • Test string and multi-size custom PVP params
  • Make any final style fixes
  • Update the SWIG-generated files

Thanks again for all your help with this

# Populate PVPBlock object from pvp_data
for channel_index, channel_data in enumerate(pvp_data):
for vector_index in range(metadata.getNumVectors(channel_index)):
for name, data in channel_data.items():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to check that the list of keys in channel_data matches the list of PVP params (the required ones, any optional ones, and any added ones)

pvp_info = {}

# Set required and optional PVP parameters
for name in dir(metadata.pvp):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of code that's duplicated here from above. Here, we need to get the list of PVPBlock setter methods for all default PVP params (required ones plus any optional parameters that are in use) and ideally build a dict that maps PVP param names to setter methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants