-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UBJSON-derived Binary JData (BJData) format #3336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, but as the owner @nlohmann will need to weigh in. A few suggestions on the code.
After spending the past few days polishing this patch, all tests turned green in the last completed build. The suggestions by @gregmarr have also been addressed and incorporated. The PR is ready for someone to take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it's a good idea to add this to the library. My concerns are:
- It brings some 4K LOC to an already blown library. The fact that it's blown is not your fault, of course, but as long as we cannot make binary format support optional, this goes into every deployment of the library.
- I could not find a "final" specification, but rather a proof of concept or work in progress. Changing the specification then requires an adjustment of this implementation which most likely would be a breaking change. I'd rather not link this library's public API to such an external factor.
- I would like to see benchmarks that show the format is really superior to the existing formats in any aspect. On https://json.nlohmann.me/features/binary_formats/#sizes, you see that CBOR and MessagePack provide the best compression for several examples. Could you please check how BJData performs for these files?
@nlohmann, thanks for the comments. my replies below
a total of 3442 added lines out of the 4261 total are from the test directory. the remaining 800 additions are double-counted because of the amalgamation. so the net addition in the library itself is just ~400 lines, which is <2% of the json.hpp lines. if the length of the test unit is your concern, I am happy to remove all tests that produce identical results as UBJSON and only keep those are related to the new features. I previously thought that keeping the two test units similar making it easier to maintain moving forward (and add additional robustness to the tests).
The spec you see in our github is intend to be the stable version and we are now moving on to implementing/disseminating this for various software tools. I totally agree that a stable spec is essential and it also aligns with our goals.
BJData is designed to be an improved version of UBJSON. The new data makers ( As you saw from my initial post, this PR contains only the reader to the ND array container (ND arrays are (appropriately) serialized to a 1D vector), but not the writer part because I have not yet identified an internal data structure to store the dimensional vector. However, my MATLAB JSON reader/writer, JSONLab, contains full implementation for both UBJSON and BJData formats (for MATLAB/Octave), so I was able to create the file size comparison using my parser: ** updated to use minimized .json files as the baseline **
Although this is a different library, I want to focus on the changes from UBJSON to BJData, which was a result of the format improvement. As you can see, the behaviors of BJData and UBJSON are quite similar. The size improvement in BJData is more prominent in Let me know what you think. |
My mistake - I added the test suite to the added LOC. Sorry for that. |
@nlohmann, I froze the Draft 2 of the BJData specification so that we can refer the implemented version using the below stable document: |
@nlohmann and @gregmarr, most suggested changes have been committed. @nlohmann: @gregmarr and I wanted to hear your opinions for some of the remaining review comments, see discussions above. the updated library, if accepted, should be able to read/write most BJData files. I do want to pick your brains on completing the ndarray optimized header support - as I mentioned above, a major size reduction mechanism offered by BJData is the recognition and collapsing of packed ND-array markers - you can see the nearly 20% size reduction over UBJSON from my above benchmark using to fully support this, on the reader side, we need to
on the writer side
without this feature, the current patch is still going go meet most of our needs (with the combination of our standardized data annotation method.), but certainly a full implementation to the spec would be helpful for general cases. happy to hear your thoughts on this. |
Just brainstorming here. These both seem like a bit of magic, might be very confusing for the user, and I'm not even sure that the writer has the ability to support the first one. Could you create an object containing a vector of dimensions and then a 1D vector of values? The writer would then need to recognize this object and write it appropriately.
Another possibility is that the 1D vector contains first the number of dimensions, then the dimension values, and then the actual data. The downsides here are that these could be stored as doubles when they're really ints, and that you have to account for the extra indices at the beginning of the vector when accessing them. (This seems like essentially what is happening in the actual data file.) |
I suppose the choice of 1 vs 2 for the reader side depends on how you expect users to access the data. Either way, I think the data needs to be accessible to the user, unless you're just using this library to load the data and save it again, and don't expect the user to actually access the stored data. I don't know your expected users, but I think either your proposal 2 or my first proposal of an object would be the easiest for users to work with. |
this was actually my initial plan, and the easiest to implement, because this is basically the same approach we took to serialize complex data structures on the data annotation level (defined in our JData Spec). In our JData annotation format, ND arrays can be serialized in an object form (before storage in JSON/UBJSON/BJData/Msgpack/HDF5 etc)
where this form is still portable, and is JSON/UBJSON/BJData compliant - when serving this data to JData-aware programs, the above JData annotation tags can be recognized and optionally assemble the object back to the ND array in the native format - like I did in this Python module and my MATLAB toolbox. However, such data can also serve to non-JData-aware programs, where ND packed array data can be accessed via the above generic JSON object keys. if both of you are on board of this approach, I will be happy to add this implementation. |
@gregmarr and @nlohmann, I've just pushed a few commits to finally complete this patch. Now json.hpp fully support all features of the BJData format, capable of round-trip handling of ndarray objects, including bi-directional translation of bjdata ndarray optimized containers to JData-annotated JSON representation Update (03/01/2022) - please ignore the below quoted question, I figured out and added the relevant tests.
Other than that, this patch is pretty much done. It adds ~600 lines of code to json.hpp, but enables reading/writing BJData format that inherits almost all the benefits of UBJSON (among which, simplicity and quasi-human-readability are the most valuable IMHO) yet further extends it with complete binary type representations and ND packed array support - which can lead to significant space savings in storing/exchanging scientific/imaging data files. Overall I think it would be a suitable addition to the already comprehensive binary-JSON readers/writers provided in this library. I am committed to supporting all codes/tests related to this feature if accepted. Let me know what you think |
Add bjdata test files for unit testing of nlohmann/json#3336
There is now version 3.1.0 of the test data available: https://github.com/nlohmann/json_test_data/releases/tag/v3.1.0 |
In e6b2b73, I've updated the test data version so it can download the latest files. I also enabled the file-based unit tests for bjdata. it compiles and runs fine. |
Can you please update to the latest |
@nlohmann, the msvc 2017 tests work now after rebasing - but msvc 2019-2022 still fail, the error was caused by missing the new can you let me know where these ci files might be located? |
The stack overflow can be fixed by adjusting this line https://github.com/nlohmann/json/blob/develop/test/CMakeLists.txt#L80 and add the new test binary there. The version of the test data is set only in one place: https://github.com/nlohmann/json/blob/develop/cmake/download_test_data.cmake |
@nlohmann, I am wondering if you are considering accepting this PR? I see more PRs related to binary reader/writer are being prepared. I can resolve the current conflict, although I worried that more will come if we leave this open. please let me know. thanks! |
I will approve it once I am back from vacation. Sorry for the inconvenience. Thanks for your work and patience. |
@fangq I had a chance to think about potential implications for alternate string types some more and do not believe there are any issues. |
@falbrechtskirchinger, thanks, let me fix the remaining CI errors as a result of merging fd2eb29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Thanks a lot @fangq for your work and patience! @falbrechtskirchinger Let's fix the remaining issues in separate PRs. |
thank you @nlohmann and everyone who had helped reviewing this PR! super excited that this format is now supported in this extremely popular (and extremely well crafted) library! thank you for all the wonderful works! |
I am trying to reconstruct the data above. I come to the following vector sizes: canada.json
twitter.json
citm_catalog.json
jeopardy.json
The numbers were calculated with: const auto ubjson_1_size = json::to_ubjson(j).size();
const auto ubjson_2_size = json::to_ubjson(j, true).size();
const auto ubjson_3_size = json::to_ubjson(j, true, true).size(); @fangq Any idea why my numbers differ so much? |
@nlohmann, regarding file size comparisons, as you can see from the above results, the current BJData output is nearly identical to that from UBJSON. This is expected - although the currently reader can read all valid BJData files and the writer can write BJData-compliant files, the writer has not yet taken full advantage of the BJData features to output more compact file. As I mentioned in #3336 (comment), the main size reduction mechanism in BJData is to use ND-array optimized format to compress the type-markers from all data elements (assumed to be uniform type) in an otherwise nested array construct. To automatically perform such compression, we will need to extend the the files I used to populate the previously mentioned file size comparison are attached below (did not include jeopardy because of the large size). All these files were produced by my MATLAB/Octave json/ubjson/bjdata/msgpack parser JSONLab - because MATLAB/Octave natively support ND-arrays, so all data encoding and decoding can use such array information to fully compress. I can do a little bit more in-depth comparison between the .ubj and .bjd files (especially |
Thanks for the clarification - good to know I did not make any wrong calls. So what needs to be done to have BJData really use its potential for the benchmarks? Just extend the |
yes, I believe so. other than that, to trigger such "deep compression", an extra parameter, currently, this PR supports the below round-trip conversion: BJData ND-array ( the reason we had to use a structured data on the JSON side is because there is no native C++ data structure to hold this ND-array without losing the dimensional/type info (unless you are aware of a data structure that is more suitable). if a nested vector will be recognized and packed as a BJData ND array stream, then we may need to add a parameter to |
So basically, the reader is feature complete as it can read BJData created with this optimization, but the writer currently only creates output without that optimization? |
that is correct, with the exception that, if a bjdata ND-array is read by the reader, it stays optimized when writing to disk. the only thing currently missing from the writer is to recognize a packed nd-array from an un-optimized input, such as JSON input like |
How do you achieve this? Where do you store this information? |
Is that because ND-arrays are stored as objects which hold metadata in |
Oof. I really need a more detailed look at the specification. I don't like the way how metadata is leaked to the JSON values. |
You can find the specification for the annotations here: |
yes, it is not ideal - in MATLAB/Octave, I could use the native matrix data structure to store ND-array information (including dimension and type info), same in Python (numpy.ndarray) and JavaScript (numjs/ndarray - 3rd party lib), but for C++, at least from my limited reading of the std data structures, I still could not find a native ND-array container to hold such information - one could use nested vector/array objects for ND-arrays, like The JData spec aims to define a language-neutral approach to represent scientific data structures (ND arrays, complex-valued/sparse arrays, tables, maps, binary, compressed binary etc), so that it can be used as a data exchange layer between programming environments. It uses JSON tags to serialize metadata associated with each type of data structure, and can be used as a fallback to certain specialized data structure that is not supported in certain environment, for example, to support sparse-array in Python without installing scipy. |
Thanks for clarifying. It would be great if you could have a look at #3464 - in particular the files
I'd really have the documentation on par with the other binary formats - in particular the limitations of the implementation. |
@nlohmann, do you want me to fill in the |
Yes, exactly.
No. #3464 (which I will merge to develop soon) has the current state. Best would be if you would make a PR to #3464. |
Dear Niels (@nlohmann) and developers, thank you for developing this great library!
I am currently working on a project funded by the US National Institute of Health (NIH), named NeuroJSON (http://neurojson.org), with a goal of developing standardized data formats for data sharing among broad neuroimaging community (MRI, fMRI, EEG etc). For this purpose, we have adopted JSON and a UBJSON-derived binary JSON as our primary formats. We noticed that several prominent neuroimaging packages have used your library for their JSON supports, for example FreeSurfer.
I would like to contribute a patch to read/write our UBJSON-derived binary JSON format - Binary JData (BJData). The format specification for BJData can be found here.
The BJData format extends UBJSON to address several limitations (see ubjson/universal-binary-json#109), these extensions include
[u] - uint16
,[m] - uint32
,[M] - uint64
and[h] - half/float16
#
marker. For example, a 2x3 2Dint8
array[[1,2,3],[4,5,6]]
can be stored as[ $ i # [ $ i # i 2 2 3 1 2 3 4 5 6
where# [ $ i # i 2 2 3
stores the dimensional vector (2x3) as the size, or alternatively[ $ i # [ i 2 i 3 ] 1 2 3 4 5 6
.$
), which means[{SHTFTN
can not follow$
, butUiuImLMLhdDC
are allowed.In this patch, I extended json.hpp to read/write BJData formatted binary files/streams. Two driver functions
to_bjdata
andfrom_bjdata
were added. I've also added a unit-testing script modified from the UBJSON test file, with additional testing on the new data markers and optimized N-D array container format (as well as little-endianness). The test suite runs without any error.The only item that I need some help/feedback is the storage of packed ND-array (i.e. numerical arrays that have uniform type). UBJSON only supports 1D packed array via the optimized format (and collapse the type marker to the header following
$
), but efficiently storing N-D array is essential in neuroimaging & scientific community.My updated
binary_reader
can recognize the above N-D array optimized container but I failed to find a place to store such dimensional vector information in the json data structure (and subsequently use it to pack the ND array in the writer). So my current approach is to serialize the N-D array as a 1D vector after parsing. Because of this, the tests involving optimized ND array constructs are not yet round-trip invariant, so I temporarily disabled those round-trip tests.Feel free to review my patch, and in the meantime, I am wondering if you have any suggestions/comments on how to better parse/handle the N-D packed array construct I mentioned above. Other than that, the rest of the code seems to work quite well.
thanks
Pull request checklist
Read the Contribution Guidelines for detailed information.
include/nlohmann
directory, runmake amalgamate
to create the single-header filesingle_include/nlohmann/json.hpp
. The whole process is described here.Please don't
#ifdef
s or other means.