Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram collections #45

Open
ryanmwhitephd opened this issue Jul 26, 2018 · 7 comments
Open

Histogram collections #45

ryanmwhitephd opened this issue Jul 26, 2018 · 7 comments

Comments

@ryanmwhitephd
Copy link

ryanmwhitephd commented Jul 26, 2018

Create a collection of histograms in a file. Utility function to merge multiple files into single file.
Provides a way to distribute processing and histogram creation, store histograms and merge.

I implemented rudimentary versions with json and google protobuf. Tensorflow stores summary data and histograms in protocol buffers, so I gave that a try. The protobuf have versioning capability, which is just one potential benefit. See my fork and the protobuf tag.

Happy to work on this some more with your input.

One thing that would help is to make 'name' required, such that the Summary protocol buffer map <string, Histogram> takes as key the histogram name. This is important for the merging. Possibly merging can be implemented without converting back into the histogram, so we only deserialize, sum and write out new summary.

@ryanmwhitephd
Copy link
Author

Histograms and protocol buffers in context of R presented in this paper:
https://arxiv.org/abs/1401.7372

@janpipek
Copy link
Owner

@janpipek
Copy link
Owner

Hi Ryan,

your suggestion looks nice. I have to study the protocol buffers to properly understand all the implications and benefits, so sorry for the delay (in the past and also future).

Just to give a short initial feeling (which may change after proper studying):

  • I like the option and I would surely include it.
  • I would probably try to put protobuf support in a separate module, not integral part of the class (but this is negotiable ;-)).
  • We cannot enforce anyone to name their histograms. From the analysis perspective, this is an extra step that makes sense only when storing. If the protocol requires this, we may come up with some auto-naming. Or even throw an exception, no problem, just the name cannot be a requirement to construct histograms.

One additional note:

  • I am (very slowly) rethinking the structure of physt (see branch "v04"). I don't have the time (or mental abilities) to do it right (re-iterating now and then the design before commiting myself to it) so it will not come in a near future, but it's worth expecting. If you were interested in commenting the ideas, you'd be welcome.

Thanks a lot and I am starting with your article :-) I will try to come back with more competent comments asap.

Cheers,
Jan

@ryanmwhitephd
Copy link
Author

Hi,

Thanks for the reply. I implemented really quickly to get some idea how it might work. I, too, would likely need to think about this further. I've been mulling over how to do this somewhat correctly with little progress. For metadata information, the protocol buffers, in general appear to be the way to properly maintain, sustain, and pass the information around. The cross language capability and versioning are the key features that make this an ideal format (for any metadata).

After some further thought, I also have the same conclusions.

  • Separate module which is more of a wrapper to physt which provides the conversion to/from protobuf messages.
  • Naming of histograms is important when creating collections, so again should be separate from physt and not required. With protobuf, we can use:
    map<string, Histogram>
  • If properly designed, underlying changes to how physt works should not change the data model. A histogram is a histogram, bins, frequencies, errors2, overflow, underflow. Everything else would be extensible meta data that should be easy to version with the protocol buffer.

Tensorflow has a histogram proto, but lacks some important information that is contained in physt. Nevertheless, worthwhile as reference:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/summary.proto

Ryan

@janpipek
Copy link
Owner

janpipek commented Sep 21, 2018

Hi Ryan,

finally I got to reading about protobufs and your code. I partly copied lines from your commits, partly wrote code from scratch (in order to fit with restructured physt.io module and to deal with multidimensional histograms). Please have a look (version as of 915a0cb) and let me know if you find the new release to your liking.

I am still in doubt about the Collection - whether to make it a collectively manipulable entity (with shared bininnigs, meta data etc.)... This part of API will probably change (hopefully keeping the old message readable).

Also let me know if you want more credits than a mention in README, you deserve it for the idea and the initial implementation.

Thanks

@ryanmwhitephd
Copy link
Author

ryanmwhitephd commented Oct 10, 2018 via email

@janpipek
Copy link
Owner

Hi, you are welcome :-)

@janpipek janpipek added this to the 0.7 milestone Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants