Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change energy dispersion format #67

Open
cdeil opened this issue Oct 4, 2016 · 19 comments
Open

Change energy dispersion format #67

cdeil opened this issue Oct 4, 2016 · 19 comments

Comments

@cdeil
Copy link
Member

cdeil commented Oct 4, 2016

This is a suggestion for a change I got via email from Michael Punch and @kosack, after presenting the FITS data formats for H.E.S.S. at the collaboration meeting last week.

There we were using EDISP in the current format (see here) with y-axis range up to y = e_reco / e_true = 2 which clipped the distribution.

To avoid this they suggested to instead use y = log(e_reco), which is what OGIP RMF does, and also use that for the DL3 responses (where there are extra axes e.g. for field of view offset, and no sparse matrix zero-suppression scheme).

@cdeil
Copy link
Member Author

cdeil commented Oct 4, 2016

I don't have a strong opinion either way. Yes, using y=log(e_reco) is a bit simpler and avoids e.g. the clipping issue. But the clipping can also be avoided by just using a larger range of e.g. y = e_reco / e_true = 3 or 5.

Note that EDISP can quickly dominate the DL3 data file size that's distributed to end users, so this question of how to store it well is relevant. The question of whether we use any compression to squeeze out zero bytes is related, comments on #46 would be welcome.

For now my recommendation would be to stick with what we have and just extend the range in the HESS HAP exporters to avoid the clipping. It's very hard to find people to work on exporters or contribute to the format spec at all. Once CTA starts to produce EDISP (cc @TarekHC), and someone does a small study how to do it well and a concrete proposal how CTA EDISP DL3 should be distributed to users, I think we can add or change this. Of course, if people want to change now, that's fine with me as well. My point is that what we have now works for current IACTs and it's not clear to me that the proposed change really is an improvement.

@kosack
Copy link
Collaborator

kosack commented Oct 4, 2016

as for the file size, somebody should look into using FITS tiled image compression on it, which should reduce the size by a large factor (though if it is implemented as a TABLE and not an IMAGE extension, that may not work). That could be a future study though. Even using gzip/bzip2 on a sparsely populated table should give resonable results since most elements in the off-diagonal parts of the matrix will be 0.

https://heasarc.gsfc.nasa.gov/docs/software/fitsio/compression.html

@mdpunch
Copy link

mdpunch commented Oct 4, 2016

Hi Christoph,

Another suggestion I made is to store Edisp rather as yy = e_true / e_reco.

  • For the general case e_true ~= e_reco, so yy = e_true / e_reco ~= 1
  • For the case near and just below threshold, what happens is that the system triggers on upward fluctuations of showers, so e_reco is then overestimated. Then yy = e_true / e_reco < 1 and goes further towards zero as the energy decreases further (whereas the current y = e_reco / e_true increases as you go below threshold, and gets clipped easily, as can be seen on the plots).

Energy reconstruction hardly ever gives a systematic underestimate of the energy, so saving e_true / e_reco is naturally lower-bounded by 0, and the upper bound is likely to be closer than 1 than 2.

So, though I agree with @kosack that it would be best to stick to the RMF standard, a compromise would be to use e_true / e_reco on the y-axis instead.

Good Luck,
Michael.

@cdeil
Copy link
Member Author

cdeil commented Oct 4, 2016

@mdpunch - Thanks for the comment about the yy = e_true / e_reco < 1 proposal and mentioning it's advantage.

One comment concerning RMF: it has this kinda complex zero-suppression scheme (storing groups of pixels above threshold) and how to store that in FITS. We might not want to use this for DL3, if we zero-suppress we might want a more generic scheme that we then also use e.g. for the table PSF format, or maybe we just want to let gzip on the FITS files take out the zeros and not do any compression in the format itself. And RMF doesn't support extra axes, like the offset axis that we currently have in the EDISP format.
I'm just mentioning that because in my opinion it means that using RMF directly doesn't work for us anyways, and so the argument that we should use it because it's an existing standard isn't very strong.

So I think what we need is a small study / evaluation / summary which of the options that have been suggested here really have advantages (in terms of simplicity / accuracy / storage or processing space):

  1. use y = e_reco / e_true like we do now.
  2. use y = log10(e_reco) like RMF
  3. use y = e_true / e_reco to avoid the large y values at the energy threshold from option 1

My guess is that it doesn't matter much, all would work.
Am I missing something, is there a big advantage to e.g. option 2 that I'm not aware of?

One concern we have in HESS is low MC stats e.g. at high energies. We will have to start to implement EDISP smoothing methods. Some axis / binning formats might be better suited for this than others, I don't know.

@registerrier, @GernotMaier, all - Thoughts on how you'd like EDISP to be stored?

@kosack
Copy link
Collaborator

kosack commented Oct 4, 2016

The only advantage to option 2 is that that is how you use the data, so no further "untransformation" is needed to get a probability distribution for a given E_reco or E_true band, you just take that row or column of the matrix. So making a conversion from Etrue to Ereco is relatively trivial.

Note that in any case, we should also store some sort of error bars with the matrix (something we learned when doing sensitivity curves). The reason for that is that when doing the transform, one needs to weight by the bin error in order to avoid crazy artifacts coming from sampling noise.

I would not suggest following the RMF format, but rather just the concept (we can always make a converter to export to RMF). You can just store an image of logEtrue vs logEreco, and zero everything that has too low stats. Gzip will efficiently compress that, i just checked with an example matrix (you get a factor of up to 10 compression). The reasoning may also be that for CTA Ereco is ~ Etrue, but if this is to be general, it may not be the case. Especially for some other analysis like cosmic rays, it could be very different in shape.

@cdeil
Copy link
Member Author

cdeil commented Oct 4, 2016

Note that in any case, we should also store some sort of error bars with the matrix (something we learned when doing sensitivity curves). The reason for that is that when doing the transform, one needs to weight by the bin error in order to avoid crazy artefacts coming from sampling noise.

Currently the assumption is that DL3 IRF producers create responses like EDISP (but also the others like AEFF, BKG, PSF) that are noise-free and well-sampled!
Science tools just use the responses, there is currently no plan to teach them about bin errors and sampling noise.

We should make that clearer in the current DL3 specs.

@kosack - If you think this doesn't work for CTA or is not a good way to do things, please open a new separate issue! There have been discussions about IRF errors of course, but as far as I know no-one is concretely working on implementing a DL3 format for that or science tool code to handle that input.

I would not suggest following the RMF format, but rather just the concept (we can always make a converter to export to RMF).

That already exists in Gammapy (and probably Gammalib as well), there's an example how to do it here.

@kosack
Copy link
Collaborator

kosack commented Oct 4, 2016

Currently the assumption is that DL3 IRF producers create responses like EDISP (but also the others like AEFF, BKG, PSF) that are noise-free and well-sampled!
Science tools just use the responses, there is currently no plan to teach them about bin errors and sampling noise.
We should make that clearer in the current DL3 specs.

Agreed, that is fine. It just needs to be a clear specification that all IRFs need to be smoothed or well-enough sampled such that noise is not a problem. How do you then deal with edges where the sampling is bad and you really don't know the response?

@kosack
Copy link
Collaborator

kosack commented Oct 4, 2016

having a NaN value in bins that should not be trusted or something could be one way to avoid border cases

@cdeil
Copy link
Member Author

cdeil commented Oct 4, 2016

Let's keep this issue focused on EDISP y-axis, and try to have one GH issue per suggestion / topic.
I've opened #68 for "Add note that responses should be smooth and well-sampled".

@mdpunch
Copy link

mdpunch commented Oct 4, 2016

Okay, but does there need to be a long discussion on this. For me it's clear that y = e_reco / e_true is causing information loss, and seems evident that it needs to be changed:
edisp
Actually, this last plot looks weird, and I think the axis labels are swapped, can you check?
Assuming that, then if you have an e_reco of 400GeV, the clipping of the plot is hiding that your e_true can have a huge error bar.

@cdeil
Copy link
Member Author

cdeil commented Oct 4, 2016

Okay, but does there need to be a long discussion on this. For me it's clear that y = e_reco / e_true is causing information loss, and seems evident that it needs to be changed

It is significant work to change this. In addition to the spec change, @jknodlseder would have to change ctools, @mimayer the HESS ParisAnalysis exporters and we the HESS HAP exporters and Gammapy, someone the VERITAS exporters.

If this is a good change that results in a better format for the next decade, we should of course do it. I'm just mentioning this to make it clear that this is not something we should change often, i.e. what we change to now should be what we use at least for the next ~ year (until we learn more and find an even better way to do it).

@jknodlseder @mimayer - Do you agree the suggested change here is a good one and that we should make it now? Would you be willing to implement it in Gammalib / the HESS ParisAnalysis exporters? Do you have a preference for using y=log(E_reco) or for y=e_true/e_reco instead of y=e_reco/e_true like we do now?

Actually, this last plot looks weird, and I think the axis labels are swapped, can you check?

Yes, that was a plotting bug in the axis labels. Thanks for pointing it out!
I fixed it just now and filed a reminder issue for other EnergyDispersion class improvements for Gammapy: gammapy/gammapy#734

Just to have another data point, and to illustrate that the cutoff issue in the HAP EDISP can easily be avoided, the HESS ParisAnalysis exporters currently use much more coarse binning in e_true, and a much larger range in migra=e_reco/e_true:
download

@mdpunch
Copy link

mdpunch commented Oct 5, 2016

Another "data point", as you say, is this plot from CAT (http://arXiv.org/abs/astro-ph/9804133v1):
cat_energydispersion
It is important that the flattening at the lower end is kept, for the definition of spectrometric thresholds, i.e. to quote from that paper:

Close to the threshold, however, the fitted energy Ef is overestimated as a consequence of the trigger selection. [...] The bias induced by the trigger selection at different energies is best illustrated by plotting 68% confidence intervals for Ef as a function of the true value Eγ used in the simulation (Fig.). It can be seen that for Ef below 350 GeV for vertical showers, only an upper limit can be safely derived for Eγ . Therefore, spectrum measurement is reliable only above a spectrometric threshold, that is, in the region in which Ef depends linearly on E. It is somewhat higher than the nominal threshold which is relevant for source discovery.

...whereas the usual threshold definition (max. convolved with Crab spectrum gives):

The energy corresponding to the maximum event rate (“mode energy”) will be considered as the nominal threshold of the telescope; it varies from 250 to 350 GeV between zenith and Z = 30◦ and increases to ≈ 700 GeV for Z = 45◦.

(you can understand my interest in this now, given the comments at the HESS meeting last week concerning threshold definitions)

@jknodlseder
Copy link
Collaborator

jknodlseder commented Jan 13, 2017 via email

@jknodlseder
Copy link
Collaborator

jknodlseder commented Jan 13, 2017 via email

@mdpunch
Copy link

mdpunch commented Jan 13, 2017

Hi @jknodlseder ,

Information is lost because of clipping.

At the threshold, only showers with upward fluctuations trigger the detector. These showers are reconstructed to be at the threshold energy, whereas their true energy is lower. So e_reco/e_true for these showers tends to be very large.

You can literally see this effect on the plot e_reco vs. e_true plot as reconstructed from the stored y = e_reco/e_true values, see my comment on 4th October.

In addition to the low threshold effect, there also seems to be an effect at high energies, which I have't tried to figure out why there are so many events with e_reco/e_true>>1... but that effect is there too.

So, to avoid information loss, needs a much larger range for the migration matrix in e_reco/e_true vs e_true.

Storing y* = e_true/e_reco gets around this clipping effect, because it is rare that the reconstructed energy is much below the true energy. It is the simplest solution if we want to store things in linear scale.

Another possibility, if it's really more practical to have e_reco/e_true, is to store rather log(e_reco/e_true). I think this could take into account nicely the outliers without needing much bigger matrices, and even seems more natural to do this since ideally that distribution should be Gaussian, so storing it in log is better,

Michael.

@jknodlseder
Copy link
Collaborator

jknodlseder commented Jan 13, 2017 via email

@mdpunch
Copy link

mdpunch commented Jan 13, 2017

Well, either to invert (from e_reco/e_true to e_true/e_reco), or to store as log... or both!
Maybe log(e_reco/e_true) is simplest to implement right now?
Whatever works...

@jknodlseder
Copy link
Collaborator

jknodlseder commented Jan 13, 2017 via email

@cdeil
Copy link
Member Author

cdeil commented Apr 5, 2017

@woodmd and I discussed this just now, and he also said that he would try y = log(e_reco/e_true) first, like @mdpunch proposed in a comment above.

Another orthogonal suggestion he made was to try splines, because they:

  • are pretty robust to construct (compared to fitting other shapes)
  • are efficient to store (just the spline parameters)
  • are efficient to evaluate (also derivatives and integrals, not just values)

I completely agree that this should be done. But it only makes sense to continue the discussion when there is someone willing to implement / prototype something, which as far as I can see is not the case at the moment. Hopefully this discussion will be useful to whoever does this work in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants