Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what kinds of arrays are allowed for IRFs #145

Open
maxnoe opened this issue Mar 27, 2020 · 16 comments
Open

Clarify what kinds of arrays are allowed for IRFs #145

maxnoe opened this issue Mar 27, 2020 · 16 comments

Comments

@maxnoe
Copy link
Member

maxnoe commented Mar 27, 2020

Currently the standard asks for the storage of two arrays of lower and upper bin edges.

In python:

# 50 bins between 1 GeV and 1 PeV
bin_edges = np.logspace(-3, 3, 51)
energ_lo = bin_edges[:-1]
energ_hi = bin_edges[1:]

same goes for theta

I would propose to change this for the next iteration of the standard to change this to just storing a
single array with bin edges.

@tibaldo
Copy link
Collaborator

tibaldo commented Mar 27, 2020

This seems dangerous to me. Are we guaranteed that the bins will always be contiguous? Is there only one unambiguous way in your proposed format to determine the edges of a bin?
Also, this is inconsistent with well-established conventions, e.g. the EBOUNDS extension in OGIP (link).
What are the needs that motivate your request exactly?

@TarekHC
Copy link
Member

TarekHC commented Mar 27, 2020

Yes, overlapping bins require the two arrays. This should not be changed.

@cboisson
Copy link
Collaborator

cboisson commented Mar 27, 2020 via email

@maxnoe
Copy link
Member Author

maxnoe commented Mar 28, 2020

Ok, I see that it's probably better to be kept as is, however I think we should then add some clarifications to the standard on what is allowed or expected and what not.

Are we guaranteed that the bins will always be contiguous?

We want to enable non-contiguous e.g. effective areas? What are the use cases where this is not the case?
How would the science tools deal with non-contiguous effective areas @adonath @jknodlseder ?

Yes, overlapping bins require the two arrays. This should not be changed.

Again, I'd like to know some use cases here. I think overlapping bins might be usefull for interpolation or getting closer to something like "differential effective area" than "total effective area in bins of energy".

It would be good to collect references for this and make clear what is allowed in the standard and what is not.

In my opinion, we should only add well defined things to these standards. So if we want to give the possibility for non-contiguous effective areas or overlapping bins, this should be mentioned in the respective documents.

A standard that basically allows storing anything as long as it has the correct names and shape is not very helpfull if the semantics of what is stored are not clear.

@maxnoe maxnoe changed the title Remove duplication of bin edges Clarify what kinds of arrays are allowed for IRFs Mar 28, 2020
@jknodlseder
Copy link
Collaborator

jknodlseder commented Mar 28, 2020 via email

@maxnoe
Copy link
Member Author

maxnoe commented Mar 30, 2020

Ok, I assume that also answers what my next question would have been: why it was decided to use binary table extensions with a single row instead of image hdus for the irfs.

@jknodlseder
Copy link
Collaborator

jknodlseder commented Mar 30, 2020 via email

@kosack
Copy link
Collaborator

kosack commented Mar 30, 2020

why it was decided to use binary table extensions with a single row instead of image hdus for the irfs.

I think also it's a question of binning - often you might have non-uniform binning in an IRF (even if right now we do not do so I think), for example to deal with lack of statistics at high energies. The IMAGE extension can't deal with that.

@registerrier
Copy link
Collaborator

Indeed science tools tend to take bin center from the bin edges and perform some type of interpolation.
Yet, if no assumption is to be made on the binning (e.g. varying bin sizes or missing bins), there is no unambiguous determination of what is the bin center (linear, log, sqrt, other?). Depending on bin sizes, this will lead to significant differences in estimated values.

No IRF is differential in true energy/position. Therefore, when storing an IRF, one could consider that specifying one energy/position and the corresponding area/energy dispersion is less ambiguous. In particular, if MC simulations made to extract the IRFs were to be done at a single energy and position.

Standards such as OGIP/Caldb, consider that energy bins have to be small and that linear interpolation is the default:
https://heasarc.gsfc.nasa.gov/docs/heasarc/caldb/docs/memos/cal_gen_92_003/cal_gen_92_003.html#tth_sEc7

There is no unambiguous interpolation scheme. Should one assume the response is flat inside a bin, linearly varying between two bin centers, log-linearly varying, log-log etc. We might need some specifications here as well.

@maxnoe
Copy link
Member Author

maxnoe commented Mar 30, 2020

@registerrier yes, this is exactly what I feared. That currently this standard is too under-specified to be useful on its own.

So at least adding ENERGY and THETA for the central point that should be used for interpolation would make sense, right? bin edges (low, high) could be optional then.

@lmohrmann
Copy link
Collaborator

@maxnoe I think most commenters preferred to keep the low and high bin edges.
@registerrier, you would propose to add some information on the default interpolation scheme to be used for any of the axes?

@tibaldo
Copy link
Collaborator

tibaldo commented Mar 31, 2020

I think Régis has a valid point here. I am under the impression that volume is not an issue here. Would it be sensible and useful to add an extra column with a (recommended) reference energy for each bin, but also leaving the information of the energy boundaries for completeness?

@maxnoe
Copy link
Member Author

maxnoe commented Mar 31, 2020

Yes, this is what I proposed in my last comment, an additional ENERGREF or ENERGY or how we want to call it.

Btw, FITS column names are not bound to the uppercase and eight character limitatio of header keywords.

Is the another technical reason we limit ourselves to eight characters and uppercase?

@adonath
Copy link
Collaborator

adonath commented Mar 31, 2020

In addition to what @registerrier commented:

Given the way we compute the IRFs, we should definitely keep the bounds information of the IRF bins. For me the only "strictly correct" way to interpret the information stored in the MC IRFs, is that the value represents the expected counts in a given parameter interval, which is an integrated quantity (even if divided by the bin width...). To keep the full information we should provide the integration interval.

In Gammapy we have two ways of interpolating the information stored in the IRFs:

  • Bilinear (sometime log-log or semi-log) interpolation of the mean values of the IRFs in the parameter bin, assuming either the linear or log center of the bin.
  • Quadratic interpolation of the cumulative distribution. This is an alternative method I have tried, that is closer to the "strictly correct" interpretation of the information I mentioned above. This methods avoids the need to define a "reference" value (or bin center...). It assumes that the value stored in the IRF corresponds to the integral in the given interval and assigns this value to the upper edge of the bin (assuming a linear distribution within the bin). This method also avoids the problem of extrapolation of the IRFs, because the cumulative distribution is 0 if evaluated outside the lowest bin edge and equivalent the total number of counts if evaluated outside the uppermost bin edge. An illustration of the method is here: https://github.com/adonath/notebooks-public/blob/master/histogram_resampling.ipynb

I think defining a reference value is only meaningful, if we provide an assumption on how the events are distributed within the bin in addition (in fact once this is defined the reference value can be easily computed from this...). In Gammapy we have a meta interp keyword for our MapAxis object which defines the scaling ("log", "linear", "sqrt") of the axis. This is then used to compute bin centers, logarithmic bin centers etc. and used to choose the interpolation method (e.g. linear, log-log, or semi-log) as well. I think providing this information on the "scaling" of the IRF axis in the meta data might be useful. On the other hand, in most cases there is a very plausible choice anyway (but I guess)

In Gammapy we already make assumptions of this kind e.g. background is interpolated in log-log in energy and linearly in spatial axes. The main advantage is a better precision when working with coarser bins. As long as the IRFs are binned fine enough (whatever this means) it should not matter anyway.

@maxnoe I think so far there is no use-case for non-contiguous effective areas (or IRFs in general) and I'm not sure either, whether it will ever arise. To me the argument for the separate min and max bounds columns is simply that if the data is stored in a table, the column length is the equivalent to the column length of the data. If there was a column with edges, it would be of length N + 1 and you would need a fill value for the data...

@maxnoe
Copy link
Member Author

maxnoe commented Mar 31, 2020

@adonath actually, the IRFs are stored in a single rows in the tables.

The columns don't need to have the same shape, e.g. it would be perfectly valid to store:

Energy (51, )
Theta (11, )
Aeff (50, 10)

@adonath
Copy link
Collaborator

adonath commented Mar 31, 2020

@maxnoe Yes, you are right. I forgot the data is stored row-wise. For some reason I had in mind we store it as a flattened ND array in a table column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants