-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow pycbc_fit_sngls_over_multiparam to take in multiple files #4670
Allow pycbc_fit_sngls_over_multiparam to take in multiple files #4670
Conversation
FWIW, I just noticed the indentation of the input checking line |
|
||
# If we are using multiple files, we should 'smooth' within each template | ||
# before smoothing over templates going forward. As the templates will | ||
# all have identical template_ids and have the same number of values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is not quite clear, which part of this calculation is 'taking the mean of means'? The discussion also seems to mix up the implementation aspect (getting entries with identical tids) with the correctness of what is implemented.
What I think this is saying is that out of the long array of n_templates * n_files values, averaging within each template involves getting the n_files occurrences which have the same tid, that explains the first part about implementation.
Then you want to argue that the averaging here does not bias the final result. I am not sure if that necessarily requires the same number of per-file entries for each template, but the linearity argument basically takes care of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See multiple suggestions for rewriting/adding comments. Otherwise, this seems reasonable
I think the main question on implementation is why we necessarily should forbid trying to average fit results from different ifos and/or different fit settings. Doing so would be messy in terms of recording this (as attrs or whatever), but basically the job of checking that the inputs are sane/consistent should be done outside this code unless it would make the calculation impossible or meaningless. |
Co-authored-by: Thomas Dent <thomas.dent@usc.es>
Co-authored-by: Thomas Dent <thomas.dent@usc.es>
Different IFOs should be okay to be used, though I'm not sure how attrs would be set in this case, and we would not be able to use this in the later statistic, so I think disallowing this case is reasonable. I think different fit settings would make the calculation meaningless; definitely |
# It's the first time we encounter this attribute | ||
attr_dict[k] = fits.attrs[k] | ||
elif k == 'ifo': | ||
# We don't mind if this attribute is different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note here also that the output file attrs will only be for one ifo even if more than one is input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this into the comment, and also added a warning if files from multiple IFOs are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes on comments / line ordering
Update
pycbc_fit_sngls_over_multiparam
to take in multiple --template-fit-file inputsWhy?
This will be useful for e.g. if we generate
pycbc_fit_sngls_by_template
for part of the data, e.g. a day, and then want to smooth over many days at once, we could do this in a rolling window in the live searchHow?
template_id
, median_sigma,
count_above_thresh,
count_in_templateand
fit_coeff`.ifo
,sngl_stat
,fit_threshold
), do. Otherwise, RuntimeErrornabove
,ntotal
andinvalphan
, smooth identical template_ids over files before doing the smoothing over parameters as before. This works as these datasets are linearly added for smoothing.Nuances
median_sigma
over different files.median_sigma
is meant to be a representative value for each template in that IFO, and taking the median previously removes any outliers. So taking the mean after this should be safefit_by_template
output group now contains thefit_by_template
values' mean average over files - like withmedian_sigma
, this group is later used to find particularly noisy, isolated templates, so this should be okayTesting
I have run the new code with one input file against the old code, and tested the output files against one another. This was done for all different smoothing methods [see note on n_closest below]. Every dataset and (shared) attribute was compared between new/old outputs and are all identical in this case.
The files differ by more than just numerical noise in a couple of minor ways:
sngl_stat
andfit_function
from thefit_by_template
attributesfit_by_template/count_above_thresh
andfit_by_template/count_in_template
datasets change from int64 to float64 when more than one file is given, to allow for the smoothing over files to give a non-integer count values.For taking in multiple files, I have tested providing the same file two / three times, these should result in exactly the same output, e.g. mean(0,1,2,3) = mean(0,1,2,3,0,1,2,3)
In these tests, all new/old datasets are the same to within a
numpy.isclose
; more stringent comparison shows that all new/old datasets are identical except there is variation in themedian_sigma
dataset.median_sigma
values change by up to 9e-10 (values are in the thousands, and we take a log of this in the statistic so I don't think this matters too much).Efficiency
The smoothing over files uses the similar methodology to the efficient 1d tophat smoothing, so this step takes less than a second for a full O4 offline bank
Other notes
pycbc_fit_sngls_by_template
in the offline search to work over different parts of the bank, but haven't thought too much about how to implement that or whether it would even be usefuln_closest
smoothing method. I did this for both the 'master' copy of the code, and the new version. The results after this fix were the same as for other smoothing methods.