Allow pycbc_fit_sngls_over_multiparam to take in multiple files #4670

GarethCabournDavies · 2024-03-22T15:50:55Z

Update pycbc_fit_sngls_over_multiparam to take in multiple --template-fit-file inputs

Why?

This will be useful for e.g. if we generate pycbc_fit_sngls_by_template for part of the data, e.g. a day, and then want to smooth over many days at once, we could do this in a rolling window in the live search

How?

Firstly, read in the files and concatenate datasets together: template_id, median_sigma, count_above_thresh, count_in_templateandfit_coeff`.
While doing this, check that all the attributes which should match (ifo, sngl_stat, fit_threshold), do. Otherwise, RuntimeError
For each of nabove, ntotal and invalphan, smooth identical template_ids over files before doing the smoothing over parameters as before. This works as these datasets are linearly added for smoothing.

Nuances

Here we take a mean of median_sigma over different files. median_sigma is meant to be a representative value for each template in that IFO, and taking the median previously removes any outliers. So taking the mean after this should be safe
The fit_by_template output group now contains the fit_by_template values' mean average over files - like with median_sigma, this group is later used to find particularly noisy, isolated templates, so this should be okay

Testing

I have run the new code with one input file against the old code, and tested the output files against one another. This was done for all different smoothing methods [see note on n_closest below]. Every dataset and (shared) attribute was compared between new/old outputs and are all identical in this case.

The files differ by more than just numerical noise in a couple of minor ways:

The attributes in the output file now include the sngl_stat and fit_function from the fit_by_template attributes
The fit_by_template/count_above_thresh and fit_by_template/count_in_template datasets change from int64 to float64 when more than one file is given, to allow for the smoothing over files to give a non-integer count values.

For taking in multiple files, I have tested providing the same file two / three times, these should result in exactly the same output, e.g. mean(0,1,2,3) = mean(0,1,2,3,0,1,2,3)

In these tests, all new/old datasets are the same to within a numpy.isclose; more stringent comparison shows that all new/old datasets are identical except there is variation in the median_sigma dataset. median_sigma values change by up to 9e-10 (values are in the thousands, and we take a log of this in the statistic so I don't think this matters too much).

Efficiency

The smoothing over files uses the similar methodology to the efficient 1d tophat smoothing, so this step takes less than a second for a full O4 offline bank

Other notes

I think this change means we could parallelise pycbc_fit_sngls_by_template in the offline search to work over different parts of the bank, but haven't thought too much about how to implement that or whether it would even be useful
I have additionally fixed a bug in selecting templates for the n_closest smoothing method. I did this for both the 'master' copy of the code, and the new version. The results after this fix were the same as for other smoothing methods.

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

tdent · 2024-03-25T17:39:45Z

FWIW, I just noticed the indentation of the input checking line err_txt = "--smoothing-keywords must take input (etc) looks off .. 8 spaces not 4?

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

tdent · 2024-03-25T19:08:13Z

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

+
+# If we are using multiple files, we should 'smooth' within each template
+# before smoothing over templates going forward. As the templates will
+# all have identical template_ids and have the same number of values,


This explanation is not quite clear, which part of this calculation is 'taking the mean of means'? The discussion also seems to mix up the implementation aspect (getting entries with identical tids) with the correctness of what is implemented.

What I think this is saying is that out of the long array of n_templates * n_files values, averaging within each template involves getting the n_files occurrences which have the same tid, that explains the first part about implementation.

Then you want to argue that the averaging here does not bias the final result. I am not sure if that necessarily requires the same number of per-file entries for each template, but the linearity argument basically takes care of it

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

tdent

See multiple suggestions for rewriting/adding comments. Otherwise, this seems reasonable

tdent · 2024-03-25T19:36:22Z

I think the main question on implementation is why we necessarily should forbid trying to average fit results from different ifos and/or different fit settings. Doing so would be messy in terms of recording this (as attrs or whatever), but basically the job of checking that the inputs are sane/consistent should be done outside this code unless it would make the calculation impossible or meaningless.

Co-authored-by: Thomas Dent <thomas.dent@usc.es>

GarethCabournDavies · 2024-03-26T15:05:35Z

I think the main question on implementation is why we necessarily should forbid trying to average fit results from different ifos and/or different fit settings. Doing so would be messy in terms of recording this (as attrs or whatever), but basically the job of checking that the inputs are sane/consistent should be done outside this code unless it would make the calculation impossible or meaningless.

Different IFOs should be okay to be used, though I'm not sure how attrs would be set in this case, and we would not be able to use this in the later statistic, so I think disallowing this case is reasonable.

I think different fit settings would make the calculation meaningless; definitely fit_function. I could be persuaded on stat_threshold, but I think that estimator would be biased, and again the statistic calculation later would not make sense.

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

tdent · 2024-04-03T17:10:48Z

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

+                # It's the first time we encounter this attribute
+                attr_dict[k] = fits.attrs[k]
+            elif k == 'ifo':
+                # We don't mind if this attribute is different


note here also that the output file attrs will only be for one ifo even if more than one is input?

I've added this into the comment, and also added a warning if files from multiple IFOs are used

tdent

Minor changes on comments / line ordering

bin/all_sky_search/pycbc_fit_sngls_over_multiparam

GarethCabournDavies added 2 commits March 20, 2024 07:40

move input reading into one block

b1ed74c

Allow fit_over_multiparam to take in multiple files

dd09577

GarethCabournDavies requested review from tdent and ArthurTolley March 22, 2024 15:50

GarethCabournDavies added the low latency label Mar 22, 2024

GarethCabournDavies added 2 commits March 22, 2024 16:06

Merge branch 'gwastro:master' into trigger_fits_in_live

3ad0f32

n_closest bugfix

2f41a99

ArthurTolley reviewed Mar 25, 2024

View reviewed changes

bin/all_sky_search/pycbc_fit_sngls_over_multiparam Outdated Show resolved Hide resolved

bin/all_sky_search/pycbc_fit_sngls_over_multiparam Outdated Show resolved Hide resolved

bin/all_sky_search/pycbc_fit_sngls_over_multiparam Outdated Show resolved Hide resolved

AT comments

4784ef4

ArthurTolley reviewed Mar 25, 2024

View reviewed changes

bin/all_sky_search/pycbc_fit_sngls_over_multiparam Outdated Show resolved Hide resolved

remove unnecessary check

acf98be