Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pycbc_fit_sngls_over_multiparam to take in multiple files #4670

Merged

Conversation

GarethCabournDavies
Copy link
Contributor

Update pycbc_fit_sngls_over_multiparam to take in multiple --template-fit-file inputs

Why?

This will be useful for e.g. if we generate pycbc_fit_sngls_by_template for part of the data, e.g. a day, and then want to smooth over many days at once, we could do this in a rolling window in the live search

How?

  • Firstly, read in the files and concatenate datasets together: template_id, median_sigma, count_above_thresh, count_in_templateandfit_coeff`.
  • While doing this, check that all the attributes which should match (ifo, sngl_stat, fit_threshold), do. Otherwise, RuntimeError
  • For each of nabove, ntotal and invalphan, smooth identical template_ids over files before doing the smoothing over parameters as before. This works as these datasets are linearly added for smoothing.

Nuances

  • Here we take a mean of median_sigma over different files. median_sigma is meant to be a representative value for each template in that IFO, and taking the median previously removes any outliers. So taking the mean after this should be safe
  • The fit_by_template output group now contains the fit_by_template values' mean average over files - like with median_sigma, this group is later used to find particularly noisy, isolated templates, so this should be okay

Testing

I have run the new code with one input file against the old code, and tested the output files against one another. This was done for all different smoothing methods [see note on n_closest below]. Every dataset and (shared) attribute was compared between new/old outputs and are all identical in this case.

The files differ by more than just numerical noise in a couple of minor ways:

  • The attributes in the output file now include the sngl_stat and fit_function from the fit_by_template attributes
  • The fit_by_template/count_above_thresh and fit_by_template/count_in_template datasets change from int64 to float64 when more than one file is given, to allow for the smoothing over files to give a non-integer count values.

For taking in multiple files, I have tested providing the same file two / three times, these should result in exactly the same output, e.g. mean(0,1,2,3) = mean(0,1,2,3,0,1,2,3)

In these tests, all new/old datasets are the same to within a numpy.isclose; more stringent comparison shows that all new/old datasets are identical except there is variation in the median_sigma dataset. median_sigma values change by up to 9e-10 (values are in the thousands, and we take a log of this in the statistic so I don't think this matters too much).

Efficiency

The smoothing over files uses the similar methodology to the efficient 1d tophat smoothing, so this step takes less than a second for a full O4 offline bank

Other notes

  • I think this change means we could parallelise pycbc_fit_sngls_by_template in the offline search to work over different parts of the bank, but haven't thought too much about how to implement that or whether it would even be useful
  • I have additionally fixed a bug in selecting templates for the n_closest smoothing method. I did this for both the 'master' copy of the code, and the new version. The results after this fix were the same as for other smoothing methods.

@tdent
Copy link
Contributor

tdent commented Mar 25, 2024

FWIW, I just noticed the indentation of the input checking line err_txt = "--smoothing-keywords must take input (etc) looks off .. 8 spaces not 4?


# If we are using multiple files, we should 'smooth' within each template
# before smoothing over templates going forward. As the templates will
# all have identical template_ids and have the same number of values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is not quite clear, which part of this calculation is 'taking the mean of means'? The discussion also seems to mix up the implementation aspect (getting entries with identical tids) with the correctness of what is implemented.

What I think this is saying is that out of the long array of n_templates * n_files values, averaging within each template involves getting the n_files occurrences which have the same tid, that explains the first part about implementation.

Then you want to argue that the averaging here does not bias the final result. I am not sure if that necessarily requires the same number of per-file entries for each template, but the linearity argument basically takes care of it

Copy link
Contributor

@tdent tdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See multiple suggestions for rewriting/adding comments. Otherwise, this seems reasonable

@tdent
Copy link
Contributor

tdent commented Mar 25, 2024

I think the main question on implementation is why we necessarily should forbid trying to average fit results from different ifos and/or different fit settings. Doing so would be messy in terms of recording this (as attrs or whatever), but basically the job of checking that the inputs are sane/consistent should be done outside this code unless it would make the calculation impossible or meaningless.

GarethCabournDavies and others added 4 commits March 26, 2024 09:38
Co-authored-by: Thomas Dent <thomas.dent@usc.es>
Co-authored-by: Thomas Dent <thomas.dent@usc.es>
@GarethCabournDavies
Copy link
Contributor Author

I think the main question on implementation is why we necessarily should forbid trying to average fit results from different ifos and/or different fit settings. Doing so would be messy in terms of recording this (as attrs or whatever), but basically the job of checking that the inputs are sane/consistent should be done outside this code unless it would make the calculation impossible or meaningless.

Different IFOs should be okay to be used, though I'm not sure how attrs would be set in this case, and we would not be able to use this in the later statistic, so I think disallowing this case is reasonable.

I think different fit settings would make the calculation meaningless; definitely fit_function. I could be persuaded on stat_threshold, but I think that estimator would be biased, and again the statistic calculation later would not make sense.

# It's the first time we encounter this attribute
attr_dict[k] = fits.attrs[k]
elif k == 'ifo':
# We don't mind if this attribute is different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note here also that the output file attrs will only be for one ifo even if more than one is input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this into the comment, and also added a warning if files from multiple IFOs are used

Copy link
Contributor

@tdent tdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes on comments / line ordering

@GarethCabournDavies GarethCabournDavies merged commit aa8358f into gwastro:master Apr 8, 2024
33 checks passed
@GarethCabournDavies GarethCabournDavies deleted the trigger_fits_in_live branch April 8, 2024 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants