Fix Parkes/Tempo2 parsing confusion #1320

dlakaplan · 2022-06-30T21:33:17Z

If TOA lines are supposed to be Tempo2 format but start with a space and have a "." in spot 41, then they may accidentally be parsed as Parkes format.

This tries to fix that.

dlakaplan · 2022-06-30T21:33:37Z

Addresses #1319

dlakaplan · 2022-06-30T21:34:22Z

This fix is pretty minimal, but I don't know enough about the TOA formats to know if it will interfere with something else. If it doesn't sound problematic I can add appropriate tests.

codecov · 2022-06-30T22:53:17Z

Codecov Report

Merging #1320 (c225f3d) into master (f21a68a) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1320   +/-   ##
=======================================
  Coverage   61.64%   61.64%           
=======================================
  Files          89       89           
  Lines       20174    20174           
  Branches     3614     3614           
=======================================
  Hits        12436    12436           
  Misses       6961     6961           
  Partials      777      777

Impacted Files	Coverage Δ
src/pint/toa.py	`80.77% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f21a68a...c225f3d. Read the comment docs.

aarchiba · 2022-07-01T14:14:05Z

In the discussion on #1231 @scottransom was pretty adamant that just because you saw a FORMAT 1 you couldn't assume that the TOAs in the rest of the file were all TEMPO2. I think that should be exactly what that line means, but I think we need the opposing opinion.

dlakaplan · 2022-07-01T14:16:13Z

Yes, I understand that. But if this file is parsed properly by tempo2 then there should be a way to get PINT to do it. Even if you have to specify the format manually in get_TOAs() or similar.

scottransom · 2022-07-01T15:30:24Z

@aarchiba No, that wasn't what my point was. The FORMAT 1 at the beginning of a file is supposed to mean that that file contains only tempo2-style TOAs. That is completely correct. The problem is that there are many .tim files floating around that have mixed format TOAs in them! And FORMAT 1 lines scattered in and out of the files. So people have started abusing this "standard". My point was that we shouldn't assume that just because FORMAT 1 is set, that all TOAs will be tempo2, unless we want PINT to abort while reading (and I think that both TEMPO and Tempo2 do read those files correctly, although I should double-check that). If this is all the case, then any tweaks that make our identification of TOA formats on a line-by-line basis better, should be a good thing. (IMO)

dlakaplan · 2022-07-01T16:11:38Z

That last task is what I was trying to do here, but it's complicated. There is a mix in terminology between format (tempo2, Parkes, Princeton) and the fmt variable (which also includes values like comment and command). One solution would be to impose a TOA format on the whole file but the machinery to do that isn't there. Another would be to more intelligently fall back between different formats that are ambiguous, but that also needs a bit of an architecture change (since the format identification & toa parsing are handled in the same place). I could solve either of those, but both would be some interface changed.

Perhaps we should discuss this at an upcoming call.

aarchiba · 2022-07-04T08:31:29Z

Is it possible we could obtain some genuine examples of mixed-format TOA files to use as test cases? I haven't seen an example where FORMAT 1 is followed by non-tempo2-format TOAs, but I know my experience with pulsar timing is a little specialized.

If we had a clear specification of what files were supposed to look like, we could write some hypothesis code to really hammer it with potentially ambiguous cases, but if our only specification is "whatever the current parser accepts" then it is impossible for it to have a bug and testing is pointless.

It certainly seems like a bug to me if some valid tempo2-format files that start with FORMAT 1 and are followed by valid tempo2-format TOAs cannot be correctly read by PINT, but that appears to be the current state of affairs.

scottransom · 2022-07-04T19:38:09Z

We can easily make test files with multiple TOA types (and we could do it by simply combining some of our current .tim files!). And I agree that this is a bug and should be fixed, if possible.

The mixed files that I've seen are archival non-tempo2 format files (using Parkes or Princeton formats) where people have simply cut and pasted tempo2-style TOAs on at the end, because those more recent TOAs are what they were sent from other folks. So there is a "FORMAT 1" (possibly, but maybe not always) in the middle of a file. However, I might have seen the opposite behavior as well, where someone has a "FORMAT 1" file with tempo2 TOAs and then copies and pastes Princeton/Parkes format TOAs into them.

According to the intentions of the tempo2 developers, I think that both of these cases would "officially' be user error. But most of the time it is easy to make simple guesses as what types of TOAs are actually there and parse them correctly. So I do think we should try and do that if not too difficult.

aarchiba · 2022-07-05T11:56:51Z

We can easily make test files with multiple TOA types (and we could do it by simply combining some of our current .tim files!). And I agree that this is a bug and should be fixed, if possible.

The mixed files that I've seen are archival non-tempo2 format files (using Parkes or Princeton formats) where people have simply cut and pasted tempo2-style TOAs on at the end, because those more recent TOAs are what they were sent from other folks. So there is a "FORMAT 1" (possibly, but maybe not always) in the middle of a file. However, I might have seen the opposite behavior as well, where someone has a "FORMAT 1" file with tempo2 TOAs and then copies and pastes Princeton/Parkes format TOAs into them.

According to the intentions of the tempo2 developers, I think that both of these cases would "officially' be user error. But most of the time it is easy to make simple guesses as what types of TOAs are actually there and parse them correctly. So I do think we should try and do that if not too difficult.

This is why I specified "genuine": what combinations are actually occurring in the wild? It is certain that there will be valid tempo2 TOAs that also parse as other formats, with varying degrees of sensibleness. The more we can narrow the plausible situations where we permit weird combinations, the more reliably we can parse tempo2 TOAs. Which, as far as I can tell, are the Right Answer going forward. As David and I both independently discovered, perfectly reasonable tempo2 files can fail to be parsed correctly because "but sometimes..."

There is a danger to flexible file parsing: the two clock file formats we parse are very poorly specified, and as a result you can read a file in as the wrong format and receive no error messages, getting clock corrections that are simply absent, vastly too large, or vastly too small. The first two cases we have a way to detect, but clock corrections that are inadvertently zero are very easy to fail to notice. Likewise, I would much rather be informed that my .tim files are malformed than have TOAs with typos simply dropped or misinterpreted.

Here are a few possible situations to improve the user experience:

TOA-reading routines could allow users to specify the format, in which case the parsing could be strict, or leave it at "take a guess" in which case it would do its best to cope with random mishmashes. (PINT allows easy conversion, so if you can read a random mishmash and verify that it worked you can write it out as tempo2.)
TOA-reading routines could assume FORMAT 1 means "everything after this is tempo2"; one could even allow "FORMAT 0" or something to indicate the end of tempo2 format.
Mishmashes could have their values sanity-checked, for example a frequency of 57123.456 MHz is probably an MJD in the wrong field so you probably have the wrong format. We could also have a clearly defined order of preference, so that a valid tempo2 TOA wasn't allowed to look like a Princeton TOA. We'd then have to make sure our output routines never produced ambiguous TOAs.
Comments could be restricted to starting with "C" (we'd have to make sure this didn't clash with tempo2 format "name" fields starting with C) or "#" and then unrecognized lines could produce exceptions rather than being treated as comments.

All of these are sort of hypothetical benefits unless we know what sorts of files our users want to read. My own experience has been generating my own TOAs in tempo2 format, or using such, usually with lots of flags so other formats are hopeless.

dlakaplan · 2022-07-05T14:36:45Z

My proposed fix was very minor and only handled a small range of cases. But I agree in principle with @aarchiba . In particular, being able to say that "I know this is the format, follow it strictly" would be good but I think it would require some rewrites of how the parsing is done (nomenclature for format vs. other syntax), and the rest of it but need even more. Where are the best references to the different formats? are there e.g., particular types of commands allowed in one but not others?

It might be best for me to close this PR (again, it only fixes a particular case) and move this discussion back to the issue?

scottransom · 2022-07-05T15:08:45Z

To my knowledge, this is the best reference on TOA formats: http://tempo.sourceforge.net/ref_man_sections/toa.txt

I have a ping out to Ingrid and David N about old-school flags used with Parkes/Princeton format TOA files. I think people used "-i" flags as info flags sometimes. What I don't know is if they used other things.

I personally don't think we need to get too complicated with this. I'd recommend giving a Warning if we see Tempo2 TOAs added into a file with other format TOAs. And we can check to see, for instance, if the decimal point requirements for Princeton/Parkes/ITOA apply to a string or a number. And maybe any TOA line that has any flags (except "-i"?) gets automatically treated as T2 if it parses correctly.

I don't think I like the idea of super strict parsing by default, at least. There are files in the wild that would break that, and I think we can do a really good job with only a little bit of extra effort (and checking of values).

aarchiba · 2022-07-06T13:08:59Z

That reference actually gives a clear and simple way to distinguish between the four formats:

tempo2-format TOA files must start with FORMAT 1 (presumably this implies everything else in the file is tempo2-format, though it doesn't quite say that)
Otherwise, if columns 1 and 2 are not blank, it is ITOA format
Otherwise, if column 1 is blank, it's Parkes format
Otherwise, if column 2 is blank, it's Princeton format

We certainly don't implement these rules. They don't leave room for comments, for one thing, or JUMPs and the like. But it is an understandable starting point, and as Scott says it is the best reference available.

Looking at the TEMPO2 source code, it looks like it treats files as containing either "FORMAT" and tempo2-format TOAs or a mix of the other three types: https://bitbucket.org/psrsoft/tempo2/src/9f4f29abe564a3f907f8b97cef79385011391a23/readTimfile.C#lines-343

Looking at the TEMPO source code, it seems to assume everything after a FORMAT 1 is tempo2 format: https://sourceforge.net/p/tempo/tempo/ci/master/tree/src/arrtim.f#l217 (also it doesn't follow the rules above from its own documentation because they mention exceptions)

fix Parkes/Tempo2 parsing confusion

303a13f

added test

c225f3d

aarchiba mentioned this pull request Jul 13, 2022

PINT's parsing of tim files needs to be revisited #1341

Open

dlakaplan added the tabled label Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Parkes/Tempo2 parsing confusion #1320

Fix Parkes/Tempo2 parsing confusion #1320

dlakaplan commented Jun 30, 2022

dlakaplan commented Jun 30, 2022

dlakaplan commented Jun 30, 2022

codecov bot commented Jun 30, 2022 •

edited

aarchiba commented Jul 1, 2022

dlakaplan commented Jul 1, 2022

scottransom commented Jul 1, 2022

dlakaplan commented Jul 1, 2022

aarchiba commented Jul 4, 2022

scottransom commented Jul 4, 2022

aarchiba commented Jul 5, 2022

dlakaplan commented Jul 5, 2022

scottransom commented Jul 5, 2022

aarchiba commented Jul 6, 2022 •

edited

Fix Parkes/Tempo2 parsing confusion #1320

Are you sure you want to change the base?

Fix Parkes/Tempo2 parsing confusion #1320

Conversation

dlakaplan commented Jun 30, 2022

dlakaplan commented Jun 30, 2022

dlakaplan commented Jun 30, 2022

codecov bot commented Jun 30, 2022 • edited

Codecov Report

aarchiba commented Jul 1, 2022

dlakaplan commented Jul 1, 2022

scottransom commented Jul 1, 2022

dlakaplan commented Jul 1, 2022

aarchiba commented Jul 4, 2022

scottransom commented Jul 4, 2022

aarchiba commented Jul 5, 2022

dlakaplan commented Jul 5, 2022

scottransom commented Jul 5, 2022

aarchiba commented Jul 6, 2022 • edited

codecov bot commented Jun 30, 2022 •

edited

aarchiba commented Jul 6, 2022 •

edited