Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CER calculation: divide distance by length of alignment path #21

Open
bertsky opened this issue May 10, 2021 · 14 comments
Open

CER calculation: divide distance by length of alignment path #21

bertsky opened this issue May 10, 2021 · 14 comments

Comments

@bertsky
Copy link

bertsky commented May 10, 2021

In the current implementation, the denominator of the CER calculation is the reference (first) text's length:

But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.

AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.

@bertsky
Copy link
Author

bertsky commented May 10, 2021

Same for

and

EditDistanceType.LEVENSHTEIN) / (double) l1;

and probably also

@rccarrasco
Copy link
Member

In the current implementation, the denominator of the CER calculation is the reference (first) text's length:

But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.

AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.

Levenshtein distance is indeed symmmetric but the texts here are not on equal footing: one is the groud-truth while the other is the output. By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...". Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

@bertsky
Copy link
Author

bertsky commented May 11, 2021

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

EDIT sorry, I actually meant 0% accuracy (as being correct in the limit). So, I must contradict you: normalizing to the max will get a 100% CER in the limit.

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

EDIT Here I also meant accuracy, not CER. Empty output will also be 100% CER.

And the asymmetrical denominator can yield larger than 100% rates.

@kba
Copy link
Contributor

kba commented May 11, 2021

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

They might have the first two tokens in common but since the longer one has lots of additions, in my intuition this should register as 100% incorrect. Such "more errors than tokens" cases where CER > 100% I've always interpreted as CER == 100% (but still written them down to investigate).

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

And the asymmetrical denominator can yield larger than 100% rates.

How about CER = levenshtein(a, b) / max(len(a), len(b))? (disregard, that would again lead to AB vs ABBBBB being less than 100% :(

@bertsky
Copy link
Author

bertsky commented May 11, 2021

How about CER = levenshtein(a, b) / max(len(a), len(b))

That's what I was arguing for (as being the only correct implementation).

disregard, that would again lead to AB vs ABBBBB being less than 100% :(

What do you mean, 100% accuracy or 100% error rate? (My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

@kba
Copy link
Contributor

kba commented May 11, 2021

What do you mean, 100% accuracy or 100% error rate?

I mean error rate.

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive, that however many spurious extra tokens are in the detection, CER will always be less than 100%. Also it converges towards 1, not 0, doesn't it? Because the (n + Inf) / Inf -> 1?

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

cer('AB', 'ABBBB') = 1
cer('AB', 'ABBBBBBBB') = 1

seems reasonable but then again

cer('ABBBB', 'A') = 1
cer('AB', 'A') = 1

perhaps not.

@M3ssman
Copy link

M3ssman commented May 11, 2021

Actually,
I do calculate the correction_ratio (CER inverted) between groundtruth gt and evaluation candidate like this:

dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

@bertsky
Copy link
Author

bertsky commented May 11, 2021

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive,

Yes, sorry, I was too sloppy with my writing, I was really thinking of accuracy, not CER (see edits above).

So we are actually in agreement :-)

cases where CER > 100% I've always interpreted as CER == 100%

That's not the same, though. You cannot just arbitrarily saturate at 100%. The asymmetrical denominator is already biased, clipping does not unbias it.

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

Same as above (biased: this would make the range close to one much more likely than it is).

@bertsky
Copy link
Author

bertsky commented May 11, 2021

I do calculate the correction_ratio (CER inverted) between groundtruth gt and evaluation candidate like this:

dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

That's nothing else than accuracy = 1 - CER, though – in the (clipped) asymmetrical denominator implementation.

@bertsky
Copy link
Author

bertsky commented May 11, 2021

I am wondering though, if the correct denominator really is not the max-length, but the length of alignment positions/operations (i.e. number of insertions plus deletions plus substitutions plus identities in the minimal alignment).

(This seems to be the implementation chosen in ISRI tools.)

@mikegerber
Copy link
Contributor

mikegerber commented May 12, 2021

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

@bertsky
Copy link
Author

bertsky commented May 12, 2021

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Sure you can define it that way, but that would not be interpretable and thus not useful anymore. (Clipping does not make it more interpretable, only more biased.)

It now seems clear to me that definition was simply a misconception (on Rice's part and possibly followers). The section does not even discuss the alternatives of using maximum sequence length or simply alignment path length. The latter is the natural definition here, as it directly maps to the [0,1] range.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

Because that rate is meant to approximate the probability of an average character being misrecognized, and that can (when assuming ergodicity) be expressed in terms of the distinct number of errors – but since the latter is (in the Levenshtein case) strictly symmetric, so the former must be, too.

@M3ssman
Copy link

M3ssman commented Jan 30, 2023

@mikegerber Thank you for pointing back to the original dissertation from Rice!

His accuracy formula on p25 matches exactly what I tried to express above. It's completely evident for me to put it this way.

Regarding the possibility of getting negative accuracy: I've seen this in some cases, therefore the restriction if one encounters more errors than GT has characters at all, which is interpreted as

the extreme situation in which the entire correct string can be
entered from scratch using fewer keystrokes than are needed to correct the
generated string.

Though one must consider Rice himself in the OCR-System Benchmarks (cf. The Fifth Annual Test of OCR Accuracy) removed any results / systems which fell below a certain limit of accuracy 90% (!) and didn't try to make sense on corner cases like above.

@bertsky
Copy link
Author

bertsky commented Jan 31, 2023

This is not about corner cases, though. CER is meant / intended as an empirical estimate of the probabilty of misrecognising a random character, assuming ergodicity. Using a biased denominator makes for a biased estimate, period. You usually aggregate CER as an arithmetical average over many pairs of strings (typically lines), and that means even a single line where the OCR text is longer than the GT text (i.e. a single CER beyond 1 – possibly many times larger) will distort the whole average arbitrarily.

Later formulations rightfully deviate from Rice by defining the denominator as $sum(i + s + d + c)$ (where $c$ is the number of correctly recognised characters) – which is equivalent to saying the denominator is the length of the alignment path. (This has been called normalized, too.)

@kba kba mentioned this issue Feb 8, 2023
@bertsky bertsky changed the title CER calculation: divide distance by max length CER calculation: divide distance by length of alignment path Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants