CER calculation: divide distance by length of alignment path #21

In the current implementation, the denominator of the CER calculation is the reference (first) text's length:

ocrevalUAtion/src/main/java/eu/digitisation/output/ErrorMeasure.java

Line 53 in 84f15b8

/ (double) l1;

But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.

AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.

Levenshtein distance is indeed symmmetric but the texts here are not on equal footing: one is the groud-truth while the other is the output. By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...". Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

bertsky · 2021-05-11T08:15:27Z

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

EDIT sorry, I actually meant 0% accuracy (as being correct in the limit). So, I must contradict you: normalizing to the max will get a 100% CER in the limit.

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

EDIT Here I also meant accuracy, not CER. Empty output will also be 100% CER.

And the asymmetrical denominator can yield larger than 100% rates.

kba · 2021-05-11T13:48:22Z

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

They might have the first two tokens in common but since the longer one has lots of additions, in my intuition this should register as 100% incorrect. Such "more errors than tokens" cases where CER > 100% I've always interpreted as CER == 100% (but still written them down to investigate).

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

And the asymmetrical denominator can yield larger than 100% rates.

~~How about CER = levenshtein(a, b) / max(len(a), len(b))?~~ (disregard, that would again lead to AB vs ABBBBB being less than 100% :(

bertsky · 2021-05-11T14:02:11Z

How about CER = levenshtein(a, b) / max(len(a), len(b))

That's what I was arguing for (as being the only correct implementation).

disregard, that would again lead to AB vs ABBBBB being less than 100% :(

What do you mean, 100% accuracy or 100% error rate? (My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

kba · 2021-05-11T14:27:29Z

What do you mean, 100% accuracy or 100% error rate?

I mean error rate.

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive, that however many spurious extra tokens are in the detection, CER will always be less than 100%. Also it converges towards 1, not 0, doesn't it? Because the (n + Inf) / Inf -> 1?

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

cer('AB', 'ABBBB') = 1
cer('AB', 'ABBBBBBBB') = 1

seems reasonable but then again

cer('ABBBB', 'A') = 1
cer('AB', 'A') = 1

perhaps not.

M3ssman · 2021-05-11T15:45:23Z

Actually,
I do calculate the correction_ratio (CER inverted) between groundtruth gt and evaluation candidate like this:

dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

bertsky · 2021-05-11T16:05:53Z

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive,

Yes, sorry, I was too sloppy with my writing, I was really thinking of accuracy, not CER (see edits above).

So we are actually in agreement :-)

cases where CER > 100% I've always interpreted as CER == 100%

That's not the same, though. You cannot just arbitrarily saturate at 100%. The asymmetrical denominator is already biased, clipping does not unbias it.

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

Same as above (biased: this would make the range close to one much more likely than it is).

bertsky · 2021-05-11T16:07:47Z

I do calculate the correction_ratio (CER inverted) between groundtruth gt and evaluation candidate like this:
dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

That's nothing else than accuracy = 1 - CER, though – in the (clipped) asymmetrical denominator implementation.

bertsky · 2021-05-11T19:32:06Z

I am wondering though, if the correct denominator really is not the max-length, but the length of alignment positions/operations (i.e. number of insertions plus deletions plus substitutions plus identities in the minimal alignment).

(This seems to be the implementation chosen in ISRI tools.)

mikegerber · 2021-05-12T08:54:49Z

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

bertsky · 2021-05-12T10:48:37Z

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Sure you can define it that way, but that would not be interpretable and thus not useful anymore. (Clipping does not make it more interpretable, only more biased.)

It now seems clear to me that definition was simply a misconception (on Rice's part and possibly followers). The section does not even discuss the alternatives of using maximum sequence length or simply alignment path length. The latter is the natural definition here, as it directly maps to the [0,1] range.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

Because that rate is meant to approximate the probability of an average character being misrecognized, and that can (when assuming ergodicity) be expressed in terms of the distinct number of errors – but since the latter is (in the Levenshtein case) strictly symmetric, so the former must be, too.

M3ssman · 2023-01-30T22:03:45Z

@mikegerber Thank you for pointing back to the original dissertation from Rice!

His accuracy formula on p25 matches exactly what I tried to express above. It's completely evident for me to put it this way.

Regarding the possibility of getting negative accuracy: I've seen this in some cases, therefore the restriction if one encounters more errors than GT has characters at all, which is interpreted as

the extreme situation in which the entire correct string can be
entered from scratch using fewer keystrokes than are needed to correct the
generated string.

Though one must consider Rice himself in the OCR-System Benchmarks (cf. The Fifth Annual Test of OCR Accuracy) removed any results / systems which fell below a certain limit of accuracy 90% (!) and didn't try to make sense on corner cases like above.

bertsky · 2023-01-31T08:42:27Z

This is not about corner cases, though. CER is meant / intended as an empirical estimate of the probabilty of misrecognising a random character, assuming ergodicity. Using a biased denominator makes for a biased estimate, period. You usually aggregate CER as an arithmetical average over many pairs of strings (typically lines), and that means even a single line where the OCR text is longer than the GT text (i.e. a single CER beyond 1 – possibly many times larger) will distort the whole average arbitrarily.

Later formulations rightfully deviate from Rice by defining the denominator as $sum(i + s + d + c)$ (where $c$ is the number of correctly recognised characters) – which is equivalent to saying the denominator is the length of the alignment path. (This has been called normalized, too.)

bertsky mentioned this issue May 11, 2021

Feature: Convert edit distance to ratio/similarity roy-ht/editdistance#28

Open

bertsky mentioned this issue Jun 9, 2021

Review error rate definitions etc. qurator-spk/dinglehopper#45

Open

bertsky mentioned this issue Sep 15, 2021

output true CER for checkpoints (at least the final one) tesseract-ocr/tesseract#3560

Open

bertsky mentioned this issue Mar 29, 2022

metrics: avoid CER > 1.0 wincentbalin/pytesstrain#3

Open

kba mentioned this issue Feb 8, 2023

QA Spec OCR-D/spec#225

Merged

bertsky changed the title ~~CER calculation: divide distance by max length~~ CER calculation: divide distance by length of alignment path Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CER calculation: divide distance by length of alignment path #21

CER calculation: divide distance by length of alignment path #21

bertsky commented May 10, 2021

bertsky commented May 10, 2021

rccarrasco commented May 11, 2021

bertsky commented May 11, 2021 •

edited

Loading

kba commented May 11, 2021 •

edited

Loading

bertsky commented May 11, 2021

kba commented May 11, 2021 •

edited

Loading

M3ssman commented May 11, 2021

bertsky commented May 11, 2021

bertsky commented May 11, 2021

bertsky commented May 11, 2021

mikegerber commented May 12, 2021 •

edited

Loading

bertsky commented May 12, 2021

M3ssman commented Jan 30, 2023

bertsky commented Jan 31, 2023 •

edited

Loading

CER calculation: divide distance by length of alignment path #21

CER calculation: divide distance by length of alignment path #21

Comments

bertsky commented May 10, 2021

bertsky commented May 10, 2021

rccarrasco commented May 11, 2021

bertsky commented May 11, 2021 • edited Loading

kba commented May 11, 2021 • edited Loading

bertsky commented May 11, 2021

kba commented May 11, 2021 • edited Loading

M3ssman commented May 11, 2021

bertsky commented May 11, 2021

bertsky commented May 11, 2021

bertsky commented May 11, 2021

mikegerber commented May 12, 2021 • edited Loading

bertsky commented May 12, 2021

M3ssman commented Jan 30, 2023

bertsky commented Jan 31, 2023 • edited Loading

bertsky commented May 11, 2021 •

edited

Loading

kba commented May 11, 2021 •

edited

Loading

kba commented May 11, 2021 •

edited

Loading

mikegerber commented May 12, 2021 •

edited

Loading

bertsky commented Jan 31, 2023 •

edited

Loading