Using ochre to evaluate synthetic ocr post processing dataset generation #4

omrishsu · 2018-02-24T12:08:38Z

Hi,
I’m working on a method to synthetically generating an ocr post processing dataset.
I think that ochre could be a great project for benchmark different datasets and evaluate which is better.
The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg):
https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI
And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset):
https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY
(13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL.
I was wondering if you plan to provide more documentation and how to for this project?
And if you can add this scenario to your examples?

Thanks!
Omri

jvdzwaan · 2018-02-26T22:24:24Z

Sorry for my late reply, I also have been busy with other projects. I think ochre could be useful for you. It uses an existing tool to calculate wer and cer: https://github.com/impactcentre/ocrevalUAtion with one small change, the default limit of 10,000 characters is removed (so by default it only calculates wer and cer for files containing a maximum of 10,000 characters).

If you just want to try the ocrevalUAtion tool, have a look at https://hub.docker.com/r/nlppln/ocrevaluation-docker/

After installing docker, you can run it with:

docker run -i --rm -v=/path/to/data/:/data/ nlppln/ocrevaluation-docker java -cp /ocrevalUAtion/target/ocrevaluation.jar eu.digitisation.Main -o /data/out.html -gt /data/gs/gs-file.txt -ocr /data/ocr/ocr-file.txt

This assumes that in /path/to/data you have two folders, one called gs containing the gold standard files and one called ocr containing the ocr files. The result is that there will be a new file in /path/to/data called out.html containing (amongst other things) the wer and cer.

The workflow lets you run the tool for a directory of files, extracts the wer and cer, and puts them in a csv-file. (I see that I haven't committed the ocrevaluation workflow just yet.)

I am planning to add the workflow and more documentation, but don't know exactly when I'll have time.

Your project sounds cool and useful for the work I am doing. I'll try to update the documentation soon!

refs #4

jvdzwaan · 2018-02-27T19:32:06Z

So, I have updated the documentation and added the workflows for calculating performance. You don't need a lot of Python knowledge, just follow the installation instructions and adjust the paths in the cwltool commands.

Let me know if you run into problems!

omrishsu · 2018-03-01T19:31:16Z

Thanks a lot!
Now I understand the folder structure, and I've updated my code to work with ochre's structure.
I'll try the workflows as soon as possible and update on my progress.

Thanks again
Omri

omrishsu mentioned this issue Feb 24, 2018

Working without aligned file #3

Open

jvdzwaan pushed a commit that referenced this issue Feb 27, 2018

Add workflows and documentation for calculating ocr performance

df7a7e7

refs #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ochre to evaluate synthetic ocr post processing dataset generation #4

Using ochre to evaluate synthetic ocr post processing dataset generation #4

omrishsu commented Feb 24, 2018 •

edited

Loading

jvdzwaan commented Feb 26, 2018 •

edited

Loading

jvdzwaan commented Feb 27, 2018

omrishsu commented Mar 1, 2018 •

edited

Loading

Using ochre to evaluate synthetic ocr post processing dataset generation #4

Using ochre to evaluate synthetic ocr post processing dataset generation #4

Comments

omrishsu commented Feb 24, 2018 • edited Loading

jvdzwaan commented Feb 26, 2018 • edited Loading

jvdzwaan commented Feb 27, 2018

omrishsu commented Mar 1, 2018 • edited Loading

omrishsu commented Feb 24, 2018 •

edited

Loading

jvdzwaan commented Feb 26, 2018 •

edited

Loading

omrishsu commented Mar 1, 2018 •

edited

Loading