Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ochre to evaluate synthetic ocr post processing dataset generation #4

Open
omrishsu opened this issue Feb 24, 2018 · 3 comments

Comments

@omrishsu
Copy link

omrishsu commented Feb 24, 2018

Hi,
I’m working on a method to synthetically generating an ocr post processing dataset.
I think that ochre could be a great project for benchmark different datasets and evaluate which is better.
The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg):
https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI
And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset):
https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY
(13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL.
I was wondering if you plan to provide more documentation and how to for this project?
And if you can add this scenario to your examples?

Thanks!
Omri

@jvdzwaan
Copy link
Collaborator

jvdzwaan commented Feb 26, 2018

Sorry for my late reply, I also have been busy with other projects. I think ochre could be useful for you. It uses an existing tool to calculate wer and cer: https://github.com/impactcentre/ocrevalUAtion with one small change, the default limit of 10,000 characters is removed (so by default it only calculates wer and cer for files containing a maximum of 10,000 characters).

If you just want to try the ocrevalUAtion tool, have a look at https://hub.docker.com/r/nlppln/ocrevaluation-docker/

After installing docker, you can run it with:

docker run -i --rm -v=/path/to/data/:/data/ nlppln/ocrevaluation-docker java -cp /ocrevalUAtion/target/ocrevaluation.jar eu.digitisation.Main -o /data/out.html -gt /data/gs/gs-file.txt -ocr /data/ocr/ocr-file.txt

This assumes that in /path/to/data you have two folders, one called gs containing the gold standard files and one called ocr containing the ocr files. The result is that there will be a new file in /path/to/data called out.html containing (amongst other things) the wer and cer.

The workflow lets you run the tool for a directory of files, extracts the wer and cer, and puts them in a csv-file. (I see that I haven't committed the ocrevaluation workflow just yet.)

I am planning to add the workflow and more documentation, but don't know exactly when I'll have time.

Your project sounds cool and useful for the work I am doing. I'll try to update the documentation soon!

@jvdzwaan
Copy link
Collaborator

So, I have updated the documentation and added the workflows for calculating performance. You don't need a lot of Python knowledge, just follow the installation instructions and adjust the paths in the cwltool commands.

Let me know if you run into problems!

@omrishsu
Copy link
Author

omrishsu commented Mar 1, 2018

Thanks a lot!
Now I understand the folder structure, and I've updated my code to work with ochre's structure.
I'll try the workflows as soon as possible and update on my progress.

Thanks again
Omri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants