Start with a bunch of images, create workspace, run tesseract
wget 'https://archive.org/download/2917685.0001.001.umich.edu/2917685.0001.001.umich.edu_tif.zip'
unzip 2917685.0001.001.umich.edu_umichbook.zip
mv 2917685.0001.001.umich.edu_tif IMG
ocrd workspace init
ocrd workspace set-id kropotkin-paris
ocrd workspace bulk-add \
--regex '^.*/(?P<fileGrp>[^/]+)/(?P<prefix>.*)(?P<pageId>\d{4})\.(?P<ext>[^\.]*)$' \
--page-id 'PHYS_{{ pageId }}' \
--file-id '{{ fileGrp }}_{{ pageId }}' \
--url '{{ fileGrp }}/{{ prefix }}{{ pageId }}.{{ ext }}' \
-G IMG \
IMG/*.tif
If that is too daunting to adapt for your own data, an easier-to-read but slower method with bash for-loops:
for file in IMG/*.tif;do
pageid=$(echo "$file" | grep -Po '\d\d\d\d.tif' |grep -Po '\d+')
fileGrp="$(dirname $file)"
ocrd workspace add --file-id "${fileGrp}_$pageid" --page-id "PHYS_$pageid" --file-grp $fileGrp $file
done
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model deu -I IMG -O OCR-TESS"
The results are in the OCR-D-OCR
file group / folder.
Start with a METS from SBB, run a more complex workflow on it
-> https://content.staatsbibliothek-berlin.de/dc/PPN680203753.mets.xml
ocrd workspace clone https://content.staatsbibliothek-berlin.de/dc/PPN680203753.mets.xml
ocrd workspace find -k fileGrp -k url -k mimetype
ocrd workspace find --file-grp DEFAULT --download
Since our file group is called DEFAULT
, OCR-D-IMG
is replaced with DEFAULT
here.
ocrd process \
"cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN" \
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
"skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
"skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
"cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page" \
"cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"
The results are in the OCR-D-OCR
file group / folder.
Same steps as in demo1 until including the adding of files but run tesseract via
ocrd process
in docker
See https://ocr-d.de/en/models#models-and-docker for a pure Docker-based solution
cd models
ocrd resmgr download --location cwd ocrd-tesserocr-recognize '*'
docker run --user $(id -u):$(id -g) \
--volume $PWD/models:/usr/local/share/ocrd-resources/ocrd-tesserocr-recognize \
--volume $PWD/workspace:/data \
ocrd/all:maximum \
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model deu -I IMG -O OCR-TESS"
Visualize results with browse-ocrd and PRImA PageViewer
Converting the results of demo 2 to TEI with https://github.com/tboenig/page2tei
We'll use an OCR-D compatible fork of @dariok's Transkribus-oriented XSLT stylesheet:
git clone https://github.com/tboenig/page2tei
We also need to download Saxon, an XSLT engine.
wget https://sourceforge.net/projects/saxon/files/Saxon-HE/9.9/SaxonHE9-9-1-6J.zip
unzip SaxonHE9-9-1-6J.zip saxon9he.jar
NOTE At the moment, versions newer than 9.9.1.6 of Saxon are not compatible with the XSLT due to a change in bracket handling.
Now we can start converting:
java -jar saxon9he.jar -xsl:page2tei/page2tei-0.xsl -s:demo2/mets.xml -o:demo2-tei.xml PAGEprogram=OCRD PAGEXML=OCR-D-OCR
The result is in demo2-tei.xml
.