Parse OCR result files for pagenos, tables of contents, etc.
Python PHP
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
fonts
.gitignore
README
analyze_ocr.php
analyze_ocr.py
color.py
diff_match_patch.py
extract_sorted.py
find_header_footer.py
find_pagenos.py
font.py
iabook.py
interval.py
make_toc.py
rnums.py
tuples.py
visualize.py
windowed_iterator.py

README

Some code for analyzing OCR'ed documents.  It's currently pretty
specific to Internet Archive OCR'd books, but it may be generalizable.

Entry point: analyze_ocr.py - run this against an archive scanned book.

Functionality: find headers/footers, page numbers, tables of contents.