Detects columns and connects indented lines in hOCR files
HTML JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
example
.gitignore
LICENSE.md
README.md
default-config.json
hocr-detect-columns.gif
index.js
package-lock.json
package.json
visualization.template.html

README.md

hocr-detect-columns

Detects columns and connects indented lines in hOCR files. This Node.js module is used in the NYPL's NYC Space/Time Directory project to extract data from digitized New York City directories.

hOCR column detection

Most OCR tools can produce hOCR files — we are using Tesseract.

But how does it work?

First, hocr-detect-columns uses Cheerio to read all pages in the hOCR file (an hOCR file is just an HTML file with special properties).

Per page, the X positions of the bounding boxes of all OCR lines are clustered, using Simple Statistics. A page with n columns should have many lines with bounding boxes on or around n different X values. If clustering finds n clusters containing most of the OCR lines, we can expect the page has n columns.

To connect indented lines with the previous line they belong to, hocr-detect-columns uses a spatial index and tries to find, for each line which doesn't belong to a column, the closest line in the upper-left direction. The algorithms we need are implemented by RBush (spatial index) and rbush-knn (nearest neighbor search). You can read more about spatial search algorithms for JavaScript on Mapbox's blog.

Installation & Usage

Standalone

npm install -g nypl-spacetime/hocr-detect-columns

hocr-detect-columns can produce the following output formats:

  • Log to stdout (default):
hocr-detect-columns /path/to/file.hocr
  • Output JSON:
hocr-detect-columns --mode json /path/to/file.hocr
  • Output NDJSON:
hocr-detect-columns --mode ndjson /path/to/file.hocr
  • Output HTML visualization:
hocr-detect-columns --mode html /path/to/file.hocr

As a Node.js module

npm install --save nypl-spacetime/hocr-detect-columns
const fs = require('fs')
const detectColumns = require('hocr-detect-columns')

const hocr = fs.readFileSync('/path/to/file.hocr', 'utf8')

const config = {}

const pages = detectColumns(hocr, config)

Configuration

You can configure hocr-detect-columns by supplying a JSON configuration object or file:

{
  "columnCount": 2, // Amount of expected columns
  "characterWidth": 25, // Width of character, in pixels
  "minLinesPerColumn": 50 // Minimum expected lines, per column
}

Example

In the directory example, you can find an hOCR file of page 418 of the 1850 city directory, as well as a JSON file and HTML visualization generated by hocr-detect-columns.

To generate these files yourself, run:

hocr-detect-columns --mode json example/example.hocr

Or:

hocr-detect-columns --mode html example/example.hocr

Data

The format of the resulting JSON pages object is as follows:

{
  "config": {
    …
  },
  "pages": [
    {
      "number": 0,
      "properties": {
        …
      },
      "lines": [
        {
          "properties": {
            "bbox": [
              …
            ],
            …
          }
          "text": "contents of line"
          "columnIndex": 0,
          "completeText": "contents of line, appended with text from indented next lines"
        },
        …
      ]
    },
    {
      "number": 1,
      "properties": {
        …
      },
      "lines": [
        …
      ]
    },
    …
  ]
}