Skip to content

A Unet based deeplearning model to line/box/spurious artifacts from text images. Unsupervised training.

Notifications You must be signed in to change notification settings

kapitsa2811/DeepErase

 
 

Repository files navigation

DeepErase

  • DeepErase is a U-net-like tensorflow sementic segmenation model removing artifacts (lines, boxes, spurious words) from text images extracted from documents. The cleansing of the artifacts enhances OCR performance over the image extractions.

Authors

Abstract

  • We present a method to programmatically generate artificial text images with realistic-looking artifacts, and use them to train the U-net-like model in a totally unsupervised manner.
  • The U-net-like model was trained in two modes:

Result

  • Both validation pixel level segmentation accuracies were above 95%.
  • Downstream recognition performances were evaluated on validation images and IRS extractions. The IRS extractions were extracted from NIST sd02 tax forms, and were not used in model training. The word recognition accuracy were improved and beat the naive Hough cv2 cleaning method.

Requirements

  • python 3.5 or above
  • tensorflow 1.12.0
  • torch 0.4.1
  • cv2 4.0.0

or simply

  • docker pull wrhuang/default
  • or docker pull jdegange/default
  • with minor further pip install

About

A Unet based deeplearning model to line/box/spurious artifacts from text images. Unsupervised training.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Jupyter Notebook 99.4%
  • Python 0.6%