Skip to content

kayoyin/DirtyDocuments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scanned Document Denoiser

Implementation of an ensemble model with CNN autoencoder and image processing as base learners, and CNN or XGB meta learner, and CycleGAN for cleaning scanned documents, to faciliate OCR subsequently.

For further details of the approach, my Medium article explains the full project and results.

To use the ensemble model (CNN or XGB):

  • Put dirty images in data/x_train, associated target clean images in data/y_train, dirty test images in data/x_test.
  • Run data_augmentation.ipynb
  • Run base_learners.ipynb
  • Run stacker_models.ipynb

To use the CycleGAN model:

  • Put dirty images for training/testing in gan_data/trainA / gan_data/testA
  • Put clean images in gan_data/trainB / gan_data/testB
  • The dirty and clean images do not need to be paired
  • Run CycleGAN.ipynb

References