Skip to content

A Convolutional Autoencoder (CAE) to remove noise from document images and reconstruct them without losing important information. I used TensorFlow, OpenCV, Scikit-Learn, and Python to develop this autoencoder.

Notifications You must be signed in to change notification settings

mmalam3/Document-Denoising-Convolutional-Autoencoder-using-TensorFlow

Repository files navigation

Document Denoising Convolutional Autoencoder using Tensorflow

This repository contains the implementation of a Denoising Convolutional Autoencoder (CAE) using TensorFlow, OpenCV, Keras, Scikit-Learn, and Python. The goal of this project is to perform noise reduction in noisy documents, such as scanned documents, or images of documents.

The autoencoder architecture used in this project is a Convolutional Neural Network (CNN). It consists of two components:

  1. An encoder that takes a noisy document as input and encodes it into a low-dimensional representation, and
  2. A decoder that takes the low-dimensional representation outputted by the encoder and reconstructs the original document discarding the noise.

Dataset

The denoising-dirty-documents dataset is used in this project for training and testing the models. The dataset provides images of documents containing various style of texts. It has three sets of data:

  1. train data: images of documents used for training the model to which synthetic noise has been added to simulate real-world, messy artifacts,
  2. train_cleaned data: dataset with denoised train data used for validation during the training procedure, and
  3. test data: noisy images of documents to be used for testing the mode.

Usage

You can run this project either 1) in Colab, or 2) in our own machine installing TensorFlow, cv2, and scikit-learn.

  1. To run the project in Google Colab, you need to open the denoising_convolutional_autoencoder.ipynb file from the notebooks directory. The notebook contains all the required codes along with suitable comments. Since the datate is hosted in Kaggle, the detailed instructions of how to download and preprocess the dataset correctly are also included in the notebook.

  2. To run the project in your own machine, use the following commands to install necessary tools/libraries:

python -m pip install -U pip # to install pip
pip install tensorflow
pip install pip install opencv-python
pip install -U scikit-learn

This project also uses two widely used Python libraries: numpy and matplotlib. If your machine doesn't have these libraries included in your Python, use the following commands to install them:

pip install numpy
python -m pip install -U matplotlib

Once all the necessary dependencies are installed, simply run the convolutional_autoencoder.py file from the convolutional_autoencoder directory.

Note: The dataset files are not uploaded in the project due to the lack of storage space. To ensure that the project runs without errors, please create a data/raw/ directory on the root folder and keep the unzipped train, test, and train_cleaned directories there. Find the config.py in the the convolutional_autoencoder directory for more details about the configurations.

Evaluation

The reports directory contains visual plots showing the evolution of loss and errors during the training process and the outputs of denoising operations over the test dataset. Here goes a plot to show how loss and mean absolute error (MAE) changes over epochs during the training process.

Change of loss on train and validation (train_cleaned) datasets over epochs

Conclusion

This Denoising Convolutional Autoencoder (CAE) can be utilized to denoise any noisy documents, including scanned documents, photographs of documents, or low-quality PDFs, minimizing the training loss and errors almost close to zero, i.e., with very high accuracy.

About

A Convolutional Autoencoder (CAE) to remove noise from document images and reconstruct them without losing important information. I used TensorFlow, OpenCV, Scikit-Learn, and Python to develop this autoencoder.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published