# Project
1.  **Project Title**: CheckMate: Automated Bank Check Processor

2.  **Project Statement and Outcomes:**

<!-- -->

1.  The objective of this project is to develop a Python software that
    automates the extraction of individual bank cheque details from
    scanned PDF pages. The software will identify and isolate each
    cheque, saving it as a separate image file. Furthermore, it will
    utilize optical character recognition (OCR) techniques to extract
    pertinent information such as payee, amount, name, and other
    essential details from each cheque. The primary aim is to streamline
    the cheque processing workflow by eliminating manual efforts,
    thereby saving time and resources for users dealing with large
    volumes of cheques.

2.  The outcome of this project will be a robust software solution
    capable of efficiently handling large volumes of cheques. The
    software will automatically extract individual cheques from scanned
    PDF pages and save them as separate image files. Later, it will
    extract relevant information from each cheque using OCR technology,
    like payee, amount, name, and other essential information into a CSV
    file or Analytical dashboard. This automated solution will
    significantly reduce the time and effort required for cheque
    processing, enhancing overall workflow efficiency and productivity
    for users.

<!-- -->

3.  **Modules to be Implemented:**

-   PDF Parsing Module

-   Image Extraction Module

-   OCR (Optical Character Recognition) Module

-   Data Parsing and Validation Module

-   Data Storage and Management Module

-   User Management & Interface Module

-   Review, Bug Fixes, Documentation

4.  **Week-wise module implementation and high-level requirements:**

-   Week 1-2:

> Module Implementation - PDF Parsing and Image Extraction
>
> High-level Requirements:

-   Setup development environment.

-   Install libraries for PDF parsing.

-   Implement PDF parsing to identify cheque images.

-   Develop algorithms for cheque image extraction.

-   Ensure image quality and accuracy.

-   Handle extraction errors effectively.

-   Week 3-4:

> Module Implementation - OCR, Data Parsing, Validation and Storage
>
> High-level Requirements:

-   Integrate OCR for text extraction.

-   Parse extracted data (Date, Account no., cheque no., Bank name,Payee
    name, Amount, etc.).

-   Handle variations in text formatting.

-   Implement validation cheques for data accuracy.

-   Design and implement Database.

<!-- -->

-   Week 5-6:

> Module Implementation - User Management & Interface Module
>
> High-level Requirements:

-   Develop forms for User Registration and Login.

-   Allow Users to upload PDFs securely and handle file limitations.

-   Design a user-friendly interface for interaction for the above
    usecase.

-   Provide export as CSV or PDF functionality.

-   Week 7-8:Review, Bug Fixes, Documentation

> High-level Requirements:

-   Conduct a thorough review of the entire system, including
    functionality, security, and user interface.

-   Address any identified bugs or issues and perform necessary fixes.

-   Prepare comprehensive documentation covering system architecture,
    user guides, and technical specifications.

5.  **Diagrams:**

-   Flowchart


6.  **Output:**

> A CSV file with the details extracted from the cheques.



# Resources
+ Github Projects:
    + [Cheque Data Extraction using Python and Tesseract OCR](https://github.com/VasudhaSingh22/Image-OCR)
    + [OCR-Checks](https://github.com/ichrakromdhani/OCR-Checks)
    + [What_is_OCR](https://github.com/ZackTanCZ/What_is_OCR)
    + [Link](https://github.com/naikshubham/Bank-Cheque-OCR)
    + [Link](https://github.com/Surbh77/Cheque-data-extraction)
+ Research Paper:
    + [OCR based Cheque Validation using Image Processing](https://www.semanticscholar.org/paper/OCR-based-Cheque-Validation-using-Image-Processing-Kunekar-Vayadande/670f8a3d937b123d51256fef04416cf5775271dd)

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.24.14


In [5]:
import fitz
import os
from pathlib import Path

def extract_images_from_pdf(pdf_path, output_dir='./Images'):
    """
    Extract images from a PDF file and save them as PNG files.

    Args:
        pdf_path (str): Path to the PDF file
        output_dir (str): Directory where images will be saved

    Returns:
        list: List of paths to saved image files

    Raises:
        FileNotFoundError: If PDF file doesn't exist
        PermissionError: If output directory can't be created
        ValueError: If PDF file is invalid
    """
    # Validate PDF file exists
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")

    # Create output directory if it doesn't exist
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    saved_images = []
    xrefs_set = set()  # Track image reference numbers
    image_counter = 1

    try:
        # Open PDF file
        pdf_document = fitz.open(pdf_path)

        # Iterate through pages
        for page_num in range(pdf_document.page_count):
            page = pdf_document[page_num]

            # Get images from page
            image_list = page.get_images()

            # Process each image
            for img_index, img in enumerate(image_list):
                xref = img[0]  # Get the image reference number

                # Skip if we've already extracted the image
                if xref in xrefs_set:
                    continue

                # Add image reference number to the set
                xrefs_set.add(xref)

                # Extract and save the image
                base_image = pdf_document.extract_image(xref)
                image_bytes = base_image["image"]

                # Generate name and path for image
                image_filename = f"image_{image_counter:03d}.png"
                image_path = output_path / image_filename

                # Save image
                with open(image_path, "wb") as img_file:
                    img_file.write(image_bytes)

                saved_images.append(str(image_path))
                image_counter += 1

        return saved_images

    except fitz.FileDataError:
        raise ValueError("Invalid or corrupted PDF file")
    finally:
        if 'pdf_document' in locals():
            pdf_document.close()

In [6]:
# Example usage
pdf_path = "/content/drive/MyDrive/Colab Notebooks/ChequeDetails/cheques.pdf"
output_directory = "/content/Images"

try:
    saved_image_paths = extract_images_from_pdf(pdf_path, output_directory)
    print(f"Successfully extracted {len(saved_image_paths)} images")
    for path in saved_image_paths:
        print(f"Saved: {path}")
except Exception as e:
    print(f"Error: {e}")

Successfully extracted 7 images
Saved: /content/Images/image_001.png
Saved: /content/Images/image_002.png
Saved: /content/Images/image_003.png
Saved: /content/Images/image_004.png
Saved: /content/Images/image_005.png
Saved: /content/Images/image_006.png
Saved: /content/Images/image_007.png


# Flowchart
![Flowchart](https://drive.google.com/file/d/1L2VCVNcjug6RcsTDxKCaQijB16KrJ1uq/view?usp=sharing)
