## TODO: PDF Page Classification Task

Objective:
Implement the `classify_page` function within the given code structure to classify each page of a PDF file into one of three categories based on its content and readability.

Code Structure:
The main function `classify_all_pages` is already implemented, but if necessary you are allowed to change its implementation. Your task is to complete the `classify_page` function.

Input:
- A PDF file path (the function should be able to handle various PDF files)

Output:
- A list of integers, where each integer represents the class of a page in the PDF

Classification Categories:

0: Machine-readable / searchable
   - Pages with text that can be directly extracted and searched within the PDF

1: Non-machine readable but OCR-able
   - Pages containing text that isn't directly extractable but can be recognized through OCR
   - Essentially, these are pages with visible text but stored as images

2: Non-machine readable and not OCR-able
   - Pages without any recognizable text
   - This may include pages with only images, complex graphics, or blank pages

Task:
1. Implement the `classify_page` function:
   - Input: A single page object (PdfReader.PageObject)
   - Output: An integer (0, 1, or 2) representing the page's class

2. The function should analyze the content of the page and determine its class based on the categories described above.

3. Ensure your implementation is robust and can handle various types of PDF content.

Requirements:
1. The function should work with different PDF files, not just a specific one.
2. Implement methods to distinguish between the three categories accurately.
3. Handle potential exceptions or edge cases (e.g., corrupted pages, mixed content types on a single page).
4. Optimize for both accuracy and processing speed, as the function will be called for each page in the PDF.
5. You are allowed to use up to 40GB of GPU VRAM if necessary for your implementation.

Additional Considerations:
- You may use additional libraries if needed, but ensure they are imported properly.
- Provide clear comments in your code to explain the classification logic.

Testing:
- Test your implementation with various types of PDFs to ensure its robustness and generalizability.
- The main script provides a way to test your implementation on a file named "grouped_documents.pdf".


In [1]:
# make sure you have installed all reqired modules:
# run pip install -r requirements.txt to do so 

import fitz  # use fitz aka PyMuPDF for better text extraction capabilities instead of PdfReader
from typing import List

def classify_all_pages(input_pdf: str) -> List[int]: 
    """
    Analyze all pages in the input PDF and determine the class of each page.

    Args:
    input_pdf (str): The file path of the input PDF.

    Returns:
    List[int]: A list of classes for each page. 
            0: machine-readable
            1: non-machine readable but OCR-able
            2: non-machine readable and not OCR-able
    """
    # read PDF file with fitz
    document = fitz.open(input_pdf)
    
    # List to hold the classification results for each page
    classes = []
    
    # Iterate through each page in the document
    for page_number in range(len(document)):
        # Get the current page
        current_page = document.load_page(page_number)
        
        # Classify the page
        page_class = classify_page(current_page)
        
        # Append the result to the list
        classes.append(page_class)
    
    # Close the document
    document.close()
    
    return classes


In [2]:
from pdf2image import convert_from_bytes
from PIL import Image
import io

def classify_page(page: fitz.Page) -> int:
    """
    Determine the class of the PDF page while ignoring header/footer margins.

    Args:
    page (fitz.Page): A single page from a PDF using PyMuPDF.

    Returns:
    int: The page is 
        0: machine-readable
        1: non-machine readable but OCR-able
        2: non-machine readable and not OCR-able
    """
    # Define margin threshold as a percentage of the page size
    margin_threshold = 0.1  # 10%

    # Get the dimensions of the page
    width = page.rect.width
    height = page.rect.height

    # Calculate margins
    top_margin = int(margin_threshold * height)
    bottom_margin = height - top_margin
    left_margin = int(margin_threshold * width)
    right_margin = width - left_margin

    # Define the cropped rectangle
    cropped_rect = fitz.Rect(left_margin, top_margin, right_margin, bottom_margin)

    # Extract text from the cropped region
    text = page.get_text("text", clip=cropped_rect)
    if text.strip():
        # If there is extractable text, it's machine-readable
        return 0

    # If no text is extracted, check for images on the page
    try:
        # Convert PDF page to image
        pdf_writer = io.BytesIO()
        pdf_writer.write(page.get_pixmap().tobytes())
        pdf_writer.seek(0)
        images = convert_from_bytes(pdf_writer.getvalue())

        # Check if images are present and analyze margins
        if images:
            for image in images:
                # Convert image to grayscale for simpler analysis
                gray_image = image.convert("L")

                # Define margins as a percentage of the image size
                width, height = image.size
                top_margin = int(margin_threshold * height)
                bottom_margin = height - top_margin
                left_margin = int(margin_threshold * width)
                right_margin = width - left_margin

                # Crop the image to exclude margins
                cropped_image = gray_image.crop((left_margin, top_margin, right_margin, bottom_margin))

                # Check if cropped image has significant content
                if cropped_image.getbbox() is not None:  # If cropped image has a bounding box, it's not empty
                    return 1

    except Exception as e:
        print(f"Error processing page: {e}")

    # If neither text extraction nor significant images are found, classify as non-OCR-able
    return 2
# Usage
input_pdf: str = "grouped_documents.pdf"
page_classes: List[int] = classify_all_pages(input_pdf)
print(f"Classes for each page: {page_classes}")

Error processing page: Unable to get page count.
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Classes for each page: [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
