# **PDF Image Extraction and Optional Color Reversal**

Welcome to this Google Colab notebook designed to assist you in efficiently extracting images from PDF files and handling color reversal issues that might arise during the extraction process.

In this notebook, we provide two Python scripts:

PDF Image Extraction Script: This part leverages the PyMuPDF library to extract images from PDF pages. It employs a robust approach, including primary and fallback methods, to ensure a high success rate in image extraction.

Color Reversal Script: The secondary part focuses on reversing the colors of images that might appear as negatives. This step is crucial for restoring the original visual appearance of extracted images.

# Getting Started
Mount your Google Drive to access input and output folders where the PDFs and extracted images will be stored.

Upload your PDF files to the designated input folder (/content/drive/MyDrive/ImageScrapePDF/Import).

Execute the PDF image extraction script to start the image extraction process. Extracted images will be saved in the output folder (/content/drive/MyDrive/ImageScrapePDF/Output).

*   Mount your Google Drive to access input and output folders where the PDFs and extracted images will be stored.
*   Upload your PDF files to the designated input folder (/content/drive/MyDrive/ImageScrapePDF/Import)
*   Execute the PDF image extraction script to start the image extraction process. Extracted images will be saved in the output folder (/content/drive/MyDrive/ImageScrapePDF/Output).
*   **(Optional)** Run the color reversal script to correct any negatively extracted images. The corrected images will be saved in separate 'flipped' folders within the output directory


Please follow the step-by-step instructions provided in each code cell to perform image extraction and color reversal. Feel free to adapt the scripts to your specific needs and explore other features of the PyMuPDF and Pillow libraries for advanced image manipulation.

Enjoy the streamlined process of working with images extracted from PDF documents!

## Mount Google Drive
Mount Drive and Make Folders

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!mkdir '/content/drive/MyDrive/ImageScrapePDF'  # Create the ImageScrapePDF folder in your Google Drive
!mkdir '/content/drive/MyDrive/ImageScrapePDF/Import'  # Create the Import folder in your Google Drive
!mkdir '/content/drive/MyDrive/ImageScrapePDF/Output'  # Create the Output folder in your Google Drive




Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
mkdir: cannot create directory ‘/content/drive/MyDrive/ImageScrapePDF’: File exists


**Install Dependencies**

In [2]:
!pip install PyMuPDF pillow

Collecting PyMuPDF
  Downloading PyMuPDF-1.22.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.22.5


# Run the code

## Upload PDFs

Select Multiple PDF's to upload or Manually place them within your import folder located (/content/drive/MyDrive/ImageScrapePDF/Import)

In [3]:
from google.colab import files
import os

# Create the target directory if it doesn't exist
target_dir = '/content/drive/MyDrive/ImageScrapePDF/Import'
os.makedirs(target_dir, exist_ok=True)

# Upload PDF files
uploaded = files.upload()

# Move uploaded PDF files to the target directory
for filename, data in uploaded.items():
    if filename.endswith('.pdf'):
        with open(os.path.join(target_dir, filename), 'wb') as f:
            f.write(data)

print("PDF files uploaded and stored in:", target_dir)

Saving cf_E.pdf to cf_E.pdf
PDF files uploaded and stored in: /content/drive/MyDrive/ImageScrapePDF/Import


## Extract Images

In [4]:
import fitz  # PyMuPDF
from io import BytesIO
from tqdm import tqdm  # For progress bar

def extract_images_from_pdf(pdf_path, output_folder):
    try:
        pdf_document = fitz.open(pdf_path)
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]

        # Create a directory with the same name as the PDF file
        pdf_directory = os.path.join(output_folder, pdf_name)
        os.makedirs(pdf_directory, exist_ok=True)

        # Initialize the overall progress bar
        total_pages = pdf_document.page_count
        overall_progress = tqdm(total=total_pages, desc="Extracting images", unit="page")

        try:
            for page_number in range(pdf_document.page_count):
                page = pdf_document[page_number]
                images = page.get_images(full=True)

                for img_index, img in enumerate(images):
                    xref = img[0]

                    # Try primary method: pdf_document.extract_image
                    try:
                        base_image = pdf_document.extract_image(xref)
                        image_data = base_image["image"]
                    except Exception as e:
                        print(f"Primary extraction method failed for page {page_number + 1}, image {img_index + 1}: {e}")
                        base_image = None

                    # If primary method fails, try fallback method: page.get_pixmap
                    if base_image is None:
                        try:
                            base_image = page.get_pixmap()
                            image_data = base_image.samples
                        except Exception as e:
                            print(f"Fallback extraction method failed for page {page_number + 1}, image {img_index + 1}: {e}")
                            continue

                    image_filename = f"{pdf_directory}/{pdf_name}_page{page_number + 1}_img{img_index + 1}.png"

                    # Save the image using Pixmap
                    try:
                        with open(image_filename, "wb") as img_file:
                            img_file.write(image_data)
                    except Exception as e:
                        print(f"Failed to save image {image_filename}: {e}")
                        continue

                    # Update the overall progress bar
                    overall_progress.update(1)

        except KeyboardInterrupt:
            print("\nExtraction interrupted.")
        finally:
            # Close the overall progress bar
            overall_progress.close()

    except Exception as e:
        print(f"PDF processing error for {pdf_path}: {e}")
    finally:
        pdf_document.close()

if __name__ == "__main__":
    input_folder = "/content/drive/My Drive/ImageScrapePDF/Import"
    output_folder = "/content/drive/My Drive/ImageScrapePDF/Output"

    # List all PDF files in the input folder
    pdf_files = [f for f in os.listdir(input_folder) if f.lower().endswith('.pdf')]

    for pdf_file in pdf_files:
        pdf_file_path = os.path.join(input_folder, pdf_file)
        print(f"Processing: {pdf_file}")
        extract_images_from_pdf(pdf_file_path, output_folder)

    print("Image extraction finished.")


Processing: cf_E.pdf


Extracting images:  23%|██▎       | 45/194 [00:00<00:02, 55.19page/s]

Image extraction finished.





# Color Reversal (Optional to fix Negative Images)

Sometimes it will extract images and they will be negative images. Use this to run a batch on your output folder that will reverse the colors


In [5]:
from PIL import Image

input_folder = "/content/drive/My Drive/ImageScrapePDF/Output"

try:
    # List all immediate subfolders in the input folder
    subfolders = [f for f in os.listdir(input_folder) if os.path.isdir(os.path.join(input_folder, f))]

    # Process images in each subfolder
    for subfolder in subfolders:
        try:
            input_subfolder = os.path.join(input_folder, subfolder)
            output_subfolder = os.path.join(input_subfolder, "flipped")

            # Create the flipped output folder inside the subfolder
            os.makedirs(output_subfolder, exist_ok=True)

            # List all image files in the input subfolder
            image_files = [f for f in os.listdir(input_subfolder) if f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp'))]

            # Process each image
            for image_file in image_files:
                try:
                    input_path = os.path.join(input_subfolder, image_file)
                    output_path = os.path.join(output_subfolder, image_file)

                    # Open the image
                    try:
                        image = Image.open(input_path)
                    except Exception as e:
                        print(f"Error opening image {input_path}: {e}")
                        continue

                    # Convert to RGB color mode if CMYK
                    if image.mode == 'CMYK':
                        image = image.convert('RGB')

                    # Invert the colors (reverse the negative)
                    inverted_image = Image.eval(image, lambda pixel: 255 - pixel)

                    # Save the inverted image to the flipped output folder
                    try:
                        inverted_image.save(output_path)
                    except Exception as e:
                        print(f"Error saving image {output_path}: {e}")
                        continue
                except Exception as e:
                    print(f"Error processing image {image_file}: {e}")
        except Exception as e:
            print(f"Error processing subfolder {subfolder}: {e}")
except Exception as e:
    print(f"Error processing input folder: {e}")

print("Images flipped and saved in the 'flipped' folders.")


Images flipped and saved in the 'flipped' folders.


**Increase Saturation of Flipped Images**

Sometimes the images are undersaturated so this can easily increase the saturation.

In [6]:
!pip install opencv-python



In [8]:
import cv2
import numpy as np

def increase_saturation(input_folder, saturation_percent):
    for root, _, files in os.walk(input_folder):
        if "flipped" in root:
            output_subfolder = os.path.join(root, "saturated")
            os.makedirs(output_subfolder, exist_ok=True)

            for filename in files:
                if filename.endswith(".jpg") or filename.endswith(".png"):
                    input_path = os.path.join(root, filename)

                    img = cv2.imread(input_path)
                    img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

                    # Calculate saturation factor based on percentage increase
                    saturation_factor = 1 + saturation_percent / 100.0

                    # Increase saturation
                    img_hsv[:, :, 1] = img_hsv[:, :, 1] * saturation_factor

                    # Ensure saturation values are within the valid range [0, 255]
                    img_hsv[:, :, 1] = np.clip(img_hsv[:, :, 1], 0, 255)

                    img_output = cv2.cvtColor(img_hsv, cv2.COLOR_HSV2BGR)

                    # Append saturation value to the end of the file name
                    saturation_str = f"-Sat{saturation_percent:02}"
                    output_filename = os.path.splitext(filename)[0] + saturation_str + os.path.splitext(filename)[1]
                    output_path = os.path.join(output_subfolder, output_filename)

                    cv2.imwrite(output_path, img_output)

def main():
    base_folder = "/content/drive/MyDrive/ImageScrapePDF/Output"
    saturation_percent = int(input("Enter saturation adjustment (0-100): "))

    increase_saturation(base_folder, saturation_percent)

if __name__ == "__main__":
    main()

Enter saturation adjustment (0-100): 35
