# Mounting Google Drive and Setting Up the Directory


In this section, we mount Google Drive to access files stored in it and specify the directory containing the PDF files we want to convert.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Define the directory containing the PDF files
directory = '/content/drive/MyDrive/GenAI/OpenAI/OpenAI Project'

By mounting Google Drive, we can read and write files directly from our Colab notebook. Specifying the directory allows our script to know where to look for the PDF files that need to be converted.



# Installing and Importing Required Libraries

Next, we install the necessary libraries for handling PDF files and image processing.



In [None]:
# Install PyMuPDF and Pillow libraries
!pip install PyMuPDF Pillow

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.12-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.12-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.12


* **PyMuPDF**: A Python binding for MuPDF, which allows for PDF and image file processing.
* **Pillow**: A Python Imaging Library that adds image processing capabilities to your Python interpreter.

We import the libraries required for PDF manipulation and image processing.



In [None]:
# Import the fitz module from PyMuPDF for PDF handling
import fitz  # PyMuPDF
# Import the Image module from Pillow for image processing
from PIL import Image
# Import os for operating system dependent functionality
import os

* `fitz` provides functions to read and manipulate PDF files.
* `Image` from `PIL` allows us to create and modify images.
* `os` helps in interacting with the operating system, such as reading files and directories.

# Defining the PDF to JPG Conversion Function

We define a function pdf_to_jpg that converts all PDF files in the specified directory to JPG images.

In [None]:
def pdf_to_jpg(directory):
    # Iterate over all files in the specified directory
    for filename in os.listdir(directory):
        # Check if the file is a PDF
        if filename.endswith('.pdf'):
            # Construct the full file path
            pdf_path = os.path.join(directory, filename)
            # Open the PDF document
            pdf_document = fitz.open(pdf_path)
            # Iterate over each page in the PDF
            for page_number in range(len(pdf_document)):
                # Load the page by its index
                page = pdf_document.load_page(page_number)
                # Render the page to a pixmap (an in-memory image)
                pix = page.get_pixmap()

                # Construct the output image file path
                image_path = os.path.join(
                    directory,
                    f"{os.path.splitext(filename)[0]}_page_{page_number + 1}.jpg"
                )
                # Create an image object from the pixmap data
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                # Save the image in JPG format
                img.save(image_path)

    # Print a message when all conversions are done
    print("All PDF files have been converted")


In this function:

* We loop through all files in the directory and select those that end with `.pdf`.
* Each PDF is opened using `fitz.open()`.
* We iterate through each page of the PDF.
* Each page is rendered to a pixmap using `page.get_pixmap()`.
* The pixmap is converted to an image using `Image.frombytes()`.
* The image is saved as a JPG file in the same directory.