<a href="https://colab.research.google.com/github/olga-terekhova/pdf-utilities/blob/main/SplitPDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Split PDF

## How to use

To **split** a PDF file into a range of pages:  
1) Prepare the PDF file that you want to split.  
2) Upload the pdf file that you want to split into the root directory of Files area. Upload one file only. E.g. *input.pdf*.  
3) In the [Set parameters](#scrollTo=XaMoALpy6JHx&line=1&uniqifier=1) section, set the preferred prefix for the output pdfs. E.g. *output.pdf*, which will generate *output_1.pdf*, *output_2.pdf*, etc.   
4) In the [Set parameters](#scrollTo=XaMoALpy6JHx&line=1&uniqifier=1) section, set *'split'* for the choice between *split* or *delete* (you may use the  dropdown field).  
5) In the [Set parameters](#scrollTo=XaMoALpy6JHx&line=1&uniqifier=1) section, specify a list of pages to be split into. You can use comma (,) for individual pages or dash (-) for ranges. Spaces are allowed but not needed. Start and end of the ranges are inclusive. Ranges can overlap, so that one page can be in several output files. E.g. *'1,3'* (will split into 2 files) or *'1,5-7,9'* (will split into 3 files) or *'2, 5,5-10'* (will split into 3 files).     
6) Run all cells in the notebook (Runtime - Run all or Ctrl-F9).  
7) Download the output pdfs from the Files area (Refresh to see the newly created split files).


If you need to split another file, **delete** current PDF files first.
For that:  
1) In the [Set parameters](#scrollTo=XaMoALpy6JHx&line=1&uniqifier=1) section, set 'delete' for the choice between split or delete (you may use the dropdown field).  
2) Run all cells in the notebook (Runtime - Run all or Ctrl-F9).  

In [15]:
# @title Set parameters

split_pdf_path = 'output.pdf' # @param {type:"string"}
split_or_delete = "split" # @param ["split", "delete"]
page_range = "1,1-3, 3-5" # @param {type:"string"}

print(split_pdf_path)
print(split_or_delete)
print(page_range)

output.pdf
delete
1,1-3, 3-5


## Code (you can collapse this section)

### Install, import, initialize  

In [16]:
!pip install -q PyPDF2

In [17]:
import os
import PyPDF2

### Rotate the PDF file

In [18]:
def get_file():
  """
  Get the first PDF file in the current directory.
  Return the file name and a message.
  """
  pdf_files = []
  for filename in os.listdir():
      if filename.endswith('.pdf'):
          pdf_files.append(filename)

  if split_pdf_path in pdf_files:
    return "", "File " + split_pdf_path + " already exists. No action taken. Do you want to delete PDF files first?"

  if len(pdf_files) == 0:
    return "", "No PDF files found. No action taken."

  # sort pdf_files in the alphabetical order
  pdf_files.sort()

  # take the first PDF file
  pdf_file = pdf_files[0]
  print(pdf_file)

  return pdf_file, "OK"

In [19]:
def parse_page_range(page_range):
    """
    Parse a string like '2, 5-7,9' into a list of page numbers.

    :param page_range: String representing the page range. Input by the user.
    Return a list of page intervals.
    """

    intervals = []

    # Remove all spaces from the input string
    page_range = page_range.replace(' ', '')

    # Split the string by commas
    ranges = page_range.split(',')

    for r in ranges:
        if '-' in r:
            start, end = map(int, r.split('-'))
            intervals.append((start - 1, end))  # Convert to 0-based index for start
        else:
            page = int(r) - 1  # Convert to 0-based index
            intervals.append((page, page + 1))


    return intervals

In [20]:
def split_selected_pages(output_pdf, page_range):
    """
    Split selected pages in the PDF into separate PDF files.

    :param output_pdf: Path to output PDF. Input by the user.
    :param page_range: Pages to split in string format (e.g., '2,5-7,9'). Input by the user.

    Return a message.
    """

    # Get the input PDF file (a first PDF file found in the root directory)
    input_pdf, input_pdf_response = get_file()
    if input_pdf == "": # no file to process
      print(input_pdf_response)
      return input_pdf_response

    # Parse the page into intervals
    intervals = parse_page_range(page_range)

    # Open the PDF file
    with open(input_pdf, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        output_files = []

        # Loop through the intervals and create separate PDF files for each
        for idx, (start, end) in enumerate(intervals):
            writer = PyPDF2.PdfWriter()

            for page_num in range(start, end):
                if page_num < len(reader.pages):  # Ensure page is within the document
                    writer.add_page(reader.pages[page_num])

            # Define the output PDF filename for this interval
            output_filename = f"{os.path.splitext(output_pdf)[0]}_{idx + 1}.pdf"
            output_files.append(output_filename)

            # Write the split pages to a new PDF
            with open(output_filename, 'wb') as output_file:
                writer.write(output_file)


    message = "Split pages saved as:\n" + ", ".join(output_files) + \
              "\n\nRefresh the Files area and locate " + ", ".join(output_files) + "."
    print(message)
    return message

### Delete all PDF files from Files

In [21]:
def delete_pdfs():
  """
  Delete all PDF files in the current directory.
  Return a message.
  """

  # Create a list of all PDF files in the current directory
  pdf_files = []
  for filename in os.listdir():
      if filename.endswith('.pdf'):
          pdf_files.append(filename)

  print(pdf_files)

  if len(pdf_files) == 0:
    return "No PDF files found. No action taken."

  # Delete all files in the pdf_files

  for filename in pdf_files:
      os.remove(filename)

  pdf_files_str = ', \n'.join(pdf_files)

  return "Files deleted:\n" + pdf_files_str + "\n\nRefresh the Files area and check that it has no PDF files."

### Run the chosen option

In [22]:
# Run the split or delete process depending on the user choice
if split_or_delete == "split":
  result = split_selected_pages(split_pdf_path, page_range)
elif split_or_delete == "delete":
  result = delete_pdfs()

['output_1.pdf', 'output_2.pdf', 'Medical Forms - Filled.pdf', 'output_3.pdf']


## Result

In [23]:
print(result)

Files deleted:
output_1.pdf, 
output_2.pdf, 
Medical Forms - Filled.pdf, 
output_3.pdf

Refresh the Files area and check that it has no PDF files.
