# Utility for <img src="img/pdf_icon.png" width="24" height="24" /> PDF Files/Pages.
In this notebook, we are trying to create an application which enables users to:
* Merge <img src="img/merge.png" width="24" height="24" />
* Splice <img src="img/splice.png" width="24" height="24" /> 
* Rotate Pages <img src="img/rotate.png" width="24" height="24" /><br>
of <img src="img/pdf_icon.png" width="24" height="24" /> PDF documents.

<b>Note:</b> The main functionalities of this notebook are taken from the original author at https://github.com/metaist/pdfmerge where it was originally meant to be a <img src="img/cmd.png" width="24" height="24" /> command-line utility. I have merely taken the same functionalities and altered its mode of input. All code snippets should be credited to the <a href='https://github.com/metaist' target='blank'>original author</a>.

<b>Running:</b> Python 3.6.7<br>
<b>Using:</b> pip 20.2.3<br>
<b>OS:</b> Windows 10

### Importing/Installing required module(s)
Most of the modules required are in-built so we shall import them in the next cell:

In [None]:
from glob import glob
import os
import re

To create the respective functionalities, we can use a PDF toolkit `PyPDF2` for Python. We'll install the official library by shelling out to `bash` and running a `pip` command right here.

In [2]:
!pip install PyPDF2

Processing c:\users\xuema\appdata\local\pip\cache\wheels\97\28\4b\142b7d8c98eeeb73534b9c5b6558ddd3bab3c2c8192aa7ab30\pypdf2-1.26.0-py3-none-any.whl
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


Now we'll import the `PyPDF2` package we just installed and use it to create subsequent utility functions. Specifically, `PdfFileWriter`,`PdfFileReader` are required from this python package

In [3]:
from PyPDF2 import PdfFileWriter, PdfFileReader

### Specifications of:
#### 1. Error Message Output
In the scenario when an exception is triggered, specific messages shall be output by the programme to feedback us what was the actual error encountered.

In [4]:
ERROR_PATH = 'ERROR: path not found: {0}'
ERROR_RULE = 'ERROR: invalid rule: {0}'
ERROR_RANGE = 'ERROR: page {0} out of range [1-{1}]'
ERROR_BOUNDS = 'ERROR: missing upper bound on range [{0}..]'

#### 2. Required Inputs for Page Range Extraction/Rotation
The following represents the syntax of the actual input by the programme to output the final PDF document eventually.

In [5]:
RULE_RANGE = '..'
RULE_ROTATE = { 
                None: 0, 
                '>': 90, 
                'V': 180, 
                '<': 270 
              }
RULE_DEFAULT = RULE_RANGE

The below specifications are regexes defined to read in every possible format by the programme to identify what are the specific rules stated to output the final PDF document.

In [6]:
RE_MATCH_TYPE = type(re.match('', ''))
RE_HAS_RULE = re.compile(r'^(.*)\[(.*)\]$')
RE_RULE = re.compile(r'^(-?\d+)?(\.\.)?(-?\d+)?([>V<])?$')

#### Define function `rangify`
Description: This function is used to extract out the stated range of pages of a PDF document (inclusive of upper limit and lower limit) and the maximum no. of pages to be extracted.<br>
<table style='margin-left:0'>
    <thead>
        <tr>
            <th colspan='3' style='text-align:left'>Args</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style='text-align:left'><pre>rule</pre></td>
            <td style='text-align:left'>(str, obj)</td>
            <td>pages to extract or a regex matching the rule</td>
        </tr>
         <tr>
            <td><pre>range_max</pre></td>
            <td>(int)</td>
            <td>maximum number of page</td>
        </tr>
    </tbody>
</table>

<table style='margin-left:0'>
    <thead>
        <tr>
            <th colspan='3' style='text-align:left'>Output</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style='text-align:left'>Returns</td>
            <td>(list)</td>
            <td>List of pages to extract.</td>
        </tr>
    </tbody>
</table>

In [7]:
def rangify(rule, range_max=None):
    # pylint: disable=R0912
    result, match = [], None
    if isinstance(rule, str):
        match = RE_RULE.search(rule)
        assert match, ERROR_RULE.format(rule)
    elif isinstance(rule, RE_MATCH_TYPE):
        assert rule is not None, ERROR_RULE.format()
        match = rule

    beg, isrange, end, _ = match.groups()
    isrange = (isrange == RULE_RANGE)

    if not beg and not end:
        assert range_max is not None, ERROR_BOUNDS.format(beg)
        beg, isrange, end = 1, True, range_max

    beg = (beg and int(beg)) or 1
    end = (end and int(end))

    if beg:
        beg = int(beg)
        if range_max and beg < 1:
            beg += range_max + 1
        elif range_max and beg > range_max:
            beg = range_max

    if end:
        end = int(end)
        if range_max and end < 1:
            end += range_max + 1
        elif range_max and end > range_max:
            end = range_max
    elif isrange:
        assert range_max is not None, ERROR_BOUNDS.format(beg)
        end = range_max

    if isrange and end < beg:
        result = sorted(range(end, beg + 1), reverse=True)
    elif isrange:
        result = range(beg, end + 1)
    else:
        result.append(beg)

    return result

#### Define Function `add` 
Description: To add one or more paths to a PdfFileWriter.<br>

<table style='margin-left:0'>
    <thead>
        <tr>
            <th colspan='3' style='text-align:left'>Args</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style='text-align:left'><pre>path</pre></td>
            <td style='text-align:left'>(str, list)</td>
            <td>path or list of paths to merge</td>
        </tr>
        <tr>
            <td><pre>password</pre></td>
            <td>(str)</td>
            <td>password for encrypted files</td>
        </tr>
        <tr>
            <td><pre>writer</pre></td>
            <td>(PdfFileWriter)</td>
            <td>output writer to add pdf files</td>
        </tr>
        <tr>
            <td><pre>rules</pre></td>
            <td>(str)</td>
            <td>pages and rotation rules</td>
        </tr>
    </tbody>
</table>

<table style='margin-left:0'>
    <thead>
        <tr>
            <th colspan='3' style='text-align:left'>Output</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style='text-align:left'>Returns</td>
            <td>(PdfFileWriter)</td>
            <td>The merged PDF is ready for output.</td>
        </tr>
    </tbody>
</table>

In [8]:
def add(path, password='', writer=None, rules=RULE_DEFAULT):
    if writer is None:
        writer = PdfFileWriter()

    if isinstance(path, list):  # merge all the paths
        for subpath in path:
            writer = add(subpath, password, writer, rules)
    else:
        match = RE_HAS_RULE.search(path)
        if match:
            path, rules = match.groups()
        rules = re.sub(r'\s', '', rules)

        if os.path.isdir(path):
            path = os.path.join(path, '*.pdf')

        if '*' in path:
            writer = add(glob(path), password, writer, rules)
        else:
            assert os.path.isfile(path), ERROR_PATH.format(path)
            reader = PdfFileReader(open(path, 'rb'))
            if reader.isEncrypted:
                reader.decrypt(password)

            for rule in rules.split(','):
                match = RE_RULE.search(rule)
                assert match, ERROR_RULE.format(rule)
                _, _, _, rotate = match.groups()
                for page in rangify(match, reader.getNumPages()):
                    writer.addPage(
                        reader.getPage(page - 1).rotateClockwise(
                            RULE_ROTATE[rotate]
                        )
                    )
    return writer

#### Define Function `merge`
Description: Merge the paths into a single PDF.<br>
<table style='margin-left:0'>
    <thead>
        <tr>
            <th colspan='3' style='text-align:left'>Args</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style='text-align:left'><pre>path</pre></td>
            <td style='text-align:left'>(list)</td>
            <td>path or list of paths to merge</td>
        </tr>
        <tr>
            <td><pre>output</pre></td>
            <td>(str)</td>
            <td>output file name</td>
        </tr>
        <tr>
            <td><pre>password</pre></td>
            <td>(str)</td>
            <td>password for encrypted files (default: '')</td>
        </tr>
    </tbody>
</table>

In [9]:
def merge(paths, output, password=''):
    writer = add(paths, password)
    with open(output, 'wb') as stream:
        writer.write(stream)

#### Input PDF Files for Processing

Proceed to specify the list of pdf files (the filepaths) in the below list:

In [10]:
pdf_files=[
    'pdf-multi-pg.pdf',
    '1.pdf',
    '2.pdf'
]

In [11]:
from IPython.core.display import display, HTML

In [12]:
rotation_options_table=''
rotation_options_table+='<table style=\'margin-left:0;border:0.5px solid #000\'>'
rotation_options_table+='<thead>'
rotation_options_table+='<tr>'
rotation_options_table+='<th colspan=\'2\' style=\'text-align:left;border:0.5px solid #000\'>Rotation Options</th>'
rotation_options_table+='</tr>'
rotation_options_table+='<tr>'
rotation_options_table+='<th style=\'text-align:left;border:0.5px solid #000\'>Input</th>'
rotation_options_table+='<th style=\'text-align:left;border:0.5px solid #000\'>Description</th>'
rotation_options_table+='</tr>'
rotation_options_table+='</thead>'
rotation_options_table+='<tbody>'
rotation_options_table+='<tr>'
rotation_options_table+='<td style=\'text-align:center;border:0.5px solid #000\'>(blank)</td>'
rotation_options_table+='<td style=\'text-align:left;border:0.5px solid #000\'>Blank defaults to None</td>'
rotation_options_table+='</tr>'
rotation_options_table+='<tr>'
rotation_options_table+='<td style=\'text-align:center;border:0.5px solid #000\'><pre>></pre></td>'
rotation_options_table+='<td style=\'text-align:left;border:0.5px solid #000\'>↱ Clockwise right 90°</td>'
rotation_options_table+='</tr>'
rotation_options_table+='<tr>'
rotation_options_table+='<td style=\'text-align:center;border:0.5px solid #000\'><pre>V</pre></td>'
rotation_options_table+='<td style=\'text-align:left;border:0.5px solid #000\'>↴ Inverse downwards 180°</td>'
rotation_options_table+='</tr>'
rotation_options_table+='<tr>'
rotation_options_table+='<td style=\'text-align:center;border:0.5px solid #000\'><pre><</pre></td>'
rotation_options_table+='<td style=\'text-align:left;border:0.5px solid #000\'>↲ Clockwise downwards 270°</td>'
rotation_options_table+='</tr>'
rotation_options_table+='</tbody>'    
rotation_options_table+='</table>'

In [13]:
to_merge=[]
#[START][..][END][ROTATE]
counter=0
for file in pdf_files:
    counter+=1
    options=file
    no_of_pages = PdfFileReader(open(file,'rb')).getNumPages()
    display(HTML('<h3>Filename (' + str(counter) + '/' + str(len(pdf_files)) + '): '+ file +'</h3>'))
    display(HTML('<small>Total no. of pages in file: <b>' + str(no_of_pages) + '</b></small>'))
    display(HTML('<h4>Part I: Specify range of PDF file to be extracted</h4>'))
    display(HTML('<small>Note: If either range values are non-numerical or blank, the default shall be <u>all pages</ul></small>'))
    try:
        START=int(input('Input range (start page) of ' + file))
        END = int(input('Input range (end page) of ' + file))
        display(HTML('<hr>'))
    except:
        START=''
        END=''
    display(HTML('<h4>Part II: Specify rotation option of PDF file</h4>'))
    display(HTML(rotation_options_table))
    ROTATE = input('Input rotation option: ' + file)
    ROTATE=ROTATE.strip().upper()
    
    page_range=[]
    page_range.append(START)
    page_range.append(END)
    page_range=list(filter(lambda x: x != '', page_range))
    if(len(page_range)==0):
        page_range=''
        if(ROTATE != ''):
            options+='['+ROTATE+']'
    else:
        page_range=str(START)+'..'+str(END)
        if(ROTATE == ''):
            options+='['+page_range+']'
        else:
            options+='['+page_range+ROTATE+']'
    
    print(options)
    to_merge.append(options)
    display(HTML('<hr>'))
    
display(HTML('<h3>Specify output filename</h3>'))
display(HTML('<small>Note: If field is left blank, default shall be <u>output.pdf</u></small>'))     
output_filename=input()
if(output_filename == ''):
    output_filename='output.pdf'
                    
output_filename=output_filename.strip()

Input range (start page) of pdf-multi-pg.pdf 3
Input range (end page) of pdf-multi-pg.pdf 5


Rotation Options,Rotation Options
Input,Description
(blank),Blank defaults to None
>,↱ Clockwise right 90°
V,↴ Inverse downwards 180°
<,↲ Clockwise downwards 270°


Input rotation option: pdf-multi-pg.pdf 


pdf-multi-pg.pdf[3..5]


Input range (start page) of 1.pdf 


Rotation Options,Rotation Options
Input,Description
(blank),Blank defaults to None
>,↱ Clockwise right 90°
V,↴ Inverse downwards 180°
<,↲ Clockwise downwards 270°


Input rotation option: 1.pdf <


1.pdf[<]


Input range (start page) of 2.pdf >


Rotation Options,Rotation Options
Input,Description
(blank),Blank defaults to None
>,↱ Clockwise right 90°
V,↴ Inverse downwards 180°
<,↲ Clockwise downwards 270°


Input rotation option: 2.pdf >


2.pdf[>]


 


In [14]:
display(HTML('<b>Output filename shall be:</b> ' + output_filename))

In [15]:
merge(to_merge, output_filename)