Here we are going to see how Python can be helpful in providing solution to handle unstructured data sources like Pdf i.e. Portable Document Format, and could be used to make it handy and useful. Once we are able to extract the useful information from the Pdf, we can then easily use that in any Machine Learning or Natural Language Processing model.

For the above purpose our high-level, interpreted language Python has several libraries that could be used to handle Pdf documents. Here I list a few:

#### PDFMiner

Parse, analyze, and convert PDF documents

PDF to HTML conversion

Outline (TOC) extraction

Tagged contents extraction

Reconstruct the original layout by grouping text chunks

#### PyPDF2

extracting document information (title, author, …)

splitting documents page by page

merging documents page by page

encrypting and decrypting PDF files

merging multiple files into a single file

#### Tabula-py

simple Python wrapper of tabula-java

can read tables in a PDF, and convert into pandas' DataFrame

can convert PDF file into CSV, TSV or JSON

#### PDFQuery

light wrapper around PDFMiner,PyQuery and lxml

designed to reliably extract data from sets of PDFs with as little code as possible

==============================

Here we will be learning how PyPDF2 works

### Contents

1. Extracting text

2. Rotating Pages

3. Merging PDFs

4. Encrypting

5. Decrypting

6. Adding Watermark

7. Combining required pages from different PDFs

==============================

##### *the pdf files used here are just randomly chosen from google and are not related to the topic in any way. 

#### *the source of my knowledge for this work is the book 'Automate The Boring Stuff With Python' by AI Sweigart

In [1]:
# import the library
import PyPDF2

## Extracting Text

In [2]:
# PDF is a binary file
# open your pdf in a read binary mode and store in some variable. read binary mode is denoted as 'rb' as used below
pdf_file = open('GDP_1.pdf', 'rb')

# to represent yor pdf and get it read, use PdfFileReader() to get your file object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

In [3]:
# print number of pages
print(pdf_reader.numPages)

7


In [4]:
# to extract text from a page, you need to get a page object, which represents a single page of a PDF from a PdfFileReader object
# first page is page 0 and second page is page 1
# so here we extracting second page
page = pdf_reader.getPage(1)

# extract the string from your page object
print(page.extractText())

GROSS DOMESTIC PRODUCT (GDP)
NATIONAL ACCOUNTS AT A GLANCE  OECD 2009
16
1.Size of GDP

1. Size of GDP
Gross domestic product (GDP) is the standard measur
e
of the value of final goods and services produced b
y a
country during a period. While GDP is the single mo
st
important indicator to capture these economic activ
i-
ties, it is not a good measure of societies' well-b
eing
and only a limited measure of people's material liv
ing
standards. The sections and indicators that follow

better address this and other related issues and th
is is
one of the primary purposes of this publication.

Countries calculate GDP in their own currencies. In

order to compare across countries these estimates

have to be converted into a common currency. Often

the conversion is made using current exchange rates

but these can give a misleading comparison of the

true volumes of final goods and services in GDP. A 
bet-
ter approach is to use purchasing power parities

(PPPs). PPPs are currency converters 

In [5]:
# make sure you close the files you opened.
pdf_file.close()

## Rotating

### Single Page

In [6]:
pdf_file = open('GDP_1.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# creating a PdfFileWriter object creates only a value that represents a PDF document in Python
pdf_writer = PyPDF2.PdfFileWriter()

In [7]:
page = pdf_reader.getPage(0)

# to rotate you chosen page clockwise 90 degrees
page.rotateClockwise(90)

# add page to the pdf writer object
pdf_writer.addPage(page)

In [8]:
# assign a name to your new pdf and open that in write binary mode this time as we are going to write new pdf.
# write binary mode is denoted by 'wb' as used below
rotated_page_file = open('rotated_page_GDP_1.pdf', 'wb')

# The write() method takes a regular file object that has been opened in write-binary mode
pdf_writer.write(rotated_page_file)

In [9]:
# close the opened binary files
pdf_file.close()

rotated_page_file.close()

### Multiple Pages

In [10]:
pdf_file = open('GDP_1.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

pdf_writer = PyPDF2.PdfFileWriter()

In [11]:
# loop over the pages that need to be rotated
for page_num in range(pdf_reader.numPages):
    pdf_writer.addPage(pdf_reader.getPage(page_num).rotateClockwise(90)) 

In [12]:
rotated_pages_file = open('rotated_pages_GDP_1.pdf', 'wb')

pdf_writer.write(rotated_pages_file)

In [13]:
pdf_file.close()

rotated_pages_file.close()

## Merging

### very less number of files

this method is used when you have very less number of pdf files or very uncommon names

In [14]:
pdf_file_1 = open('GDP_1.pdf', 'rb')
pdf_file_2 = open('GDP_2.pdf', 'rb')
pdf_file_3 = open('GDP_3.pdf', 'rb')

pdf_reader_1 = PyPDF2.PdfFileReader(pdf_file_1)
pdf_reader_2 = PyPDF2.PdfFileReader(pdf_file_2)
pdf_reader_3 = PyPDF2.PdfFileReader(pdf_file_3)

pdf_writer = PyPDF2.PdfFileWriter()

print('Number of pages in GDP_1 = ',pdf_reader_1.numPages)
print('Number of pages in GDP_2 = ',pdf_reader_2.numPages)
print('Number of pages in GDP_3 = ',pdf_reader_3.numPages)

Number of pages in GDP_1 =  7
Number of pages in GDP_2 =  53
Number of pages in GDP_3 =  9




In [15]:
for page_num in range(pdf_reader_1.numPages):
    page = pdf_reader_1.getPage(page_num)
    pdf_writer.addPage(page)
    
for page_num in range(pdf_reader_2.numPages):
    page = pdf_reader_2.getPage(page_num)
    pdf_writer.addPage(page)

for page_num in range(pdf_reader_3.numPages):
    page = pdf_reader_3.getPage(page_num)
    pdf_writer.addPage(page)

In [16]:
merged_file = open('merged_file.pdf', 'wb')

pdf_writer.write(merged_file)

In [17]:
pdf_file_1.close()
pdf_file_2.close()
pdf_file_3.close()
merged_file.close()

### large number of files

this method is used when you have large number of files and have common file name

In [18]:
# import the required libraries
from PyPDF2 import PdfFileMerger, PdfFileReader

In [19]:
# create an object to merge all the files
merged_object = PdfFileMerger()

In [20]:
#
for file_number in range(1, 4):
    merged_object.append(PdfFileReader('GDP_' + str(file_number)+ '.pdf', 'rb'))

In [21]:
merged_object.write('merged_large_files.pdf')

NOTE: PyPDF2 cannot insert pages in the middle of a PdfFileWriter object; the addPage() method will only add pages to the end.

## Encrypting

In [22]:
pdf_file = open('GDP_1_Copy.pdf' , 'rb')

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

pdf_writer = PyPDF2.PdfFileWriter()

In [23]:
for page_num in range(pdf_reader.numPages):
    pdf_writer.addPage(pdf_reader.getPage(page_num))

In [24]:
# Before calling the write() method to save to a file, call the encrypt() method and pass it a password string
pdf_writer.encrypt('encrypted')

encrypt() has two arguments, user_pwd and owner_pwd (here pwd means password) as first and second arguments, respectively. Former is to allow user to view the PDF and later is to allow user for printing, commenting, extracting text, and other featues. If only one argument is passed in encrypt(), it will be taken as both the passwords.

In [25]:
encrypted_file = open('encrypted_file.pdf', 'wb')
pdf_writer.write(encrypted_file)

In [26]:
pdf_file.close()
encrypted_file.close()

## Decrypting

In [27]:
pdf_file = open('encrypted_file.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# check if file encrypted
pdf_reader.isEncrypted

True

In [28]:
# try to get page out of the reader object
pdf_reader.getPage(0)
# will give as error 'file has not been decrypted'

PdfReadError: file has not been decrypted

In [29]:
# re-open the file in read binary mode 'rb'
pdf_reader = PyPDF2.PdfFileReader('encrypted_file.pdf', 'rb')

# decrypt the file using decrypt() passing the password string as argument
# this will return 0 if failed to decrypt and 1 if it passed.
pdf_reader.decrypt('encrypted')

1

In [30]:
pdf_reader.getPage(0)
# this time no error

{'/Type': '/Page',
 '/MediaBox': [0, 0, 595, 842],
 '/Rotate': 0,
 '/Parent': {'/Type': '/Pages',
  '/Count': 7,
  '/Kids': [IndirectObject(3, 0),
   IndirectObject(4, 0),
   IndirectObject(5, 0),
   IndirectObject(6, 0),
   IndirectObject(7, 0),
   IndirectObject(8, 0),
   IndirectObject(9, 0)]},
 '/Resources': {'/ProcSet': ['/PDF', '/ImageC', '/Text'],
  '/ColorSpace': {'/R10': ['/ICCBased', IndirectObject(14, 0)]},
  '/XObject': {'/R11': {'/Subtype': '/Image',
    '/ColorSpace': ['/ICCBased', IndirectObject(14, 0)],
    '/Width': 1152,
    '/Height': 532,
    '/BitsPerComponent': 8,
    '/Filter': '/DCTDecode'}},
  '/Font': {'/R12': {'/BaseFont': '/XQVVHK+Caecilia-Roman',
    '/FontDescriptor': {'/Type': '/FontDescriptor',
     '/FontName': '/XQVVHK+Caecilia-Roman',
     '/FontBBox': [-55, -270, 1072, 788],
     '/Flags': 4,
     '/Ascent': 788,
     '/CapHeight': 788,
     '/Descent': -270,
     '/ItalicAngle': 0,
     '/StemV': 160,
     '/MissingWidth': 278,
     '/CharSet': '/A/

In [31]:
pdf_file.close()

NOTE: decrypt() method will only decrypt the reader object of the file and not the file itself. That means the file will remain encrypted and you need to call decrypt() again every time you pass getPage() after closing the file.

## Add Watermark

In [32]:
pdf_file = open('GDP_1.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

pdf_writer = PyPDF2.PdfFileWriter()

In [33]:
# open the watermark file as read binary mode 'rb'
watermark_file = open('Watermark_.pdf', 'rb')

# save the reader object of the watermark file
watermark_reader = PyPDF2.PdfFileReader(watermark_file)

In [34]:
# get the page of the pdf file you want to watermark
# loop can be made for multiple pages
first_page = pdf_reader.getPage(0)

# mergePage() is used to merge both the pages with watermark page as argument
first_page.mergePage(watermark_reader.getPage(0))

# add the merged page to the writer object
pdf_writer.addPage(first_page)

In [35]:
# add other pages of the pdf file to the watermarked(merged) page in the new pdf file
for page_num in range(1,pdf_reader.numPages):
    pdf_writer.addPage(pdf_reader.getPage(page_num))

In [36]:
watermarked_file = open('watermarked_file.pdf', 'wb')

pdf_writer.write(watermarked_file)

In [37]:
pdf_file.close()
watermark_file.close()
watermarked_file.close()

## Combining selected pages from different PDFs

In [38]:
# list of file names
files = ['GDP_1.pdf','GDP_2.pdf','GDP_3.pdf','GDP_4.pdf']

pdf_writer = PyPDF2.PdfFileWriter()

In [39]:
# here we have two loops. one for the file names and the other for the selected pages we want to add in our new file 
for file_name in files:
    pdf_files = open(file_name,'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_files)
    
    for page_num in range(1,pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        pdf_writer.addPage(page)

In [40]:
required_file = open('required_file.pdf', 'wb')
pdf_writer.write(required_file)

In [41]:
pdf_files.close()
required_file.close()

#### There are many websites which allow us do what we read above, but they often come with some limits be it number of files or the amount of time every merging operation takes, and thus Python is here for us.