# Interact with PDF Files
PDF files have become a sort of necessary evil these days. Despite their frequent use,
PDFs are some of the most difficult files to work with in terms of making modifications,
combining files, and especially for extracting text information.

We are going to install a library called PyPDF2. On a linux machine we will do:

$	pip	install	PyPDF2

On a Windows machine using Anaconda

Start - > anaconda3 -> anaconda prompt

![](img/anaconda_prompt.png)

And then in the command prompt, type:


> conda install -c conda-forge pypdf2

![](img/anaconda_prompt2.png)

The	PyPDF2	package	includes	a	PdfFileReader	and	a	PdfFileWriter;	just	like	when
performing	other	types	of	file	input/output,	reading	and	writing	are	two	entirely	separate
processes.

In [3]:
import os
from PyPDF2 import PdfFileReader

print('We have now imported modules')

We have now imported modules


First,	let's	get	started	by	reading	in	some	basic	information	from	a	sample	PDF	file,	the
first	couple	chapters	of	Jane	Austen's	Pride	and	Prejudice	via	Project	Gutenberg:

In [4]:
input_file_name =  os.path.abspath("./pdf/Pride_and_Prejudice.pdf")
input_file = PdfFileReader(open(input_file_name,"rb"))

print("Number of pages:",input_file.getNumPages())
print("Title:",input_file.getDocumentInfo().title)

Number of pages: 234
Title: Pride and Prejudice, by Jane Austen


We created a PdfFileReader object named   input_file   by passing a   file()   object
with   rb   (read binary) mode and giving the full path of the file. The additional "binary"
part is necessary for reading PDF files because we aren't just reading basic text data.
PDFs include much more complicated information, and saying "rb" here instead of just
"r" tells Python that we might encounter characters that can't be represented as standard
readable text.

We also have access to certain attributes through the getDocumentInfo() method; in fact, if we display
the result of simply calling this method, we will see a dictionary with all of the available
document info:

In [None]:
print(input_file.getDocumentInfo())

Formatting standards in PDFs are inconsistent at best, and it's usually necessary to take
a look at the PDF files you want to use on a case-by-case basis. In this instance, notice
how we don't actually see newline characters in the output; instead, it appears that new
lines are being represented as multiple spaces in the text extracted by PyPDF2. We can
use this knowledge to write out a roughly formatted version of the book to a plain text file
(for instance, if we only had the PDF available and wanted to make it readable on an
untalented (err dumb) mobile device):

In [6]:
output_file_name = os.path.abspath("./temp/Pride_and_Prejudice.txt")
output_file = open(output_file_name, "w")
title = input_file.getDocumentInfo().title # get the file title
total_pages = input_file.getNumPages() # get the total page count
output_file.write(title + "\n")
output_file.write("Number of pages: {}\n\n".format(total_pages))
for page_num in range(0, total_pages):
    text = input_file.getPage(page_num).extractText()
    text = text.replace("  ", "\n")
    output_file.write(text)
output_file.close()

Since we're writing out basic text, we chose the plain   w   mode and created a file
book.txt in the "Output" folder. Meanwhile, we still use   rb   mode to read data from the
PDF file since, before we can extract the plain text from each page, we are in fact
reading much more complicated data. We loop over every page number in the PDF file,
extracting the text from that page. Since we know that new lines will show up as
additional spaces, we can approximate better formatting by replacing every instance of
double spaces (  " "  ) with a newline character.

Instead of extracting text, we might want to modify the PDF file itself, saving out a new
version of the PDF. We'll see more examples of why and how this might occur in the
next section, but for now create the simplest "modified" file by saving out only a section of the original file. Here we copy over the first three pages of the PDF (not including the
cover page) into a new PDF file:

In [5]:
import os
from PyPDF2 import PdfFileReader, PdfFileWriter

input_file_name = os.path.abspath("./pdf/Pride_and_Prejudice.pdf")
input_file = PdfFileReader(open(input_file_name, "rb"))
output_PDF = PdfFileWriter()
for page_num in range(1, 4):
    output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.abspath("./temp/portion.pdf")
output_file = open(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()