# Delo z PDF datotekami

Viri:
- [How to Work With a PDF in Python](https://realpython.com/pdf-python/)
- [Create and Modify PDF Files in Python](https://realpython.com/creating-modifying-pdf/#creating-a-pdf-file-from-scratch)

The Portable Document Format, or PDF, is a file format that can be used to present and exchange documents reliably across operating systems. While the PDF was originally invented by Adobe, it is now an open standard that is maintained by the International Organization for Standardization (ISO). You can work with a preexisting PDF in Python by using the PyPDF2 package.

PyPDF2 is a pure-Python package that you can use for many different types of PDF operations.

### History of pyPdf, PyPDF2, and PyPDF4

The original pyPdf package was released way back in 2005. The last official release of pyPdf was in 2010. After a lapse of around a year, a company called Phasit sponsored a fork of pyPdf called PyPDF2. The code was written to be backwards compatible with the original and worked quite well for several years, with its last release being in 2016.

There was a brief series of releases of a package called PyPDF3, and then the project was renamed to PyPDF4. All of these projects do pretty much the same thing, but the biggest difference between pyPdf and PyPDF2+ is that the latter versions added Python 3 support. There is a different Python 3 fork of the original pyPdf for Python 3, but that one has not been maintained for many years.

While PyPDF2 was recently abandoned, the new PyPDF4 does not have full backwards compatibility with PyPDF2. Most of the examples in this article will work perfectly fine with PyPDF4, but there are some that cannot, which is why PyPDF4 is not featured more heavily in this article. Feel free to swap out the imports for PyPDF2 with PyPDF4 and see how it works for you.

### Installation

Installing PyPDF2 can be done with pip or conda if you happen to be using Anaconda instead of regular Python.

Here’s how you would install PyPDF2 with pip:

In [None]:
#!pip install pypdf2

In [None]:
#!pip install PyPDF4

The install is quite quick as PyPDF2 does not have any dependencies. You will likely spend as much time downloading the package as you will installing it.

Now let’s move on and learn how to extract some information from a PDF.

### Funkcije za pomoč pri delu z PDF dokumenti

#### How to Extract Document Information From a PDF in Python

You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you’re doing certain types of automation on your preexisting PDF files.

Here are the current types of data that can be extracted:
- Author
- Creator
- Producer
- Subject
- Title
- Number of pages

Let’s write some code using that PDF and learn how you can get access to these attributes:

In [None]:
# extract_doc_info.py

from PyPDF2 import PdfFileReader

def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    txt = f"""
    Information about {pdf_path}: 

    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """

    print(txt)
    return information

if __name__ == '__main__':
    path = 'data/test_read.pdf'
    extract_information(path)

Here you import PdfFileReader from the PyPDF2 package. The PdfFileReader is a class with several methods for interacting with PDF files. In this example, you call `.getDocumentInfo()`, which will return an instance of DocumentInformation. This contains most of the information that you’re interested in. You also call `.getNumPages()` on the reader object, which returns the number of pages in the document.

The information variable has several instance attributes that you can use to get the rest of the metadata you want from the document. You print out that information and also return it for potential future use.

#### Merge PDFs

There are many situations where you will want to take two or more PDFs and merge them together into a single PDF. For example, you might have a standard cover page that needs to go on to many types of reports. You can use Python to help you do that sort of thing.

For this example, you can open up a PDF and print a page out as a separate PDF. Then do that again, but with a different page. That will give you a couple of inputs to use for example purposes.

Let’s go ahead and write some code that you can use to merge PDFs together:

In [None]:
# pdf_merging.py

from PyPDF2 import PdfFileReader, PdfFileWriter

def merge_pdfs(paths, output):
    pdf_writer = PdfFileWriter()

    for path in paths:
        pdf_reader = PdfFileReader(path)
        for page in range(pdf_reader.getNumPages()):
            # Add each page to the writer object
            pdf_writer.addPage(pdf_reader.getPage(page))

    # Write out the merged PDF
    with open(output, 'wb') as out:
        pdf_writer.write(out)

if __name__ == '__main__':
    paths = ['document1.pdf', 'document2.pdf']
    merge_pdfs(paths, output='merged.pdf')

You can use merge_pdfs() when you have a list of PDFs that you want to merge together. You will also need to know where to save the result, so this function takes a list of input paths and an output path.

Then you loop over the inputs and create a PDF reader object for each of them. Next you will iterate over all the pages in the PDF file and use .addPage() to add each of those pages to itself.

Once you’re finished iterating over all of the pages of all of the PDFs in your list, you will write out the result at the end.

One item I would like to point out is that you could enhance this script a bit by adding in a range of pages to be added if you didn’t want to merge all the pages of each PDF. If you’d like a challenge, you could also create a command line interface for this function using Python’s argparse module.

### Extracting Text From a PDF

In this section, you’ll learn how to read a PDF file and extract the text using the PyPDF2 package.

Let’s get started by opening a PDF and reading some information about it. You’ll use the Pride_and_Prejudice.pdf file located in the practice_files/ folder in the companion repository.

Open IDLE’s interactive window and import the PdfFileReader class from the PyPDF2 package:



In [None]:
from PyPDF2 import PdfFileReader

To create a new instance of the PdfFileReader class, you’ll need the path to the PDF file that you want to open. Let’s get that now using the pathlib module:

In [None]:
from pathlib import Path
pdf_path = (
    Path.cwd()
    / "data"
    / "sample.pdf"
)

Now create the PdfFileReader instance:

In [None]:
pdf = PdfFileReader(str(pdf_path))

You convert pdf_path to a string because PdfFileReader doesn’t know how to read from a pathlib.Path object.

Now that you’ve created a PdfFileReader instance, you can use it to gather information about the PDF. For example, .getNumPages() returns the number of pages contained in the PDF file:

In [None]:
pdf.getNumPages()

2

Notice that .getNumPages() is written in mixedCase, not lower_case_with_underscores as recommended in PEP 8. Remember, PEP 8 is a set of guidelines, not rules. As far as Python is concerned, mixedCase is perfectly acceptable.

> Note: PyPDF2 was adapted from the pyPdf package. pyPdf was written in 2005, only four years after PEP 8 was published. At that time, many Python programmers were migrating from languages in which mixedCase was more common.

PDF pages are represented in PyPDF2 with the PageObject class. You use PageObject instances to interact with pages in a PDF file. You don’t need to create your own PageObject instances directly. Instead, you can access them through the PdfFileReader object’s .getPage() method.

> While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

There are two steps to extracting text from a single PDF page:
1. Get a PageObject with PdfFileReader.getPage().
2. Extract the text as a string with the PageObject instance’s .extractText() method.

You can get the PageObject representing a specific page by passing the page’s index to PdfFileReader.getPage():

In [None]:
first_page = pdf.getPage(0)

In [None]:
type(first_page)

PyPDF2.pdf.PageObject

In [None]:
first_page.extractText()

' A Simple PDF File  This is a small demonstration .pdf file -  just for use in the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ...'

Every PdfFileReader object has a .pages attribute that you can use to iterate over all of the pages in the PDF in order.



In [None]:
for page in pdf.pages:
    print(page.extractText())

 A Simple PDF File  This is a small demonstration .pdf file -  just for use in the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2  ...continued from page 1. Yet more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. Oh, how boring typing this stuff. But not as boring as watching  paint dry. And more text. And more text. And more text. And more text.  Boring.  More, a little more text. The end, and just as well. 


Open a new editor window in IDLE and type in the following code:

In [None]:
def extract_pdf_text(path, output_path="data/pdf_output.txt"):
    pdf_reader = PdfFileReader(str(path))
    output_file_path = Path.cwd() / output_path
    
    with output_file_path.open(mode="w") as output_file:
        title = pdf_reader.documentInfo.title
        num_pages = pdf_reader.getNumPages()
        output_file.write(f"{title}\\nNumber of pages: {num_pages}\\n\\n")

        for page in pdf_reader.pages:
            text = page.extractText()
            output_file.write(text)

In [None]:
from pathlib import Path
pdf_path = (
    Path.cwd()
    / "data"
    / "sample.pdf"
)

extract_pdf_text(pdf_path)