# Objective
As a good example of extracting information from PDF files, we set out to convert a Jupyter notebook in PDF form to its original `.ipynb` form.

I don't know if this is hard. Let's get started.

In [1]:
from pathlib import Path
from PyPDF2 import PdfFileReader

**N.B.**

- The `n_pages` assignment should be kept in the `with` context; otherwise, one'd get a `ValueError: seek of closed file`
  - same as `pdf.getPage()`

In [2]:
path_pdf = Path("pdf2ipynb.pdf")
with open(path_pdf, "rb") as f:
    pdf = PdfFileReader(f)
    n_pages = pdf.getNumPages()
    page00 = pdf.getPage(0)
    print(page00)
pdf

{'/Resources': IndirectObject(21, 0), '/Type': '/Page', '/Parent': IndirectObject(93, 0), '/Contents': [IndirectObject(20, 0)], '/MediaBox': [0, 0, 612, 792]}


<PyPDF2.pdf.PdfFileReader at 0x7f3149fc10d0>

In [3]:
n_pages

14

In [4]:
info = pdf.getDocumentInfo()
info

{'/Creator': 'LaTeX with hyperref',
 '/Producer': 'xdvipdfmx (20210318)',
 '/CreationDate': "D:20210702212123+07'00'"}

In [5]:
[ s for s in dir(pdf) if not s.startswith("_")]

['cacheGetIndirectObject',
 'cacheIndirectObject',
 'decrypt',
 'documentInfo',
 'flattenedPages',
 'getDestinationPageNumber',
 'getDocumentInfo',
 'getFields',
 'getFormTextFields',
 'getIsEncrypted',
 'getNamedDestinations',
 'getNumPages',
 'getObject',
 'getOutlines',
 'getPage',
 'getPageLayout',
 'getPageMode',
 'getPageNumber',
 'getXmpMetadata',
 'isEncrypted',
 'namedDestinations',
 'numPages',
 'outlines',
 'pageLayout',
 'pageMode',
 'pages',
 'read',
 'readNextEndLine',
 'readObjectHeader',
 'resolvedObjects',
 'stream',
 'strict',
 'trailer',
 'xmpMetadata',
 'xref',
 'xrefIndex',
 'xref_objStm']

The above few cells came from the real python's tutorial, but only after reading the first few paragraphs of it did I find out that to extract the content of a PDF file, people seems to not recommend `pypdf2`; instead, people suggest using `pdfminer` (or `pdfminer.six`).

I chose to install `pip install pdfminer.six` because it seems to be a fork of `pdfminer` that is being constantly maintained, whereas `pdfminer` itself seems to be free of maintainance.

In [6]:
import pdfminer
dir(pdfminer)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'sys',

In [7]:
from pdfminer.high_level import extract_text
text = extract_text(path_pdf)
text



In [8]:
print(text)

pdf2ipynb

July 2, 2021

1 Objective

As a good example of extracting information from PDF files, we set out to convert a Jupyter
notebook in PDF form to its original .ipynb form.

• The n_pages assignment should be kept in the with context; otherwise, one’d get a

I don’t know if this is hard. Let’s get started.

[1]: from pathlib import Path

from PyPDF2 import PdfFileReader

N.B.

ValueError: seek of closed file

– same as pdf.getPage()

[2]: path_pdf = Path("pdf2ipynb.pdf")

with open(path_pdf, "rb") as f:
pdf = PdfFileReader(f)
n_pages = pdf.getNumPages()
page00 = pdf.getPage(0)
print(page00)

pdf

[3]: n_pages

[3]: 8

[4]: info = pdf.getDocumentInfo()

info

[4]: {'/Creator': 'LaTeX with hyperref',

'/Producer': 'xdvipdfmx (20210318)',
'/CreationDate': "D:20210702211923+07'00'"}

1

{'/Resources': IndirectObject(21, 0), '/Type': '/Page', '/Parent':
IndirectObject(69, 0), '/Contents': [IndirectObject(20, 0)], '/MediaBox': [0, 0,
612, 792]}

[2]: <PyPDF2.pdf.PdfFileReader at 0x7f0

## Ref.
- <https://realpython.com/pdf-python/>
- <https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html>