Add add_outlines method for PdfWriter #1048

fancywriter · 2022-07-01T20:56:19Z

I have tried PyPDF2 for some PDF manipulation and it works brilliantly!
However, my output file doesn't keep table of contents.

Code Example

What I do is not much difference from reading and writing file with all pages and metadata like here in example of removing duplication https://pypdf2.readthedocs.io/en/latest/user/file-size.html#removing-duplication

How would your feature be used?

from PyPDF2 import PdfReader, PdfWriter

writer.add_table_of_contents(reader.table_of_contents)

MartinThoma · 2022-07-05T20:04:11Z

There are a couple of things PyPDF2 knows which could be what you're thinking of:

outlines: reader.outlines
named_destinations: reader.named_destinations and writer.add_named_destination
bookmarks: writer.add_bookmark

I'm uncertain how those three relate to each other / what you're looking for. Can you give it a try with an example document and share your knowledge?

fancywriter · 2022-07-06T10:52:27Z

@MartinThoma thanks for the reply.
I have managed to solve my problem with pdfbox with Java. They allow to read document from file, modify it in memory and later save the same document to another file. All metadata, including table of contents is kept.

It's probably still nice to make it possible with PyPDF.
Offtopic question: is it possible to have "document" abstraction to make possible some sort of the following manipulations?

document = reader.read("input.pdf")
# modify document, add/resize/remove pages or whataever
writer.write(document, "output.pdf")

?

The example of PDF with table of contents
Sections_and_Chapters.pdf

If I do "read all pages and write them to another file", table of contents disappear.

Yes, it's like that "outlines" is what I was looking for. Let me try... Probably worth updating documentation page if it works that way.

MartinThoma · 2022-07-06T10:57:19Z

Please open a discussion / other issues for other topics - many people might read the issues; we need to keep it focused.

fancywriter · 2022-07-07T13:07:00Z

@MartinThoma OK, please ignore the comment about document, it's different.
I have checked that it's indeed the outlines what I was looking for. reader.outlines returns what I was looking for.

>>> reader.outlines
[{'/Title': 'Unnumbered Section', '/Page': IndirectObject(9, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 451.62, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbaf20>}, {'/Title': 'Unnumbered Section', '/Page': IndirectObject(9, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 451.62, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbaf20>}, {'/Title': 'Second Section', '/Page': IndirectObject(1, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 628.33, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbafe0>}]

However I can't find any addOutlines method for PdfWriter.

MartinThoma · 2022-07-07T19:17:59Z

You can add bookmarks:

from PyPDF2 import PdfReader, PdfWriter


def add_outlines_as_bookmark(writer, outline):
    if not isinstance(outlines, list):
        page_nb = 0  # I don't know how to get the page number
        writer.add_bookmark(outline.title, page_nb, None)
    else:
        for o in outline:
            add_outlines_as_bookmark(writer, o)


if __name__ == "__main__":
    reader = PdfReader("example.pdf")

    writer = PdfWriter()
    for page in reader.pages:
        writer.add_page(page)

    for outlines in reader.outlines:
        add_outlines_as_bookmark(writer, outlines)

    with open("out.pdf", "wb") as fp:
        writer.write(fp)

Are bookmarks different from outlines?

I agree that something like writer.add_outlines(reader.outlines) should be possible; I'm just wondering how things relate to each other and how much work it would be to add this feature.

mtd91429 · 2022-07-11T21:47:04Z

Are bookmarks different from outlines?

The PDF Reference uses the term "Outline" but recognizes "Bookmarks" as a synonymous term. From PDF Reference version 1.6 page 554 (section 8.2.2):

The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user.

Something that seems to be contributing to the confusion is that within the PdfReader object, they are referred to as outlines; within the PdfWriter object, they are referred to as both bookmarks (add_bookmark, add_bookmark_dict, add_bookmark_destination) and outlines (get_outline_root).

Further contributing to the confusion, most PDF reference software refers to these objects as "Bookmarks". For example, a screenshot from the Adobe Acrobat:

I've created a new issue (#1098) concerning this problem.

MartinThoma · 2022-07-29T06:48:15Z

See #1156

fancywriter assigned MartinThoma Jul 1, 2022

MartinThoma changed the title ~~Extract table of contents from PDF~~ Extract table of contents (outline) from PDF Jul 5, 2022

fancywriter changed the title ~~Extract table of contents (outline) from PDF~~ Add add_outlines method for PdfWriter Jul 7, 2022

MartinThoma closed this as completed in 8c532a0 Jul 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add add_outlines method for PdfWriter #1048

Add add_outlines method for PdfWriter #1048

fancywriter commented Jul 1, 2022 •

edited by MartinThoma

MartinThoma commented Jul 5, 2022 •

edited

fancywriter commented Jul 6, 2022 •

edited

MartinThoma commented Jul 6, 2022 •

edited

fancywriter commented Jul 7, 2022

MartinThoma commented Jul 7, 2022

mtd91429 commented Jul 11, 2022

MartinThoma commented Jul 29, 2022

Add add_outlines method for PdfWriter #1048

Add add_outlines method for PdfWriter #1048

Comments

fancywriter commented Jul 1, 2022 • edited by MartinThoma

Code Example

MartinThoma commented Jul 5, 2022 • edited

fancywriter commented Jul 6, 2022 • edited

MartinThoma commented Jul 6, 2022 • edited

fancywriter commented Jul 7, 2022

MartinThoma commented Jul 7, 2022

mtd91429 commented Jul 11, 2022

MartinThoma commented Jul 29, 2022

fancywriter commented Jul 1, 2022 •

edited by MartinThoma

MartinThoma commented Jul 5, 2022 •

edited

fancywriter commented Jul 6, 2022 •

edited

MartinThoma commented Jul 6, 2022 •

edited