Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add_outlines method for PdfWriter #1048

Closed
fancywriter opened this issue Jul 1, 2022 · 7 comments
Closed

Add add_outlines method for PdfWriter #1048

fancywriter opened this issue Jul 1, 2022 · 7 comments
Assignees

Comments

@fancywriter
Copy link

fancywriter commented Jul 1, 2022

I have tried PyPDF2 for some PDF manipulation and it works brilliantly!
However, my output file doesn't keep table of contents.

Code Example

What I do is not much difference from reading and writing file with all pages and metadata like here in example of removing duplication https://pypdf2.readthedocs.io/en/latest/user/file-size.html#removing-duplication

How would your feature be used?

from PyPDF2 import PdfReader, PdfWriter

writer.add_table_of_contents(reader.table_of_contents)
@MartinThoma
Copy link
Member

MartinThoma commented Jul 5, 2022

There are a couple of things PyPDF2 knows which could be what you're thinking of:

I'm uncertain how those three relate to each other / what you're looking for. Can you give it a try with an example document and share your knowledge?

@MartinThoma MartinThoma changed the title Extract table of contents from PDF Extract table of contents (outline) from PDF Jul 5, 2022
@fancywriter
Copy link
Author

fancywriter commented Jul 6, 2022

@MartinThoma thanks for the reply.
I have managed to solve my problem with pdfbox with Java. They allow to read document from file, modify it in memory and later save the same document to another file. All metadata, including table of contents is kept.

It's probably still nice to make it possible with PyPDF.
Offtopic question: is it possible to have "document" abstraction to make possible some sort of the following manipulations?

document = reader.read("input.pdf")
# modify document, add/resize/remove pages or whataever
writer.write(document, "output.pdf")

?

The example of PDF with table of contents
Sections_and_Chapters.pdf

If I do "read all pages and write them to another file", table of contents disappear.

Yes, it's like that "outlines" is what I was looking for. Let me try... Probably worth updating documentation page if it works that way.

@MartinThoma
Copy link
Member

MartinThoma commented Jul 6, 2022

Please open a discussion / other issues for other topics - many people might read the issues; we need to keep it focused.

@fancywriter
Copy link
Author

@MartinThoma OK, please ignore the comment about document, it's different.
I have checked that it's indeed the outlines what I was looking for. reader.outlines returns what I was looking for.

>>> reader.outlines
[{'/Title': 'Unnumbered Section', '/Page': IndirectObject(9, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 451.62, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbaf20>}, {'/Title': 'Unnumbered Section', '/Page': IndirectObject(9, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 451.62, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbaf20>}, {'/Title': 'Second Section', '/Page': IndirectObject(1, 0), '/Type': '/XYZ', '/Left': 133.768, '/Top': 628.33, '/Zoom': <PyPDF2.generic.NullObject object at 0x7f5057bbafe0>}]

However I can't find any addOutlines method for PdfWriter.

@fancywriter fancywriter changed the title Extract table of contents (outline) from PDF Add add_outlines method for PdfWriter Jul 7, 2022
@MartinThoma
Copy link
Member

You can add bookmarks:

from PyPDF2 import PdfReader, PdfWriter


def add_outlines_as_bookmark(writer, outline):
    if not isinstance(outlines, list):
        page_nb = 0  # I don't know how to get the page number
        writer.add_bookmark(outline.title, page_nb, None)
    else:
        for o in outline:
            add_outlines_as_bookmark(writer, o)


if __name__ == "__main__":
    reader = PdfReader("example.pdf")

    writer = PdfWriter()
    for page in reader.pages:
        writer.add_page(page)

    for outlines in reader.outlines:
        add_outlines_as_bookmark(writer, outlines)

    with open("out.pdf", "wb") as fp:
        writer.write(fp)

Are bookmarks different from outlines?

I agree that something like writer.add_outlines(reader.outlines) should be possible; I'm just wondering how things relate to each other and how much work it would be to add this feature.

@mtd91429
Copy link
Contributor

Are bookmarks different from outlines?

The PDF Reference uses the term "Outline" but recognizes "Bookmarks" as a synonymous term. From PDF Reference version 1.6 page 554 (section 8.2.2):

The outline consists of a tree-structured hierarchy of outline items (sometimes called bookmarks), which serve as a visual table of contents to display the document’s structure to the user.

Something that seems to be contributing to the confusion is that within the PdfReader object, they are referred to as outlines; within the PdfWriter object, they are referred to as both bookmarks (add_bookmark, add_bookmark_dict, add_bookmark_destination) and outlines (get_outline_root).

Further contributing to the confusion, most PDF reference software refers to these objects as "Bookmarks". For example, a screenshot from the Adobe Acrobat:

image

I've created a new issue (#1098) concerning this problem.

@MartinThoma
Copy link
Member

See #1156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants