Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for editing markup in PDFs #6

Open
widdowquinn opened this issue Jun 18, 2020 · 2 comments
Open

Add support for editing markup in PDFs #6

widdowquinn opened this issue Jun 18, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@widdowquinn
Copy link

PDFs also support editing markup such as highlighting, strikethroughs, and text insertion. These could also be captured and would be useful, for example, when providing lists of corrections for large documents such as PhD theses.

(the code below assumes that the offset suggested in #5 is also implemented)

SEVERITY_NAMES = {0: "Minor comments", 1: "Major comments", 2: "Edits"}

# Edits to be reported
EDITS = ("Cross-Out", "Inserted Text")

def iter_edit_contents(page: PageObject) -> Iterator[str]:
    try:
        edit_indirects = page["/Annots"]
    except KeyError:
        return

    for edit_indirect in edit_indirects:
        edit = edit_indirect.getObject()

        try:
            if edit["/Subj"] in EDITS:
                if edit["/Subj"] == "Cross-Out":
                    yield edit["/Subj"], "-"
                else:
                    yield edit["/Subj"], edit["/Contents"]
        except KeyError:
            continue


def load_comments(filename: str, offset: int) -> SeverityDict:
    res: SeverityDict = defaultdict(list)

    reader = PdfFileReader(filename, STRICT)
    for page_num, page in enumerate(reader.pages, 1):
        for contents in iter_annot_contents(page):
            m_stars = re_stars.match(contents)
            assert m_stars is not None  # should always match

            stars = m_stars["stars"]
            comment = m_stars["comment"]

            # number of stars
            severity = len(stars)

            res[severity].append(f"p{page_num - offset}: {comment}")
        for edit_type, edit in iter_edit_contents(page):
            res[2].append(f"p{page_num - offset}: {edit_type} ({edit})")

    return res
@michaelmhoffman michaelmhoffman added the enhancement New feature or request label Jun 20, 2020
@michaelmhoffman
Copy link
Contributor

Something like this could work. "Edits" aren't really a severity level though, and certainly shouldn't have higher severity than "Major comments". It would be better to leave them at level 0—if a user wants non-edits to have a different severity level they can add extra asterisks.

For insertions, the comment should be Insert "text"

I take it that one cannot easily identify the text corresponding to a strikethrough? In a long-forgotten version of pdfcomments I had [ink] as the comment text to remind myself to look there for comments scribbled with a stylus. You could use [deletion] for strikeout annotations.

A test case for the system added in #7 would be necessary.

@michaelmhoffman
Copy link
Contributor

PyPDF2 documentation now has extensive examples for how to deal with various kinds of annotations https://pypdf2.readthedocs.io/en/3.0.0/user/reading-pdf-annotations.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants