Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting a pdf into pages and joining again #193

Open
BjornFJohansson opened this issue Jan 30, 2024 · 0 comments
Open

Splitting a pdf into pages and joining again #193

BjornFJohansson opened this issue Jan 30, 2024 · 0 comments

Comments

@BjornFJohansson
Copy link

I made two scripts to split pdfs into pages and join pdfs. The join script fails to join pages that th split script produces.
Tested on the pdf attached. 2019_ReferenceWorkEntry_.pdf

What am I doing wrong?

Cell In[23], line 4
pdf = PDF.loads(pdf_file_handle)

File ~/miniforge3/envs/bjorn311/lib/python3.11/site-packages/borb/pdf/pdf.py:83 in loads
document: Document = ReadAnyObjectTransformer().transform(

...

  File ~/miniforge3/envs/bjorn311/lib/python3.11/site-packages/borb/io/read/tokenize/high_level_tokenizer.py:144 in read_indirect_object
    value = self.read_object()

  File ~/miniforge3/envs/bjorn311/lib/python3.11/site-packages/borb/io/read/tokenize/high_level_tokenizer.py:206 in read_object
    return self.read_dictionary()

  File ~/miniforge3/envs/bjorn311/lib/python3.11/site-packages/borb/io/read/tokenize/high_level_tokenizer.py:94 in read_dictionary
    assert token.get_token_type() == TokenType.NAME

AssertionError

split:

#!/home/bjorn/miniforge3/envs/bjorn311/bin python3
# -*- coding: utf-8 -*-
import sys
from pathlib import Path
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from tqdm import tqdm

script, *cliarg = sys.argv
pdfpaths = [Path(p) for p in cliarg] or 
[2019_ReferenceWorkEntry_.pdf](https://github.com/jorisschellekens/borb/files/14094251/2019_ReferenceWorkEntry_.pdf)
sorted(Path(".").glob("*.pdf"))

for pdfpath in tqdm(pdfpaths):

    fn = pdfpath.stem

    with open(pdfpath, "rb") as pdf_file_handle:
        pdf = PDF.loads(pdf_file_handle)

    number_of_pages = int(pdf.get_document_info().get_number_of_pages())

    for i in range(number_of_pages):
        print(i)
        outpdf = Document()
        outpdf.add_page(pdf.get_page(i))
        with open(f"{fn}_{i:03d}.pdf", "wb") as pdf_out_handle:
            PDF.dumps(pdf_out_handle, outpdf)

join:

#!/home/bjorn/miniforge3/envs/bjorn311/bin python3
# -*- coding: utf-8 -*-
# https://pdfstandalone.com/en/merge-pdf
import sys
from pathlib import Path
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from tqdm import tqdm

script, *cliarg = sys.argv
pdfpaths = [Path(p) for p in cliarg] or sorted(Path(".").glob("*.pdf"))
output_document = Document()
outpath = Path("output.pdf")

try:
    pdfpaths.remove(outpath)
except ValueError:
    pass

for pdfpath in tqdm(pdfpaths):

    with open(pdfpath, "rb") as pdf_file_handle:
        pdf = PDF.loads(pdf_file_handle)
        output_document.add_document(pdf)

with open(outpath, "wb") as pdf_out_handle:
    PDF.dumps(pdf_out_handle, output_document)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant