Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

处理PDF文件时遇到了无效的交叉引用(XRef)表 #27

Open
yoyicue opened this issue Apr 11, 2023 · 0 comments
Open

处理PDF文件时遇到了无效的交叉引用(XRef)表 #27

yoyicue opened this issue Apr 11, 2023 · 0 comments

Comments

@yoyicue
Copy link

yoyicue commented Apr 11, 2023

解析这个optimized过的pdf报错, 在deepl里面是可以正常处理的。
https://assets.ctfassets.net/95kuvdv8zn1v/44FqPJmYPZRwiZN2socdOK/14f5eb025d87a452100d80f513567f2a/Cruise_Impact_Report_-_2022-optimized.pdf

Converting PDF to text:   0% 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 722, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 1000, in read_xref_from
    xref.load(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 282, in load
    raise PDFNoValidXRef("Invalid PDF stream spec.")
pdfminer.pdfdocument.PDFNoValidXRef: Invalid PDF stream spec.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 347, in <module>
    text = convert_pdf_to_text(filename,startpage,endpage)
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 221, in convert_pdf_to_text
    end_page = get_total_pages(pdf_filename)
  File "/content/drive/MyDrive/ebook-GPT-translator/text_translation.py", line 217, in get_total_pages
    document = PDFDocument(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 727, in __init__
    newxref.load(parser)
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/pdfdocument.py", line 241, in load
    (_, obj) = parser.nextobject()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 609, in nextobject
    (pos, token) = self.nexttoken()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 526, in nexttoken
    self.fillbuf()
  File "/usr/local/lib/python3.9/dist-packages/pdfminer/psparser.py", line 239, in fillbuf
    raise PSEOF("Unexpected EOF")
pdfminer.psparser.PSEOF: Unexpected EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant