Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#17 in arch linux #120

Closed
dvtate opened this issue Jan 2, 2024 · 9 comments
Closed

#17 in arch linux #120

dvtate opened this issue Jan 2, 2024 · 9 comments

Comments

@dvtate
Copy link

dvtate commented Jan 2, 2024

Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4

I'm guessing the python bindings are broken somehow.

@dvtate
Copy link
Author

dvtate commented Jan 2, 2024

#17 - link

@dvtate
Copy link
Author

dvtate commented Jan 2, 2024

directly using poppler via pdftotext works fine so I'll just do that for now

@jalan
Copy link
Owner

jalan commented Jan 2, 2024

Can you attach the file here that you claim doesn't work?

I have nothing to do with arch linux, by the way.

@dvtate
Copy link
Author

dvtate commented Jan 3, 2024

it's the same one from #17 -- https://arxiv.org/pdf/1004.5293.pdf

No worries, I'd reach out to the package maintainer but I'm not sure how to get their email address

@dvtate
Copy link
Author

dvtate commented Jan 3, 2024

it actually doesn't work with any pdf file for me

@jalan
Copy link
Owner

jalan commented Jan 3, 2024

I don't use arch, so maybe I got something wrong here, but it seems to work:

$ docker run -it archlinux:base
[root@0366535679bd /]# pacman -Sy python-pdftotext
[output cut]
[root@0366535679bd /]# curl --silent --output test.pdf https://arxiv.org/pdf/1004.5293.pdf
[root@0366535679bd /]# python
Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] on linux
>>> import pdftotext
>>> f = open("test.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34
>>> pdf[0][:100]
'EPJ manuscript No.\n(will be inserted by the editor)\n\narXiv:1004.5293v2 [physics.ins-det] 7 Jun 2010\n'
>>> 

@dvtate
Copy link
Author

dvtate commented Jan 3, 2024

Interesting, it seems to work fine with rb but not wb+. Regardless, I think I can make this work, thanks!

@dvtate dvtate closed this as completed Jan 3, 2024
@jalan
Copy link
Owner

jalan commented Jan 3, 2024

wb+ truncates the file, so of course that doesn't work

@dvtate
Copy link
Author

dvtate commented Jan 3, 2024

I started with file.write(contents) which works with pypdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants