Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docinfo fails in threads #27

Closed
impredicative opened this issue Feb 20, 2019 · 4 comments
Closed

docinfo fails in threads #27

impredicative opened this issue Feb 20, 2019 · 4 comments

Comments

@impredicative
Copy link

impredicative commented Feb 20, 2019

docinfo doesn't work at all in a thread. The following code can demonstrate the problem.

from concurrent.futures import ThreadPoolExecutor
from io import BytesIO
from urllib.request import urlopen
import sys
import threading

import pikepdf

print(f'sys.version = {sys.version.replace(chr(10), "")}')
print(f'pikepdf.__version__ = {pikepdf.__version__}')
print(f'pikepdf.libqpdf_version__ = {pikepdf.__libqpdf_version__}')

pdf_bytes = urlopen('https://www.fda.gov/downloads/drugs/guidances/ucm353925.pdf').read()


def get_docinfo(pdf_bytes):
    thread_name = threading.current_thread().name
    pdf = pikepdf.open(BytesIO(pdf_bytes))
    print(f'{thread_name}: got pdf {pdf}')
    docinfo = pdf.docinfo  # GETS STUCK HERE IN THREAD.
    print(f'{thread_name}: got docinfo')
    docinfo = dict(docinfo)
    return docinfo


local_docinfo = get_docinfo(pdf_bytes)

executor = ThreadPoolExecutor(max_workers=1)
threaded_docinfos = list(executor.map(get_docinfo, [pdf_bytes]))

print('Finished.')

The output is:

sys.version = 3.7.2 (default, Dec 25 2018, 03:50:46) [GCC 7.3.0]
pikepdf.__version__ = 1.0.5
pikepdf.libqpdf_version__ = 8.3.0
MainThread: got pdf <pikepdf.Pdf description='<_io.BytesIO object at 0x7ff01c6b38e0>'>
MainThread: got docinfo
ThreadPoolExecutor-0_0: got pdf <pikepdf.Pdf description='<_io.BytesIO object at 0x7ff01c6b38e0>'>

It then gets stuck.

@jbarlow83
Copy link
Member

jbarlow83 commented Feb 20, 2019

It does work with a ProcessPoolExecutor and a slight modification to avoid marshalling pikepdf objects across process boundaries which is currently not implemented (and very difficult to implement). Still looking at the thread issue.

from concurrent.futures import ProcessPoolExecutor
from io import BytesIO
from urllib.request import urlopen
import sys
import threading

import pikepdf

print(f'sys.version = {sys.version.replace(chr(10), "")}')
print(f'pikepdf.__version__ = {pikepdf.__version__}')
print(f'pikepdf.libqpdf_version__ = {pikepdf.__libqpdf_version__}')

pdf_bytes = urlopen('https://www.fda.gov/downloads/drugs/guidances/ucm353925.pdf').read()


def get_docinfo(pdf_bytes):
    thread_name = threading.current_thread().name
    pdf = pikepdf.open(BytesIO(pdf_bytes))
    print(f'{thread_name}: got pdf {pdf}')
    docinfo = pdf.docinfo  # GETS STUCK HERE IN THREAD.
    print(f'{thread_name}: got docinfo')
    docinfo = {k: str(v) for k, v in dict(docinfo).items()}
    print(f'{docinfo}')
    return docinfo


local_docinfo = get_docinfo(pdf_bytes)

executor = ProcessPoolExecutor(max_workers=1)
threaded_docinfos = list(executor.map(get_docinfo, [pdf_bytes]))

print('Finished.')

@jbarlow83
Copy link
Member

It looks like the issue is in pybind11 and fixed in master but not in a release build.
pybind/pybind11@e2b884c

Essentially there are problems in pybind11 2.2.4 when a thread tries to acquire GIL, which is necessary here. If you apply that patch and build pikepdf against a local copy of pybind11 it should resolve the issue. I will wait for a tagged release of pybind11 that contains the fix.

For what you're doing a ProcessPoolExecutor is probably more performant anyway because it avoids competition for the GIL, and so the work can be properly parallelized. As I mentioned above, there is currently a restriction that you can't marshall pikepdf objects across a process boundary, but if you force the objects into some Python representation then there is no issue.

@impredicative
Copy link
Author

I am now using ProcessPoolExecutor with one caveat. It is that I can never import pikepdf except in a child process. If I import pikepdf even once in the parent process, the problem manifests. This happens even if the above example works with ProcessPoolExecutor.

jbarlow83 pushed a commit that referenced this issue Mar 2, 2019
@jbarlow83
Copy link
Member

Fixed for v1.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants