Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using pipes #889

Closed
Fulguritus opened this issue Jan 7, 2022 · 20 comments
Closed

Segmentation fault when using pipes #889

Fulguritus opened this issue Jan 7, 2022 · 20 comments
Labels
third party issue Problem with a third party dependency

Comments

@Fulguritus
Copy link

Fulguritus commented Jan 7, 2022

Describe the bug
When running ocrmypdf through podman/docker I sometimes (#864) experience segmentation faults and the container hangs indefinitely. The output file is empty.

To Reproduce
The following command is executed to reproduce the failure, due to the non-deterministic behavior of ocrmypdf, it might take a while or even multiple loops to reproduce.

for i in $(seq 0 100); do
    podman run --rm -i ocrmypdf --verbose -rcd  --jbig2-lossy -l deu - - <tmp.pdf >out.pdf; done
done

All of the options can be omitted and the issue is reproducible. The resulting log is:

ocrmypdf 12.6.0.post6+g42713b77.d20211012
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['unpaper', '--version']
Found unpaper 6.1
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/stdin, /tmp/ocrmypdf.io.yzr1_6f6/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
    1 Rotating output by 0
    1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.yzr1_6f6/000001_rasterize_preview.jpg', 'stdout']
    1 page is facing ⇧, confidence 7.23 - no change
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
    1 Rotating output by 0
    1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpmqv67lqw/input.pnm', '/tmp/tmpmqv67lqw/output.pgm']
    1 stdout/stderr = [image2 @ 0x55a80053afc0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55a80053afc0] Encoder did not produce proper pts, making some up.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpmqv67lqw/input.pnm -> /tmp/tmpmqv67lqw/output.pgm
input-file for sheet 1: /tmp/tmpmqv67lqw/input.pnm
output-file for sheet 1: /tmp/tmpmqv67lqw/output.pgm
sheet size: 1232x1718
...
noise-filter ... deleted 47 clusters.
blur-filter... deleted 0 pixels.
writing output.

    1 resolution (150.01239999999999, 150.01239999999999)
    1 convert
    1 PIL format = PNG
    1 imgformat = PNG
    1 input dpi = 150 x 150
    1 rotation = 0°
    1 input colorspace = L
    1 width x height = 1232px x 1718px
    1 read_images() embeds a PNG
    1 convert done
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr.png', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr_tess', 'pdf', 'txt']
    1 Emplacement update
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/graft_layers.pdf, /tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf', '/tmp/ocrmypdf.io.yzr1_6f6/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/optimize.opt.pdf, /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf)
/tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf -> -
Output sent to stdout

dmesg yields:

[21719.464718] conmon[91767]: segfault at 111d000 ip 00007fcf434cf980 sp 00007ffc7f66d4e8 error 4 in libc.so.6[7fcf43380000+176000]
[21719.464741] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed

(Always the same location in libc)

Exchanging >out.pdf with tee out.pdf I at some point could see strange characters being omited after %%EOF (?), however, most of the time it hangs before that.

Example file
The example file is attached in encrypted form. tmp.pdf.gpg.zip

Expected behavior
The output file should be correct and the tool should not hang.

System

  • OS: Fedora 35
  • OCRmyPDF Version: 12.6.0.post6+g42713b77.d20211012, but reproducible just as well with jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0
  • How did you install ocrmypdf? podman pull jbarlow83/ocrmypdf
@Fulguritus
Copy link
Author

Output with less options, same file:

for i in $(seq 0 100); do podman run --rm -i ocrmypdf --verbose -l deu - - <tmp.pdf >out.pdf; done
ocrmypdf 12.6.0.post6+g42713b77.d20211012
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.6rngvn7y/stdin, /tmp/ocrmypdf.io.6rngvn7y/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.6rngvn7y/origin.pdf']
    1 Rotating output by 0
    1 resolution (150.01239999999999, 150.01239999999999)
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.6rngvn7y/000001_ocr.png', '/tmp/ocrmypdf.io.6rngvn7y/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.6rngvn7y/graft_layers.pdf, /tmp/ocrmypdf.io.6rngvn7y/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.6rngvn7y/fix_docinfo.pdf', '/tmp/ocrmypdf.io.6rngvn7y/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 269, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 204, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimizable images: JPEGs: 0 PNGs: 0
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.6rngvn7y/optimize.opt.pdf, /tmp/ocrmypdf.io.6rngvn7y/optimize.pdf)
/tmp/ocrmypdf.io.6rngvn7y/optimize.pdf -> -
Output sent to stdout

dmesg:

$ dmesg
[23454.630421] conmon[96614]: segfault at fe0000 ip 00007f36c4798980 sp 00007fff96201d18 error 4 in libc.so.6[7f36c4649000+176000]
[23454.630439] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed

@Fulguritus
Copy link
Author

...and here with just --verbose 1

ocrmypdf 12.6.0.post6+g42713b77.d20211012
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.tw7da05g/stdin, /tmp/ocrmypdf.io.tw7da05g/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.tw7da05g/origin.pdf']
    1 Rotating output by 0
    1 resolution (150.01239999999999, 150.01239999999999)
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.tw7da05g/000001_ocr.png', '/tmp/ocrmypdf.io.tw7da05g/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.tw7da05g/graft_layers.pdf, /tmp/ocrmypdf.io.tw7da05g/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.tw7da05g/fix_docinfo.pdf', '/tmp/ocrmypdf.io.tw7da05g/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 269, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 204, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimizable images: JPEGs: 0 PNGs: 0
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.tw7da05g/optimize.opt.pdf, /tmp/ocrmypdf.io.tw7da05g/optimize.pdf)
/tmp/ocrmypdf.io.tw7da05g/optimize.pdf -> -
Output sent to stdout

@jbarlow83
Copy link
Collaborator

Looks like you're using ocrmypdf 12.6. Can you try again with a more recent version? v13 introduced some improvements to concurrency that reduce hanging.

I won't be able to do much with the segfault. Perhaps you can identify the responsible process?

@Fulguritus
Copy link
Author

Fulguritus commented Jan 7, 2022

Hi @jbarlow83 I am using the latest docker container, but also explicitly tried out jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0

Just tried with an additional -j 1 - same issue (jbarlow83/ocrmypdf:v13.2.0).

@jbarlow83
Copy link
Collaborator

The message ocrmypdf 12.6.0.post6+g42713b77.d20211012 indicates it's an old version.

The latest should read:

$ docker run -t jbarlow83/ocrmypdf --version
13.2.0.post3+g7966192d.d20220104

@Fulguritus
Copy link
Author

I won't be able to do much with the segfault. Perhaps you can identify the responsible process?

Could you give me a pointer how to do that with docker/podman?

@Fulguritus
Copy link
Author

Fulguritus commented Jan 7, 2022

The message ocrmypdf 12.6.0.post6+g42713b77.d20211012 indicates it's an old version.

The latest should read:

$ docker run -t jbarlow83/ocrmypdf --version
13.2.0.post3+g7966192d.d20220104

Oh, sorry - my bad. It seems I mixed up versions when filing this bug report... Here is the result with 13.2.0:

for i in $(seq 0 100); do podman run --rm -i jbarlow83/ocrmypdf:v13.2.0 --verbose -l deu - - <tmp.pdf >out.pdf; done
ocrmypdf 13.2.0.post0+g298bdb86.d20211219
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.eriggexu/stdin, /tmp/ocrmypdf.io.eriggexu/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.eriggexu/origin.pdf']
    1 Rotating output by 0
    1 resolution (150.01239999999999, 150.01239999999999)
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.eriggexu/000001_ocr.png', '/tmp/ocrmypdf.io.eriggexu/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.eriggexu/graft_layers.pdf, /tmp/ocrmypdf.io.eriggexu/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.eriggexu/fix_docinfo.pdf', '/tmp/ocrmypdf.io.eriggexu/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 268, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 203, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimizable images: JPEGs: 0 PNGs: 0
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.eriggexu/optimize.opt.pdf, /tmp/ocrmypdf.io.eriggexu/optimize.pdf)
/tmp/ocrmypdf.io.eriggexu/optimize.pdf -> -
Output sent to stdout

dmesg

$ dmesg

[24408.182063] conmon[100696]: segfault at a8a000 ip 00007f025e26d9b5 sp 00007ffcc9e773e8 error 4 in libc.so.6[7f025e11e000+176000]
[24408.182091] Code: fd 74 5f 41 c5 fd 74 67 61 c5 ed eb e9 c5 dd eb f3 c5 cd eb ed c5 fd d7 cd 85 c9 75 48 48 83 ef 80 48 81 ea 80 00 00 00 77 cb <c5> fd 74 4f 01 c5 fd d7 c1 66 90 85 c0 75 5c 83 c2 40 0f 8f c3 00

@Fulguritus
Copy link
Author

Taking that last segfault:

[24408.182063] conmon[100696]: segfault at a8a000 ip 00007f025e26d9b5 sp 00007ffcc9e773e8 error 4 in libc.so.6[7f025e11e000+176000]
[24408.182091] Code: fd 74 5f 41 c5 fd 74 67 61 c5 ed eb e9 c5 dd eb f3 c5 cd eb ed c5 fd d7 cd 85 c9 75 48 48 83 ef 80 48 81 ea 80 00 00 00 77 cb <c5> fd 74 4f 01 c5 fd d7 c1 66 90 85 c0 75 5c 83 c2 40 0f 8f c3 00

Address within libc: 0x00007f025e26d9b5 - 0x7f025e11e000 = 0x14F9B5

$ addr2line -e /usr/lib/libc.so.6 -fCi 0x14F9B5
__GI_netname2host
:?

Does this help?

@jbarlow83
Copy link
Collaborator

Re addr2line: That's just showing that the container manager crashed but doesn't give any insight into the container process responsible (if any).

You could also try 1) --use-threads and 2) --jobs 1 as input arguments. These options use threads instead of processes for ocrmypdf and disabling multiprocessing, respectively.

@Fulguritus
Copy link
Author

for i in $(seq 0 100); do podman run --rm -i jbarlow83/ocrmypdf:v13.2.0  --use-threads --verbose -l deu - - <tmp.pdf >out.pdf; done
ocrmypdf 13.2.0.post0+g298bdb86.d20211219
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.jkqrutct/stdin, /tmp/ocrmypdf.io.jkqrutct/origin.pdf)
Using Tesseract OpenMP thread limit 3
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.jkqrutct/origin.pdf']
    1 Rotating output by 0
    1 resolution (150.01239999999999, 150.01239999999999)
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.jkqrutct/000001_ocr.png', '/tmp/ocrmypdf.io.jkqrutct/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.jkqrutct/graft_layers.pdf, /tmp/ocrmypdf.io.jkqrutct/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.jkqrutct/fix_docinfo.pdf', '/tmp/ocrmypdf.io.jkqrutct/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 268, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 203, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimizable images: JPEGs: 0 PNGs: 0
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.jkqrutct/optimize.opt.pdf, /tmp/ocrmypdf.io.jkqrutct/optimize.pdf)
/tmp/ocrmypdf.io.jkqrutct/optimize.pdf -> -
Output sent to stdout
$ dmesg
[26263.186556] conmon[104410]: segfault at d80000 ip 00007f3b09241980 sp 00007fff9e38a568 error 4 in libc.so.6[7f3b090f2000+176000]
[26263.186590] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed

@Fulguritus
Copy link
Author

for i in $(seq 0 100); do podman run --rm -i jbarlow83/ocrmypdf:v13.2.0  --jobs 1 --verbose -l deu - - <tmp.pdf >out.pdf; done
ocrmypdf 13.2.0.post0+g298bdb86.d20211219
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.0biy4pp4/stdin, /tmp/ocrmypdf.io.0biy4pp4/origin.pdf)
Using Tesseract OpenMP thread limit 1
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.0biy4pp4/origin.pdf']
    1 Rotating output by 0
    1 resolution (150.01239999999999, 150.01239999999999)
    1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.0biy4pp4/000001_ocr.png', '/tmp/ocrmypdf.io.0biy4pp4/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.0biy4pp4/graft_layers.pdf, /tmp/ocrmypdf.io.0biy4pp4/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.0biy4pp4/fix_docinfo.pdf', '/tmp/ocrmypdf.io.0biy4pp4/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 268, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 203, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimizable images: JPEGs: 0 PNGs: 0
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.0biy4pp4/optimize.opt.pdf, /tmp/ocrmypdf.io.0biy4pp4/optimize.pdf)
/tmp/ocrmypdf.io.0biy4pp4/optimize.pdf -> -
Output sent to stdout
$ dmesg
[26416.316781] conmon[104723]: segfault at 8f3000 ip 00007fb6f256e9b5 sp 00007fffdcf4afa8 error 4 in libc.so.6[7fb6f241f000+176000]
[26416.316799] Code: fd 74 5f 41 c5 fd 74 67 61 c5 ed eb e9 c5 dd eb f3 c5 cd eb ed c5 fd d7 cd 85 c9 75 48 48 83 ef 80 48 81 ea 80 00 00 00 77 cb <c5> fd 74 4f 01 c5 fd d7 c1 66 90 85 c0 75 5c 83 c2 40 0f 8f c3 00

@Fulguritus
Copy link
Author

So, I assume nothing to do with concurrency :(

@Fulguritus
Copy link
Author

Fulguritus commented Jan 7, 2022

podman logs (stderr) of the last crashes yields (both approximately the same)

/tmp/ocrmypdf.io.0biy4pp4/optimize.pdf -> -
��-CI>���
F��D���CM�rԗ�\��C<��on'}:
1����JU�e NotImplementedError(
A@�`� 18 as an optimization candidate@0�����JU���8�l
�+�&)��

��й�mypdf.io.0biy4pp4/optimize.pdf -> -
.op1@���0biy4pp4/optimizpP<< /Subtype /XML /Type /Metadata /Length 1323 >>
�����S��n��!��� ������JU�cription xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="" pdfaid:part="2" pdfaid:conformance="B"/><rdf:Description rdf:about=""><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Seq><rdf:li/></rdf:Seq></dc:creator></rdf:Description><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:MetadataOutput sent to stdout

Full logs here:
log_err_1.txt
log_err_2.txt

Logs of stdout contain a corrupt PDF that includes some debug messages (see gs options to the far right of below snippet):

12 0 obj
<< /Filter /FlateDecode /Length 7767 >>
stream
x\9C\A5][\93U\B7\B1\DE\CF\FC\8A9U\A9Sv\95\D9Y\BA,-鼜\82`c\C7q.\86\C4U!y��\86!\9E�{.Pş\F5K~C\9E\F2p\A4nI\FD\B5\D6\DAÐ�\CA�\E3n-]\FA\F2\F5E\9A\9F\8F\A6\BD9\9Aʯ\FA\FB\8B\8B{?\DF3\F47G\F5\B7��G�\9F\DE\FB\F5\F7\F1\C8\F8\A3\A7\A7\F7\CC~\F2\FCߏ\\8A{�\B4D\BB_\EC\D1Ӌ{\EE\E8\E9\E5\BD\CFv\DF\EF\BE\DC\FDf\F7\F5\EE\E9\EE\E8\F3\A7\FF\B87\EDg�\E7Xy\BC�\8A\E7\B3\DD�3Y�\B7\D3\CCq�\A3"z\92G\FBr\F7\E7\FC\CF\F7\BB߷qc2\DEV\9E\D9\F8\FD��\CF7LGSwe\EA\D3>\A5\E8*\83\89i\9F\E9}\DC�\CF\F4\DF\ED\DE\EENv/w\97\BB7\BB׻\AB\FCO\F9\F3M\FE\B7W\F9ߎ\F3\EF?\EE\FE\9D\FF\FF$\FF\97W\F5\EF\FB�ܴ\9F\A6\C9\F3g\A2	\CBrt\BF\FC\95\9F\A2\8B\F3�\FD1\E5\FF\E5\BDۗ\DF\F3\FA|\9E\AAI\E9h1S\9F\F3wy\EC\EB\FC\EBe\FE\FD\BC�\FB\A7\DDi\FE\B7׻�\BB3\FAoo\FA�\E4o\BA\F6\A1i2\DB�rƖ\BF=Z\A6\B4w\8B\E3/\95\95^\F5qB>\CBT�ryR��Z\F2\DF\C6@�\E5u\F1@�\A1m\BA\A0\ED\BA\A6?_ml\BC\99\97vR\997�\ED(D\B7w\81\C7x\90\97Z�W\A7�|0\A1\92'\9F?5�E~B�\EBs\9FC#6f\DE�=\F4\BB|\AA\97\F9\F4n\F2\B4\AE\FA�\B6݄\BDLӔ\9Ad[Kb\84\E3<\A4\E5]җ\9F\E7\91~\CC\E3^\E1id�\9E\ACm3q&\EC\9D�\E1"KO\99�~\D3\F5u\BA\B0\8CS/\9F\B8ȿnH ?T\81\BC\EE̾�S\D5�\B3\D7�{O\B4'\F2\B1r�\89�c\F1\A1\EFX\F0\FBr\DE\F9w\BF\F7\91y_V5h�<;\D76&zE\B9>\FBʃ\9F0v\DE/A\F1\BD'�\FFe\F7E?\F1\E4b\FB\86\99\A7\BD\F5�=�\B8\99d\EC�\8A\DA#\ED��\EC\FB\BA[\AF\9B\92f3�bl&\C6N\A1\98�\BD\EA\AE�>O\A7\8B\82Y\F6\C9*\CA\D7t\F0}\EE\D9
u3d\9D\DD\EBaŪ\DC\EF\C9Nv\DC�g\F6s5�[\E6\E7"\8B\F0/\F9\CF\C7ym�(B\CB�;=/{\EF\D5X\97]\E4a'R\88\B6\EFv\B4\DB\C0SL݇\FCe4wht\AC|\D1fs^��\B8o\88\A7\CD.\CAN\F6\A5\9AXά.\F5X)~\90\A1cq�H\AB\842\CDF�f\B2\B4�\A0}\94\A7qI#\9F\E6E\9C\F5\8D�q\EB\D32\C1\90x�\FCR\98\85v\C9J\AAHow�|j\F7\C1�N\D5C\C4qS\A6l\B9dO\F8\90\DB&&kFk(\E4E\D2/\F3Y��u\D1e}n�f\8Ag\8AȠ\AC\E7\D2\E9\B2̦\80t\DFu\B5n\83.]в#\F0H[ĥ\8D8/"[d\90\80\AEl\CE\D9\EE_\D5\F2\EE\C5\\88fZ>�\E1\F9\FD #I\84\CF\81\98\CAl]\81�Y\E3\E6d\BB\B3�\B2\FB�\D8\F57k-R\CD\F4\ABҹ�\94Vӽ\00\00\00\00\00\00\00F\97\BE\00\00\00\00\00\00\00\00\00\B1�\00\00\00\00\00\00\DEҍ\00\00\00\00\00\A1\B2\F4\8F�JU\EE ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-s

Full log (stdout):
log.txt.gpg.zip

@Fulguritus
Copy link
Author

Fulguritus commented Jan 8, 2022

OK, removing --verbose 1 gets rid of the debug printout in stdout (pdf is still corrupt), stderr seems to contain some stream data:

reading file from standard input
Postprocessing...
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
While extracting image xref 18, an error occurred
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 268, in extract_images
    result = extract_fn(
  File "/usr/local/lib/python3.9/dist-packages/ocrmypdf/optimize.py", line 203, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/lib/python3.9/dist-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
Optimize ratio: 1.00 savings: 0.0%
\C0\CC-CI\00\00\00\00>\C3\EC\00\BF
\00��\00\00\00\00\00\00Ƀb\00\00\00\00\00\D8JUd1\C3{_\B13\A5�\80\B7k2Z�\D1\E9~�v9�!\A6ЍL\8E\F7��� ~p8\B4G\804-4/`T\A7�\90\8B�\9E\C3\F3\91�\F5\B4p\A4�\D4մXM'\DB��bi�p\BF��\B8D	a\CA�
_Y\C3\D6\FB¹\D8\D5\E6\E61\BCh!۽w\D5O�\93\C2\E0a\CE]6C\DDo\E1�ӭE\F7|VPB\B6|\C5\F3�\F9\BF\AB��3[$\E1\8D0.i�\89�q`\\85\C0h]\EC\C4;/\E4'\F9Ke1`41f6\A6\A3\85lr\F3\E9N%4`\D6\DF҉}\90\89a|1\87\8A\ED:\C33\D1\E5\FB\C1\CC{\97RS\BA\A3\CF\FF\FE\F4\B7��\C4X\A0&|\EBA>\E1\EB\95\DDm��\C4�U\B6\80\F7k
\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\001\00\00\00\00\00\00\00\F9\97b\00\00\00\00\00\D8JUd1\C3{\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00A\00\00\00\00\00\00\00\A9Gb\00\00\00\00\00\D8JUd1\C3{1\AB\A0\FC�\DF\D0��\DDĩ\DB\CD8\89\9E8\EAQ�\C7?\AB\A5엺#�1*\00\00{\EF\D5X\97]1\00\00\00\00\00\00\00Y\91b\00\00\00\00\00\D8JUd1\C3{\D1fs^��\B8o\88\A7\CD.\CAN\F6\A5\9AXά\00\00\00\00\A1�\00\00\00\00\00\00)�\00\00\00\00\00\00\D8JUd1\C3{cription xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="" pdfaid:part="2" pdfaid:conformance="B"/><rdf:Description rdf:about=""><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/"><rdf:Seq><rdf:li/></rdf:Seq></dc:creator></rdf:Description><rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:MetadataDate="2022-0Output sent to stdout

BTW: the error regarding xref 18 also shows up when I get the correct output with valid (and correctly ocr'ed PDFs).

@jbarlow83
Copy link
Collaborator

Can you try to reproduce this issue using docker rather than podman as the container host?

@Fulguritus
Copy link
Author

Fulguritus commented Jan 8, 2022

It seems that --output-type pdf solves the issue. At least, I successfully ran several thousand iterations without an issue so far.
This would mean that the issue is located in the PDF/A conversion code.

I'll set up docker and test it there tomorrow.

However there are a few oddities that may require individual tickets/bug reports and investigation:

  1. --verbose should not log to stdout when output_pdf is -
  2. It is weird that when the crash is observed, some PDF content ends up on stderr
  3. I am wondering about the exception referring to xref 18 (15 in the original PDF) from pikepdf. I ran the document through mutools and qpdf and these tools did not find any issue. It is a direct scan from gnome-simple-scan.

@jbarlow83
Copy link
Collaborator

ocrmypdf does not log to stdout. The test suite covers checking that stdout is clean.

I believe that podman is responsible for the behavior you're seeing. When I run ocrmypdf in native Linux or in Docker on your file, a valid PDF/A is produced (confirmed by checking PDF/A compliance with verapdf and a PDF viewer).

The error regarding xref 18 is just indicating that an unusual image (specifically 2-bit grayscale) could not be optimized.

@jbarlow83 jbarlow83 added the third party issue Problem with a third party dependency label Jan 9, 2022
@Fulguritus
Copy link
Author

After a lot of testing I can confirm that this is working with docker and the stand-alone version. Even if I start the container, replace the entry point, and then exec ocrmypdf in a loop I cannot reproduce the issue. It certainly looks like a podman issue.

For now, I will live with the --output-type pdf work-around, although I do not understand why this is working without an issue. During my research I saw some issues with abruptly ending scripts in podman. Would you know whether ocrmypdf exits differently for pdfa and pdf?

@Fulguritus
Copy link
Author

Cross-posted in containers/conmon#315

@jbarlow83
Copy link
Collaborator

ocrmypdf does not behave much differently. Frankly if the error shown here is actually the case,
https://issueexplorer.com/issue/containers/conmon/251
there are two unchecked pointer dereferences so pretty much anything is possible... including getting the host to execute arbitrary code produced by the container. I wouldn't use podman for anything until this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

2 participants