test_bad_utf8 fails when run in parallel #456

QuLogic · 2019-11-24T08:33:50Z

Describe the issue
I'm not sure what's going with it, but when I run tests in parallel (-n auto), test_bad_utf8 fails. When everything is run in serial, everything works fine. I guess there is some sort of global state somewhere that is modified by a parallel test that breaks things. Also, this seemed to work fine on 9.0.5, but I don't see much in the diff between it and 9.1.1 that would be indicative of anything.

To Reproduce
pytest -n auto

_____________________________ test_bad_utf8[hocr] ______________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3
spoof_tess_bad_utf8 = {'CONFIG_SITE': 'NONE', 'HISTCONTROL': 'ignoredups', 'HISTSIZE': '1000', 'HOME': '/builddir', ...}
renderer = 'hocr'
resources = PosixPath('/builddir/build/BUILD/ocrmypdf-9.1.1/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-mockbuild/pytest-0/popen-gw1/test_bad_utf8_hocr_0/no_output.pdf')
    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(       
            resources / 'ccitt.pdf',
            no_outpdf,                        
            '--pdf-renderer',
            renderer,                     
            env=spoof_tess_bad_utf8,                             
        )                                                        
    
        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in '   INFO - Using Tesseract OpenMP thread limit 3\n  ERROR - SubprocessOutputError\n'
tests/test_stdio.py:128: AssertionError
___________________________ test_bad_utf8[sandwich] ____________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3
spoof_tess_bad_utf8 = {'CONFIG_SITE': 'NONE', 'HISTCONTROL': 'ignoredups', 'HISTSIZE': '1000', 'HOME': '/builddir', ...}
renderer = 'sandwich'
resources = PosixPath('/builddir/build/BUILD/ocrmypdf-9.1.1/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-mockbuild/pytest-0/popen-gw1/test_bad_utf8_sandwich_0/no_output.pdf')
    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(
            resources / 'ccitt.pdf',
            no_outpdf,
            '--pdf-renderer',
            renderer,
            env=spoof_tess_bad_utf8,
        )
    
        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in '   INFO - Using Tesseract OpenMP thread limit 3\n  ERROR - SubprocessOutputError\n'
tests/test_stdio.py:128: AssertionError

System:

OS: Fedora Rawhide
OCRmyPDF Version: 9.1.1

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2019-11-24T09:40:29Z

Can't reproduce locally.

I think it may have to do with the logging state though. Would you mind trying this change?

Probably introduced by commit 979b0bc.

diff --git a/src/ocrmypdf/_jobcontext.py b/src/ocrmypdf/_jobcontext.py
index e50aa75..bfcc345 100644
--- a/src/ocrmypdf/_jobcontext.py
+++ b/src/ocrmypdf/_jobcontext.py
@@ -116,9 +116,9 @@ class LogNamePageAdapter(logging.LoggerAdapter):
 def make_logger(options=None, prefix='ocrmypdf', filename=None, page=None):
     log = logging.getLogger(prefix)
     if filename and page:
-        adapter = LogNamePageAdapter(log, dict(filename=filename, page=page))
+        adapter = LogNamePageAdapter(log, dict(input_filename=filename, page=page))
     elif filename:
-        adapter = LogNameAdapter(log, dict(filename=filename))
+        adapter = LogNameAdapter(log, dict(input_filename=filename))
     else:
         adapter = log
     return adapter
diff --git a/src/ocrmypdf/exec/tesseract.py b/src/ocrmypdf/exec/tesseract.py
index afa0e4a..0a9afa9 100644
--- a/src/ocrmypdf/exec/tesseract.py
+++ b/src/ocrmypdf/exec/tesseract.py
@@ -55,6 +55,7 @@ HOCR_TEMPLATE = """<?xml version="1.0" encoding="UTF-8"?>
 
 class TesseractLoggerAdapter(logging.LoggerAdapter):
     def process(self, msg, kwargs):
+        kwargs['extra'] = self.extra
         return '[tesseract] %s' % (msg), kwargs

QuLogic · 2019-11-25T03:50:00Z

No, it doesn't seem to be that. If I revert db914d4, then it works, which I guess is a similar change since it's about logging too. It's a bit flaky and gets stuck sometimes, so I wonder if there's some underlying pytest or pytest-xdist problem. What versions are you running? What I have is:

platform linux -- Python 3.8.0, pytest-4.6.6, py-1.8.0, pluggy-0.12.0
rootdir: /builddir/build/BUILD/ocrmypdf-9.1.1, inifile: setup.cfg, testpaths: tests
plugins: helpers-namespace-2019.1.8, forked-1.1.1, cov-2.8.1, xdist-1.30.0

QuLogic · 2019-11-25T04:47:26Z

It works on Fedora 31 with:

platform linux -- Python 3.7.5, pytest-4.6.6, py-1.8.0, pluggy-0.12.0
rootdir: /builddir/build/BUILD/ocrmypdf-9.1.1, inifile: setup.cfg, testpaths: tests
plugins: helpers-namespace-2019.1.8, xdist-1.29.0, cov-2.7.1, forked-1.0.2

so it's probably one of the other changes, Python 3.8 or xdist 1.30.0.

jbarlow83 · 2019-11-25T09:17:31Z

Main system

platform darwin -- Python 3.7.5, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/jb/src/ocrmypdf, inifile: setup.cfg, testpaths: tests
plugins: xdist-1.30.0, helpers-namespace-2019.1.8, forked-1.1.3, cov-2.8.1

If you do pytest -nauto tests/test_stdio.py or pytest -nauto tests/test_stdio.py::test_bad_utf8, does it fail?

QuLogic · 2019-11-25T09:33:27Z

No, it works fine if just running those by themselves.

jbarlow83 · 2019-11-25T09:46:03Z

I am trying now it on a Docker image of Fedora Rawhide. 4 cores assigned.ocrmypdf 9.0.5 package installed. Used pip to upgrade to latest. Test suite gives:

platform linux -- Python 3.8.0b4, pytest-4.6.6, py-1.8.0, pluggy-0.13.1
rootdir: /root/OCRmyPDF, inifile: setup.cfg, testpaths: tests
plugins: cov-2.8.1, forked-1.1.3, helpers-namespace-2019.1.8, xdist-1.30.0

Still waiting for results. Docker can be slow.

I suppose, since it's in intermittent and likely an artifact of testing, I'm pretty confident you could safely disable the test for Fedora. The test ensures that we had a rare error condition (Tesseract's data files don't match the executable) correctly. This condition caused Tesseract to dump some binary data to stdout which in turn violated Python 3's UTF-8 assumptions. It's likely that an error in Tesseract and it's probably been fixed because they've done a lot of memory fixes lately.

I am still interested in learning what's causing this.

jbarlow83 · 2019-11-25T10:09:23Z

Finally reproduced the failure on Fedora:

If you don't have tesseract-osd installed, that may be related in some way. ocrmypdf depends on it. I got a few failures related to this.

Is there an equivalent to https://tracker.debian.org/pkg/ocrmypdf where I can see the Fedora build logs?

jbarlow83 · 2019-11-25T10:10:28Z

Not an exact reproduction though:

___________________________ test_bad_utf8[sandwich] ____________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3

spoof_tess_bad_utf8 = {'DISTTAG': 'f32container', 'FBR': 'f32', 'FGC': 'f32', 'HOME': '/root', ...}
renderer = 'sandwich', resources = PosixPath('/root/OCRmyPDF/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-root/pytest-33/popen-gw1/test_bad_utf8_sandwich_0/no_output.pdf')

    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(
            resources / 'ccitt.pdf',
            no_outpdf,
            '--pdf-renderer',
            renderer,
            env=spoof_tess_bad_utf8,
        )

        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in 'Traceback (most recent call last):\n  File "/usr/lib64/python3.8/runpy.py", line 183, in _run_module_as_main\n    mod...error: symbol \'setMsgSeverity\' not found in library \'<None>\': /usr/bin/python3: undefined symbol: setMsgSeverity\n'

tests/test_stdio.py:128: AssertionError

QuLogic · 2019-11-27T08:11:37Z

Sure, Fedora builds are in koji. The Rawhide one got stuck though.

jbarlow83 · 2019-12-09T07:32:10Z

Although I haven't made it an official release yet, I am removing this test as of commit 9db01c7 along with the functionality it tested, if you want to cherry pick it. I consider the test no longer valid as I don't believe the Tesseract executable behaves in that way any more.

I also believe the issue you reported is an actually an issue with pytest-xdist on Python 3.8, since I found similar problems, again with Python 3.8 but nor previously. I reported that here in hopes of getting some ideals about how to diagnose it better:
pytest-dev/pytest-xdist#492

jbarlow83 · 2020-01-06T02:03:25Z

Decided test was not longer valid so removed in v9.2.0.

I think the parallelism issue could have been related to the use of @pytest.fixture(scope="session") in some instances. These are removed in the master branch as well.

QuLogic · 2020-01-06T07:13:15Z

Somehow, I managed to build 9.3.0 even though it doesn't work locally. I'm not sure what exactly is wrong, but I guess I won't have to worry about it later.

jbarlow83 closed this as completed Jan 6, 2020

jvavrek mentioned this issue Apr 5, 2022

Skip NNDC/XCOM tests if databases are down lbl-anp/becquerel#333

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_bad_utf8 fails when run in parallel #456

test_bad_utf8 fails when run in parallel #456

QuLogic commented Nov 24, 2019

jbarlow83 commented Nov 24, 2019

QuLogic commented Nov 25, 2019

QuLogic commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

QuLogic commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

QuLogic commented Nov 27, 2019

jbarlow83 commented Dec 9, 2019

jbarlow83 commented Jan 6, 2020

QuLogic commented Jan 6, 2020

test_bad_utf8 fails when run in parallel #456

test_bad_utf8 fails when run in parallel #456

Comments

QuLogic commented Nov 24, 2019

jbarlow83 commented Nov 24, 2019

QuLogic commented Nov 25, 2019

QuLogic commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

QuLogic commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

jbarlow83 commented Nov 25, 2019

QuLogic commented Nov 27, 2019

jbarlow83 commented Dec 9, 2019

jbarlow83 commented Jan 6, 2020

QuLogic commented Jan 6, 2020