Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_bad_utf8 fails when run in parallel #456

Closed
QuLogic opened this issue Nov 24, 2019 · 12 comments
Closed

test_bad_utf8 fails when run in parallel #456

QuLogic opened this issue Nov 24, 2019 · 12 comments

Comments

@QuLogic
Copy link
Contributor

QuLogic commented Nov 24, 2019

Describe the issue
I'm not sure what's going with it, but when I run tests in parallel (-n auto), test_bad_utf8 fails. When everything is run in serial, everything works fine. I guess there is some sort of global state somewhere that is modified by a parallel test that breaks things. Also, this seemed to work fine on 9.0.5, but I don't see much in the diff between it and 9.1.1 that would be indicative of anything.

To Reproduce
pytest -n auto

_____________________________ test_bad_utf8[hocr] ______________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3
spoof_tess_bad_utf8 = {'CONFIG_SITE': 'NONE', 'HISTCONTROL': 'ignoredups', 'HISTSIZE': '1000', 'HOME': '/builddir', ...}
renderer = 'hocr'
resources = PosixPath('/builddir/build/BUILD/ocrmypdf-9.1.1/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-mockbuild/pytest-0/popen-gw1/test_bad_utf8_hocr_0/no_output.pdf')
    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(       
            resources / 'ccitt.pdf',
            no_outpdf,                        
            '--pdf-renderer',
            renderer,                     
            env=spoof_tess_bad_utf8,                             
        )                                                        
    
        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in '   INFO - Using Tesseract OpenMP thread limit 3\n  ERROR - SubprocessOutputError\n'
tests/test_stdio.py:128: AssertionError
___________________________ test_bad_utf8[sandwich] ____________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3
spoof_tess_bad_utf8 = {'CONFIG_SITE': 'NONE', 'HISTCONTROL': 'ignoredups', 'HISTSIZE': '1000', 'HOME': '/builddir', ...}
renderer = 'sandwich'
resources = PosixPath('/builddir/build/BUILD/ocrmypdf-9.1.1/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-mockbuild/pytest-0/popen-gw1/test_bad_utf8_sandwich_0/no_output.pdf')
    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(
            resources / 'ccitt.pdf',
            no_outpdf,
            '--pdf-renderer',
            renderer,
            env=spoof_tess_bad_utf8,
        )
    
        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in '   INFO - Using Tesseract OpenMP thread limit 3\n  ERROR - SubprocessOutputError\n'
tests/test_stdio.py:128: AssertionError

System:

  • OS: Fedora Rawhide
  • OCRmyPDF Version: 9.1.1
@jbarlow83
Copy link
Collaborator

Can't reproduce locally.

I think it may have to do with the logging state though. Would you mind trying this change?

Probably introduced by commit 979b0bc.

diff --git a/src/ocrmypdf/_jobcontext.py b/src/ocrmypdf/_jobcontext.py
index e50aa75..bfcc345 100644
--- a/src/ocrmypdf/_jobcontext.py
+++ b/src/ocrmypdf/_jobcontext.py
@@ -116,9 +116,9 @@ class LogNamePageAdapter(logging.LoggerAdapter):
 def make_logger(options=None, prefix='ocrmypdf', filename=None, page=None):
     log = logging.getLogger(prefix)
     if filename and page:
-        adapter = LogNamePageAdapter(log, dict(filename=filename, page=page))
+        adapter = LogNamePageAdapter(log, dict(input_filename=filename, page=page))
     elif filename:
-        adapter = LogNameAdapter(log, dict(filename=filename))
+        adapter = LogNameAdapter(log, dict(input_filename=filename))
     else:
         adapter = log
     return adapter
diff --git a/src/ocrmypdf/exec/tesseract.py b/src/ocrmypdf/exec/tesseract.py
index afa0e4a..0a9afa9 100644
--- a/src/ocrmypdf/exec/tesseract.py
+++ b/src/ocrmypdf/exec/tesseract.py
@@ -55,6 +55,7 @@ HOCR_TEMPLATE = """<?xml version="1.0" encoding="UTF-8"?>
 
 class TesseractLoggerAdapter(logging.LoggerAdapter):
     def process(self, msg, kwargs):
+        kwargs['extra'] = self.extra
         return '[tesseract] %s' % (msg), kwargs

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 25, 2019

No, it doesn't seem to be that. If I revert db914d4, then it works, which I guess is a similar change since it's about logging too. It's a bit flaky and gets stuck sometimes, so I wonder if there's some underlying pytest or pytest-xdist problem. What versions are you running? What I have is:

platform linux -- Python 3.8.0, pytest-4.6.6, py-1.8.0, pluggy-0.12.0
rootdir: /builddir/build/BUILD/ocrmypdf-9.1.1, inifile: setup.cfg, testpaths: tests
plugins: helpers-namespace-2019.1.8, forked-1.1.1, cov-2.8.1, xdist-1.30.0

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 25, 2019

It works on Fedora 31 with:

platform linux -- Python 3.7.5, pytest-4.6.6, py-1.8.0, pluggy-0.12.0
rootdir: /builddir/build/BUILD/ocrmypdf-9.1.1, inifile: setup.cfg, testpaths: tests
plugins: helpers-namespace-2019.1.8, xdist-1.29.0, cov-2.7.1, forked-1.0.2

so it's probably one of the other changes, Python 3.8 or xdist 1.30.0.

@jbarlow83
Copy link
Collaborator

Main system

platform darwin -- Python 3.7.5, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/jb/src/ocrmypdf, inifile: setup.cfg, testpaths: tests
plugins: xdist-1.30.0, helpers-namespace-2019.1.8, forked-1.1.3, cov-2.8.1

If you do pytest -nauto tests/test_stdio.py or pytest -nauto tests/test_stdio.py::test_bad_utf8, does it fail?

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 25, 2019

No, it works fine if just running those by themselves.

@jbarlow83
Copy link
Collaborator

I am trying now it on a Docker image of Fedora Rawhide. 4 cores assigned.ocrmypdf 9.0.5 package installed. Used pip to upgrade to latest. Test suite gives:

platform linux -- Python 3.8.0b4, pytest-4.6.6, py-1.8.0, pluggy-0.13.1
rootdir: /root/OCRmyPDF, inifile: setup.cfg, testpaths: tests
plugins: cov-2.8.1, forked-1.1.3, helpers-namespace-2019.1.8, xdist-1.30.0

Still waiting for results. Docker can be slow.

I suppose, since it's in intermittent and likely an artifact of testing, I'm pretty confident you could safely disable the test for Fedora. The test ensures that we had a rare error condition (Tesseract's data files don't match the executable) correctly. This condition caused Tesseract to dump some binary data to stdout which in turn violated Python 3's UTF-8 assumptions. It's likely that an error in Tesseract and it's probably been fixed because they've done a lot of memory fixes lately.

I am still interested in learning what's causing this.

@jbarlow83
Copy link
Collaborator

Finally reproduced the failure on Fedora:

If you don't have tesseract-osd installed, that may be related in some way. ocrmypdf depends on it. I got a few failures related to this.

Is there an equivalent to https://tracker.debian.org/pkg/ocrmypdf where I can see the Fedora build logs?

@jbarlow83
Copy link
Collaborator

Not an exact reproduction though:

___________________________ test_bad_utf8[sandwich] ____________________________
[gw1] linux -- Python 3.8.0 /usr/bin/python3

spoof_tess_bad_utf8 = {'DISTTAG': 'f32container', 'FBR': 'f32', 'FGC': 'f32', 'HOME': '/root', ...}
renderer = 'sandwich', resources = PosixPath('/root/OCRmyPDF/tests/resources')
no_outpdf = PosixPath('/tmp/pytest-of-root/pytest-33/popen-gw1/test_bad_utf8_sandwich_0/no_output.pdf')

    @pytest.mark.parametrize('renderer', ['hocr', 'sandwich'])
    def test_bad_utf8(spoof_tess_bad_utf8, renderer, resources, no_outpdf):
        p, out, err = run_ocrmypdf(
            resources / 'ccitt.pdf',
            no_outpdf,
            '--pdf-renderer',
            renderer,
            env=spoof_tess_bad_utf8,
        )

        assert out == '', "stdout not clean"
        assert p.returncode != 0
>       assert 'not utf-8' in err, "should whine about utf-8"
E       AssertionError: should whine about utf-8
E       assert 'not utf-8' in 'Traceback (most recent call last):\n  File "/usr/lib64/python3.8/runpy.py", line 183, in _run_module_as_main\n    mod...error: symbol \'setMsgSeverity\' not found in library \'<None>\': /usr/bin/python3: undefined symbol: setMsgSeverity\n'

tests/test_stdio.py:128: AssertionError

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 27, 2019

Sure, Fedora builds are in koji. The Rawhide one got stuck though.

@jbarlow83
Copy link
Collaborator

Although I haven't made it an official release yet, I am removing this test as of commit 9db01c7 along with the functionality it tested, if you want to cherry pick it. I consider the test no longer valid as I don't believe the Tesseract executable behaves in that way any more.

I also believe the issue you reported is an actually an issue with pytest-xdist on Python 3.8, since I found similar problems, again with Python 3.8 but nor previously. I reported that here in hopes of getting some ideals about how to diagnose it better:
pytest-dev/pytest-xdist#492

@jbarlow83
Copy link
Collaborator

Decided test was not longer valid so removed in v9.2.0.

I think the parallelism issue could have been related to the use of @pytest.fixture(scope="session") in some instances. These are removed in the master branch as well.

@QuLogic
Copy link
Contributor Author

QuLogic commented Jan 6, 2020

Somehow, I managed to build 9.3.0 even though it doesn't work locally. I'm not sure what exactly is wrong, but I guess I won't have to worry about it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants