Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python test fails on little-endian 32bit archs: realloc(): invalid next size #1247

Closed
jonassmedegaard opened this issue Sep 7, 2021 · 23 comments

Comments

@jonassmedegaard
Copy link

Building link-grammar 5.10.0 on Debian fails on architectures armel, armhf, i386, mipsel, hurd-i386, kfreebsd-i386, and x32 - i.e. all little-endian 32-bit architectures if I am not mistaken.
Overview here: https://buildd.debian.org/status/package.php?p=link-grammar (but please ignore kfreebsd-amd64 failing for a different Debian-specific reason).
Excerpt from build log of armhf build:

Running by: /usr/bin/python3
Running ./tests.py in: /<<PKGBUILDDIR>>/bindings/python-examples
PYTHONPATH=./../python:../python:../python/.libs
srcdir=.
LINK_GRAMMAR_DATA=./../../data
Compiled with: gcc __VERSION__="10.3.0"  
OS: linux-gnueabihf __unix__ 
Standards: __STDC_VERSION__=201112L _POSIX_C_SOURCE=200809L 
Configuration (source code):
	CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2
	CFLAGS=-D_DEFAULT_SOURCE -std=c11 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -D_ISOC11_SOURCE -g -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -fvisibility=hidden
Configuration (features):
	DICTIONARY_DIR=/usr/share/link-grammar
	-DPACKAGE_NAME="link-grammar" -DPACKAGE_TARNAME="link-grammar" -DPACKAGE_VERSION="5.10.0" -DPACKAGE_STRING="link-grammar 5.10.0" -DPACKAGE_BUGREPORT="https://github.com/opencog/link-grammar" -DPACKAGE_URL="https://www.abisource.com/projects/link-grammar" -DPACKAGE="link-grammar" -DVERSION="5.10.0" -DYYTEXT_POINTER=1 -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DHAVE_THREADS_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DHAVE_STRNDUP=1 -DHAVE_STRTOK_R=1 -DHAVE_SIGACTION=1 -DHAVE_ALIGNED_ALLOC=1 -DHAVE_POSIX_MEMALIGN=1 -DHAVE_ALLOCA_H=1 -DHAVE_ALLOCA=1 -DHAVE_FORK=1 -DHAVE_VFORK=1 -DHAVE_WORKING_VFORK=1 -DHAVE_WORKING_FORK=1 -DHAVE_PRCTL=1 -D__STDC_FORMAT_MACROS=1 -D__STDC_LIMIT_MACROS=1 -DTLS=_Thread_local -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_VISIBILITY=1 -DHAVE_LOCALE_T_IN_LOCALE_H=1 -DHAVE_STDATOMIC_H=1 -DUSE_WORDGRAPH_DISPLAY=1 -DHAVE_SQLITE3=1 -DHAVE_HUNSPELL=1 -DHUNSPELL_DICT_DIR="/usr/share/hunspell" -DHAVE_EDITLINE=1 -DHAVE_WIDECHAR_EDITLINE=1 -DHAVE_PCRE2_H=1 -DHAVE_MAYBE_UNINITIALIZED=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DSTRERROR_R_CHAR_P=1
Using <module 'linkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/linkgrammar.py'>
Using <module 'clinkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/clinkgrammar.py'>
Using <module '_clinkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/.libs/_clinkgrammar.so'>
Using <module 'lg_testutils' from '/<<PKGBUILDDIR>>/bindings/python-examples/lg_testutils.py'>
............................................s...............s.s.......................s.s..realloc(): invalid next size
FAIL tests.py (exit status: 134)

Final message is same for mentioned architectures except i386-based ones.
On i386, hurd-i386, and x32 it is malloc(): invalid size (unsorted).
On kfreebsd-i386 it is *** Error in /usr/bin/python3': malloc(): memory corruption: 0x09361120 ***`.

@linas
Copy link
Member

linas commented Sep 7, 2021

Huh. I can't imagine. One problem is that we run the python tests before running the plain-C tests; I would like to know if the plain-C tests pass or not. If they passed, I would then point my finger at something in python and/or swig, which is generating the python bindings.

I guess that the way to test this is to build a container with a 32-bit distro in it.

BTW, just before spotting this issue, I spun a brand new version 5.10.1 to fix a break when the perl bindings are enabled (they should probably remain disabled, since most users should probably be using the Lingua::LinkParser bindings instead.)

@ampli
Copy link
Member

ampli commented Sep 7, 2021

I will check that ASAP.
(In the last time a similar thing happened on 32-bit CPUs, it was due to an uninitialized struct field that happened to contain zero on 64-bit due to different field sizes and the use of unions).

@jonassmedegaard
Copy link
Author

jonassmedegaard commented Sep 14, 2021

@linas not sure why you mention the 5.10.1 release in relation to this issue (seems unrelated to me), but in case it is helpful, that newer release seems to fails just the same. Details are at https://buildd.debian.org/status/package.php?p=link-grammar

@jonassmedegaard
Copy link
Author

Please do tell if you guys think this is an issue with the testsuite rather than with the link-grammar code itself - then I can simply ignore the failures on 32bit architectures for now.

@linas
Copy link
Member

linas commented Sep 14, 2021

OK, So ... I checked. Indeed, I see exactly the same crash. It is occurring in a test of some brand-new code. Stubbing out that test: YGenerationTestCase lines 1099 to 1111, here:

class YGenerationTestCase(unittest.TestCase):
"""
Generation mode tests.
"""
def test_getting_linkages_file_dict(self):
ParseOptions(test='generate')
linkages = Sentence((clg.WILDCARD_WORD + ' ') * 5, Dictionary(lang='lt'), ParseOptions()).parse()
self.assertTrue(len(linkages) > 0, "No linkages")
def test_getting_linkages_sql_dict(self):
ParseOptions(test='generate')
linkages = Sentence((clg.WILDCARD_WORD + ' ') * 4, Dictionary(lang='demo-sql'), ParseOptions()).parse()
self.assertTrue(len(linkages) > 0, "No linkages")
allows the unit test to pass. Basically, link-grammar works fine on 32-bit, except for something strange in some brand new code that no one except me uses.

The actual malloc crash is quite bizzare, as the arguments passed to it appear to be just fine. @ampli I'm stumped as to what the actual bug is.

@jonassmedegaard
Copy link
Author

jonassmedegaard commented Sep 14, 2021 via email

@linas
Copy link
Member

linas commented Sep 14, 2021

I will then simply skip those two tests when targeting 32bit archs.

Yes, please.

Meanwhile: @ampli, I unleashed valgrind on this. I used the suppressions file here: https://github.com/opencog/atomspace/blob/master/lib/valgrind.python.suppressions and then I did this:

PYTHONPATH=/usr/local/./lib/python3.9/site-packages/linkgrammar valgrind --suppressions=valgrind.python.suppressions python3 tests.py

I get the following output:

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 32, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


Thread 1: status = VgTs_Runnable (lwpid 16346)
==16346==    at 0x4038C47: realloc (vg_replace_malloc.c:834)
==16346==    by 0x52F633F: eliminate_duplicate_disjuncts (disjunct-utils.c:449)
==16346==    by 0x53011C2: prepare_to_parse (preparation.c:207)
==16346==    by 0x5300088: classic_parse (parse.c:373)
==16346==    by 0x52DB82C: sentence_parse (api.c:711)
==16346==    by 0x52B9F77: _wrap_sentence_parse (lg_python_wrap.cc:5855)
==16346==    by 0x81791C9: ??? (in /usr/bin/python3.9)
==16346==    by 0x8159E2E: _PyObject_MakeTpCall (in /usr/bin/python3.9)
==16346==    by 0x8153CF0: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==16346==    by 0x8163FCF: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==16346==    by 0x8171AFD: ??? (in /usr/bin/python3.9)
==16346==    by 0x8159D2B: _PyObject_MakeTpCall (in /usr/bin/python3.9)

Note the line number 449 is off, because I added some print statements: it should be line 444.

... and I found the bug. Dohhh. It's easy,

linas added a commit that referenced this issue Sep 14, 2021
Only triggers during generation. Fixes issue #1247
@linas
Copy link
Member

linas commented Sep 14, 2021

Fixed in commit d57493e

I will be spinning a version 5.10.2 if the Apple Homebrew folks verifiy the python install paths. (issue #1246) This new version will be very thin, just fixing this bug and the homebrew bug. Otherwise, shipping version 5.10.0 or 5.10.1 is just fine.

Closing, cause I think this is now resolved.

@linas linas closed this as completed Sep 14, 2021
@ampli
Copy link
Member

ampli commented Sep 15, 2021

Sorry for not responding in a timely manner.
This fix seems good.
Meanwhile, by looking at the code around the fix I found ways to make it slightly faster (added to my TODO list).

@jonassmedegaard
Copy link
Author

Sorry, but release 5.10.1 with commits ca8c0e0 and d57493e succeeds on i386 but still fails on armel and armhf (other archs still uncertain): https://buildd.debian.org/status/package.php?p=link-grammar

@jonassmedegaard
Copy link
Author

Compared to original report, code now succeeds on ia32 arches (i386, hurd-i386, x32 - unknown about kfreebsd-i386), but continues to fail on armel, armhf and mipsel.

@linas
Copy link
Member

linas commented Sep 15, 2021

When you say "succeeds" I assume you mean "succeeds when the python test is disabled" ... The armel, armhf and mipsel builds fail for exactly the same reason as the ia32 builds. (Any 32-bit system will fail the same way.)

I can spin a version 5.10.2 with the required fix later today, maybe that is the easiest path?

@jonassmedegaard
Copy link
Author

By "succeeds" I mean building release 5.10.1 with commits ca8c0e0 and d57493e completes without failing its full testsuite.
Here is the full build log for x32: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=x32&ver=5.10.1%7Edfsg-3&stamp=1631714789&raw=0

By "fails" I mean building release 5.10.1 with commits ca8c0e0 and d57493e does not complete without failing its testsuite.
Here is the full build log for armhf: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=mipsel&ver=5.10.1%7Edfsg-3&stamp=1631714920&raw=0

(Any 32-bit system will fail the same way.)

No: armel, armhf and mipsel seemingly fails the same as they did with 5.10.0 and with 5.10.1 with no additional commits applied, and same as ia32 failed with 5.10.0 and 5.10.1 with no additional commits applied - but not the same as identical build for ia32, because ia32 builds now succeeds.

@linas linas reopened this Sep 15, 2021
@linas
Copy link
Member

linas commented Sep 15, 2021

OK, so that is .. unexpected, and I cannot guess at the moment. Could you build with the following patch applied:

--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,6 +441,12 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
                                if (dx->num_categories == dx->num_categories_alloced - 1)
                                {
                                        dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(*(dx->category)),
+dx->num_categories_alloced,
+sizeof(*(dx->category)) * dx->num_categories_alloced);
+fflush(stdout);
                                        dx->category = realloc(dx->category,
                                           sizeof(*(dx->category)) * dx->num_categories_alloced);
                                }

Another possibility would be this; it assumes that sizeof() does not work as expected on these arches.

--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,8 +441,14 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
                                if (dx->num_categories == dx->num_categories_alloced - 1)
                                {
                                        dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(Category_cost),
+dx->num_categories_alloced,
+sizeof(Category_cost) * dx->num_categories_alloced);
+fflush(stdout);
                                        dx->category = realloc(dx->category,
-                                          sizeof(*(dx->category)) * dx->num_categories_alloced);
+                                          sizeof(Category_cost) * dx->num_categories_alloced);
                                }
                                dassert((d->category[0].num > 0) && (d->category[0].num < 64*1024),
                                        "Insane category %u", d->category[0].num);

@linas
Copy link
Member

linas commented Sep 15, 2021

Oops, I forget the fflush(stdout) it will fail to print without that. Editing above post now.

ampli added a commit to ampli/link-grammar that referenced this issue Sep 16, 2021
@ampli
Copy link
Member

ampli commented Sep 16, 2021

It was also a problem with the initial size of dx->category (for the same reason).
I tested it on armv7l virtual machine (armhf architecture).

linas added a commit that referenced this issue Sep 16, 2021
A additional fix for issue  #1247 ++
@linas
Copy link
Member

linas commented Sep 16, 2021

... And is fixed in #1250 .. I will spin a 5.10.2 shortly.

@linas linas closed this as completed Sep 16, 2021
@linas
Copy link
Member

linas commented Sep 16, 2021

... and version 5.10.2 is now available.

@jonassmedegaard
Copy link
Author

I can confirm that this issue no longer occurs: https://buildd.debian.org/status/logs.php?pkg=link-grammar&ver=5.10.2~dfsg-1

(another different but possibly related error occurs on powerpc - the failures on kfreebsd-amd64 and sh4 are Debian-specific).

@linas
Copy link
Member

linas commented Sep 16, 2021

I looked at the powerpc build log: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=powerpc&ver=5.10.2%7Edfsg-1&stamp=1631821070&raw=0 it says the multi-dict unit test failed. This is not python... it tests the use of multiple dictionaries in multiple threads. No clue what the problem may be.

I no longer have a powerpc login on which to test. One of the universities - OSU-USL - Oklahoma State University Unix Systems Labs - use to provide free logins for porting software. @ampli you have worked magic on stuff like this .. are you using QEMU to do that?

@linas linas reopened this Sep 16, 2021
@jonassmedegaard
Copy link
Author

Debian has hardware and I am happy to endorse requests to grant access to some of you as needed. Details of the procedure is at https://dsa.debian.org/doc/guest-account/ and general info on using the hardware is at https://wiki.debian.org/PorterBoxHowToUse

@ampli
Copy link
Member

ampli commented Sep 16, 2021

are you using QEMU to do that?

Yes. For the armhf architecture I used a Ubuntu cloud image with a virt-install command line.

@linas
Copy link
Member

linas commented Apr 25, 2024

Closing. I retested on i386 Debian 11 aka bullseye and all unit tests pass.

@linas linas closed this as completed Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants