Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

visual test -j randomly running out of memory (linux) #3339

Closed
lightmare opened this issue Mar 2, 2016 · 16 comments
Closed

visual test -j randomly running out of memory (linux) #3339

lightmare opened this issue Mar 2, 2016 · 16 comments
Milestone

Comments

@lightmare
Copy link
Contributor

It happened to me a few times already, the test kept allocating memory until killed. So today I decided to stress it.

First a little function I use to kill it when it exceeds 2GB -- because if I let it run wild, it starts swapping and freezes the whole system.

watchrun () {
    while sleep 1; do
        ps --no-headers -o pid,rss "$@"
    done |
    while read pid rss; do
        test $rss -lt 200100 && continue
        echo "killing pid=$pid because rss=$rss"
        kill -QUIT $pid
    done
}

watchrun -C visual-test-run

In another terminal

# don't want to accidentally kill another process named 'run'
ln -s test/visual/run visual-test-run
# want core dump when killed
ulimit -c unlimited
while ! test -f core; do
    ./visual-test-run --jobs=20 --agg $(for x in {1..20}; do echo tiff-alpha-gdal; done)
done

edit: changed command to run 20 jobs with the same style; fails more reliably

@lightmare
Copy link
Contributor Author

Roughly half of the threads in core dump were in

Thread 6 (Thread 0x7f9ea57fa700 (LWP 24097)):
#0  __memcpy_sse2 () at ../sysdeps/x86_64/multiarch/../memcpy.S:443
#1  0x00007f9ebe9328ac in _int_realloc (av=0x7f9e7c000020, oldp=0x7f9e7c898640, 
    oldsize=609760, nb=609776) at malloc.c:4316
#2  0x00007f9ebe933fc9 in __GI___libc_realloc (oldmem=0x7f9e7c898650, bytes=609768)
    at malloc.c:3029
#3  0x00007f9eb9eb4ddc in ?? () from /usr/lib/libgdal.so.1
#4  0x00007f9eb9ebbbb4 in ?? () from /usr/lib/libgdal.so.1
#5  0x00007f9eb9f1f945 in ?? () from /usr/lib/libgdal.so.1
#6  0x00007f9eb9ebacb9 in ?? () from /usr/lib/libgdal.so.1
#7  0x00007f9eb9ec1803 in ?? () from /usr/lib/libgdal.so.1
#8  0x00007f9eb9ee3c78 in ?? () from /usr/lib/libgdal.so.1
#9  0x00007f9eb9ef0964 in ?? () from /usr/lib/libgdal.so.1
#10 0x00007f9eb9d34ead in ?? () from /usr/lib/libgdal.so.1
#11 0x00007f9eb9f43868 in ?? () from /usr/lib/libgdal.so.1
#12 0x00007f9eb9f43a98 in ?? () from /usr/lib/libgdal.so.1
#13 0x00007f9eae54c22c in gdal_datasource::gdal_datasource (this=0x7f9e7c009850, params=...)
    at plugins/input/gdal/gdal_datasource.cpp:87
#14 0x00007f9eae54b752 in create (params=...) at plugins/input/gdal/gdal_datasource.cpp:38
#15 0x00007f9ec220fe12 in mapnik::datasource_cache::create (
    this=0x7f9ec34693c0 <mapnik::CreateStatic<mapnik::datasource_cache>::create()::static_memory>, params=...) at src/datasource_cache.cpp:126
#16 0x00007f9ec238e3d2 in mapnik::map_parser::parse_layer (this=0x7f9ea57f8df0, map=..., 
    node=...) at src/load_map.cpp:768

@lightmare
Copy link
Contributor Author

I actually have a branch from two weeks ago called simpleton, where I replaced the atrocious singleton<> template base with a static instance. I'll test with that later.

@springmeyer
Copy link
Member

this might be related to the conclusion at #3093 that we still need to fix: we are not using gdal register function correctly (need to call once rather than per datasource).

@lightmare
Copy link
Contributor Author

I see, so it's not in cache singleton creation (misinterpreted the backtrace), but further in datasource creation.

@lightmare
Copy link
Contributor Author

Update: It's not (only) GDALAllRegister(), After guarding it with

static const bool once = (GDALAllRegister(), true);

I'm still able to reproduce. There doesn't seem to be ubuntu/debian package with gdal debug symbols, so I'm kinda lost once it gets into GDALOpen() (but most threads are there, called from plugins/input/gdal/gdal_datasource.cpp:93 and ends up in __GI__libc_realloc). One thread was in __GI__libc_malloc called from mapnik::detail::buffer::buffer and another in boost::gil::detail::homogeneous_color_base called from mapnik::filter::apply_filter.

I'm not even sure the issue lies within gdal plugin. Because memory allocation needs to be synchronized, if something higher in the stack goes wrong and starts creating datasources like crazy, it's expectable that when I kill the process, the threads will be waiting on a memory lock. But the original problem may be elsewhere.

@lightmare
Copy link
Contributor Author

(gdb) info threads
  Id   Target Id         Frame 
  18   Thread 0x7f07e0ff9700 (LWP 9806) _int_realloc (av=0x7f0798000020, oldp=0x7f0798c6dd90, oldsize=856672, nb=856688) at malloc.c:4268
  17   Thread 0x7f07e2ffd700 (LWP 9802) sysmalloc (av=0x7f07a4000020, nb=207664) at malloc.c:2715
  16   Thread 0x7f07d9c2a700 (LWP 9810) 0x00007f07fcb34a27 in mprotect () at ../sysdeps/unix/syscall-template.S:81
  15   Thread 0x7f07da42b700 (LWP 9809) 0x00007f07f804a209 in ?? () from /usr/lib/libgdal.so.1
  14   Thread 0x7f07db7fe700 (LWP 9808) sysmalloc (av=0x7f0790000020, nb=225856) at malloc.c:2715
  13   Thread 0x7f07d9429700 (LWP 9811) _int_realloc (av=0x7f077c000020, oldp=0x7f077ce2c450, oldsize=969632, nb=969664) at malloc.c:4268
  12   Thread 0x7f07e8c28700 (LWP 9799) _int_realloc (av=0x7f07c0000020, oldp=0x7f07c1029df0, oldsize=1098224, nb=1098256) at malloc.c:4268
  11   Thread 0x7f07ebc2e700 (LWP 9792) sysmalloc (av=0x7f07dc000020, nb=1081840) at malloc.c:2715
  10   Thread 0x7f07e3fff700 (LWP 9800) 0x00007f07f804a478 in ?? () from /usr/lib/libgdal.so.1
  9    Thread 0x7f07e27fc700 (LWP 9803) __memcpy_sse2 () at ../sysdeps/x86_64/multiarch/../memcpy.S:272
  8    Thread 0x7f07eac2c700 (LWP 9795) _int_realloc (av=0x7f07c4000020, oldp=0x7f07c5143070, oldsize=1171328, nb=1171360) at malloc.c:4268
  7    Thread 0x7f07e17fa700 (LWP 9805) _int_realloc (av=0x7f0794000020, oldp=0x7f079502a260, oldsize=1109392, nb=1109408) at malloc.c:4268
  6    Thread 0x7f07e1ffb700 (LWP 9804) _int_realloc (av=0x7f07a0000020, oldp=0x7f07a11442e0, oldsize=1187072, nb=1187104) at malloc.c:4268
  5    Thread 0x7f07eb42d700 (LWP 9793) _int_realloc (av=0x7f07e4000020, oldp=0x7f07e4c6ea80, oldsize=877936, nb=877952) at malloc.c:4268
  4    Thread 0x7f07dac2c700 (LWP 9794) 
    0x00007f0800cc5314 in boost::gil::detail::homogeneous_color_base<unsigned char, boost::gil::layout<boost::mpl::vector4<boost::gil::red_t, boost::gil::green_t, boost::gil::blue_t, boost::gil::alpha_t>, boost::mpl::range_c<int, 0, 4> >, 4>::at (this=0x7f07cc0a11b8) at /usr/include/boost/gil/color_base.hpp:204
  3    Thread 0x7f07ea42b700 (LWP 9796) __memcpy_sse2 () at ../sysdeps/x86_64/multiarch/../memcpy.S:272
  2    Thread 0x7f07dbfff700 (LWP 9807) _int_realloc (av=0x7f078c000020, oldp=0x7f078d143170, oldsize=1224320, nb=1224352) at malloc.c:4268
* 1    Thread 0x7f08017de900 (LWP 9791) 0x00007f07fce0e66b in pthread_join (threadid=139671996917504, thread_return=0x0) at pthread_join.c:92

@lightmare
Copy link
Contributor Author

(gdb) thread apply 2 bt

Thread 2 (Thread 0x7f07dbfff700 (LWP 9807)):
#0  _int_realloc (av=0x7f078c000020, oldp=0x7f078d143170, oldsize=1224320, nb=1224352) at malloc.c:4268
#1  0x00007f07fcac2fc9 in __GI___libc_realloc (oldmem=0x7f078d143180, bytes=1224336) at malloc.c:3029
#2  0x00007f07f8043ddc in ?? () from /usr/lib/libgdal.so.1
#3  0x00007f07f804abb4 in ?? () from /usr/lib/libgdal.so.1
#4  0x00007f07f80ae945 in ?? () from /usr/lib/libgdal.so.1
#5  0x00007f07f8049cb9 in ?? () from /usr/lib/libgdal.so.1
#6  0x00007f07f8050803 in ?? () from /usr/lib/libgdal.so.1
#7  0x00007f07f8072c78 in ?? () from /usr/lib/libgdal.so.1
#8  0x00007f07f807f964 in ?? () from /usr/lib/libgdal.so.1
#9  0x00007f07f7ec3ead in ?? () from /usr/lib/libgdal.so.1
#10 0x00007f07f80d2868 in ?? () from /usr/lib/libgdal.so.1
#11 0x00007f07f80d2a98 in ?? () from /usr/lib/libgdal.so.1
#12 0x00007f07ec6db332 in gdal_datasource::gdal_datasource (this=0x7f078c0015d0, params=...) at plugins/input/gdal/gdal_datasource.cpp:93
#13 0x00007f07ec6da852 in create (params=...) at plugins/input/gdal/gdal_datasource.cpp:38
#14 0x00007f080039ee12 in mapnik::datasource_cache::create (this=0x7f08015f83c0 <mapnik::CreateStatic<mapnik::datasource_cache>::create()::static_memory>, params=...)
    at src/datasource_cache.cpp:126
#15 0x00007f080051d3d2 in mapnik::map_parser::parse_layer (this=0x7f07dbffddf0, map=..., node=...) at src/load_map.cpp:768
#16 0x00007f0800518933 in mapnik::map_parser::parse_map_include (this=0x7f07dbffddf0, map=..., node=...) at src/load_map.cpp:352
Python Exception <class 'gdb.error'> There is no member named _M_dataplus.: 
#17 0x00007f080051814e in mapnik::map_parser::parse_map (this=0x7f07dbffddf0, map=..., node=..., base_path=) at src/load_map.cpp:322
Python Exception <class 'gdb.error'> There is no member named _M_dataplus.: 
#18 0x00007f08005164e0 in mapnik::load_map (map=..., filename=, strict=true, base_path=<incomplete type>) at src/load_map.cpp:154
#19 0x000000000040f7ad in visual_tests::runner::test_one (this=0x7fff2fe1b688, style_path=..., report=..., fail_count=...) at test/visual/runner.cpp:260
#20 0x000000000040f330 in visual_tests::runner::test_range (this=0x7fff2fe1b688, begin=..., end=..., report=..., fail_count=...) at test/visual/runner.cpp:226

@lightmare
Copy link
Contributor Author

Now this is weird:

#14 in mapnik::datasource_cache::create ... at src/datasource_cache.cpp:126
#15 in mapnik::map_parser::parse_layer ... at src/load_map.cpp:768

load_map.cpp:768 is the final closing brace of parse_layer, whereas datasource_cache::create is called at L745 😖

@lightmare
Copy link
Contributor Author

There's a race condition in GDAL's libgeotiff function XTIFFInitialize

Funny that after I found it, I googled the function and we're not the first to get hit: https://sourceforge.net/p/freeimage/discussion/36109/thread/9e5ba9b7/
If only I knew some magic word that would lead to that page earlier 🀄

@lightmare
Copy link
Contributor Author

Question is, how should I call XTIFFInitialize

$ rgrep -F XTIFFInitialize **/*.h
frmts/gtiff/libgeotiff/gdal_libgeotiff_symbol_rename.h +92 #define XTIFFInitialize gdal_XTIFFInitialize
frmts/gtiff/libgeotiff/xtiffio.h +68 extern void CPL_DLL XTIFFInitialize();

$ nm -CD /usr/lib/libgdal.so.1|grep XTIFF
00000000004aca10 T gdal_XTIFFClientOpen
00000000004aca90 T gdal_XTIFFClose
00000000004ac9e0 T gdal_XTIFFFdOpen
00000000004ac970 T gdal_XTIFFInitialize
00000000004ac9b0 T gdal_XTIFFOpen

libgdal exports it, but xtiffio.h is not installed with it, I'd have to declare the function. Will that work with external libgeotiff?

@lightmare
Copy link
Contributor Author

The issue is fixed in gdal 2.0.2

@springmeyer
Copy link
Member

@lightmare - can you try testing again after #3395 just to confirm things work still? As far as the race condition - how about we just refer people to use gdal 2.0.2 when/if they hit this?

@springmeyer springmeyer added this to the Mapnik 3.0.12 milestone Apr 6, 2016
@lightmare
Copy link
Contributor Author

585de59 still running out of memory with the above loop. This has nothing to do with GDALRegisterAll, the race condition is in libgeotiff initialization which happens later, when the first gtiff is loaded.

Better refer to 2.0.2 preventively - i.e. using gdal plugin in multithreaded program requires libgdal 2.0.2+ ;; Which is quite new, the upcoming Ubuntu LTS has 1.11.3 atm http://packages.ubuntu.com/xenial/libgdal-dev

@springmeyer
Copy link
Member

Confirming that running while ! test -f core; do ./test/visual/run --jobs=20 --agg $(for x in {1..20}; do echo tiff-alpha-gdal; done); done I don't see a crash with gdal 2.0.2. But I do see one occasionally with gdal 1.11.2.

@artemp
Copy link
Member

artemp commented Apr 6, 2016

I'm on gdal-config --version
2.1.0

@springmeyer
Copy link
Member

@lightmare per your comment I added GDAL min recommended version in 2a46e45. Not seeing any other actions here. Closing. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants