Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPy wrapper: sliced reading from file crashes #161

Closed
n01r opened this issue Feb 14, 2018 · 6 comments · Fixed by #162
Closed

NumPy wrapper: sliced reading from file crashes #161

n01r opened this issue Feb 14, 2018 · 6 comments · Fixed by #162

Comments

@n01r
Copy link

n01r commented Feb 14, 2018

Hey there,
I encountered a segmentation fault that occurs when reading a slice from a linear AdiosVar array when the slice contains a certain index. The index can change (checked different files).
The values can be read on their own but not when accessing more than one element.

The ADIOS 1.13.0 I'm using for post-processing has been built with the same flags as the one I used for the creation of the data - just without mpi. I built the numpy wrapper from source using python setup.py.
I am using blosc with zstd for compression, full parameters: threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd.

$ adios_config -s
DIR=/users/<USER>/lib/adios-1.13.0_nompi
CFLAGS=-I/users/<USER>/lib/adios-1.13.0_nompi/include -D_NOMPI -DZLIB -I/users/<USER>/lib/zlib-1.2.11/include -DBLOSC -I/users/<USER>/lib/blosc-1.12.1/include -I/users/<USER>/lib/blosc-1.12.1/include
LDFLAGS=-L/users/<USER>/lib/adios-1.13.0_nompi/lib -ladios_nompi -L/users/<USER>/lib/zlib-1.2.11/lib64 -L/users/<USER>/lib/blosc-1.12.1/lib -libverbs -lz -lblosc
Available write methods (in XML <method> element or in adios_select_method()):
    "POSIX"
Available read methods (constants after #include "adios_read.h"):
    ADIOS_READ_METHOD_BP (=0)
Available data transformation methods (in XML transform tags in <var> elements):
    "none"	: No data transform
    "identity"	: Identity transform
    "zlib"	: zlib compression
    "zfp"	: zfp compression
    "blosc"	: blosc compression
Available query methods (in adios_query_set_method()):
    ADIOS_QUERY_METHOD_MINMAX (=0)

This is what I observe:

In [1]: import numpy as np
In [2]: import adios as ad
In [3]: path = "014_0060gpus2DCopper30nmLeadingEdge1E-3/simOutput/bp/simData_87040.bp"
In [4]: f = ad.File(path)
In [5]: px = f['/data/87040/particles/H_all/momentum/x']
In [6]: px[3740]
Out[6]: 0.04298079386353493
In [7]: px[3741]
Out[7]: 0.08291889727115631
In [8]: px[3742]
Out[8]: -0.13517844676971436
In [9]: px[3740:3741]
Out[9]: array([ 0.04298079], dtype=float32)
In [10]: px[3741:3742]
Out[10]: array([ 0.0829189], dtype=float32)
In [11]: px[3740:3742]
Segmentation fault (core dumped)

A run with gdb shows

(gdb) run test.py 
Starting program: /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/bin/python test.py
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-61.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
Missing separate debuginfos, use: zypper install libibverbs1-debuginfo-1.2.0-17.1.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64
(gdb) backtrace
#0  0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed ()
   from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#1  0x00002aaab09732b7 in adios_transform_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#2  0x00002aaab0972aba in adios_transform_process_all_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#3  0x00002aaab0944b24 in common_read_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#4  0x00002aaab0937157 in adios_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#5  0x00002aaab08b321d in __pyx_f_5adios_3var_read (__pyx_v_self=0x2aaab0bf9688, __pyx_skip_dispatch=<optimized out>, __pyx_optional_args=<optimized out>) at adios.cpp:24203
#6  0x00002aaab089ac1e in __pyx_pf_5adios_3var_12read (__pyx_v_step_scalar=<optimized out>, __pyx_v_fill=<optimized out>, __pyx_v_nsteps=<optimized out>, __pyx_v_from_steps=<optimized out>, 
    __pyx_v_scalar=<optimized out>, __pyx_v_count=<optimized out>, __pyx_v_offset=<optimized out>, __pyx_v_self=0x2aaab0bf9688) at adios.cpp:24495
#7  __pyx_pw_5adios_3var_13read (__pyx_v_self=0x2aaab0bf9688, __pyx_args=<optimized out>, __pyx_kwds=0x2aaaabc60288) at adios.cpp:24461
#8  0x0000555555660364 in _PyCFunction_FastCallDict ()
#9  0x000055555568ef30 in _PyCFunction_FastCallKeywords ()
#10 0x00005555556f2ebc in call_function ()
#11 0x00005555557153e7 in _PyEval_EvalFrameDefault ()
#12 0x00005555556ed8d9 in PyEval_EvalCodeEx ()
#13 0x00005555556ee67c in PyEval_EvalCode ()
#14 0x0000555555768ce4 in run_mod ()
#15 0x00005555557690e1 in PyRun_FileExFlags ()
#16 0x00005555557692e4 in PyRun_SimpleFileExFlags ()
#17 0x000055555576cdaf in Py_Main ()
#18 0x00005555556338be in main ()

What could I specifically look into?

@ax3l
Copy link
Contributor

ax3l commented Feb 14, 2018

The data set that we read is a 1D array written by several process groups.
At the offset of concern, a process group wrote zero entries. This is an issue we encountered (& fixed) before, e.g. with zlib transforms.

@n01r
Copy link
Author

n01r commented Feb 14, 2018

The numpy wrapper version is:

import adios as ad
ad.__version__
'1.13.0'

The blockinfo from the file shows

In[9]: px.blockinfo
Out[9]:
[[AdiosBlockinfo (process_id=0, time_index=1, start=(0,), count=(19,)),
  AdiosBlockinfo (process_id=1, time_index=1, start=(19,), count=(45,)),
  AdiosBlockinfo (process_id=2, time_index=1, start=(64,), count=(1477,)),
  AdiosBlockinfo (process_id=3, time_index=1, start=(1541,), count=(61,)),
  AdiosBlockinfo (process_id=4, time_index=1, start=(1602,), count=(82,)),
  AdiosBlockinfo (process_id=5, time_index=1, start=(1684,), count=(1154,)),
  AdiosBlockinfo (process_id=6, time_index=1, start=(2838,), count=(22,)),
  AdiosBlockinfo (process_id=7, time_index=1, start=(2860,), count=(46,)),
  AdiosBlockinfo (process_id=8, time_index=1, start=(2906,), count=(570,)),
  AdiosBlockinfo (process_id=9, time_index=1, start=(3476,), count=(18,)),
  AdiosBlockinfo (process_id=10, time_index=1, start=(3494,), count=(18,)),
  AdiosBlockinfo (process_id=11, time_index=1, start=(3512,), count=(198,)),
  AdiosBlockinfo (process_id=12, time_index=1, start=(3710,), count=(16,)),
  AdiosBlockinfo (process_id=13, time_index=1, start=(3726,), count=(4,)),
  AdiosBlockinfo (process_id=14, time_index=1, start=(3730,), count=(2,)),
  AdiosBlockinfo (process_id=15, time_index=1, start=(3732,), count=(4,)),
  AdiosBlockinfo (process_id=16, time_index=1, start=(3736,), count=(5,)),
  AdiosBlockinfo (process_id=17, time_index=1, start=(3741,), count=(0,)),
  AdiosBlockinfo (process_id=18, time_index=1, start=(3741,), count=(2,)),
  AdiosBlockinfo (process_id=19, time_index=1, start=(3743,), count=(1,)),
  AdiosBlockinfo (process_id=20, time_index=1, start=(3744,), count=(1,)),
  AdiosBlockinfo (process_id=21, time_index=1, start=(3745,), count=(0,)),
  AdiosBlockinfo (process_id=22, time_index=1, start=(3745,), count=(2,)),
  AdiosBlockinfo (process_id=23, time_index=1, start=(3747,), count=(2,)),
  AdiosBlockinfo (process_id=24, time_index=1, start=(3749,), count=(1,)),
  AdiosBlockinfo (process_id=25, time_index=1, start=(3750,), count=(1,)),
  AdiosBlockinfo (process_id=26, time_index=1, start=(3751,), count=(2,)),
  AdiosBlockinfo (process_id=27, time_index=1, start=(3753,), count=(0,)),
  AdiosBlockinfo (process_id=28, time_index=1, start=(3753,), count=(0,)),
  AdiosBlockinfo (process_id=29, time_index=1, start=(3753,), count=(2,)),
  AdiosBlockinfo (process_id=30, time_index=1, start=(3755,), count=(0,)),
  AdiosBlockinfo (process_id=31, time_index=1, start=(3755,), count=(1,)),
  AdiosBlockinfo (process_id=32, time_index=1, start=(3756,), count=(1,)),
  AdiosBlockinfo (process_id=33, time_index=1, start=(3757,), count=(0,)),
  AdiosBlockinfo (process_id=34, time_index=1, start=(3757,), count=(2,)),
  AdiosBlockinfo (process_id=35, time_index=1, start=(3759,), count=(1,)),
  AdiosBlockinfo (process_id=36, time_index=1, start=(3760,), count=(2,)),
  AdiosBlockinfo (process_id=37, time_index=1, start=(3762,), count=(6,)),
  AdiosBlockinfo (process_id=38, time_index=1, start=(3768,), count=(3,)),
  AdiosBlockinfo (process_id=39, time_index=1, start=(3771,), count=(3,)),
  AdiosBlockinfo (process_id=40, time_index=1, start=(3774,), count=(3,)),
  AdiosBlockinfo (process_id=41, time_index=1, start=(3777,), count=(0,)),
  AdiosBlockinfo (process_id=42, time_index=1, start=(3777,), count=(7,)),
  AdiosBlockinfo (process_id=43, time_index=1, start=(3784,), count=(3,)),
  AdiosBlockinfo (process_id=44, time_index=1, start=(3787,), count=(0,)),
  AdiosBlockinfo (process_id=45, time_index=1, start=(3787,), count=(10,)),
  AdiosBlockinfo (process_id=46, time_index=1, start=(3797,), count=(15,)),
  AdiosBlockinfo (process_id=47, time_index=1, start=(3812,), count=(0,)),
  AdiosBlockinfo (process_id=48, time_index=1, start=(3812,), count=(7,)),
  AdiosBlockinfo (process_id=49, time_index=1, start=(3819,), count=(21,)),
  AdiosBlockinfo (process_id=50, time_index=1, start=(3840,), count=(171,)),
  AdiosBlockinfo (process_id=51, time_index=1, start=(4011,), count=(7,)),
  AdiosBlockinfo (process_id=52, time_index=1, start=(4018,), count=(50,)),
  AdiosBlockinfo (process_id=53, time_index=1, start=(4068,), count=(593,)),
  AdiosBlockinfo (process_id=54, time_index=1, start=(4661,), count=(34,)),
  AdiosBlockinfo (process_id=55, time_index=1, start=(4695,), count=(48,)),
  AdiosBlockinfo (process_id=56, time_index=1, start=(4743,), count=(1264,)),
  AdiosBlockinfo (process_id=57, time_index=1, start=(6007,), count=(16,)),
  AdiosBlockinfo (process_id=58, time_index=1, start=(6023,), count=(28,)),
  AdiosBlockinfo (process_id=59, time_index=1, start=(6051,), count=(1537,))]]

@pnorbert
Copy link
Contributor

pnorbert commented Feb 14, 2018 via email

@pnorbert
Copy link
Contributor

First question: I cannot even write a zero-length block with zlib transformation into the output because the write segfaults. How do you produce the file? Do you turn off zlib for the zero blocks?

@ax3l
Copy link
Contributor

ax3l commented Feb 14, 2018

Hi @pnorbert,

@psychocoderHPC just found the root of the issue and will provide a fix in a few minutes. Affects about half of the transforms: blosc, zlib, bzip2, lz4.

Writing zero-length blocks with transformations is possible since a long time (I think we fixed that together in 1.10 or so) and is an important use case for unstructured, domain-decomposed data. We were writing with blosc (not zlib) where we skip compression on zero-size input in the write transform. Maybe the zlib transform has a bug if it does not do the same - but I seem to remember it worked in the past for us.

Or maybe it's just a misunderstanding of what we do: our overall variable is not zero-sized, it's just individual process groups that contribute zero in parallel writes.

It's just a missing meta-data check on read that is causing the crash right now: see #162

psychocoderHPC added a commit to psychocoderHPC/ADIOS that referenced this issue Feb 14, 2018
fix ornladios#161

Avoid null pointer dereferencing during the read of transformed data of size zero
  - bzip2
  - lz4
  - szip
  - zlib
psychocoderHPC added a commit to psychocoderHPC/ADIOS that referenced this issue Feb 14, 2018
fix ornladios#161

Avoid null pointer dereferencing during the read of transformed data of size zero
  - bzip2
  - lz4
  - szip
  - zlib
@psychocoderHPC
Copy link
Contributor

@pnorbert The zero length write with zlib is fixed in #165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants