Segmentation fault (core dumped) #124

physixfan · 2017-03-01T06:27:36Z

Hi

I am really frustrated about the Segmentation fault (core dumped) error. It shows up when I try to read the .bp file produced from some large runs. Actually the file is not very large, just ~4GB. I can successfully read some larger files, so I don't think it is because of the limitation of memory. This is the code that I use (in python):

[xfan@gr-fe1 target_shape_46]$ python
Python 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import adios as ad
>>> binary_file = ad.file("record.bp")
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

pnorbert · 2017-03-01T14:30:29Z

Hi, can you bpls the content of the file? How about adios 1.11?

jychoi-hpc · 2017-03-01T15:38:01Z

Hi. If "bpls" works ok with your file, then, can you try to use gdb and share the results? Save your command as a script (e.g, test_adios.py) and run the following command: $ gdb --args python test_adios.py In gdb, type "run": (gdb) run If you can share your backtraces, it will be helpful to identify the location. Also, which version of Adios and Adios python version did you use? The following command will give the Adios python version:

>> import adios as ad >> ad.__version__

Thanks, Jong

…

On Mar 1, 2017, at 9:30 AM, pnorbert ***@***.***> wrote: Hi, can you bpls the content of the file? How about adios 1.11? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

physixfan · 2017-03-01T16:56:57Z

bpls:
[xfan@gr-fe1 target_shape_46]$ bpls -latv record.bp
Segmentation fault (core dumped)

The bp file is produced by a code which uses adios 1.3.1, and my python code is using version 1.11.0. Do you think the difference of the versions have an effect on this as well? It's difficult to recompile my advisor's code with adios 1.11.0, while adios 1.3.1 does not support python...

pnorbert · 2017-03-01T17:22:18Z

We have some old record.bp and pixie3d.bp from 2010/11 and bpls works fine with them. Can you guys somehow upload your record.bp to ORNL so we can see what has changed?

…

On Wed, Mar 1, 2017 at 11:56 AM, physixfan ***@***.***> wrote: bpls: ***@***.*** target_shape_46]$ bpls -latv record.bp Segmentation fault (core dumped) The bp file is produced by a code which uses adios 1.3.1, and my python code is using version 1.11.0. Do you think the difference of the versions have an effect on this as well? It's difficult to recompile my advisor's code with adios 1.11.0, while adios 1.3.1 does not support python... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLReiJ95MNUkxF4sA9b6L2m4ebwscks5rhaNagaJpZM4MPRNZ> .

physixfan · 2017-03-01T17:38:39Z

Well, not all the record.bp files fail, only some of them. And I still can't figure out when. I am working with a LANL server called Grizzly, it is a new machine so maybe that's the reason. I don't know how to upload the file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I don't think it will take long.

pnorbert · 2017-03-01T18:06:40Z

Is that a big endian or little endian machine?

…

On Mar 1, 2017 12:38 PM, "physixfan" ***@***.***> wrote: Well, not all the record.bp files fail, only some of them and I still can't figure out when. I am working with a LANL server called Grizzly, it is a new machine so maybe that's the reason. I don't know how to upload the file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I don't think it will take long. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLXNgZ_RCUtIqlnhsJdmtTlspwoedks5rha0fgaJpZM4MPRNZ> .

physixfan · 2017-03-01T18:07:35Z

I would say it's big. I usually use 4096 processors to run the code.

pnorbert · 2017-03-01T18:10:16Z

I meant the byte-order of the machine, big endian or little endian. What is the CPU. Anyway, if some files created on Grizzly work and some don't then this does not matter. Only if all files from Grizzly cause segfaults.

…

On Mar 1, 2017 1:07 PM, "physixfan" ***@***.***> wrote: I would say it's big. I usually use 4096 processors to run the code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLQRyk85rGSLYndvyPhKUJHk7U_oRks5rhbPogaJpZM4MPRNZ> .

physixfan · 2017-03-01T18:43:24Z

https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0
This is the dropbox link of the record.bp file. Thanks for your help here~!

pnorbert · 2017-03-01T20:16:48Z

Can you please check if you give a big enough size in the adios_group_size() call. It should be >= than how many bytes you actually write between adios_open() ... adios_close() in a process. If that's the case, please tell me what the total_size value is (this is returned by adios_group_size().

…

On Wed, Mar 1, 2017 at 1:43 PM, physixfan ***@***.***> wrote: https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0 This is the dropbox link of the record.bp file. Thanks for your help here~! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLe1RW6y5DjvGcwrdOTvuqxeoJuaGks5rhbxMgaJpZM4MPRNZ> .

physixfan · 2017-03-02T00:41:40Z

Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is:

 rank=           0  ADIOS group_size=               206164
 rank=           0  ADIOS total size=               211574
  ADIOS file write...
r0 offset=(  0,  0,  0)
r0 size=( 33, 65,  3)
r0 i=  0: 32,  0: 64,  0:  2
  ADIOS file close...
 rank=           1  ADIOS group_size=               199736
 rank=           1  ADIOS total size=               205146
r1 offset=( 33,  0,  0)
r1 size=( 32, 65,  3)
r1 i=  1: 32,  0: 64,  0:  2
 rank=           2  ADIOS group_size=               199736
 rank=           2  ADIOS total size=               205146
r2 offset=( 65,  0,  0)
r2 size=( 32, 65,  3)
r2 i=  1: 32,  0: 64,  0:  2

(and some repetitive outputs)

And this is the relevant code in pixie2d:

      call adios_group_size(handle,groupsize,totalsize,err)

      if (err /= 0) then
        write (*,*) 'Problem in writeRecordFile'
        write (*,*) 'rank=',my_rank,'  ERROR in "adios_group_size"'
        stop
      endif

      if (adios_debug) then
        write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
        write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
      endif

pnorbert · 2017-03-02T02:01:10Z

The index in the file you gave me is corrupt. I could figure that you had 372736 blocks of data, that's 4096 processors times 91 output steps. But the pg_index that just enumerates those blocks becomes corrupt at the 237th block already. I don't know whether the data is corrupt. 1.11's bprecover utility is probably not exactly working for this old version BP file so I cannot rely on its report. So the question is why does your output file become corrupted on this machine? Can you try using the POSIX method instead of MPI to produce one file per process? I don't even remember how that works in adios 1.3.1... BTW, could you read the data from your last rerun with 32 processes?

…

On Wed, Mar 1, 2017 at 7:41 PM, physixfan ***@***.***> wrote: Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is: rank= 0 ADIOS group_size= 206164 rank= 0 ADIOS total size= 211574 ADIOS file write... r0 offset=( 0, 0, 0) r0 size=( 33, 65, 3) r0 i= 0: 32, 0: 64, 0: 2 ADIOS file close... rank= 1 ADIOS group_size= 199736 rank= 1 ADIOS total size= 205146 r1 offset=( 33, 0, 0) r1 size=( 32, 65, 3) r1 i= 1: 32, 0: 64, 0: 2 rank= 2 ADIOS group_size= 199736 rank= 2 ADIOS total size= 205146 r2 offset=( 65, 0, 0) r2 size=( 32, 65, 3) r2 i= 1: 32, 0: 64, 0: 2 (and some repetitive outputs) And this is the relevant code in pixie2d: call adios_group_size(handle,groupsize,totalsize,err) if (err /= 0) then write (*,*) 'Problem in writeRecordFile' write (*,*) 'rank=',my_rank,' ERROR in "adios_group_size"' stop endif if (adios_debug) then write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize endif — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ> .

pnorbert · 2017-03-02T02:04:11Z

Can you send me the output of bprecover ran on your 32 process run? I am curious what numbers we see there. On Wed, Mar 1, 2017 at 9:00 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

…

The index in the file you gave me is corrupt. I could figure that you had 372736 blocks of data, that's 4096 processors times 91 output steps. But the pg_index that just enumerates those blocks becomes corrupt at the 237th block already. I don't know whether the data is corrupt. 1.11's bprecover utility is probably not exactly working for this old version BP file so I cannot rely on its report. So the question is why does your output file become corrupted on this machine? Can you try using the POSIX method instead of MPI to produce one file per process? I don't even remember how that works in adios 1.3.1... BTW, could you read the data from your last rerun with 32 processes? On Wed, Mar 1, 2017 at 7:41 PM, physixfan ***@***.***> wrote: > Thanks. Since reading the original file causes a segment fault, I copied > and restarted this run. I asked for 32 processors to run. The relevant > output is: > > rank= 0 ADIOS group_size= 206164 > rank= 0 ADIOS total size= 211574 > ADIOS file write... > r0 offset=( 0, 0, 0) > r0 size=( 33, 65, 3) > r0 i= 0: 32, 0: 64, 0: 2 > ADIOS file close... > rank= 1 ADIOS group_size= 199736 > rank= 1 ADIOS total size= 205146 > r1 offset=( 33, 0, 0) > r1 size=( 32, 65, 3) > r1 i= 1: 32, 0: 64, 0: 2 > rank= 2 ADIOS group_size= 199736 > rank= 2 ADIOS total size= 205146 > r2 offset=( 65, 0, 0) > r2 size=( 32, 65, 3) > r2 i= 1: 32, 0: 64, 0: 2 > > (and some repetitive outputs) > > And this is the relevant code in pixie2d: > > call adios_group_size(handle,groupsize,totalsize,err) > > if (err /= 0) then > write (*,*) 'Problem in writeRecordFile' > write (*,*) 'rank=',my_rank,' ERROR in "adios_group_size"' > stop > endif > > if (adios_debug) then > write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize > write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize > endif > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#124 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ> > . >

physixfan · 2017-03-02T03:26:26Z

I do not save the bp file for the 32 process run. But I made a new run with the same input file with 4096 processors and 30 minutes. There's not much data there yet, and the file can be read correctly. Here's the link: https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0

pnorbert · 2017-03-02T12:11:49Z

Although I can still read data with bpls from this file, 1.3.1 version of bpdump already dies on it. 1.11.0 version of bprecover tells me the same strange sizes I saw with the big one. So what is the adios_group_size() report on this run? Is the total size 27050 for all processes, or is it 8054 for the first process and 7386 for the rest?

…

On Wed, Mar 1, 2017 at 10:26 PM, physixfan ***@***.***> wrote: I do not save the bp file for the 32 process run. But I made a new run with the same input file with 4096 processors and 30 minutes. There's not much data there yet, and the file can be read correctly. Here's the link: https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLbRTEr1AEAAPPt9VC4BDYZ5JYDJkks5rhjbigaJpZM4MPRNZ> .

physixfan · 2017-03-02T16:42:05Z

The latter:

r* offset=( 49,  0,  0)
r* size=(  4,  5,  3)
r* i=  1:  4,  0:  4,  0:  2
 rank=           0  ADIOS group_size=                 2644
 rank=           0  ADIOS total size=                 8054
  ADIOS file write...
r0 offset=(  0,  0,  0)
r0 size=(  5,  5,  3)
r0 i=  0:  4,  0:  4,  0:  2
  ADIOS file close...
 rank=           1  ADIOS group_size=                 1976
 rank=           1  ADIOS total size=                 7386
r1 offset=(  5,  0,  0)
r1 size=(  4,  5,  3)
r1 i=  1:  4,  0:  4,  0:  2
rank= 1 ihip=5 ilom=0 jhip=5 jlom=0 khip=2 klom=0
 rank=           1  ADIOS open: handle=             59689568
 rank=           3  ADIOS group_size=                 1976
 rank=           3  ADIOS total size=                 7386
r3 offset=( 13,  0,  0)
r3 size=(  4,  5,  3)
r3 i=  1:  4,  0:  4,  0:  2
 rank=           2  ADIOS group_size=                 1976
 rank=           2  ADIOS total size=                 7386

physixfan · 2017-03-03T00:03:52Z

So what are the implications from these? How can I fix the problem?

pnorbert · 2017-03-03T00:20:32Z

I think the only implication so far is that bprecover from 1.11 is not working at all on this old file (was not intended to do that). So we have no idea what goes wrong during the writes, and therefore we/you cannot fix it. We suggested to Luis to come to ORNL and work together to add adios 1.11 to Pixie3D. Do you see these segfaults depending on the size of run (maybe the per-process variables are too small and cause some problem in the index (it shouldn't of course)), or depending on the number of timesteps? Maybe you could do one more run using the POSIX method instead of the MPI method and see if you can read the global data, or at least the individual files, or if you have the same segfaults.

…

On Thu, Mar 2, 2017 at 7:03 PM, physixfan ***@***.***> wrote: So what are the implications from these? How can I fix the problem? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ> .

pnorbert · 2017-03-03T00:23:47Z

BTW, I could not even compile 1.3.1 with recent compilers on my Mac without fixing a line here and there. As far as I know, Luis has been using a modified version of 1.3.1 where he fixed some bugs himself but I don't know more details about those bugs. But I wonder how that version works on new machines with new compilers. On Thu, Mar 2, 2017 at 7:20 PM, Norbert Podhorszki < norbert.podhorszki@gmail.com> wrote:

…

I think the only implication so far is that bprecover from 1.11 is not working at all on this old file (was not intended to do that). So we have no idea what goes wrong during the writes, and therefore we/you cannot fix it. We suggested to Luis to come to ORNL and work together to add adios 1.11 to Pixie3D. Do you see these segfaults depending on the size of run (maybe the per-process variables are too small and cause some problem in the index (it shouldn't of course)), or depending on the number of timesteps? Maybe you could do one more run using the POSIX method instead of the MPI method and see if you can read the global data, or at least the individual files, or if you have the same segfaults. On Thu, Mar 2, 2017 at 7:03 PM, physixfan ***@***.***> wrote: > So what are the implications from these? How can I fix the problem? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#124 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ> > . >

physixfan · 2017-03-03T00:30:00Z

The segment fault is depending on the number of time steps as far as I observed. But since you mentioned even for the small run the bp file is broken, I don't know now... I'll contact Luis about this issue and see whether he can update the version of ADIOS.

physixfan · 2017-03-07T01:20:57Z

Hi!

Luis has re-compiled pixie2d with ADIOS version 1.10. However, I still see the segmentation fault error. The following link is a record.bp file produced by the new code, please check whether you can find what caused these errors. Thanks!

https://www.dropbox.com/s/qpk669c1lkh4f6j/record_Mar6.bp?dl=0

physixfan · 2017-03-13T18:23:46Z

Hi, do you have time to take a look? Thanks!

pnorbert · 2017-03-17T16:56:33Z

I was on travel but now I looked at it. bprecover took a long time to process this file but it found that in the last step the blocks written by the processes (PGs, or Process Groups) were corrupt in the file. So something went wrong at that output step. Did you get any adios or system error messages in your application logs? 116 steps could be recovered, although it took a very long time to do that. The metadata is 460MB at that point which is far from being a problem for adios, unless some compute nodes run out of memory. Did your application die at this output step? What was the reason for it? Look for a Process Group (PG) at offset 1386173555 PG reported size: 2892 Check if it looks like a PG... Fortran flag char should be 'y' or 'n': 'y' Group name length expected to be less than 64 characters: 6 Group name: "record" PG 475280 found at offset 1386173555 Group Name: record Host Language Fortran?: Y Coordination Var Member ID: 0 Time Name: tidx Time: 117 Methods used in output: 1 Method ID: 0 Method Parameters: Vars Count: 14 Var Name (ID): nvar (5) Var Name (ID): nxd+2 (7) Var Name (ID): nyd+2 (9) Var Name (ID): nzd+2 (11) Var Name (ID): xoffset (13) Var Name (ID): yoffset (14) Var Name (ID): zoffset (15) Var Name (ID): xsize (16) Var Name (ID): ysize (17) Var Name (ID): zsize (18) Var Name (ID): v1 (21) Var Name (ID): v2 (22) Var Name (ID): v3 (23) Var Name (ID): v4 (24) Attributes Count: 0 Actual size of group by processing: 2892 bytes ======================================================== Look for a Process Group (PG) at offset 1386176447 PG reported size: 1258272568115204 === Offset + PG reported size >> file size. This is not a (good) PG. ======================================================== Found 475280 PGs to be processable Index metadata size is 460579051 bytes Ready to write index to file offset 1386176447. Size is 460579051 bytes Wrote index to file offset 1386176447. Size is 460579051 bytes Truncate file to size 1846755498

…

On Mon, Mar 13, 2017 at 2:23 PM, physixfan ***@***.***> wrote: Hi, do you have time to take a look? Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLVJh_7nGq8M1ym-l4wnRmfRObd_Yks5rlYmzgaJpZM4MPRNZ> .

physixfan · 2017-03-19T00:01:27Z

Thank you very much! Luis has solved this issue.

pnorbert closed this as completed Mar 20, 2017

PrometheusPi mentioned this issue Jan 18, 2021

model without target ComputationalRadiationPhysics/picongpu#3460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault (core dumped) #124

Segmentation fault (core dumped) #124

physixfan commented Mar 1, 2017 •

edited

Loading

pnorbert commented Mar 1, 2017

jychoi-hpc commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017 •

edited

Loading

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 2, 2017

pnorbert commented Mar 2, 2017 via email

pnorbert commented Mar 2, 2017 via email

physixfan commented Mar 2, 2017

pnorbert commented Mar 2, 2017 via email

physixfan commented Mar 2, 2017 •

edited

Loading

physixfan commented Mar 3, 2017

pnorbert commented Mar 3, 2017 via email

pnorbert commented Mar 3, 2017 via email

physixfan commented Mar 3, 2017

physixfan commented Mar 7, 2017

physixfan commented Mar 13, 2017

pnorbert commented Mar 17, 2017 via email

physixfan commented Mar 19, 2017

Segmentation fault (core dumped) #124

Segmentation fault (core dumped) #124

Comments

physixfan commented Mar 1, 2017 • edited Loading

pnorbert commented Mar 1, 2017

jychoi-hpc commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017 • edited Loading

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 1, 2017

pnorbert commented Mar 1, 2017 via email

physixfan commented Mar 2, 2017

pnorbert commented Mar 2, 2017 via email

pnorbert commented Mar 2, 2017 via email

physixfan commented Mar 2, 2017

pnorbert commented Mar 2, 2017 via email

physixfan commented Mar 2, 2017 • edited Loading

physixfan commented Mar 3, 2017

pnorbert commented Mar 3, 2017 via email

pnorbert commented Mar 3, 2017 via email

physixfan commented Mar 3, 2017

physixfan commented Mar 7, 2017

physixfan commented Mar 13, 2017

pnorbert commented Mar 17, 2017 via email

physixfan commented Mar 19, 2017

physixfan commented Mar 1, 2017 •

edited

Loading

physixfan commented Mar 1, 2017 •

edited

Loading

physixfan commented Mar 2, 2017 •

edited

Loading