-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault (core dumped) #124
Comments
Hi, can you bpls the content of the file? How about adios 1.11? |
Hi. If "bpls" works ok with your file, then, can you try to use gdb and share the results?
Save your command as a script (e.g, test_adios.py) and run the following command:
$ gdb --args python test_adios.py
In gdb, type "run":
(gdb) run
If you can share your backtraces, it will be helpful to identify the location.
Also, which version of Adios and Adios python version did you use? The following command will give the Adios python version:
>> import adios as ad
>> ad.__version__
Thanks,
Jong
… On Mar 1, 2017, at 9:30 AM, pnorbert ***@***.***> wrote:
Hi, can you bpls the content of the file? How about adios 1.11?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
bpls: The bp file is produced by a code which uses adios 1.3.1, and my python code is using version 1.11.0. Do you think the difference of the versions have an effect on this as well? It's difficult to recompile my advisor's code with adios 1.11.0, while adios 1.3.1 does not support python... |
We have some old record.bp and pixie3d.bp from 2010/11 and bpls works fine
with them. Can you guys somehow upload your record.bp to ORNL so we can see
what has changed?
…On Wed, Mar 1, 2017 at 11:56 AM, physixfan ***@***.***> wrote:
bpls:
***@***.*** target_shape_46]$ bpls -latv record.bp
Segmentation fault (core dumped)
The bp file is produced by a code which uses adios 1.3.1, and my python
code is using version 1.11.0. Do you think the difference of the versions
have an effect on this as well? It's difficult to recompile my advisor's
code with adios 1.11.0, while adios 1.3.1 does not support python...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLReiJ95MNUkxF4sA9b6L2m4ebwscks5rhaNagaJpZM4MPRNZ>
.
|
Well, not all the record.bp files fail, only some of them. And I still can't figure out when. I am working with a LANL server called Grizzly, it is a new machine so maybe that's the reason. I don't know how to upload the file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I don't think it will take long. |
Is that a big endian or little endian machine?
…On Mar 1, 2017 12:38 PM, "physixfan" ***@***.***> wrote:
Well, not all the record.bp files fail, only some of them and I still
can't figure out when. I am working with a LANL server called Grizzly, it
is a new machine so maybe that's the reason. I don't know how to upload the
file to ORNL, and I am trying to upload it to Dropbox. It's just 4GB, so I
don't think it will take long.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLXNgZ_RCUtIqlnhsJdmtTlspwoedks5rha0fgaJpZM4MPRNZ>
.
|
I would say it's big. I usually use 4096 processors to run the code. |
I meant the byte-order of the machine, big endian or little endian. What is
the CPU.
Anyway, if some files created on Grizzly work and some don't then this does
not matter. Only if all files from Grizzly cause segfaults.
…On Mar 1, 2017 1:07 PM, "physixfan" ***@***.***> wrote:
I would say it's big. I usually use 4096 processors to run the code.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLQRyk85rGSLYndvyPhKUJHk7U_oRks5rhbPogaJpZM4MPRNZ>
.
|
https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0 |
Can you please check if you give a big enough size in the
adios_group_size() call. It should be >= than how many bytes you actually
write between adios_open() ... adios_close() in a process.
If that's the case, please tell me what the total_size value is (this is
returned by adios_group_size().
…On Wed, Mar 1, 2017 at 1:43 PM, physixfan ***@***.***> wrote:
https://www.dropbox.com/s/yfbzhden7hlj98a/record.bp?dl=0
This is the dropbox link of the record.bp file. Thanks for your help here~!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLe1RW6y5DjvGcwrdOTvuqxeoJuaGks5rhbxMgaJpZM4MPRNZ>
.
|
Thanks. Since reading the original file causes a segment fault, I copied and restarted this run. I asked for 32 processors to run. The relevant output is:
(and some repetitive outputs) And this is the relevant code in pixie2d:
|
The index in the file you gave me is corrupt. I could figure that you had
372736 blocks of data, that's 4096 processors times 91 output steps. But
the pg_index that just enumerates those blocks becomes corrupt at the 237th
block already.
I don't know whether the data is corrupt. 1.11's bprecover utility is
probably not exactly working for this old version BP file so I cannot rely
on its report.
So the question is why does your output file become corrupted on this
machine?
Can you try using the POSIX method instead of MPI to produce one file per
process? I don't even remember how that works in adios 1.3.1...
BTW, could you read the data from your last rerun with 32 processes?
…On Wed, Mar 1, 2017 at 7:41 PM, physixfan ***@***.***> wrote:
Thanks. Since reading the original file causes a segment fault, I copied
and restarted this run. I asked for 32 processors to run. The relevant
output is:
rank= 0 ADIOS group_size= 206164
rank= 0 ADIOS total size= 211574
ADIOS file write...
r0 offset=( 0, 0, 0)
r0 size=( 33, 65, 3)
r0 i= 0: 32, 0: 64, 0: 2
ADIOS file close...
rank= 1 ADIOS group_size= 199736
rank= 1 ADIOS total size= 205146
r1 offset=( 33, 0, 0)
r1 size=( 32, 65, 3)
r1 i= 1: 32, 0: 64, 0: 2
rank= 2 ADIOS group_size= 199736
rank= 2 ADIOS total size= 205146
r2 offset=( 65, 0, 0)
r2 size=( 32, 65, 3)
r2 i= 1: 32, 0: 64, 0: 2
(and some repetitive outputs)
And this is the relevant code in pixie2d:
call adios_group_size(handle,groupsize,totalsize,err)
if (err /= 0) then
write (*,*) 'Problem in writeRecordFile'
write (*,*) 'rank=',my_rank,' ERROR in "adios_group_size"'
stop
endif
if (adios_debug) then
write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
endif
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ>
.
|
Can you send me the output of bprecover ran on your 32 process run? I am
curious what numbers we see there.
On Wed, Mar 1, 2017 at 9:00 PM, Norbert Podhorszki <
norbert.podhorszki@gmail.com> wrote:
… The index in the file you gave me is corrupt. I could figure that you had
372736 blocks of data, that's 4096 processors times 91 output steps. But
the pg_index that just enumerates those blocks becomes corrupt at the 237th
block already.
I don't know whether the data is corrupt. 1.11's bprecover utility is
probably not exactly working for this old version BP file so I cannot rely
on its report.
So the question is why does your output file become corrupted on this
machine?
Can you try using the POSIX method instead of MPI to produce one file per
process? I don't even remember how that works in adios 1.3.1...
BTW, could you read the data from your last rerun with 32 processes?
On Wed, Mar 1, 2017 at 7:41 PM, physixfan ***@***.***>
wrote:
> Thanks. Since reading the original file causes a segment fault, I copied
> and restarted this run. I asked for 32 processors to run. The relevant
> output is:
>
> rank= 0 ADIOS group_size= 206164
> rank= 0 ADIOS total size= 211574
> ADIOS file write...
> r0 offset=( 0, 0, 0)
> r0 size=( 33, 65, 3)
> r0 i= 0: 32, 0: 64, 0: 2
> ADIOS file close...
> rank= 1 ADIOS group_size= 199736
> rank= 1 ADIOS total size= 205146
> r1 offset=( 33, 0, 0)
> r1 size=( 32, 65, 3)
> r1 i= 1: 32, 0: 64, 0: 2
> rank= 2 ADIOS group_size= 199736
> rank= 2 ADIOS total size= 205146
> r2 offset=( 65, 0, 0)
> r2 size=( 32, 65, 3)
> r2 i= 1: 32, 0: 64, 0: 2
>
> (and some repetitive outputs)
>
> And this is the relevant code in pixie2d:
>
> call adios_group_size(handle,groupsize,totalsize,err)
>
> if (err /= 0) then
> write (*,*) 'Problem in writeRecordFile'
> write (*,*) 'rank=',my_rank,' ERROR in "adios_group_size"'
> stop
> endif
>
> if (adios_debug) then
> write (*,*) 'rank=',my_rank,' ADIOS group_size=',groupsize
> write (*,*) 'rank=',my_rank,' ADIOS total size=',totalsize
> endif
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#124 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADGMLfNZyywjGMCLrYqGzZfvoJnMBp-fks5rhhBEgaJpZM4MPRNZ>
> .
>
|
I do not save the bp file for the 32 process run. But I made a new run with the same input file with 4096 processors and 30 minutes. There's not much data there yet, and the file can be read correctly. Here's the link: https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0 |
Although I can still read data with bpls from this file, 1.3.1 version of
bpdump already dies on it. 1.11.0 version of bprecover tells me the same
strange sizes I saw with the big one.
So what is the adios_group_size() report on this run?
Is the total size 27050 for all processes, or is it 8054 for the first
process and 7386 for the rest?
…On Wed, Mar 1, 2017 at 10:26 PM, physixfan ***@***.***> wrote:
I do not save the bp file for the 32 process run. But I made a new run
with the same input file with 4096 processors and 30 minutes. There's not
much data there yet, and the file can be read correctly. Here's the link:
https://www.dropbox.com/s/z21y5s9worh6t56/record2.bp?dl=0
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLbRTEr1AEAAPPt9VC4BDYZ5JYDJkks5rhjbigaJpZM4MPRNZ>
.
|
The latter:
|
So what are the implications from these? How can I fix the problem? |
I think the only implication so far is that bprecover from 1.11 is not
working at all on this old file (was not intended to do that). So we have
no idea what goes wrong during the writes, and therefore we/you cannot fix
it.
We suggested to Luis to come to ORNL and work together to add adios 1.11 to
Pixie3D.
Do you see these segfaults depending on the size of run (maybe the
per-process variables are too small and cause some problem in the index (it
shouldn't of course)), or depending on the number of timesteps?
Maybe you could do one more run using the POSIX method instead of the MPI
method and see if you can read the global data, or at least the individual
files, or if you have the same segfaults.
…On Thu, Mar 2, 2017 at 7:03 PM, physixfan ***@***.***> wrote:
So what are the implications from these? How can I fix the problem?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ>
.
|
BTW, I could not even compile 1.3.1 with recent compilers on my Mac without
fixing a line here and there. As far as I know, Luis has been using a
modified version of 1.3.1 where he fixed some bugs himself but I don't know
more details about those bugs. But I wonder how that version works on new
machines with new compilers.
On Thu, Mar 2, 2017 at 7:20 PM, Norbert Podhorszki <
norbert.podhorszki@gmail.com> wrote:
… I think the only implication so far is that bprecover from 1.11 is not
working at all on this old file (was not intended to do that). So we have
no idea what goes wrong during the writes, and therefore we/you cannot fix
it.
We suggested to Luis to come to ORNL and work together to add adios 1.11
to Pixie3D.
Do you see these segfaults depending on the size of run (maybe the
per-process variables are too small and cause some problem in the index (it
shouldn't of course)), or depending on the number of timesteps?
Maybe you could do one more run using the POSIX method instead of the MPI
method and see if you can read the global data, or at least the individual
files, or if you have the same segfaults.
On Thu, Mar 2, 2017 at 7:03 PM, physixfan ***@***.***>
wrote:
> So what are the implications from these? How can I fix the problem?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#124 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADGMLRRB-uzbznGQNd2dWqbnL1ttLFAaks5rh1jpgaJpZM4MPRNZ>
> .
>
|
The segment fault is depending on the number of time steps as far as I observed. But since you mentioned even for the small run the bp file is broken, I don't know now... I'll contact Luis about this issue and see whether he can update the version of ADIOS. |
Hi! Luis has re-compiled pixie2d with ADIOS version 1.10. However, I still see the segmentation fault error. The following link is a record.bp file produced by the new code, please check whether you can find what caused these errors. Thanks! https://www.dropbox.com/s/qpk669c1lkh4f6j/record_Mar6.bp?dl=0 |
Hi, do you have time to take a look? Thanks! |
I was on travel but now I looked at it. bprecover took a long time to
process this file but it found that in the last step the blocks written by
the processes (PGs, or Process Groups) were corrupt in the file. So
something went wrong at that output step. Did you get any adios or system
error messages in your application logs?
116 steps could be recovered, although it took a very long time to do that.
The metadata is 460MB at that point which is far from being a problem for
adios, unless some compute nodes run out of memory. Did your application
die at this output step? What was the reason for it?
Look for a Process Group (PG) at offset 1386173555
PG reported size: 2892
Check if it looks like a PG...
Fortran flag char should be 'y' or 'n': 'y'
Group name length expected to be less than 64 characters: 6
Group name: "record"
PG 475280 found at offset 1386173555
Group Name: record
Host Language Fortran?: Y
Coordination Var Member ID: 0
Time Name: tidx
Time: 117
Methods used in output: 1
Method ID: 0
Method Parameters:
Vars Count: 14
Var Name (ID): nvar (5)
Var Name (ID): nxd+2 (7)
Var Name (ID): nyd+2 (9)
Var Name (ID): nzd+2 (11)
Var Name (ID): xoffset (13)
Var Name (ID): yoffset (14)
Var Name (ID): zoffset (15)
Var Name (ID): xsize (16)
Var Name (ID): ysize (17)
Var Name (ID): zsize (18)
Var Name (ID): v1 (21)
Var Name (ID): v2 (22)
Var Name (ID): v3 (23)
Var Name (ID): v4 (24)
Attributes Count: 0
Actual size of group by processing: 2892 bytes
========================================================
Look for a Process Group (PG) at offset 1386176447
PG reported size: 1258272568115204
=== Offset + PG reported size >> file size. This is not a (good) PG.
========================================================
Found 475280 PGs to be processable
Index metadata size is 460579051 bytes
Ready to write index to file offset 1386176447. Size is 460579051 bytes
Wrote index to file offset 1386176447. Size is 460579051 bytes
Truncate file to size 1846755498
…On Mon, Mar 13, 2017 at 2:23 PM, physixfan ***@***.***> wrote:
Hi, do you have time to take a look? Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLVJh_7nGq8M1ym-l4wnRmfRObd_Yks5rlYmzgaJpZM4MPRNZ>
.
|
Thank you very much! Luis has solved this issue. |
Hi
I am really frustrated about the Segmentation fault (core dumped) error. It shows up when I try to read the .bp file produced from some large runs. Actually the file is not very large, just ~4GB. I can successfully read some larger files, so I don't think it is because of the limitation of memory. This is the code that I use (in python):
The text was updated successfully, but these errors were encountered: