Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VR crashes when computing black hole subgrid masses #15

Closed
EvgeniiChaikin opened this issue Aug 18, 2020 · 11 comments
Closed

VR crashes when computing black hole subgrid masses #15

EvgeniiChaikin opened this issue Aug 18, 2020 · 11 comments

Comments

@EvgeniiChaikin
Copy link

EvgeniiChaikin commented Aug 18, 2020

Hi all, recently I have been running VR on SWIFT output containing black hole particles with subgrid properties. VR crashes while it is computing the BH subgrid properties. The VR output with the crash looks as follows:

Opening group PartType0: Data set Coordinates
Opening group PartType1: Data set Coordinates
Opening group PartType4: Data set Coordinates
Opening group PartType5: Data set Coordinates
Opening group PartType0: Data set Velocities
Opening group PartType1: Data set Velocities
Opening group PartType4: Data set Velocities
Opening group PartType5: Data set Velocities
Opening group PartType0: Data set ParticleIDs
Opening group PartType1: Data set ParticleIDs
Opening group PartType4: Data set ParticleIDs
Opening group PartType5: Data set ParticleIDs
Opening group PartType0: Data set Masses
Opening group PartType1: Data set Masses
Opening group PartType4: Data set Masses
Opening group PartType5: Data set DynamicalMasses
Opening group PartType0: Data set InternalEnergies
Opening group PartType0: Data set StarFormationRates
Opening group PartType0: Data set MetalMassFractions
Opening group PartType4: Data set MetalMassFractions
Opening group PartType4: Data set BirthScaleFactors
Opening group PartType0: Data set ElementMassFractions
Opening group PartType0: Data set SpeciesFractions
Opening group PartType0: Data set SpeciesFractions
Opening group PartType0: Data set SpeciesFractions
Opening group PartType5: Data set SubgridMasses
HDF5-DIAG: Error detected in HDF5 (1.10.3) MPI-process 0:
  #000: H5S.c line 921 in H5Sget_simple_extent_ndims(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
vel_rap_exact.sh: line 123: 185245 Aborted                 VR_ICRAR/VELOCIraptor-STF/build/stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i /snap7/scratch/dp004/dc-chai1/my_cosmological_box/AGN_L006N188_00/colibre_2729 -o /snap7/scratch/dp004/dc-chai1/my_cosmological_box/AGN_L006N188_00/halo_2729 -I 2

If I comment out the lines

BH_internal_property_names=SubgridMasses,SubgridMasses,SubgridMasses,SubgridMasses,
BH_internal_property_input_output_unit_conversion_factors=1.0e10,1.0e10,1.0e10,1.0e10,
BH_internal_property_calculation_type=max,min,average,aperture_total,
BH_internal_property_output_units=solar_mass,solar_mass,solar_mass,solar_mass,

in the VR config file, the bug dissapperas and VR runs smoothly.

  • I am using the latest (18.08.20) version of VR from the Master branch of ICRAR
  • I compiled VR as: cmake -DVR_USE_HYDRO=ON
  • On cosma7, I loaded the following modules gsl/2.4, intel_mpi/2018, cmake/3.18.1, intel_comp/2018, parallel_hdf5/1.10.3
  • On cosma, the VR config file I am using can be accessed via /cosma7/data/dp004/dc-chai1/vrconfig_3dfof_subhalos_SO_hydro.cfg
  • The SWIFT snapshot file I am running VR on resides at: /snap7/scratch/dp004/dc-chai1/my_cosmological_box/AGN_L006N188_00_iso/colibre_2729.hdf5
  • I double-checked that the snapshot contains the SubgridMasses field.
@rtobar
Copy link

rtobar commented Aug 19, 2020

Hi,

This is just to confirm that we can also reproduce the error. We were on our way to double-check that the changes in #14 were correct by running VR offline against SWIFT and hit the exact same issue. This seems to be completely unrelated to #13 though.

@MatthieuSchaller
Copy link

The problem seems to be that the dynamic list of open datasets is not correctly used when reading. Specifically, I think iextraoffset is not set correctly the in the loop starting on line 2516 of hdfio.cxx. The values are not set in a similar way in the first loop (line 2342) and the code does not read the correct dataset.
It "magically" works for gas extra properties as they are all at the start of the array.

A temporary very very dirty fix for the case of EAGLE/COLIBRE in SWIFT is to do this:

patch.txt

(don't do this at home!)

It forces the code to read the correct dataset. It will of course only work for the specific configuration file used by XL and friends.

@rtobar
Copy link

rtobar commented Sep 1, 2020

@MatthieuSchaller just before leaving for our break we had also been looking around these lines with @cdplagos, trying to understand where the indexing goes wrong. Of course the patch above is fragile, so a more robust solution (which we'll try to get done this week) is needed.

@MatthieuSchaller
Copy link

Glad we all think the issue is around these lines. The patch is definitely not for public consumption but I listed it here for our colleagues in EAGLE using exactly the same configuration file.

If it helps with the debugging, I noticed that you also run into trouble if you add extra star properties. Only the gas seems to be OK.

@rtobar
Copy link

rtobar commented Sep 11, 2020

An update on this: I have identified the problem that is causing this crash. As we had guessed it was an indexing issue, although in a different point of the code. In #15 (comment) @MatthieuSchaller correctly indicates that the indexing into the arrays is different between the first loop (which opens datasets and dataspaces) and the second one (where they are actually read). However, the first loop indexes past the size of the arrays (writing outside the vector's allocated memory), while the second indexes within the array boundaries. This makes it seem that the correct indexing scheme is the second one, so I fixed the first rather than adjusting the second.

After this VR runs almost to completion, until a new access into these arrays in a different part of the code produces again a crash (when using -O2 or higher, maybe also with -O1). I'm currently looking at this and I hope I'll have something working soon.

@MatthieuSchaller
Copy link

Thanks for the update.

Maybe adding extra star properties to the config file can also help identify problems.

rtobar added a commit that referenced this issue Sep 14, 2020
When HDF5 datasets and dataspaces for the extra properties indicated by
the user (gas/bh/star) are opened, their IDs are stored in two
corresponding vectors that are previously sized to hold exactly the
number of extra properties that are required. However, the code
incorrectly added some offsets to the indices used when storing the IDs
in the vectors, writing past the vectors' end.

This commit fixes the calculation of the indices, removing the
additional offsets added when certain conditions are not met, yielding
index values within the vectors' size boundaries.

Note that when these IDs were accessed later on, the indices used for
that access operation were correctly calculated. This gave a wrong
initial impression that *those* indices were wrong, while in reality it
was the "opening" indices that were wrong.

This should address the problem reported in #15.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@rtobar
Copy link

rtobar commented Sep 14, 2020

This problem has been fixed now, and the fix is on the latest master.

However note that, as outlined in #15 (comment), a further problem appears down the road now that prevents VR from successfully finishing. I could at least confirm that this newly found problem is not caused by the fixes introduced for this issue, and as such I'll be opening a new issue to keep track of it.

@rtobar
Copy link

rtobar commented Sep 14, 2020

The new issue described in #18 (and first mentioned in #15 (comment)) is now also fixed (via #19).

@MatthieuSchaller
Copy link

Oh excellent news! Thanks for tackling these.

@MatthieuSchaller
Copy link

I can confirm that things work correctly.

@rtobar
Copy link

rtobar commented Sep 17, 2020

Great, I'm closing this issue then.

@rtobar rtobar closed this as completed Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants