# Date Engineering

The Data used to train the neural network was gathered from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS). CAMELS contains 6325 cosmological organized in different suits:     

- magnetohydrodynamic:
    - IllustrisTNG
    - SIMBA
    - Astrid
- N-body

Each suit is split into 4 different simulation sets:

- LH (Latin-Hypercube): 1000 simulations with different values of the cosmological and astrophysical parameters which are arranged in a latin-hypercube. The initial conditions of each simulation are also different.
- 1P (1-parameter at a time): 61 simulations where the values of the cosmological and astrophysical parameters are varied one at a time. The initial conditions of each simulation are the same.
- CV (Cosmic Variance): 27 simulations where the cosomological and astrophysical parameters are fixed but the initial conditions vary.
- EX (Extreme): 4 simulations with fixed cosmological parameters but different astrophysical parameters

CAMELS vary two cosmological and 4 astrophysical parameters:

- cosmological:
    - $\Omega_m$: Description. range of variation: $0.1 \leq \Omega_m \leq 0.5$
    - $\sigma_8$: Description. range of variation: $0.6 \leq \sigma_8 \leq 1.0$

- astrophysical:
    - $A_{SN1}$: Description. range of variation: $0.25 \leq A_{SN1} \leq 4.0$
    - $A_{SN2}$: Description. range of variation: $0.50 \leq A_{SN2} \leq 2.0$
    - $A_{GN1}$: Description. range of variation: $0.25 \leq A_{GN1} \leq 4.0$
    - $A_{GN2}$: Description. range of variation: $0.50 \leq A_{GN2} \leq 2.0$

Each hydrodynamic simulation evolves 256³ dark matter particles and 256³ gas resolution elements within a periodic comoving volume of 25 h⁻¹Mpc³ from a redshift z= 127 to z=0. During the evolution snapshots of the universe as a function of the redshift are created. In this work just data from the final snapshot, which corresponds to our universe at its current state, were used. We also restrict the data for the purpose of reasonable computation duration to IllustrisTNG-CV simulations. Details zu fof subfindhalos??
Further details on the project and its goals are explained in [quelle: simulation_data.pdf][website].

The data comes as h5py files containing a tremendous amount of information(quelle: https://www.tng-project.org/data/docs/specifications/#sec2
) about the halos and subhalos (galaxies) in the universe. Since we are intereseted in infering the mass of a halo from a few distinct properties of its subhalos we must extract the desired features:
- halo features:
    - Halo mass
    - normalized 3D position
    - normalized 3D velocity
- subhalo features
    - Mass of star & wind particles ?? whyyy
    - normalized 3D position
    - normalized 3D velocity


In [12]:
import h5py
import numpy as np
from constants import *

#--- FEATURES CHOICES ---#

use_hmR = 1     # 1 for using the half-mass radius as feature
use_vel = 1     # 1 for using subhalo velocity as feature
only_positions = 0  # 1 for using only positions as features
galcen_frame = 0    # 1 for writing positions and velocities in the central galaxy rest frame (otherwise it uses the total center of mass)

#--- NORMALIZATION ---#

Nstar_th = 10   # Minimum number of stellar particles required to consider a galaxy
radnorm = 8.    # Ad hoc normalization for half-mass radius
velnorm = 100.  # Ad hoc normalization for velocity. Use velnorm=1. for galcen_frame=1

def general_tab(path):

    # Read hdf5 file
    f = h5py.File(path, 'r')

    # Load subhalo features
    #types = (Gas, Dark Matter, unused, Tracers, Stars & Wind particles, Black holes)Quelle: https://www.tng-project.org/data/docs/specifications/#sec2
    SubhaloPos = f["Subhalo/SubhaloPos"][:]/boxsize
    SubhaloMassType = f["Subhalo/SubhaloMassType"][:,4]
    SubhaloVel = f["Subhalo/SubhaloVel"][:]/velnorm

    # Load halo features
    HaloMass = f["Group/Group_M_Crit200"][:]
    GroupPos = f["Group/GroupPos"][:]/boxsize
    GroupVel = f["Group/GroupVel"][:]/velnorm

    # restriction features
    HaloID = np.array(f["Subhalo/SubhaloGrNr"][:], dtype=np.int32)
    SubhaloLenType = f["Subhalo/SubhaloLenType"][:,4]
    SubhaloHalfmassRadType = f["Subhalo/SubhaloHalfmassRadType"][:,4]/radnorm
    
    # Create general table with subhalo properties
    # Host halo ID, 3D position, stellar mass, number of stellar particles, stellar half-mass radius, 3D velocity
    tab = np.column_stack((HaloID, SubhaloPos, SubhaloMassType, SubhaloLenType, SubhaloHalfmassRadType, SubhaloVel))

    # Restrictions:
    indexes = np.argwhere(HaloMass>0.).reshape(-1)  # Neglect halos with zero mass
    tab = tab[tab[:,4]>0.]                          # restrict to subhalos with mass (stars)
    tab = tab[tab[:,5]>Nstar_th]                    # restrict to subhalos with a minimum of star/wind particles
    tab[:,4] = np.log10(tab[:,4])                   # take the log of the stellar mass

    # Once restricted to a minimum number of stellar particles, remove this feature since it is not observable
    tab = np.delete(tab, 5, 1)

    if not use_hmR:
        tab = np.delete(tab, 5, 1)  # remove SubhaloHalfmassRadType if not required

    if only_positions:
        tab = np.column_stack((tab[:,0],tab[:,1],tab[:,2],tab[:,3]))

    f.close()
    return tab, HaloMass, GroupPos, GroupVel, indexes

??Boundary conditions??

In [4]:
# Correct periodic boundary effects
# Some halos close to a boundary could have subhalos at the other extreme of the box, due to periodic boundary conditions
# Just add or substract a length boxe in such cases to correct this artifact
def correct_boundary(pos, boxlength=1.):

    for i, pos_i in enumerate(pos):
        for j, coord in enumerate(pos_i):
            if coord > boxlength/2.:
                pos[i,j] -= boxlength
            elif -coord > boxlength/2.:
                pos[i,j] += boxlength

    return pos

In order to validate the neural net one has to split the data into training, test and validation sets:

In [6]:
from torch_geometric.data import Data, DataLoader
import random

def split_datasets(dataset):

    random.shuffle(dataset)

    num_train = len(dataset)
    split_valid = int(np.floor(valid_size * num_train))
    split_test = split_valid + int(np.floor(test_size * num_train))

    train_dataset = dataset[split_test:]
    valid_dataset = dataset[:split_valid]
    test_dataset = dataset[split_valid:split_test]

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

    return train_loader, valid_loader, test_loader

In [None]:
from torch_geometric.data import Data, DataLoader
from Source.constants import *
from Source.plotting import *