# Justice League Stellar Merger History

## Charlotte Christensen, June 11 2025

This jupyter notebook runs Anna Wright's code (/home/christenc/Code/python/AnnaWrite_startrace/RomZoomSHAnalysisScripts) to identify the stars in those halos.

From Anna's Email (June 4, 2025)

I managed to get through the pipeline and remember what everything did well enough to comment it today, but didn't have time to test it on a new halo. However, if you'd like to try it in the next 24 hours or so, I've attached a zip file with steps 1-7 of my pipeline (there are 8 files, but TrackDownStars_rz.ipynb and FixHostIDs_rz.py are really two halves of a single step). Step 0 is creating a tangos db for the simulation you'll be working with and my pipeline uses that and the simulation itself to create an hdf5 file with relevant data for each star particle. The most important bits are a unique host ID for each star particle so that stars that formed in the same halo can be grouped together, even if that halo doesn't exist at z=0, and the orbital circularity of each star particle, which I use to identify members of the stellar halo.
I will be testing it tomorrow on one of the newer Romulus Zooms, so there's a good chance I'll be sending you an updated version very soon with any bug fixes :) 
I apologize for how many pieces the pipeline is in - it really isn't all that complicated, but it was adapted from what I did for the FOGGIE sims and I split the steps of that pipeline up so that I could do as much as possible locally (rather than on Pleiades) and so that I could sanity check as often as possible. Please let me know if you have any issues, questions, or suggestions! I'd definitely be eager to hear what Juan does differently!

In [1]:
import os
os.environ['TANGOS_SIMULATION_FOLDER'] = '/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/'
os.environ['TANGOS_DB_CONNECTION'] = '/home/selvani/MAP/Data/Marvel_BN_N10.db'
os.chdir('/home/selvani/MAP/pynbody/AnnaWright_startrace/')

import tangos
tangos.all_simulations()

OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/20/e3q8)

In [2]:
# tangos.get_simulation("cptmarvel.4096g5HbwK1BH_bn").timesteps

In [3]:
import pynbody as pb
import numpy as np
import pandas as pd
import glob
import h5py
import tangos

# For importing modules
import importlib.util
import sys
from pathlib import Path

# Import the module
base_path = '/home/selvani/MAP/pynbody/AnnaWright_startrace'

for root, dirs, files in os.walk(base_path):
    if root not in sys.path:
        print("Adding to sys.path:", root)
        sys.path.append(root)

Adding to sys.path: /home/selvani/MAP/pynbody/AnnaWright_startrace
Adding to sys.path: /home/selvani/MAP/pynbody/AnnaWright_startrace/Extra
Adding to sys.path: /home/selvani/MAP/pynbody/AnnaWright_startrace/RomZoomSHAnalysisScripts
Adding to sys.path: /home/selvani/MAP/pynbody/AnnaWright_startrace/RomZoomSHAnalysisScripts/__pycache__


In [4]:
# Simulation name and path

# Elena
# simname = 'h329' # not needed?
# res = '100'  # The Near Mint runs not needed?
#res = 'Mint'  # The Mint
simpath = '/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/'
basename = 'cptmarvel.cosmo25cmb.4096g5HbwK1BH'

ss_dir = 'cptmarvel.4096g5HbwK1BH_bn'#'snapshots_200crit_h329' # same as db_sim?
sim_base = simpath + ss_dir + '/'
ss_z0 = sim_base + basename + '.004096'

outfile_dir = "/home/selvani/MAP/pynbody/stellarhalo_trace_aw/"

### 1) GrabTF_rz.py

Step 1 of stellar halo pipeline

**What it does:**
- Loads the final snapshot (z=0) of the simulation
- Extracts formation time (`tform`) and particle IDs (`iord`) for all star particles that have `tform > 0` (i.e., actual stars, not wind particles)
- Converts formation times to Gyr units
- Creates a 2D array with particle IDs in the first row and formation times in the second row
- Saves this data as a `.npy` file named `<simulation_name>_tf.npy`
- Prints the total number of star particles found as a sanity check

**Input:** Simulation snapshot (specifically the final snapshot `.004096`)

**Output:** `<sim>_tf.npy` - NumPy file containing:
- Row 0: Star particle IDs (`iord`)  
- Row 1: Formation times in Gyr (`tform`)

**Purpose:** This creates the foundation dataset that subsequent steps will use to trace back each star particle to determine which halo it was forming in at its birth time.

* Usage:   `python GrabTF_rz.py <simpath> <output_directory>`
* Example: `python GrabTF_rz.py /path/to/sim/ /path/to/output/`
* Runtime:  ~1 min

In [5]:
# Import Anna's code, even if not along my path
file_path = '/home/selvani/MAP/pynbody/AnnaWright_startrace/RomZoomSHAnalysisScripts/GrabTF_rz.py'
module_name = 'GrabTF_rz'

spec = importlib.util.spec_from_file_location(module_name, file_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

In [6]:
module.main(simpath, outfile_dir)

414596 stars found!
Save to  /home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_tf.npy


### 2) LocAtCreation_pool_rz.py

<!-- Step 2 of stellar halo pipeline -->
<!-- Identifies the host of each star particle in \<sim\>_tf.npy at the  -->
<!-- time it was formed.  -->
**Note that what is stored is NOT the amiga.grp 
ID, but the index of that halo in the tangos database. The amiga.grp
ID can be backed out via tangos with sim[stepnum][halonum].finder_id.** 
(CC, I believe that I edited this so now the halo_)

<!-- Output: <sim>_stardata_<snapshot>.h5
        where <snapshot> is the first snapshot that a given process
        analyzed. There will be <nproc> of these files generated
        and processes will not necessarily analyze adjacent snapshots -->

<!-- * Usage:   python LocAtCreation_pool_rz.py <sim> optional:<nproc>
* Example: python LocAtCreation_pool_rz.py r634 2 -->

<!-- Includes an optional argument to specify number of processes to run
with; default is 4. Note that this will get reduced if you've specified
more processes than you have snapshots to process. -->

<!-- Note that this has the name of the snapshots directory hardcoded into FindHaloStars.py (L63)
-- Will need to be adjusted
The 

CC: When I did my edits, I moved code into FindHaloStars so that it can be imported for multiprocessing -->


Step 2 of stellar halo pipeline

**What it does:**
- Loads the `<sim>_tf.npy` file created in step 1 (containing star particle IDs and formation times)
- Queries the Tangos database to get all available simulation snapshots and their cosmic times
- Determines which snapshots contain newly formed stars by binning star formation times
- For each relevant snapshot, identifies which halo each star particle belonged to at the time it formed
- Extracts additional data for each star: formation position, formation time, and host halo ID
- Converts Amiga halo group IDs to Tangos database indices for consistency
- Handles unbound particles (those not in any halo) by assigning them host ID = -1

**Detailed Process:**
1. **Data Loading**: Loads the `_tf.npy` file containing star particle IDs and formation times
2. **Snapshot Analysis**: Gets all simulation timesteps from Tangos database and sorts by cosmic time
3. **Star Distribution**: Creates histogram of star formation times to identify which snapshots contain new stars
4. **Chunk Creation**: Divides snapshots among multiple processes for parallel processing
<!-- 5. **Multiprocessing Execution**: Each process handles a subset of snapshots independently -->

**FindHaloStars Function (called by each process):**
- **Time Matching**: For each snapshot, finds stars that formed between the previous snapshot and current one
- **Particle Matching**: Uses `iord` (particle IDs) to match stars from step 1 with their counterparts in historical snapshots
- **Host Identification**: Determines which halo (`amiga.grp`) each star was in when it formed
- **Database Indexing**: Converts halo IDs to Tangos database indices using a lookup dictionary
- **Position/Time Extraction**: Records formation positions, times, and snapshot locations
- **Data Writing**: Periodically saves data to HDF5 files to manage memory usage

<!-- **Key Technical Details:**
<!-- - Each process loads the same `_tf.npy` data independently to avoid sharing conflicts -->
<!-- - Uses `np.searchsorted()` for efficient particle ID matching between snapshots -->
<!-- - Creates `fid` dictionary to map Amiga group IDs to Tangos database indices
- Handles missing particles gracefully (assigns host ID = -1 for unbound stars -->

**Input:** 
- `<sim>_tf.npy` from step 1
- Simulation snapshots (all timesteps)
- Tangos database connection

**Output:** `<sim>_stardata_<snapshot>.h5` files (one per process) containing:
- `particle_IDs`: Star particle IDs (`iord`) of stars formed between snapshot and previous snapshot
- `particle_positions`: 3D positions at formation time (Mpc)
- `particle_creation_times`: Formation times (Gyr)
- `timestep_location`: Snapshot number where star was first found
- `particle_hosts`: Host halo index in Tangos database (-1 for unbound stars)

<!-- **Performance Features:**
- **Multiprocessing**: Uses all available CPU cores (up to 72 logical cores) for parallel processing
- **Load Balancing**: Randomly shuffles snapshot order to distribute work evenly
- **Memory Management**: Periodically writes data to disk to prevent memory overflow
- **Process Isolation**: Each process works independently to avoid conflicts -->

**Important Notes:**
<!-- - **Host IDs are Tangos database indices, NOT Amiga group IDs** -->
- Multiple output files are created (one per process) that will be combined in later steps. Just run with n_threads>num_snapshots to make one file per snapshot.
<!-- - Uses multiprocessing for significant speed improvement on multi-core systems -->
<!-- - Automatically handles load balancing by shuffling snapshot order -->
<!-- - Each process creates its own output file named by the first snapshot it processes -->

**Purpose:** This step creates the detailed formation history for each star particle, linking it to its birth halo and enabling stellar halo analysis. The multiprocessing approach significantly reduces computation time for large simulations.
* Usage: `python LocAtCreation_pool_rz.py <simpath> <db_sim_name> <output_dir> [n_processes]`
* Example: `python LocAtCreation_pool_rz.py /path/to/sim/ cptmarvel.4096g5HbwK1BH_bn /output/ 36`
* Runtime: ~1 hour sequential

In [5]:
import LocAtCreation_pool_rz

In [6]:
import psutil
n_cpus = psutil.cpu_count(logical=True) # use up to 36 quirm cores
LocAtCreation_pool_rz.main(simpath, ss_dir, outfile_dir, 72, overwrite=False)

Stars from 42 steps left to deal with
Initializing  42
Shuffled chunks: [['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001920']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001280']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003200']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003456']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001543']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.000640']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.000384']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003328']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001792']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001408']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.002176']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.002432']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003072']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.000818']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.002304']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003968']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.000482']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.002048']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001162']
 ['cptmarvel.cosmo25cmb.4096g5HbwK1

Processing:   0%|          | 0/42 [00:00<?, ?chunks/s]

Processing chunk 1/42: ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001920']
MyFirstStep:  001920
/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/
File already exists: /home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_001920.h5
  Completed: 001920

Processing chunk 2/42: ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.001280']
MyFirstStep:  001280
/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/
File already exists: /home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_001280.h5
  Completed: 001280

Processing chunk 3/42: ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.003200']
MyFirstStep:  003200
/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/
File already exists: /home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_003200.h5
  Completed: 003200

Processing chunk 4/42: ['cptmarvel.cosmo25cmb.4096g5HbwK1BH.0

### 3) writeouthosts_rz.py

<!-- Step 3 of stellar halo pipeline                                                                                   
For each snapshot, writes out a list of halos that formed stars                                                   
between the last snapshot and this one and the number of stars formed;                                            
as in step 2, note that the IDs of these halos will be their index in                                             
the tangos database, not necessarily their amiga.grp ID. This is used                                             
to construct a unique ID for each star-forming halo in the next step.                                             
                                                                                                                   -->
<!-- Output: <sim>_halostarhosts.txt                                                                                   
                                                                                                                  
Usage:   python writeouthosts_rz.py <sim>                                                                         
Example: python writeouthosts_rz.py r634                                                                          
                                                                                                                   -->
Note that this is currently set up for MMs, but should be easily adapted                                          
by e.g., changing the paths or adding a path CL argument.    

Step 3 of stellar halo pipeline

**What it does:**
- Reads all the `<sim>_stardata_*.h5` files created in Step 2 
- For each simulation snapshot, identifies which halos formed new stars between that snapshot and the previous one
- Counts how many stars formed in each halo during each time interval
- Creates a comprehensive timeline of star formation activity across all halos
- Outputs a summary text file listing star-forming halos and their activity over snapshots

**Detailed Process:**
1. **File Discovery**: Locates all `*_stardata_*.h5` files from Step 2 using glob pattern matching
2. **Data Extraction**: For each HDF5 file, extracts:
   - `particle_hosts`: Host halo indices 
   - `timestep_location`: Snapshot numbers where each star particle first appears
3. **Temporal Binning**: Groups star particles by the snapshot where they formed
4. **Halo Counting**: For each snapshot, counts how many stars formed in each unique halo
5. **Output Formatting**: Creates chronologically ordered summary with format:
   ```
   <snapshot_number>    <halo_id_1>,<star_count_1>    <halo_id_2>,<star_count_2>    ...
   ```

<!-- **Key Technical Details:**
- **Vectorized Operations**: Uses `np.unique(return_counts=True)` for efficient halo counting instead of slow loops
- **Memory Efficient**: Processes one HDF5 file at a time to minimize memory usage
- **Chronological Ordering**: Sorts output by snapshot number for temporal analysis
- **Duplicate Handling**: Aggregates data from multiple stardata files that may have overlapping snapshots -->

**Input Files:**
- Multiple `<sim>_stardata_<snapshot>.h5` files from Step 2
- Each file contains star formation data for a subset of simulation snapshots

**Output File:** `<sim>_halostarhosts.txt`
- Text file with tab-separated values
- Each line represents one simulation snapshot
- Format: `<timestep>\t<halo_id>,<count>\t<halo_id>,<count>\t...`
- Example line: `3840    -1,234    1,156    3,89`
  - At snapshot 3840: halo -1 formed 234 unbound stars, halo 1 formed 156 stars, halo 3 formed 89 stars

<!-- **Data Flow:**
```
Step 2 Output: Multiple *_stardata_*.h5 files
                    ↓
Step 3: Aggregate and summarize by snapshot/halo
                    ↓
Step 3 Output: Single *_halostarhosts.txt file
``` -->

<!-- **Performance Optimizations:**
- **Batch Processing**: Handles large datasets efficiently using vectorized NumPy operations
- **String Building**: Uses `list.join()` instead of repeated string concatenation for speed
- **Efficient I/O**: Single-pass reading of HDF5 files with minimal memory footprint -->

**Purpose:** This step creates a compact summary of star formation activity that enables Step 4 to efficiently track halo merger histories and assign unique IDs to star-forming halos. 
<!-- The chronological format makes it easy to identify:
- Which halos were actively forming stars at each epoch
- How star formation rates varied over cosmic time  
- Which halos contributed most significantly to stellar mass assembly -->
<!-- 
**Example Usage:**
- Input: 40+ `cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_*.h5` files
- Output: `cptmarvel.cosmo25cmb.4096g5HbwK1BH_halostarhosts.txt`
- Result: Timeline of ~100 snapshots showing star formation in ~1000s of halos -->

* Usage: `python writeouthosts_rz.py <sim> <output_dir>`
* Example: `python writeouthosts_rz.py cptmarvel.cosmo25cmb.4096g5HbwK1BH /output/path/`
* Runtime: instant

**Important Notes:**
<!-- - **Halo IDs are Tangos database indices**, not Amiga group IDs (consistent with Step 2)
- Output file size is typically much smaller than input HDF5 files (text vs binary format)
- This summary enables efficient processing in Step 4 without re-reading large HDF5 files -->
- Unbound stars (host ID = -1) are included in the summary for completeness

In [7]:
import writeouthosts_rz

In [8]:
writeouthosts_rz.main(basename, odir=outfile_dir)

Found 42 stardata files
Output file: /home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_halostarhosts.txt


#### Extra stuff charlotte added

In [None]:
print(np.unique(tslist))
print(len(tslist))
print(len(hostlist))

In [None]:
t = 456
tstr = str(int(t))
tmask = np.where(tslist==t)[0]
np.size(tmask)

### 4) IDUniqueHost_rz.py

Step 4 of stellar halo pipeline                                                                                                                       
Creates a unique ID for each host that forms a star. The format of                                                                                    
this ID is \<last snapshot where host was IDed\>_\<index at this snapshot\>.                                                                              
So, if a host was halo 5 at snapshot 3552 and then merged with halo 1                                                                                 
before the next snapshot, its unique ID will be 3552_5. Stars that form                                                                               
in its main progenitors will also be associated with this ID. These IDs                                                                               
are written out to a file with a similar format to <sim>_halostarhosts.txt.                                                                           
                                                                                                                                                      
Output: <sim>_uniquehalostarhosts.txt                                                                                                                 
                                                                                                                                                      
Usage:   python IDUniqueHost_rz.py <sim>                                                                                                              
Example: python IDUniqueHost_rz.py r634                                                                                                               
                                                                                                                                                      
Note that this is currently set up for MMs, but should be easily adapted                                                                              
by e.g., changing the paths or adding a path CL argument. It is also                                                                                  
designed to accommodate the phantoms that rockstar generates when it                                                                                  
temporarily loses track of a halo, which slows it down quite a bit.                                                                                   
If you're only ever going to be using it with other types of merger                                                                                   
trees, it can be simplified.  

Step 4 of stellar halo pipeline

**What it does:**
- Reads the `<sim>_halostarhosts.txt` file created in Step 3 (which lists halos that formed stars at each snapshot)
- For each star-forming halo, traces its merger history forward in time using Tangos database merger trees
- Creates a unique, persistent ID for each halo that accounts for mergers and halo evolution
- Assigns the same unique ID to stars formed in progenitor halos that later merge
- Handles "phantom" halos (temporary tracking losses in halo finders) for robust merger tree following
- Outputs a mapping file that connects local halo IDs to persistent unique IDs

**Detailed Process:**
1. **Input Parsing**: Reads the timeline of star-forming halos from `_halostarhosts.txt`
2. **Merger Tree Traversal**: For each halo, uses Tangos database to trace descendants forward in time
3. **Self-Consistency Checking**: Verifies that merger tree connections are bidirectional (descendant→progenitor matches)
4. **Unique ID Assignment**: Creates IDs in format `<last_snapshot>_<halo_index>` where the halo was last independently identified
5. **Progenitor Chain Building**: Links all progenitors in a merger chain to the same unique ID
6. **Phantom Handling**: Accommodates temporary halo finder failures using robust tree traversal algorithms

**Key Technical Details:**
- **Unique ID Format**: `SSSS_H` where `SSSS` = 4-digit snapshot number, `H` = halo index at that snapshot
- **Forward Tracking**: Uses `trackforward()` function to find the last snapshot where a halo exists independently
- **Merger Tree Validation**: Employs `checkmatch_p()` and `checkmatch_d()` to verify progenitor/descendant relationships
- **Phantom Accommodation**: Filters out phantom halos (type > 0) but maintains merger tree integrity
- **Caching System**: Uses dictionary `d` to store previously computed unique IDs for efficiency

**Example Unique ID Creation:**
```
Halo 5 at snapshot 3552 merges with halo 1 before snapshot 3553
→ Unique ID: "3552_5"
→ All stars formed in this halo's progenitors get ID "3552_5"
→ Stars formed in halo 1 (after merger) get a different unique ID
```

**Algorithm Workflow:**
1. **For each timestep** in the simulation:
   - **For each halo** that formed stars at that timestep:
     - Check if unique ID already computed (use cached result)
     - If not cached: trace forward to find last independent existence
     - Create unique ID based on final independent snapshot
     - Trace backward through progenitor chain
     - Assign same unique ID to all progenitors in the chain
     - Cache results for future lookups

**Merger Tree Functions:**
- **`trackforward(step, halo)`**: Traces halo forward to find last independent snapshot
- **`checkmatch_p(step, halo, hid, disp)`**: Verifies progenitor relationship
- **`checkmatch_d(step, halo, hid, disp)`**: Verifies descendant relationship

**Input Files:**
- `<sim>_halostarhosts.txt` from Step 3 (timeline of star-forming halos)
- Tangos database with merger tree information

**Output File:** `<sim>_uniquehalostarhosts.txt`
- Text file with same format as input but with unique IDs replacing local halo indices
- Format: `<timestep>\t<unique_id>,<local_halo_id>[,<star_count>]\t...`
- Example line: `3840    3552_5,15,234    3721_42,42,156`
  - At snapshot 3840: unique halo "3552_5" (local ID 15) formed 234 stars

<!-- **Performance Considerations:**
- **Caching**: Avoids recomputing unique IDs for halos already processed
- **Phantom Handling**: Designed for Rockstar halo finder but slows processing
- **Database Queries**: Intensive use of Tangos merger tree calculations
- **Memory Usage**: Stores merger tree data and caching dictionary -->

**Data Flow:**
```
Step 3 Output: *_halostarhosts.txt (local halo IDs)
                    ↓
Step 4: Merger tree analysis + unique ID assignment
                    ↓
Step 4 Output: *_uniquehalostarhosts.txt (persistent unique IDs)
```

**Special Cases Handled:**
- **Unbound Stars**: Particles with host ID = -1 get unique ID `<snapshot>_0`
- **Phantom Halos**: Temporary tracking losses in halo finder are filtered but accounted for
- **Merger Events**: Multiple local halos can map to the same unique ID if they're part of the same merger tree
- **Isolated Halos**: Halos that never merge retain their snapshot-specific unique ID

**Purpose:** This step solves the fundamental problem that halo IDs change over time due to mergers, making it impossible to track stellar populations. By creating persistent unique IDs, we can:
- Group stars by their true formation halo, even after mergers
- Trace stellar populations through cosmic time
- Identify which stars belong to the main galaxy vs. accreted satellites
- Enable stellar halo analysis based on formation environment

* Usage: `python IDUniqueHost_rz.py <sim> <tangos_simulation> <input_file> <output_file>`
* Example: `python IDUniqueHost_rz.py cptmarvel.cosmo25cmb.4096g5HbwK1BH sim_object halostarhosts.txt uniquehalostarhosts.txt`
* Runtime: ~2 minutes 

<!-- **Important Notes:**
- **Computationally Intensive**: Merger tree queries can be slow for large simulations
- **Rockstar Optimized**: Phantom handling is specific to Rockstar halo finder behavior
- **Database Dependent**: Requires properly constructed Tangos merger trees
- **Bidirectional Verification**: Ensures merger tree consistency through forward/backward checking
- **Memory Scaling**: Caching dictionary grows with number of unique halos processed -->

**Output Validation:**
The output file enables Step 5 to create a comprehensive database where every star particle is assigned to a persistent halo ID, regardless of when it formed or how many mergers occurred afterward.

<!-- ### 4) IDUniqueHost_rz.py

Step 4 of stellar halo pipeline

**What it does:**
- Reads the `<sim>_halostarhosts.txt` file created in Step 3 (timeline of star-forming halos)
- For each star-forming halo, traces its merger history forward in time using Tangos database merger trees
- Creates a unique, persistent ID for each halo that accounts for mergers and halo evolution  
- Assigns the same unique ID to stars formed in progenitor halos that later merge
- Handles "phantom" halos (temporary tracking losses in halo finders) for robust merger tree following

**Input:** `<sim>_halostarhosts.txt` from Step 3 (timeline of star-forming halos with local IDs)

**Output:** `<sim>_uniquehalostarhosts.txt` - Text file containing:
- Same format as input but with unique IDs replacing local halo indices
- Format: `<timestep>\t<unique_id>,<local_halo_id>,<star_count>\t...`
- Unique ID format: `SSSS_H` where `SSSS` = 4-digit snapshot number, `H` = halo index

**Purpose:** This solves the fundamental problem that halo IDs change over time due to mergers. By creating persistent unique IDs, we can group stars by their true formation halo even after mergers, enabling stellar halo analysis based on formation environment.

* Usage: `python IDUniqueHost_rz.py <sim> <tangos_simulation> <input_file> <output_file>`
* Example: `python IDUniqueHost_rz.py cptmarvel.cosmo25cmb.4096g5HbwK1BH sim_object halostarhosts.txt uniquehalostarhosts.txt` -->

In [9]:
import IDUniqueHost_rz

In [11]:
sim = tangos.get_simulation(ss_dir)
print(f"Simulation: {sim}")
import collections
d = collections.defaultdict(list)
print(d)

hsfile = os.path.join(outfile_dir, f"{basename}_halostarhosts.txt")
ofile = os.path.join(outfile_dir, f"{basename}_uniquehalostarhosts.txt")

IDUniqueHost_rz.main(sim, d, hsfile, ofile)

Simulation: <Simulation("cptmarvel.4096g5HbwK1BH_bn")>
defaultdict(<class 'list'>, {})
------ 291
Current: 0291_1


OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/20/e3q8)

In [23]:
fname = sim_base + basename+ '.000818'
print(fname)
s = pb.load(fname)
unique_gp = np.unique(s.s['amiga.grp'])
print(unique_gp)

/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/cptmarvel.4096g5HbwK1BH_bn/cptmarvel.cosmo25cmb.4096g5HbwK1BH.000818
[ -1   1   2   3   4   5   6   7   8   9  10  11  12  15  16  27  28  30
  47  49 642]


In [7]:
fname = sim_base + basename+ '.002162'
print(fname)
s = pb.load(fname)
unique_gp = np.unique(s.s['amiga.grp'])
print(unique_gp)

/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/cptmarvel.4096g5HbwK1BH_bn/cptmarvel.cosmo25cmb.4096g5HbwK1BH.002162
[  -1    1    2    3    4    5    7    9   10   11   13   14   16   27
   48  126  181 6155]


### 5) StoreUniqueHostID_rz.py

Step 5 of stellar halo pipeline
Stores the unique ID of each star particle's host at formation time.
Creates an hdf5 file that contains this in addition to all of the data
from the <sim>_stardata_<snapshot>.h5 files. Note that all star particles
that don't have a host in the snapshot after they formed will be assigned 
a unique ID of <snapshot_index>_0 and a particle host (i.e., host at
formation time) of -1. It is recommended that you use the TrackDownStars
Jupyter notebook to try to manually identify hosts for these stars and then
use FixHostIDs_rz.py to amend <sim>_allhalostardata.h5.

Output: <sim>_allhalostardata.h5

Usage:   python StoreUniqueHostID_rz.py <sim>
Example: python StoreUniqueHostID_rz.py r634 

Note that this is currently set up for MMs, but should be easily adapted 
by e.g., changing the paths or adding a path CL argument.



Step 5 of stellar halo pipeline

**What it does:**
- Combines all `<sim>_stardata_*.h5` files from Step 2 into a single HDF5 file
- Maps each star particle's local host ID to its unique host ID from Step 4
- Creates final dataset linking every star particle to its persistent formation halo
- Handles unbound stars (host ID = -1) by assigning unique IDs like `3840_0`

**Input:** 
- Multiple `<sim>_stardata_*.h5` files from Step 2
- `<sim>_uniquehalostarhosts.txt` from Step 4 (halo ID mapping)

**Output:** `<sim>_allhalostardata.h5` containing:
- `particle_IDs`: Star particle IDs (`iord`)
- `particle_positions`: Formation positions (Mpc)
- `particle_creation_times`: Formation times (Gyr)
- `timestep_location`: Formation snapshot numbers
- `particle_hosts`: Local halo IDs at formation
- `host_IDs`: **Unique persistent halo IDs** (e.g., "3552_5")

**Key Process:**
1. Load unique ID mapping from Step 4: `"timestep,hostid" → "unique_id"`
2. For each star particle: lookup `(timestep, local_halo_id)` → assign unique ID
3. Unbound particles get default IDs: `f"{timestep:04d}_0"`
4. Combine all data into single compressed HDF5 file

**Purpose:** Creates analysis-ready dataset enabling stellar population studies with persistent halo tracking across mergers and cosmic time.

* Usage: `python StoreUniqueHostID_rz.py <sim> <output_directory>`
* Example: `python StoreUniqueHostID_rz.py cptmarvel.cosmo25cmb.4096g5HbwK1BH /output/path/`

In [4]:
import StoreUniqueHostID_rz

StoreUniqueHostID_rz.main(basename, outfile_dir)

/home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_000482.h5 <KeysViewHDF5 ['particle_IDs', 'particle_creation_times', 'particle_hosts', 'particle_positions', 'timestep_location']>
/home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_003840.h5 <KeysViewHDF5 ['particle_IDs', 'particle_creation_times', 'particle_hosts', 'particle_positions', 'timestep_location']>
/home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_002304.h5 <KeysViewHDF5 ['particle_IDs', 'particle_creation_times', 'particle_hosts', 'particle_positions', 'timestep_location']>
/home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_003968.h5 <KeysViewHDF5 ['particle_IDs', 'particle_creation_times', 'particle_hosts', 'particle_positions', 'timestep_location']>
/home/selvani/MAP/pynbody/stellarhalo_trace_aw/cptmarvel.cosmo25cmb.4096g5HbwK1BH_stardata_002432.h5 <KeysViewHDF5 [

### 6a) TrackDownStars_rz.ipnb

### 6b) FixHostID_rz

Optional: Step 6b of stellar halo pipeline
Updates the host_ID values stored in the allhalostardata hdf5 file based
on user input. This is designed as a follow-up to TrackDownStars and can 
be used in a couple of ways. If ffile=True, this script will look for 
numpy files with names <sim>_new_ID_?.npy and will assign the particles
with the iords in a given file to new_ID. If ffile=False, it will assign
all particles with a host_ID in the old_ID list to new_ID.

Output: <sim>_allhalostardata_upd.h5

Usage:   python FixHostIDs_rz.py <sim>
Example: python FixHostIDs_rz.py r634 

The script will print out all host_IDs for which the number of assigned particles 
changed and how many particles each gained/lost. If the output looks correct,
the user should manually rename <sim>_allhalostardata_upd.h5 to <sim>_allhalostardata.h5.
It's often necessary to go back and forth between this and TrackDownStars, in which 
case I usually move the *.npy files that have already been processed to a subfolder. 


### 6c) CompTwoHalos_rz

Optional: Step 6c of stellar halo pipeline
Compares two halos to see how likely it is that one is
the main progenitor of the other based on how many particles
they have in common. This is particularly useful when you've
used a merger tree constructor that doesn't use phantoms or
some equivalent and may therefore fail to connect a halo at
snapshot1 to the same halo at snapshot3 if it lost track of it
at snapshot2. This information can then be used with FixHostIDs_rz
to merge two unique IDs and/or to create a new link in the 
relevant tangos db.

Usage: python CompTwoHalos_rz.py <sim> <halo1> <halo2>
Example: python CompTwoHalos_rz.py r718 0136_4 0192_3

Output: prints out the fraction of <halo1>'s DM particles
that are in <halo2> and vice-versa.

It takes three arguments: the simulation you're working with
and the tangos IDs of the two halos you want to compare, which
are formatted as <snapshot>_<IDatsnapshot>. Note that this ID
is assumed to be the tangos ID, not necessarily the amiga.grp
ID.


In [9]:
# boxsize, mass unit, vel unit, h

import pynbody

print(os.listdir('/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/cptmarvel.4096g5HbwK1BH_bn/'))

s = pynbody.load('/home/selvani/MAP/Sims/cptmarvel.cosmo25cmb/cptmarvel.cosmo25cmb.4096g5HbwK1BH/cptmarvel.4096g5HbwK1BH_bn/cptmarvel.cosmo25cmb.4096g5HbwK1BH.004096')
s.properties

['cptmarvel.cosmo25cmb.4096g5HbwK1BH.000482.amiga.stat', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.001792.igasorder', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.000768', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.002162', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.001664.amiga.stat', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.003456.FeMassFrac', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.000291.amiga.stat', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.003968.igasorder', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.002176.z0.741.AHF_profiles', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.000512', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.003245.amiga.grp', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.001025.z1.999.AHF_halos', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.003328.OxMassFrac', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.002816.amiga.stat', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.000672.amiga.gtp', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.001792.FeMassFrac', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.000640.FeMassFrac', 'cptmarvel.cosmo25cmb.4096g5HbwK1BH.003584.HI', 'cptmarvel.cosmo25cmb

{'omegaM0': 0.24,
 'omegaL0': 0.76,
 'h': 0.7299490542599526,
 'boxsize': Unit("2.50e+04 kpc a"),
 'a': 1.0000000000142635,
 'time': Unit("1.40e+01 s kpc km**-1")}

In [11]:
print(s['mass'].units)
print(s['vel'].units)

2.31e+15 Msol
6.30e+02 km a s**-1


In [14]:
result = 'cosmo25cmb.4096g5HbwK1BH_stardata_002688.h5'
result.split('.')[-2][-6:]

'002688'