<img align="left" src = "images/linea.png" width=120 style="padding: 20px"> 
<img align="left" src = "images/rubin.png" width=140 style="padding: 30px"> 

# PZ Compute - E2E Notebook 
### Photo-zs for LSST Object catalog

## Data Release: DP0.2 




In [None]:
release = 'lsst_dp02' 

In [None]:
! conda env list

Notebook contributors: Julia Gschwend, Luigi Silva, Heloisa Mengisztki <br>
Contact: [julia@linea.org.br](mailto:julia@linea.org.br) <br>
Last verified run: **2024-Nov-08** <br>

## README - Disclaimer
This notebook is an alternative front-end for the pipeline Photo-z Compute, originally developed for command line execution on LIneA's HPC environment. It is meant to be used by the "photo-z experts" in charge of the production tasks related to the Brazilian in-kind contribution to LSST. It should **not** be considered as a source of [documentation or user guide](https://github.com/linea-it/pz-compute/tree/main/doc/manpages). 

After each complete execution, this notebook must be exported and saved as HTML file to serve as an execution report for future provenance tracking. Additional process metadata and provenance info are available in the `provenance_info.yaml` file attached. 

## Notebook contents 

1. Pre-processing: data preparation, photo-z training and validation 
2. Photo-z Compute 
3. Post-processing: analize results and performance  


Each one of these steps was carefuly explored in separate notebooks. This notebook contains only the final decisions regarding sample selection and configuration choices.   

--- 

Setup:

In [None]:
import os 
#import ... 

In [None]:
# PZ Server
from pzserver import PzServer
with open('.token.txt', 'r') as file:
    token = file.read()
pz_server = PzServer(token=token, host="pz-dev") # "pz-dev" is the temporary host for test phase  

--- 

# 1. Pre-processing 

## 1.1 Create Skinny tables 

Skinny tables are a subset of the [LSST Object catalog](https://sdm-schemas.lsst.io/dp02.html#Object) that includes only the columns of interest for photo-z algorithms, with ready-to-use data, i.e.: fluxes converted into deredded magnitudes.  

### Input data

The very first input data of this end-to-end sequence is the original LSST Object catalog for DP0.2, stored in Lustre system at: 

`/lustre/t1/cl/lsst/dp02/primary/catalogs/object/` 

Filename pattern: `objectTable_tract_xxxx_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_x_2022xxxxTxxxxxxZ.parq`

In [None]:
# IMPORTS
import os
import glob
import getpass
import numpy as np
import pandas as pd
import dask.dataframe as dd
import pyarrow.parquet as pq  
import matplotlib.pyplot as plt
from IPython.display import Markdown, display

File size summary.

In [None]:
# Paths for the catalog files
catalog_path = '/lustre/t1/cl/lsst/dp02/primary/catalogs/object'
catalog_files = '*.parq'
catalog_files_paths = [f for f in glob.glob(os.path.join(catalog_path, catalog_files))]

# If the IDs are in the index column, reset the index and add a column corresponding to the IDs.
ids_are_in_the_index = True

# Defining if you want to save the file size distribution, file size histogram and summarize pixels.
save_input_catalog_info = False

# If you choose True above, select the path to save the info.
if save_input_catalog_info==True:
    user = getpass.getuser()
    path_to_save_input_catalog_info = f'/lustre/t0/scratch/users/{user}/pz-compute/ondemand/data'
    os.makedirs(path_to_save_input_catalog_info, exist_ok=True)

In [None]:
# Reading the catalog and getting the columns
catalog_ddf = dd.read_parquet(catalog_files_paths)
if ids_are_in_the_index:
    catalog_ddf = catalog_ddf.reset_index()
catalog_columns = catalog_ddf.columns
catalog_columns_list = catalog_columns.to_list()

# Getting general information about the catalog files
file_info_list = []
for file_path in catalog_files_paths:
    try:
        # File size
        file_size = os.stat(file_path).st_size  # Size in bytes
        file_size_gb = file_size / (1024 ** 3)  # Converting to gigabytes
        
        # Counting the rows using parquet metadata
        parquet_file = pq.ParquetFile(file_path)  # Loading parquet metadata
        num_rows = parquet_file.metadata.num_rows  # Number of rows in the file
        
        # Adding information to the dictionary
        file_info_list.append({
            "file": file_path,
            "size_on_disk": file_size,
            "gbs": file_size_gb,
            "rows": num_rows
        })
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

# Creating a dataframe with the files info
input_info_frame = pd.DataFrame(file_info_list)

# Calculating statistics
num_partitions = len(input_info_frame)
min_size_on_disk = input_info_frame["gbs"].min() if not input_info_frame.empty else 0
max_size_on_disk = input_info_frame["gbs"].max() if not input_info_frame.empty else 0
total_size_on_disk = input_info_frame["gbs"].sum()

# Getting the rows count corresponding to the min and max size files
min_size_file = input_info_frame[input_info_frame["gbs"] == min_size_on_disk]
max_size_file = input_info_frame[input_info_frame["gbs"] == max_size_on_disk]
min_rows = min_size_file["rows"].iloc[0] if not min_size_file.empty else 0
max_rows = max_size_file["rows"].iloc[0] if not max_size_file.empty else 0

total_rows = input_info_frame["rows"].sum() if not input_info_frame.empty else 0
avg_file_size = total_size_on_disk / num_partitions if num_partitions > 0 else 0
avg_rows_per_file = total_rows / num_partitions if num_partitions > 0 else 0

# Preparing the table in Markdown format
markdown_table = f"""
| Metric               | Value                       |
|----------------------|-----------------------------|
| Number of files      | {len(catalog_files_paths)}  |
| Number of columns    | {len(catalog_columns_list)} |
| Min file size        | {min_size_on_disk:.2f} GB; {min_rows} rows |
| Max file size        | {max_size_on_disk:.2f} GB; {max_rows} rows |
| Average file size    | {avg_file_size:.2f} GB; {avg_rows_per_file:.0f} rows |
| Total size on disk   | {total_size_on_disk:.2f} GB; {total_rows} rows |
"""

# Display the Markdown table in a cell
display(Markdown(markdown_table))

# Saving the information if required
if save_input_catalog_info:
    # Ensure the directory exists
    os.makedirs(path_to_save_input_catalog_info, exist_ok=True)
    
    # Save the DataFrame as CSV
    input_info_frame_path = os.path.join(path_to_save_input_catalog_info, "input_info_frame.csv")
    input_info_frame.to_csv(input_info_frame_path, index=False)
    
    # Save the Markdown table as a text file
    markdown_table_path = os.path.join(path_to_save_input_catalog_info, "input_info_summary.txt")
    with open(markdown_table_path, "w") as f:
        f.write(markdown_table)
    
    # Save the provenance information
    provenance_path = os.path.join(path_to_save_input_catalog_info, "input_info_provenance.txt")
    with open(provenance_path, "w") as f:
        f.write(f"catalog_path: {catalog_path}\n")
        f.write(f"catalog_files: {catalog_files}\n")
    
    print(f"Information saved to {path_to_save_input_catalog_info}")

File sizes distribution and file sizes histogram.

In [None]:
def compute_and_save_histogram_info(info_frame, bins, labels, type_of_files, logs_dir, save=False, show=False):
    """
    Computes, optionally saves, and optionally shows histogram information (counts and percentages) for given bins and labels.

    Args:
        info_frame (DataFrame): DataFrame containing the file size information.
        bins (list): Bin edges for the histogram.
        labels (list): Labels corresponding to the bins.
        type_of_files (str): Identifier for the type of files (used in filenames).
        logs_dir (str): Directory to save the output files.
        save (bool): Whether to save the histogram information.
        show (bool): Whether to display the histogram information as a Markdown table.
    """
    # Compute histogram counts and percentages
    hist = np.histogram(info_frame["gbs"], bins=bins)[0]
    pcts = hist / len(info_frame) if len(info_frame) > 0 else [0] * len(labels)

    # Save histogram information
    if save and logs_dir:
        os.makedirs(logs_dir, exist_ok=True)
        output_path = os.path.join(logs_dir, f"{type_of_files}_file_size_distribution.txt")
        with open(output_path, "w") as file:
            file.write(f"Bins: {bins} GB\n")
            file.write(f"Labels: {labels} \n \n")
            for i, label in enumerate(labels):
                file.write(f"{label} \t: {hist[i]} \t({pcts[i]*100:.1f} %)\n")
        print(f"Histogram information saved to {output_path}")
        
    # Show histogram information as a Markdown table
    if show:
        # Prepare the Markdown table
        markdown_table = "| Label        | Count | Percentage (%) |\n"
        markdown_table += "|--------------|-------|----------------|\n"
        for i, label in enumerate(labels):
            markdown_table += f"| {label:<12} | {hist[i]:<5} | {pcts[i]*100:.1f}       |\n"

        # Display bins and labels in text, then the Markdown table
        bins_labels_md = f"**Bins**: {bins}  GB \n\n**Labels**: {labels}  \n\n"
        display(Markdown(bins_labels_md + markdown_table))


def plot_and_save_histogram(info_frame, bins, type_of_files, logs_dir, save=False, show=False):
    """
    Plots and optionally saves the histogram of file sizes and displays the bins as Markdown.

    Args:
        info_frame (DataFrame): DataFrame containing the file size information.
        bins (list or None): Bin edges for the histogram. If None, bins will be generated automatically.
        type_of_files (str): Identifier for the type of files (used in filenames).
        logs_dir (str): Directory to save the output files.
        save (bool): Whether to save the histogram plot.
        show (bool): Whether to display the plot and bins information.
    """
    # Plot the histogram
    n, bins_generated, _ = plt.hist(info_frame["gbs"], bins=bins, edgecolor='black')
    plt.xlabel("File size (GB)")
    plt.ylabel("Number of files")
    
    bins_used = bins_generated if bins is None else bins
    bins_used_rounded = [round(float(bin_edge), 3) for bin_edge in bins_used]
    
    # Save the plot
    if save and logs_dir:
        os.makedirs(logs_dir, exist_ok=True)
        output_path_plot = os.path.join(logs_dir, f"{type_of_files}_file_size_histogram.png")
        plt.savefig(output_path_plot)
        print(f"Histogram plot saved to {output_path_plot}")
        
        output_path_bins = os.path.join(logs_dir, f"{type_of_files}_file_size_histogram.txt")
        with open(output_path_bins, "w") as file:
            file.write(f"Bins: {bins_used} GB\n")

    # Show the plot
    if show:
        # Display bins in Markdown format
        bins_md = f"**Bins**: {bins_used_rounded}  GB\n"
        display(Markdown(bins_md))
        
        plt.show()

In [None]:
# Parameters
type_of_files = 'input'

bins_lincc_categories = [0, 0.5, 1, 2, 100]
labels_lincc_categories = ["small-ish", "sweet-spot", "big-ish", "too-big"]

bins_plotted_histogram = None

logs_dir = path_to_save_input_catalog_info if save_input_catalog_info else None

compute_and_save_histogram_info(
    info_frame=input_info_frame, 
    bins=bins_lincc_categories, 
    labels=labels_lincc_categories, 
    type_of_files=type_of_files, 
    logs_dir=logs_dir, 
    save=save_input_catalog_info, 
    show=True
)

print("\n \n")

plot_and_save_histogram(
    info_frame=input_info_frame, 
    bins=bins_plotted_histogram,  
    type_of_files=type_of_files, 
    logs_dir=logs_dir, 
    save=save_input_catalog_info, 
    show=True
)

### How to run commands in the conda env corresponding to the kernel directly in the cell

In [None]:
import sys
import os

def run_command(command_text):
    # Get the path to the current Python executable (used by the active kernel)
    current_python = sys.executable

    # Resolve the path to the Conda environment directory
    conda_env_path = os.path.dirname(os.path.dirname(current_python))

    # Display the active Conda environment path (optional, for confirmation)
    print(f"Active Conda environment path: {conda_env_path}")

    # Locate the base Conda installation
    base_conda_path = os.path.dirname(os.path.dirname(conda_env_path))

    # Use the Conda activation script to activate the environment and run the command
    command = f"source {base_conda_path}/etc/profile.d/conda.sh && conda activate {conda_env_path} && {command_text}"
    os.system(command)

run_command('conda list')

### Column selection  

Columns included in the skinny tables: 

| column name | data type |  description |
| ---         | ---       |  ---         |
| objectId	  | int  	  | Unique identifier | 
| coord_ra	  | float64	  | Fiducial ICRS Right Ascension of centroid (degrees)|
| coord_dec   |	float64	  | Fiducial ICRS Declination of centroid (degrees)| 
| detect_isPrimary	| boolean	| True if source has no children and is in the inner region of a coadd patch and is in the inner region of a coadd tract and is not a sky source | 
| mag_{u, g, r, i, z, y} | float64 | {u, g, r, i, z, y}-band magnitude converted from final cmodel fit flux measurements | 
| magerr_{u, g, r, i, z, y} | float64 | {u, g, r, i, z, y}-band magnitude errors converted from final cmodel fit flux error measurements | 

### Object selection  

Data cleaning to reduce the number of rows in the catalog is recommended for tests or when using the pipeline to create value-added catalogs for science cases. Survey conditions maps can be used to remove intire regions at once, based on a given quality threshold.   


**WARNING: Data cleaning must not be applied on the production runs to generate the official data products to be delivered as part of the in-kind contribution. Every detected object from the Object catalog must receive a photo-z estimate regardless of its nature or photometry quality.**      

### Configuration parameters 

Parameters defined inside `$SCRATCH/bin/rail-slurm-preprocess.batch`: 

```python 
SRUN = 'srun'
SRUN_ARGS = [SRUN, '-n1', '-N1']
LOG = 'log'
PROG = 'rail-slurm-preprocess'
PREPROCESS = 'rail-preprocess-parquet'
PREPROCESS_ARGS = ['--rows=130000', '--apply-dered=sfd', '--apply-detect-flag=True', 
                   '--round-mags=4', '--output-template={fname}-part{idx}.parquet']
```

### Execute `rail-preprocess-parquet` on Apollo

WARNING: Current working directory must contain a directory named as `log`. 


In [None]:
Fazer funcionar a execução em linha de comando - Helo 

In [None]:
#! sbatch -N 2 -n 20 rail-slurm-preprocess.batch  input/ output/ input/objectTable_tract_*.parq

monitor slurm queue

In [None]:
! squeue

### Output data 

#### Basic QA of skinny tables 

## 1.2 Create Training and Test Sets 


### Representative spectroscopic sample 

A true-z sample randomly selected from the DC2 simulation to mimic a representative spectroscopic sample regarding the color-magnitude-redshift space. 

Split the sample into two random subsets, with 70% of the galaxies designated for training and 30% for tests by adding an extra column `test`: 
* `test=0`: galaxies included in the **training** procedure
* `test=1`: galaxies included in the **test** procedure, mandatorily excluded from the training procedure 

In [None]:
Run Training Set Maker via pz server lib 


#### Basic QA of the representative training set 

### Realistic spectroscopic sample (TBD)

A true-z sample arbitrarily selected from the DC2 simulation to mimic realistic spectroscopic sample regarding the color-magnitude-redshift space, based on current spectroscopic data available from the literature . 




#### Basic QA of the realistic training set 

## 1.3 Train the photo-z algorithm  

Train the photo-z algorithm with RAIL (`rail_inform`). Available options: BPZ, FlexZBoost, GPz, LePHARE,and  TPZ.  


## 1.4 Photo-z Validation    

### PZ estimates for the Test Set

Run `rail_estimate` module to produce the photo-z estimates (PDFs) for the Test Set. 

### PZ validation results

#### Metrics and plots 

Run `rail_evaluate` module to compute PDF metrics. 

#### PZ Validation conclusions 

Quality assessment, comparison with science requirements. 

# 2. Photo-z Compute 

## Submit pipeline to Apollo cluster 

## Real-time monitoring  

# 3. Post-processing

## Performance evaluation 

## PZ Estimates - QA of final results 

# Export to HTML 