<img align="left" src = "images/linea.png" width=120 style="padding: 20px"> 
<img align="left" src = "images/rubin.png" width=140 style="padding: 30px"> 

# [Under construction] PZ Compute - E2E Notebook
### Photo-zs for LSST Object catalog

<br><br>

Notebook contributors: Julia Gschwend, Luigi Silva, Heloisa Mengisztki <br>
Contact: [julia@linea.org.br](mailto:julia@linea.org.br) <br>
Last verified run: **2024-Nov-08** <br>

## README - Disclaimer
This notebook is an alternative front-end for the pipeline Photo-z Compute, originally developed for command line execution on LIneA's HPC environment. It is meant to be used by the "photo-z experts" in charge of the production tasks related to the Brazilian in-kind contribution to LSST. It should **not** be considered as a source of [documentation or user guide](https://github.com/linea-it/pz-compute/tree/main/doc/manpages). 

After each complete execution, this notebook must be exported and saved as HTML file to serve as an execution report for future provenance tracking. Additional process metadata and provenance info are available in the `provenance_info.yaml` file attached. 

## Notebook contents 

1. Pre-processing: data preparation, photo-z training and validation 
2. Photo-z Compute 
3. Post-processing: analize results and performance  


Each one of these steps was carefuly explored in separate notebooks. This notebook contains only the final decisions regarding sample selection and configuration choices.   

--- 

## Imports and auxiliary functions

In [None]:
import tables_io
import getpass
import pyarrow
import yaml
import time
import glob
import h5py
import sys
import os
import re
import qp 

import numpy as np
import pandas as pd
import dask.dataframe as dd
import pyarrow.parquet as pq  
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from pathlib import Path
from IPython.display import Markdown, display
from IPython.core.magic import register_cell_magic
from IPython import get_ipython

pyarrow.__version__

In [None]:
# Rail modules
from rail.core.data import TableHandle, QPHandle
from rail.core.stage import RailStage
from rail.core.utils import find_rail_file

from rail.estimation.algos.naive_stack import NaiveStackSummarizer
from rail.estimation.algos.point_est_hist import PointEstHistSummarizer
from rail.evaluation.point_to_point_evaluator import PointToPointEvaluator
from rail.evaluation.metrics.cdeloss import *

from qp.metrics.pit import PIT
 
%matplotlib inline
%reload_ext autoreload
%autoreload 2

DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
import ondemand_utils as utils
from pzserver import PzServer

utils.run_command("pwd")

In [None]:
cwd = os.getcwd()

in case ondemand_utils.utils does not import correctly, please run on your terminal: 

>```python 
> pip install -e $SCRATCH/pz-compute/ondemand/ondemand_utils/. 
>```

or uncomment the following cell and after running it, restart your kernel.

In [None]:
#! pip install -e $SCRATCH/pz-compute/ondemand/ondemand_utils/. 

In [None]:
@register_cell_magic
def skip_if(line, cell):
    if eval(line):
        return
    get_ipython().run_cell(cell)

## Data Release: DP0.2 




--- 

# 1. Pre-processing 

## 1.1 Create Skinny tables 

Skinny tables are a subset of the [LSST Object catalog](https://sdm-schemas.lsst.io/dp02.html#Object) that includes only the columns of interest for photo-z algorithms, with ready-to-use data, i.e.: fluxes converted into deredded magnitudes.  

### Input data

The very first input data of this end-to-end sequence is the original LSST Object catalog for DP0.2, stored in Lustre system at: 

`/lustre/t1/cl/lsst/dp02/primary/catalogs/object/` 

Filename pattern: `objectTable_tract_xxxx_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_x_2022xxxxTxxxxxxZ.parq`

File size summary.

In [None]:
# Paths for the catalog files
catalog_path = '/lustre/t1/cl/lsst/dp02/primary/catalogs/object'
catalog_files = '*.parq'
catalog_files_paths = [f for f in glob.glob(os.path.join(catalog_path, catalog_files))]

# If the IDs are in the index column, reset the index and add a column corresponding to the IDs.
ids_are_in_the_index = True

# Defining if you want to save the file size distribution, file size histogram and summarize pixels.
save_input_catalog_info = False

# If you choose True above, select the path to save the info.
if save_input_catalog_info==True:
    user = getpass.getuser()
    path_to_save_input_catalog_info = f'/lustre/t0/scratch/users/{user}/pz-compute/ondemand/data'
    os.makedirs(path_to_save_input_catalog_info, exist_ok=True)

In [None]:
# Reading the catalog and getting the columns
catalog_ddf = dd.read_parquet(catalog_files_paths)
if ids_are_in_the_index:
    catalog_ddf = catalog_ddf.reset_index()
catalog_columns = catalog_ddf.columns
catalog_columns_list = catalog_columns.to_list()

# Getting general information about the catalog files
file_info_list = []
for file_path in catalog_files_paths:
    try:
        # File size
        file_size = os.stat(file_path).st_size  # Size in bytes
        file_size_gb = file_size / (1024 ** 3)  # Converting to gigabytes
        
        # Counting the rows using parquet metadata
        parquet_file = pq.ParquetFile(file_path)  # Loading parquet metadata
        num_rows = parquet_file.metadata.num_rows  # Number of rows in the file
        
        # Adding information to the dictionary
        file_info_list.append({
            "file": file_path,
            "size_on_disk": file_size,
            "gbs": file_size_gb,
            "rows": num_rows
        })
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

# Creating a dataframe with the files info
input_info_frame = pd.DataFrame(file_info_list)

# Calculating statistics
num_partitions = len(input_info_frame)
min_size_on_disk = input_info_frame["gbs"].min() if not input_info_frame.empty else 0
max_size_on_disk = input_info_frame["gbs"].max() if not input_info_frame.empty else 0
total_size_on_disk = input_info_frame["gbs"].sum()

# Getting the rows count corresponding to the min and max size files
min_size_file = input_info_frame[input_info_frame["gbs"] == min_size_on_disk]
max_size_file = input_info_frame[input_info_frame["gbs"] == max_size_on_disk]
min_rows = min_size_file["rows"].iloc[0] if not min_size_file.empty else 0
max_rows = max_size_file["rows"].iloc[0] if not max_size_file.empty else 0

total_rows = input_info_frame["rows"].sum() if not input_info_frame.empty else 0
avg_file_size = total_size_on_disk / num_partitions if num_partitions > 0 else 0
avg_rows_per_file = total_rows / num_partitions if num_partitions > 0 else 0

# Preparing the table in Markdown format
markdown_table = f"""
| Metric               | Value                       |
|----------------------|-----------------------------|
| Number of files      | {len(catalog_files_paths)}  |
| Number of columns    | {len(catalog_columns_list)} |
| Min file size        | {min_size_on_disk:.2f} GB; {min_rows} rows |
| Max file size        | {max_size_on_disk:.2f} GB; {max_rows} rows |
| Average file size    | {avg_file_size:.2f} GB; {avg_rows_per_file:.0f} rows |
| Total size on disk   | {total_size_on_disk:.2f} GB; {total_rows} rows |
"""

# Display the Markdown table in a cell
display(Markdown(markdown_table))

# Saving the information if required
if save_input_catalog_info:
    # Ensure the directory exists
    os.makedirs(path_to_save_input_catalog_info, exist_ok=True)
    
    # Save the DataFrame as CSV
    input_info_frame_path = os.path.join(path_to_save_input_catalog_info, "input_info_frame.csv")
    input_info_frame.to_csv(input_info_frame_path, index=False)
    
    # Save the Markdown table as a text file
    markdown_table_path = os.path.join(path_to_save_input_catalog_info, "input_info_summary.txt")
    with open(markdown_table_path, "w") as f:
        f.write(markdown_table)
    
    # Save the provenance information
    provenance_path = os.path.join(path_to_save_input_catalog_info, "input_info_provenance.txt")
    with open(provenance_path, "w") as f:
        f.write(f"catalog_path: {catalog_path}\n")
        f.write(f"catalog_files: {catalog_files}\n")
    
    print(f"Information saved to {path_to_save_input_catalog_info}")

File sizes distribution and file sizes histogram.

In [None]:
def compute_and_save_histogram_info(info_frame, bins, labels, type_of_files, logs_dir, save=False, show=False):
    """
    Computes, optionally saves, and optionally shows histogram information (counts and percentages) for given bins and labels.

    Args:
        info_frame (DataFrame): DataFrame containing the file size information.
        bins (list): Bin edges for the histogram.
        labels (list): Labels corresponding to the bins.
        type_of_files (str): Identifier for the type of files (used in filenames).
        logs_dir (str): Directory to save the output files.
        save (bool): Whether to save the histogram information.
        show (bool): Whether to display the histogram information as a Markdown table.
    """
    # Compute histogram counts and percentages
    hist = np.histogram(info_frame["gbs"], bins=bins)[0]
    pcts = hist / len(info_frame) if len(info_frame) > 0 else [0] * len(labels)

    # Save histogram information
    if save and logs_dir:
        os.makedirs(logs_dir, exist_ok=True)
        output_path = os.path.join(logs_dir, f"{type_of_files}_file_size_distribution.txt")
        with open(output_path, "w") as file:
            file.write(f"Bins: {bins} GB\n")
            file.write(f"Labels: {labels} \n \n")
            for i, label in enumerate(labels):
                file.write(f"{label} \t: {hist[i]} \t({pcts[i]*100:.1f} %)\n")
        print(f"Histogram information saved to {output_path}")
        
    # Show histogram information as a Markdown table
    if show:
        # Prepare the Markdown table
        markdown_table = "| Label        | Count | Percentage (%) |\n"
        markdown_table += "|--------------|-------|----------------|\n"
        for i, label in enumerate(labels):
            markdown_table += f"| {label:<12} | {hist[i]:<5} | {pcts[i]*100:.1f}       |\n"

        # Display bins and labels in text, then the Markdown table
        bins_labels_md = f"**Bins**: {bins}  GB \n\n**Labels**: {labels}  \n\n"
        display(Markdown(bins_labels_md + markdown_table))


def plot_and_save_histogram(info_frame, bins, type_of_files, logs_dir, save=False, show=False):
    """
    Plots and optionally saves the histogram of file sizes and displays the bins as Markdown.

    Args:
        info_frame (DataFrame): DataFrame containing the file size information.
        bins (list or None): Bin edges for the histogram. If None, bins will be generated automatically.
        type_of_files (str): Identifier for the type of files (used in filenames).
        logs_dir (str): Directory to save the output files.
        save (bool): Whether to save the histogram plot.
        show (bool): Whether to display the plot and bins information.
    """
    # Plot the histogram
    n, bins_generated, _ = plt.hist(info_frame["gbs"], bins=bins, edgecolor='black')
    plt.xlabel("File size (GB)")
    plt.ylabel("Number of files")
    
    bins_used = bins_generated if bins is None else bins
    bins_used_rounded = [round(float(bin_edge), 3) for bin_edge in bins_used]
    
    # Save the plot
    if save and logs_dir:
        os.makedirs(logs_dir, exist_ok=True)
        output_path_plot = os.path.join(logs_dir, f"{type_of_files}_file_size_histogram.png")
        plt.savefig(output_path_plot)
        print(f"Histogram plot saved to {output_path_plot}")
        
        output_path_bins = os.path.join(logs_dir, f"{type_of_files}_file_size_histogram.txt")
        with open(output_path_bins, "w") as file:
            file.write(f"Bins: {bins_used} GB\n")

    # Show the plot
    if show:
        # Display bins in Markdown format
        bins_md = f"**Bins**: {bins_used_rounded}  GB\n"
        display(Markdown(bins_md))
        
        plt.show()

In [None]:
# Parameters
type_of_files = 'input'

bins_lincc_categories = [0, 0.5, 1, 2, 100]
labels_lincc_categories = ["small-ish", "sweet-spot", "big-ish", "too-big"]

bins_plotted_histogram = None

logs_dir = path_to_save_input_catalog_info if save_input_catalog_info else None

compute_and_save_histogram_info(
    info_frame=input_info_frame, 
    bins=bins_lincc_categories, 
    labels=labels_lincc_categories, 
    type_of_files=type_of_files, 
    logs_dir=logs_dir, 
    save=save_input_catalog_info, 
    show=True
)

print("\n \n")

plot_and_save_histogram(
    info_frame=input_info_frame, 
    bins=bins_plotted_histogram,  
    type_of_files=type_of_files, 
    logs_dir=logs_dir, 
    save=save_input_catalog_info, 
    show=True
)

### Column selection  

Columns included in the skinny tables: 

| column name | data type |  description |
| ---         | ---       |  ---         |
| objectId	  | int  	  | Unique identifier | 
| coord_ra	  | float64	  | Fiducial ICRS Right Ascension of centroid (degrees)|
| coord_dec   |	float64	  | Fiducial ICRS Declination of centroid (degrees)| 
| detect_isPrimary	| boolean	| True if source has no children and is in the inner region of a coadd patch and is in the inner region of a coadd tract and is not a sky source | 
| mag_{u, g, r, i, z, y} | float64 | {u, g, r, i, z, y}-band magnitude converted from final cmodel fit flux measurements | 
| magerr_{u, g, r, i, z, y} | float64 | {u, g, r, i, z, y}-band magnitude errors converted from final cmodel fit flux error measurements | 

### Object selection  

Data cleaning to reduce the number of rows in the catalog is recommended for tests or when using the pipeline to create value-added catalogs for science cases. Survey conditions maps can be used to remove intire regions at once, based on a given quality threshold.   


**WARNING: Data cleaning must not be applied on the production runs to generate the official data products to be delivered as part of the in-kind contribution. Every detected object from the Object catalog must receive a photo-z estimate regardless of its nature or photometry quality.**      

### Configuration parameters 

Parameters defined inside `$SCRATCH/bin/rail-slurm-preprocess.batch`: 

```python 
SRUN = 'srun'
SRUN_ARGS = [SRUN, '-n1', '-N1']
LOG = 'log'
PROG = 'rail-slurm-preprocess'
PREPROCESS = 'rail-preprocess-parquet'
PREPROCESS_ARGS = ['--rows=130000', '--apply-dered=sfd', '--apply-detect-flag=True', 
                   '--round-mags=4', '--output-template={fname}-part{idx}.parquet']
```

### Execute `rail-preprocess-parquet` on Apollo

WARNING: Current working directory must contain a directory named as `log`. 


In [None]:
Fazer funcionar a execução em linha de comando - Helo 

In [None]:
#! sbatch -N 2 -n 20 rail-slurm-preprocess.batch  input/ output/ input/objectTable_tract_*.parq

monitor slurm queue

In [None]:
! squeue

### Output data 

#### Basic QA of skinny tables 

# 1.2 Create Training and Test Sets 

In [None]:
will_train = utils.load_config('process_info.yaml')['will_train']
will_train

## 1.2.1 Creating a training and testing set

### Representative spectroscopic sample 

A true-z sample randomly selected from the DC2 simulation to mimic a representative spectroscopic sample regarding the color-magnitude-redshift space. 

### Download from PZ-Server

In [None]:
%%skip_if not will_train

token = 'a22d527544fb1215105a40cea9d36279ed8220ab'
# with open('.token.txt', 'r') as file:
#     token = file.read()
pz_server = PzServer(token=token, host="pz-dev") # "pz-dev" is the temporary host for test phase  

In [None]:
%%skip_if not will_train

pz_server.download_product('72_pzcompute_results_for_qa_validation')

In [None]:
%%skip_if not will_train

! unzip {cwd}/72_pzcompute_results_for_qa_validation_*.zip

In [None]:
%%skip_if not will_train

! unzip {cwd}/validation_set.zip

### Creating a training and testing set

In [None]:
%%skip_if not will_train

file_path = f'{cwd}/validation_set.hdf5'

In [None]:
%%skip_if not will_train

full_file = find_rail_file(file_path)
full_data = DS.read_file('full_set_tests', TableHandle, full_file)

truth = tables_io.convertObj(full_data(), tables_io.types.PD_DATAFRAME)
truth

Split the sample into two random subsets, with 70% of the galaxies designated for training and 30% for tests by adding an extra column `test`: 
* `test=0`: galaxies included in the **training** procedure
* `test=1`: galaxies included in the **test** procedure, mandatorily excluded from the training procedure 

In [None]:
%%skip_if not will_train

train_file_hdf5 = 'training_set.hdf5'
test_file_hdf5 = 'validation_set.hdf5'

#### Basic QA of the representative training set 

In [None]:
%%skip_if not will_train

## todo implement here basic qa

### Realistic spectroscopic sample (TBD)

A true-z sample arbitrarily selected from the DC2 simulation to mimic realistic spectroscopic sample regarding the color-magnitude-redshift space, based on current spectroscopic data available from the literature . 




In [None]:
%%skip_if not will_train

#### Basic QA of the realistic training set 

In [None]:
%%skip_if not will_train

## 1.2.2  Train the photo-z algorithm  

Train the photo-z algorithm with RAIL (`rail_inform`). Available options: BPZ, FlexZBoost, GPz, LePHARE,and  TPZ.  

In [None]:
%%skip_if not will_train

pz_train_configs = utils.load_config("pz-train.yaml")
pz_train_configs

In [None]:
%%skip_if not will_train

train_configs = utils.load_config(pz_train_configs['param_file'])
train_configs

In [None]:
%%skip_if not will_train
# utils.run_pz_train()

### SUBMIT TRAINING JOB

Once these files are with all the configs you want to, run on of the following commands in the terminal:

>```python 
>pz-train-dev or pz-train
>```


In [None]:
%%skip_if not will_train

pz_train_job_id = utils.get_last_job_id()
utils.monitor_job(pz_train_job_id)

## 1.2.3  Photo-z Validation    

### PZ estimates for the Test Set

For details over the fields for the yamls look into: https://github.com/linea-it/pz-compute/tree/main/doc/manpages

Run `rail_estimate` module to produce the photo-z estimates (PDFs) for the Test Set. 

In [None]:
%%skip_if not will_train

utils.load_config('process_info.yaml')

In [None]:
%%skip_if not will_train

!mkdir input_test
!mkdir output_test

In [None]:
%%skip_if not will_train

!mv {test_file_hdf5} input_test

In [None]:
%%skip_if not will_train

pz_compute_configs = utils.load_config('pz-compute.yaml')
pz_compute_configs['inputdir'] = 'input_test'
pz_compute_configs['outputdir'] = 'output_test'
pz_compute_configs

In [None]:
%%skip_if not will_train
with open('pz-compute.yaml', 'w') as outfile:
    yaml.dump(pz_compute_configs, outfile)

utils.load_config('pz-compute.yaml')

In [None]:
%%skip_if not will_train

utils.load_config(pz_compute_configs['param_file'])

In [None]:
%%skip_if not will_train

# utils.run_pz_compute()

### Run test pz-compute

Once these files are with all the configs you want to, run on of the following commands in the terminal:

>```python 
>pz-compute-dev or pz-compute
>```

### Monitor run test pz-compute

In [None]:
%%skip_if not will_train

pz_compute_job_id = utils.get_last_job_id()
utils.monitor_job(pz_compute_job_id)

### PZ validation results

for a complete explanation over the metrics, please check out this notebook: https://github.com/linea-it/pz-compute/blob/main/doc/output-validation.ipynb

#### Metrics and plots 

Run `rail_evaluate` module to compute PDF metrics. 

In [None]:
%%skip_if not will_train

test_set_output_path = f'{cwd}/output_test/{test_file_hdf5}'
pdfs_file_output = find_rail_file(test_set_output_path)
table = tables_io.read(pdfs_file_output, fmt='hdf5')
table

### Adding Zmode to the output

Adding the mode of the pdf generated for each object in the file, each zmode is equivalent to the photoz calculated.

In [None]:
%%skip_if not will_train

utils.add_zmodes(table, pdfs_file_output)

In [None]:
%%skip_if not will_train

output_pdfs = DS.read_file(pdfs_file_output, QPHandle, pdfs_file_output)
output_pdfs().build_tables()

#### Reading the Truth table

In [None]:
%%skip_if not will_train

ztrue_file = find_rail_file(f'{cwd}/input_test/{test_file_hdf5}')
ztrue_data = DS.read_file('ztrue_data', TableHandle, ztrue_file)

truth = tables_io.convertObj(ztrue_data(), tables_io.types.PD_DATAFRAME)
truth.head()

In [None]:
%%skip_if not will_train
x_vals = output_pdfs().metadata()['xvals'] #the photoz bins
y_vals = output_pdfs().build_tables()['data']['yvals'] #the pdfs

print(f"ztrue with {len(truth)} objects")
print(f"pdfs output com {len(y_vals)} objetos")

#### Point to point metrics - Sumary statistics

1. point_stats_iqr: 'Interquatile range from 0.25 to 0.75', i.e., the middle 50% of the distribution of point_stats_ez, robust to outliers
2. point_bias: Median of point_stats_ez
3. point_outlier_rate: Calculates the catastrophic outlier rate, defined in the Science Book as the number of galaxies with ez larger than max(0.06,3sigma). This keeps the fraction reasonable when sigma is very small.
4. point_stats_sigma_mad: Sigma of the median absolute deviation

In [None]:
%%skip_if not will_train

stage_dict = dict(
    metrics=['point_stats_ez', 'point_stats_iqr', 'point_bias', 'point_outlier_rate', 'point_stats_sigma_mad'],
    _random_state=None,
    hdf5_groupname='photometry',
    point_estimate_key='zmode',
    chunk_size=10000,
    metric_config={
        'point_stats_iqr':{'tdigest_compression': 100},
    }
)
ptp_stage = PointToPointEvaluator.make_stage(name='point_to_point', **stage_dict)
ptp_results = ptp_stage.evaluate(output_pdfs, ztrue_data)
ptp_results

In [None]:
%%skip_if not will_train

results_df = tables_io.convertObj(ptp_results['summary'](), tables_io.types.PD_DATAFRAME)
results_df

### Dist to point metrics - CDF based Metrics

1. cdeloss: [Conditional Density Estimation](https://vitaliset.github.io/conditional-density-estimation/)
2. pit: [Probability Integral Transform](https://en.wikipedia.org/wiki/Probability_integral_transform)
3. cvm: [Cramer-von Mises](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion)
4. ks: [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)
5. ad: [Anderson-Darling](https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test)

In [None]:
%%skip_if not will_train

cdelossobj = CDELoss(output_pdfs.data, x_vals.ravel(), ztrue_data()['redshift'])
cde_stat_and_pval = cdelossobj.evaluate()

pitobj = PIT(output_pdfs(), ztrue_data()['redshift'])
pit_out_rate = pitobj.evaluate_PIT_outlier_rate()
ks_stat_and_pval = pitobj.evaluate_PIT_KS()
cvm_stat_and_pval = pitobj.evaluate_PIT_CvM()
ad_stat_crit_sig = pitobj.evaluate_PIT_anderson_ksamp()
ad_stat_crit_sig_cut = pitobj.evaluate_PIT_anderson_ksamp(pit_min=0.01, pit_max=0.99)

In [None]:
%%skip_if not will_train

STD_DEV = results_df['point_stats_iqr'][0]
BIAS = results_df['point_outlier_rate'][0]
OUTRATE = pit_out_rate
PIT = cde_stat_and_pval.statistic
CDE_LOSS = ad_stat_crit_sig.statistic
AD = ad_stat_crit_sig.statistic
CVM = cvm_stat_and_pval.statistic
KS = ks_stat_and_pval.statistic

CDE_LOSS_P = cde_stat_and_pval.p_value
AD_P = ad_stat_crit_sig.pvalue
CVM_P = cvm_stat_and_pval.pvalue
KS_P = ks_stat_and_pval.pvalue

In [None]:
%%skip_if not will_train

markdown_table = f"""
| metric   | classification | values    | limits                   | reference |
| :--------| :-------------:| :-------: | :----------------------: | :-------: |
| STD DEV  | POINT Metric   | {STD_DEV:.4f} | < 0.05(1 + zphot)       |  [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799)| 
| BIAS     | POINT Metric   | {BIAS:.4f}    | < 0.003                 | [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799) | 
| OUTRATE  | POINT Metric   | {OUTRATE:.4f} | < 10%                   | [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799) | 
| CDE loss | CDE metric     | {PIT:.4f}     | lower the better        | [Izbicki & Lee, 2017](https://arxiv.org/abs/1704.08095)       |
| PIT      | PIT Metric     | {CDE_LOSS:.4f} p-value {CDE_LOSS_P:.4f} | aprox 1                 | [Polsterer et al., 2016](https://arxiv.org/abs/1608.08016)    |
| AD       | PIT Metrics    | {AD:.4f}    p-value {AD_P:.4f}  | lower for a uniform PIT | [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799) |
| CVM      | PIT Metrics    | {CVM:.4f}   p-value {CVM_P:.4f}  | lower for a uniform PIT | [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799) |
| KS       | PIT Metrics    | {KS:.4f}    p-value {KS_P:.4f}  | lower for a uniform PIT | [Schmidt et al., 2020](http://doi.org/10.1093/mnras/staa2799) |
"""

display(Markdown(markdown_table))

### Redshift stacked distribution of the pdfs

In [None]:
%%skip_if not will_train

point_estimate_test = PointEstHistSummarizer.make_stage(name="point_estimate_test")
naive_stack_test = NaiveStackSummarizer.make_stage(name="naive_stack_test")

point_estimate_ens = point_estimate_test.summarize(output_pdfs)
naive_stack_ens = naive_stack_test.summarize(output_pdfs)

In [None]:
%%skip_if not will_train

naive_stack_ens.data.plot_native(xlim=(0, 3)) #pdfs
plt.show()

In [None]:
%%skip_if not will_train

point_estimate_ens.data.plot_native(xlim=(0, 3)) #zmode
plt.show()

### Zphot vs. Ztrue

In [None]:
%%skip_if not will_train

utils.photoz_specz_plot(ztrue_data, output_pdfs)

In [None]:
%%skip_if not will_train

utils.plot_pit_qq(output_pdfs.data.objdata()['yvals'], x_vals.ravel(), ztrue_data()['redshift'], 
                  title="PIT-QQ plot", code="FlexZBoost", pit_out_rate=pit_out_rate, savefig=False)

In [None]:
plt.show()

In [None]:
%%skip_if not will_train

utils.ks_plot(pitobj)
plt.show()

#### PZ Validation conclusions 

Quality assessment, comparison with science requirements. 

Write here your conclusions

---
# 2. Photo-z Compute 

In [None]:
utils.load_config('process_info.yaml')

## 2.1 Check Pipeline and Algorithm Configs

In [None]:
pz_compute_configs = utils.load_config('pz-compute.yaml')
pz_compute_configs

if pz_compute_configs['']

In [None]:
with open('pz-compute.yaml', 'w') as outfile:
    yaml.dump(pz_compute_configs, outfile)

utils.load_config('pz-compute.yaml')

in case you want to specify the apollo machines, modify the sbatch_args param, for example:  

```yaml 
sbatch_args: -N3 -n 216 -w apl08,apl09,apl16
```

In [None]:
utils.load_config(pz_compute_configs['param_file'])

## 2.2 Submit pipeline to Apollo cluster 

Once these files are with all the configs you want to, run on of the following commands in the terminal:

>```python 
>pz-compute-dev or pz-compute
>```

In [None]:
# utils.run_pz_compute()

## 2.3 Real-time monitoring  

In [None]:
! squeue

In [None]:
pz_compute_job_id = utils.get_last_job_id()
pz_compute_job_id

In [None]:
utils.monitor_job(pz_compute_job_id)

# 3. Post-processing

In [None]:
process_info = utils.load_config('process_info.yaml' )
process_info

## 3.1 Performance evaluation 

In [None]:
utils.run_post_process_evaluation()

In [None]:
process_info = utils.load_config('process_info.yaml' )
process_info

In [None]:
img = mpimg.imread('processes_time_profiler.png')

plt.imshow(img)
plt.axis('off')
plt.show()

## PZ Estimates - QA of final results 

In [None]:
img = mpimg.imread('stack_nz.png')

plt.imshow(img)
plt.axis('off')
plt.show()

# Export to HTML 

In [None]:
os.system(f'jupyter nbconvert --to html {cwd.split("/")[-1]}.ipynb')