<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=100 style="padding: 20px"> 

<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=160 style="padding: 20px">  

# RAIL - Fundamentals

**Contact author**: Heloisa da Silva Mengisztki ([heloisasmengisztki@gmail.com](mailto:heloisasmengisztki@gmail.com)) 

**Last verified run**: 2023-02-01 <br><br>



***
**DISCLAIMER**: This notebook is not part of RAIL official documentation. It is an excercise to explore and learn about RAIL's fundamentals, as part of the education program for undergraduate students ("Iniciação Científica") offered by LIneA.  
The official documentation is available on [RAIL's Read The Docs page](https://lsstdescrail.readthedocs.io/en/stable/). 
*** 


RAIL is a LSST-DESC software created to process different algorithms used to calculate photometric redshift. Its main goal is to minimize impact that different infrastructures can cause on different algorithms. For that, it unifyes in a modular code supporting different inputs that different algorithms needs and padronizing the output so that it can be a more fair comparison between their results.

Rail uses 4 principal libraries in its core: <br>
* _tables_io_: for data manipulation as hdf5 files, fits, etc. <br>
* _qp_: used to paremitrize data PDFs for metrics calculation. <br>
* _ceci_: construct pipelines, produces a .yaml within the steps and configurations as threads. <br>
* _pzflow_: creates a flow for data creation. <br>

RAIL's code base is organized in four subpackges: 

#### Core.
Where the main functions are going to manage the data and files that the program creates. It works almost like the behavioral chain of [resposability pattern](https://refactoring.guru/pt-br/design-patterns/chain-of-responsibility), where you create a flux in the code, where there is a request related/processed by a class handler that decides to pass it foward or not according to what is defined. For instance, to run the BPZ algorithm, one creates a class request (eg: Inform_BPZ_lite) that has all the inputs/configurations and is handled by its class handler (BPZ_lite).

#### Creation.
Contain all the support for data creation, as degradors, data flow creation, Column remapping, etc. It creates .hdf5 files with the data that is being manipulated.

#### Estimation.
This is where the codes are defined and executed. The Estimation is separated into two steps:  <br>

* **inform:** this is the first step of a photo-z code excecution, where the PRIORS for template fitting are informed and the machine learning codes are trained. <br> 

* **estimate:** where the algorith is executed though the .evaluate() function. The code is wrapped as a RAIL stage so that it can be run in a controlled way. Estimation code can be stored in a yaml file to be run as a ceci module.


#### Evaluation.
This step contais the metrics for performance of the estimated codes. It depends on the availability of a truth table for comparison. In case of data created by RAIL Creation subpackage, the truth redshifts and truth posteriors are available by definition.  

<br>
------
For installation instructions check the official [documentation](https://lsstdescrail.readthedocs.io/en/latest/source/installation.html). <br>

Its important to point out that as Rail is still being developed it may be necessary to update (once in a while) the packages installed.<br> 
Commands to update: <br>
`pip install pz-rail-bpz --upgrade` <br>
`pip install pz-rail --upgrade`

For Rail versions check [documentation](https://github.com/LSSTDESC/RAIL/releases)

[Adding the kernel to jupyter](https://lsstdescrail.readthedocs.io/en/latest/source/installation.html#adding-your-kernel-to-jupyter): <br>
 `ipykernel with conda install ipykernel` <br>
 `python -m ipykernel install –user –name [nametocallnewkernel] `

## Imports, setup and some sample data

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import qp
import tables_io
import matplotlib.pyplot as plt

import rail
from rail.core.utils import RAILDIR
from rail.core.data import TableHandle, PqHandle, ModelHandle, QPHandle
from rail.core.stage import RailStage
from rail.core.utilStages import ColumnMapper, TableConverter

from rail.estimation.algos.bpz_lite import Inform_BPZ_lite, BPZ_lite
from rail.evaluation.evaluator import Evaluator

from rail.estimation.algos.knnpz import Inform_KNearNeighPDF

In [None]:
CURR_DIR = os.getcwd()
CURR_DIR, RAILDIR

### Reading some sample

In [None]:
data_columns = ["coadd_objects_id","ra","dec","mag_g","magerr_g","mag_i","magerr_i","mag_r","magerr_r","mag_u","magerr_u","mag_y","magerr_y","mag_z","magerr_z","z_true"]

file_path = os.path.join(CURR_DIR, '../../../../DATA/dp0_train_random.csv')
full_data = pd.read_csv(file_path, usecols=data_columns)
full_data.head()

#### Spliting into train and test data

In [None]:
size = len(full_data)//2

train_sample = full_data.sample(n=size,ignore_index=True)
test_sample = full_data.drop(train_sample.index)

In [None]:
train_sample.head()

In [None]:
test_sample.head()

---

##  RAIL 

Rail has a lot of classes and it uses Object Oriented Programming - OOP, therefore things can get complicated very fast, but for now we are going to focus on understanging a little bit of the three bases ones: **RailStage, DataStore and DataHandler**

**Image:** This diagram represents some classes and its hierarchy.

![title](RAILclasses.png)


## DataStore

The data store class is the class that is going to store all the data that is being processed associated with a key value. For example for a file containing the sample that we are going to use to test an algorithm named 'test_sampe.hdf5' we add this to the data store naming the key 'test_sample' and a what class (DaataHandler) is it going to use to read it, in this case TableHandler -> HandlerHdf5. 
<br>

Another important thing to know is that the DataStore class acts as a [singleton class](https://refactoring.guru/design-patterns/singleton) wich basically is a class that has only one instance in the aplication. That is important due to the fact that rail keeps all the data and handlers as it runs so that the previous stage can access and read it. Based on that when if we try to create another instace, what its going to do is serve as a DataStore factory, but not the DataStore class itself.  
<br>
We can see access the data storage trough the attribute data_store. By default it does not allow to overwrite the data tha its being stored so if we want to change the value of a key we have to manually set the property allow_overwrite to true.

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
def print_data_store():
    print(DS)
    print(sys.getsizeof(DS), 'bytes')

To help us undestand better we are constantly going to monitor how DataStore stores the data and how the memory goes with that.

In [None]:
def print_data_store():
    print(DS)
    print(sys.getsizeof(DS), 'bytes')

print_data_store()

We can manually add data to the data store with the add_data or pass a file and store it in the DS with the read_file.

In [None]:
DS.add_data(key="input", data=train_sample, handle_class=PqHandle)
## DS.read_file(key="name", path=file_path, handle_class=Handler)
print_data_store()

Here is how we can access the data, what it is going to do is use the handler that we passed

In [None]:
DS.read("input").head()

#### Memory x Files

As we can see, as soon that we added the data to the DS the memory increased in 200 bytes. In Rail we can store data as a tableLike object as pandas dataframe, orderDic, etc. but we can also work with the flow in memory. For that there are a bunch of steps/configs that are going be different from bringing or not the data to memory.

## DataHandler

As all the stages herd from RailStages, [Delegate Pattern](https://en.wikipedia.org/wiki/Delegation_pattern), that can be seen in figure of the classe maped above, and RailStages can be seen as a 
[CeciStage](https://github.com/LSSTDESC/ceci/blob/d1d5686aefab18bc53e3d4d8a05af42d19e28a91/ceci/stage.py#L24]), when we declare a stage we use the method _make_stage(**args)_, what id does is to return the object itself as a stage configured with the given parameters. 

To undestand how the returned object works we can use an explanation present in the c# language for [delegate](https://docs.microsoft.com/pt-br/dotnet/csharp/programming-guide/delegates/using-delegates). Basically we can think of delegates as a method that points to an abstract class and a method of that class that is going to execute. Therefore a class that behaves as a method and can be executed. In python this method can be declared as `__call__` and the retuned class can be executed as class(), then this is going to execute the defined methos class. For RailStages it is going to run the algorithm.

**Image:** basic flux of inputs and outputs. 

![title](SimpleRailBPZflow.png)


## PRIOR - Preparing Data
Here we are going to start preparing our data, first we set the ColumnMapper stage that is going to remmap all the data columns for bpz algorithm

In [None]:
##help(ColumnMapper)

In [None]:
columns_map = {
"coadd_objects_id": "id",
"ra": "coord_ra",
"dec": "coord_dec",
"mag_g": "mag_g_lsst",
"magerr_g": "mag_err_g_lsst",
"mag_r": "mag_r_lsst",
"magerr_r": "mag_err_r_lsst",
"mag_i": "mag_i_lsst",
"magerr_i": "mag_err_i_lsst",
"mag_u": "mag_u_lsst",
"magerr_u": "mag_err_u_lsst",
"mag_y": "mag_y_lsst",
"magerr_y": "mag_err_y_lsst",
"mag_z": "mag_z_lsst",
"magerr_z": "mag_err_z_lsst",
"z_true": "redshift"
}

col_remapper = ColumnMapper.make_stage(name='col_remapper', columns=columns_map)
print(f"Returned class: {type(col_remapper)}")

In [None]:
print_data_store()

we can see the configurations of the returned class with `returned_obj.config.to_dict()`

In [None]:
col_remapper.config.to_dict()

Basically we can call execute it these ways using the notebook.
1. When the data is added manually to the DS with `col_remapper.run()`
2. Passing the data trough parameter and invoking the method as `col_remapper(dataAsTableLike)`

in this case we are going to call the method run

In [None]:
## remembering what data we are storing
DS.read("input").head()

In [None]:
col_remapper.run()
print(f"\nRodando em paralelo -> {col_remapper.is_parallel()}")

We can see that it is storing the outputs in the DS before the stage. Lets check the outupt

In [None]:
col_remapper.get_data("output").head() 
## or trough DS as 
##DS.read("output_col_remapper")
##DS["output_col_remapper"].data

#tables_io.convertObj(DS.read("output_estimate_bpz").build_tables()['ancil'], tables_io.types.PD_DATAFRAME)

In [None]:
print_data_store()

____
**OBSERVATION**

Passing the input key in make_stage does not work.
`ColumnMapper.make_stage(name='col_remapper_train_2', columns=columns_map, input='test')`

what it does is search for the key in DS with the 'input' name when we call the stage.run(), so if we want to call it that way it would be necessary to keep updating the value of intput key. To correct that it would be necessary to call the method that is responsable for setting the key name for when the stage runs it search for the right key.
<br>
Setting the key passed in the make_datege(input="key"):
<br>`self.set_data(self.config.input, data)`

How the algorithm search in the .run() method:
<br>`data = self.get_data('input', allow_missing=True)` <br>


**OBS:** when we call the function as _stage(data)_ its not necessary to set the input key, it already does as the sugestion above.

In [None]:
DS.add_data(key="input", data=col_remapper.get_data("output"), handle_class=PqHandle)
print(DS)
DS.read("input").head()

## PRIOR - Inform BPZ

For the The algorithms, basically they all expect an input as TableHandler.<br>
`inputs = [('input', <class 'rail.core.data.TableHandle'>)]`<br>
as the output of remmapColumns is already a TableHandler we dont need to specify, but if the data is already in the correct form, it may be helpful to use the TableConverter class. 

Eg:

     table_conv_train = TableConverter.make_stage(name='table_conv_train', output_format='numpyDict')
     table_conv_train.run()


and the output is a ModelHandler<br>
`outputs = [('model', <class 'rail.core.data.ModelHandle'>)]`

Here we are going to set the data to input in DS so that when inform runs it gets the remmaped data in with the key 'input' in the DataStore  

In [None]:
DS.read("input").head()

Running the InformBpz to define the priors to bpz algorithm, here we can configure all the parameters such as  zmin, zmax, etc.

According to the [documentation BPZ_lite](https://github.com/LSSTDESC/rail_bpz/blob/main/src/rail/estimation/algos/bpz_lite.py) Inform:

> Inform stage for BPZ_lite, this stage *assumes* that you have a set of
    SED templates and that the training data has already been assigned a
    'best fit broad type' (that is, something like ellliptical, spiral,
    irregular, or starburst, similar to how the six SEDs in the CWW/SB set
    of Benitez (2000) are assigned 3 broad types).  This informer will then
    fit parameters for the evolving type fraction as a function of apparent
    magnitude in a reference band, P(T|m), as well as the redshift prior
    of finding a galaxy of the broad type at a particular redshift, p(z|m, T)
    where z is redshift, m is apparent magnitude in the reference band, and T
    is the 'broad type'.  We will use the same forms for these functions as
    parameterized in Benitez (2000).  For p(T|m) we have
    p(T|m) = exp(-kt(m-m0))
    where m0 is a constant and we fit for values of kt
    For p(z|T,m) we have
    P(z|T,m) = f_x*z0_x^a *exp(-(z/zm_x)^a)
    where zm_x = z0_x*(km_x-m0)
    where f_x is the type fraction from p(T|m), and we fit for values of
    z0, km, and a for each type.  These parameters are then fed to the BPZ
    prior for use in the estimation stage.
    
   I dont rlly undestand everything here scientifically but we can move on for now

In [None]:
#help(Inform_BPZ_lite)
bpz_columns_file = os.path.join(CURR_DIR, '../configs/bpz.columns')
bpz_columns_file

In [None]:
inform_bpz = Inform_BPZ_lite.make_stage(
    name='inform_bpzlite', 
    #input="test_nome",
    model='trained_BPZ_output.pkl', 
    hdf5_groupname='', 
    columns_file=bpz_columns_file,
    prior_band="mag_i_lsst"
)
#inform_bpz.connect_input(col_remapper_train) # If we do not want to keep updating the input key in DS, we can connect it like this
inform_bpz.config.to_dict()

Running Inform to compute the best fit prior parameters. What is fo and kt?

In [None]:
%%time
inform_bpz.run()
## or inform_bpz.inform(data)

In [None]:
print_data_store()

In [None]:
inform_bpz.config.to_dict()

Here it is interesting to see the aliases key and as we run it sets the model and the output

In [None]:
DS['model_inform_bpzlite'].data

___

## BPZ Estimate -  Preparing Data


For posteriors

     inputs = [('model', <class 'rail.core.data.ModelHandle'>)]
     outputs = [('output', <class 'rail.core.data.QPHandle'>)]

adding data to DS key input

In [None]:
test_sample.head()

In [None]:
DS.add_data(key="input", data=test_sample, handle_class=ModelHandle)
print_data_store()
DS.read("input").head()

In [None]:
col_remapper.run()

The Estimate stage requires a different type of input data, either a dictionary of all input data or a ModelHandle providing access to the data

In [None]:
DS.add_data(key="input", data=col_remapper.get_data("output"), handle_class=PqHandle)
print_data_store()
DS.read("input").head()

table_conv = TableConverter.make_stage(name='table_conv', output_format='numpyDict');
## or set the input as 
# table_conv.connect_input(col_remapper_test)
table_conv.run()

adding the result to input in DS

In [None]:
DS.add_data(key="input", data=table_conv.get_data("output"), handle_class=PqHandle)
print_data_store()
DS.read("input")['id']

## BPZ Estimate - run

Now that we have our test sample in the input key in DS we define and run the algorithm. Here we can change the configs such as zmin, zmax, dx, bins, etc, via parameter fow each run that we want.

In [None]:
estimate_bpz = BPZ_lite.make_stage(
    name='estimate_bpz', 
    hdf5_groupname='', 
    columns_file=bpz_columns_file, 
    #input="inprogress_output_table_conv_train.hdf5", 
    model=inform_bpz.get_handle('model'),
    zmax=1.5
)
#estimate_bpz.connect_input(table_conv)
estimate_bpz.config.to_dict()

In [None]:
estimate_bpz.run() ## -> input -> DS
#estimate(data) ## 'input' -> dados

In [None]:
print_data_store()
estimate_bpz.get_data("output").build_tables()

## Evaluator

In [None]:
DS.read("input")

In [None]:
ztrue = DS.read("input")
DS.add_data(key="truth", data=ztrue, handle_class=TableHandle)
print_data_store()
DS.read("truth")

In [None]:
ensemble = DS.read("output_estimate_bpz")
DS.add_data(key="input", data=ensemble, handle_class=QPHandle)
print_data_store(), type(ensemble)

In [None]:
evaluator = Evaluator.make_stage(name=f'bpz_eval', truth=ztrue)
# evaluator.connect_input(estimate_bpz)
evaluator.run()
#evaluator.evaluate(ensemble, ztrue)

In [None]:
print_data_store()

In [None]:
tables_io.convertObj(DS.read("output_bpz_eval"), tables_io.types.PD_DATAFRAME)

____

## Plot pz x zspec

In [None]:
results_tables = tables_io.convertObj(DS.read("output_estimate_bpz").build_tables()['ancil'], tables_io.types.PD_DATAFRAME)
zmode = results_tables['zmode']
len(zmode)

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(train_sample['z_true'],zmode,s=1,c='k',label='simple bpz mode')
plt.plot([0,3],[0,3],'r--');
plt.xlabel("true redshift")
plt.ylabel("bpz photo-z")

## CONCLUSION

As we can see some PITs were not calculated and the plot has all the dots highly scattered, that makes some sense with why it cut down the PITs values, since there can be some error in sample configuration as a switched column or another input error that could not be detected. However, for this notebook, is important to understand how we can run rail and how its stages work together using the notebook as the place to run. It is important to say that we can use ceci to run all the same stages that were configured here, but this is going to be explored in another notebook.