<img align='left' src = '../images/linea.png' width=150 style='padding: 20px'> 

# PZ Compute - Input Data QA Notebook - DP0.2

Collection of objects made available by Rubin Observatory Legacy Survey of Space and Time (LSST), release Data Preview 0.2 (DP0.2).

<br>
<br>

Contact: Luigi Silva ([luigi.lcsilva@linea.org.br](mailto:luigi.lcsilva@linea.org.br))

Last check: September 2, 2024.

#### Acknowledgements

'_This notebook used computational resources from the Associação Laboratório Interinstitucional de e-Astronomia (LIneA) with the financial support of INCT do e-Universo (Process no. 465376/2014-2)._'

#### Instructions

Before running the notebook, check the instructions in the ```pz-compute/doc/notebooks/DP02_QA_notebook_input.md``` file.

# About the survey

The data set used for DP0.2 is the $300 \text{ deg}^2$ of simulated, LSST-like images and catalogs generated by the Dark Energy Science Collaboration (DESC) for their Data Challenge 2 (DC2) ([DP0.2 Docs](https://dp0-2.lsst.io/)).

|seq.|Survey name <br> (link to the website)| Number of objects in <br>the original sample | Reference <br> (link to the paper) |
|---|---|:-:|---|
|1|[DP0.2](https://dp0-2.lsst.io/)|~200 million astronomical objects|[LSST Dark Energy Science Collaboration (LSST DESC)](https://ui.adsabs.harvard.edu/abs/2021ApJS..253...31L/abstract)| 

The complete table schema for DP0.2 can be found [here](https://dm.lsst.org/sdm_schemas/browser/dp02.html). In this noteboook, we will use the skinny tables generated using the LIneA Apollo Cluster. 

The skinny tables have the following columns: coord_ra; coord_dec; mag_u; mag_g; mag_r; mag_i; mag_z; mag_y; magerr_u; magerr_g; magerr_r; magerr_i; magerr_z; magerr_y; objectId. These columns correspond to the R.A. and DEC coordinates, the magnitudes measurements and the magnitudes errors for each band, and the objects unique identifiers, respectively.

# Inputs and configs

Requirements for this notebook:

* **General libraries**: os, sys, math, numpy, pandas.
* **Visualization libraries**: bokeh, holoviews, geoviews, cartopy.

It is also necessary to install the **fastparquet** library for reading the parquet files with pandas.

In [None]:
# General
import os
import sys
import math
import numpy as np
import pandas as pd

# Bokeh
import bokeh
from bokeh.io import output_notebook

# Holoviews
import holoviews as hv
from holoviews import opts

# Geoviews
import geoviews as gv
import geoviews.feature as gf
from geoviews.operation import project
import cartopy.crs as ccrs

In [None]:
pd.set_option('display.max_rows', 10)

In [None]:
hv.extension('bokeh')

In [None]:
gv.extension('bokeh')

In [None]:
output_notebook()

In [None]:
%matplotlib inline

# Basic Statistics

In this section, we will show the basic statistics of the data. In both tables, we will find, for each column, the count number of non-NA/null observations, the maximum value, minimum value, mean, standard deviation and the 25th, 50th and 75th percentiles.

The first table was obtained by applying the method ```describe```, from the pandas library, in the entire input dataset, without any filtering.

The second table was obtained by applying the same method, but this time filtering all the NaN and inf values in the dataset.

### How to get the data

The files containing the data of the basic statistics were obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the basic statistics. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_basic_stats.py```
2. Sbatch script: ```DP02_QA_basic_stats.sbatch```
3. Resulting Parquet files containing the data: ```basic_stats.parquet``` and ```basic_stats_no_nan_no_inf.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output files ```basic_stats.parquet``` and ```basic_stats_no_nan_no_inf.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

## Showing the basic statistics

Below, we can see the first table, containing the basic statistics for the entire input dataset.

In [None]:
### Reading the Parquet file and showing the dataframe.
df_basic_stats = pd.read_parquet('output/basic_stats.parquet', engine='fastparquet')
df_basic_stats 

The second table is shown below, containing the basic statistics ignoring all NaN and inf values in all columns.

In [None]:
### Reading the Parquet file and showing the dataframe.
df_basic_stats_no_nan_no_inf = pd.read_parquet('output/basic_stats_no_nan_no_inf.parquet', engine='fastparquet')
df_basic_stats_no_nan_no_inf

# Spatial distribution

In the following subsections, we will read the data of the 2D histogram corresponding to the **spatial distribution of objects**, considering their distribution in the sky according to their Right Ascension (R.A.) and Declination (DEC) coordinates, and we will make two plots.

The first plot uses the **equidistant cylindrical projection (Plate Carrée projection)**, in which the lines corresponding to R.A. are equally spaced vertical straight lines, and the lines corresponding to DEC are equally spaced horizontal straight lines. This projection distorts areas and shapes, especially at high declinations.

The second plot uses the **Mollweide projection**, an equal-area, pseudocylindrical map projection. The Mollweide projection preserves area; however, it distorts shapes, especially near the edges of the sky map. The central meridian and the celestial equator are straight lines, while other lines of R.A. and DEC are represented as curves.

Both graphs will also have a **colorbar** corresponding to the counts of objects per R.A. and DEC bin of the 2D histogram.

### How to get the data

The file containing the data of the 2D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 2D histogram. In this Python file, you can personalize the R.A. and DEC bin edges. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_histo_2d_ra_dec.py```
2. Sbatch script: ```DP02_QA_histo_2d_ra_dec.sbatch```
3. Resulting Parquet file containing the data: ```histo_2d_ra_dec.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_2d_ra_dec.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the spatial distribution of objects, we do not compute one 2D histogram for each input Parquet file, just the total histogram considering all files together.**

## Reading the data

Below we can see the output Parquet file structure. The line 0 (type "histogram_ra_dec") contains the counts of the 2D histogram in the "values" column, and the line 1 (type "bins_ra_dec") contains the R.A. bin edges ("ra_bins") and DEC bin edges ("dec_bins") in the "values" column.

In [None]:
### Reading the Parquet file and showing the dataframe.
df_spatial_dist = pd.read_parquet('output/histo_2d_ra_dec.parquet', engine='fastparquet')
df_spatial_dist

We have to convert the lists to numpy arrays.

In [None]:
### Generating new dataframes from the original one, containing the counts, the R.A. bin edges and the DEC bin edges, and converting
### these dataframes to numpy arrays.
histogram_ra_dec = np.array(df_spatial_dist['values'][0])
bins_ra = np.array(df_spatial_dist['values'][1]['ra_bins'])
bins_dec = np.array(df_spatial_dist['values'][1]['dec_bins'])

Now, we show some information about the R.A. bins, DEC bins and the 2D histogram counts.

In [None]:
### Printing the information.
print("INFO - R.A. BINS")
print(f"Min. edge: {bins_ra.min():.2f} | Max. edge: {bins_ra.max():.2f} | Step: {bins_ra[1]-bins_ra[0]:.2f} | Shape: {bins_ra.shape} \n")
print("INFO - DEC BINS")
print(f"Min. edge: {bins_dec.min():.2f} | Max. edge: {bins_dec.max():.2f} | Step: {bins_dec[1]-bins_dec[0]:.2f} | Shape: {bins_dec.shape} \n") 
print("INFO- 2D HISTOGRAM COUNTS")
print(f"Min. count: {histogram_ra_dec.min()} | Max. count: {histogram_ra_dec.max()} | Shape: {histogram_ra_dec.shape}")

## Spatial distribution plot

Before making the plots, we must perform some tasks:

1. Change the 0 values in the 2D histogram counts array to NaN values, so that they appear white in the plot.
2. Compute the centers of the bins.
3. For the Plate Carrée projection, change the R.A. coordinates so that they belong to the range $[−180^{\circ},180^{\circ})$. This is necessary for inverting the x-axis in the plot, a widely used convention. We must also adjust the 2D histogram counts accordingly, so that they agree with the new R.A. range.
4. For the Mollweide projection, invert the R.A. coordinates by doing $360^{\circ} - x$ for all $x$ in the R.A. values. This is just computational artifice, which is necessary for inverting the x-axis in the plot. However, in the final plot, the R.A. and DEC ticks will be correctly showed in the original range, $[0^{\circ},360^{\circ})$. Again, we must also adjust the 2D histogram counts accordingly.
5. Transpose the 2D histogram counts arrays so that they become compatible with HoloViews/GeoViews.

In [None]:
### Changing the 0 values to NaN values.
histogram_ra_dec_NaN = histogram_ra_dec.astype(float)
histogram_ra_dec_NaN[histogram_ra_dec_NaN == 0] = np.nan

### Getting the bins centers.
bins_ra_centers = (bins_ra[1:] + bins_ra[:-1])/2
bins_dec_centers = (bins_dec[1:] + bins_dec[:-1])/2

### Plate Carrée projection - Changing the R.A. coordinates to the range [-180,180), and changing the 2d histogram counts accordingly.
bins_ra_centers_180_range = np.where(bins_ra_centers >= 180, bins_ra_centers - 360, bins_ra_centers)
sorted_indices_180_range = np.argsort(bins_ra_centers_180_range)
histogram_ra_dec_180_range = histogram_ra_dec_NaN[sorted_indices_180_range, :]
bins_ra_centers_180_range = bins_ra_centers_180_range[sorted_indices_180_range]

### Mollweide projection - Inverting the R.A. values (360 - values), and changing the 2d histogram counts accordingly.
bins_ra_centers_inverted = np.where(bins_ra_centers <= 360, 360 - bins_ra_centers, bins_ra_centers)
sorted_indices_inverted = np.argsort(bins_ra_centers_inverted)
histogram_ra_dec_inverted = histogram_ra_dec_NaN[sorted_indices_inverted, :]
bins_ra_centers_inverted = bins_ra_centers_inverted[sorted_indices_inverted]

### Transposing the histogram arrays for the holoviews plots.
histogram_ra_dec_180_range_transpose = histogram_ra_dec_180_range.T
histogram_ra_dec_inverted_transpose = histogram_ra_dec_inverted.T

After these tasks, we are ready to make the spatial distribution plots.

First, the Plate Carrée projection plot. In this plot, **the R.A. values are in the $[−180^{\circ},180^{\circ})$ range**, where the negative values corresponds to values greater than $180^{\circ}$ in the original range, $[0^{\circ}, 360^{\circ})$.

In [None]:
### Creating the image using holoviews.
hv_image_ra_dec = hv.Image((bins_ra_centers_180_range, bins_dec_centers, histogram_ra_dec_180_range_transpose), [f'R.A.', f'DEC'], f'Counts')

### Adjusting the image options.
hv_image_ra_dec = hv_image_ra_dec.opts(
    opts.Image(cmap='viridis', cnorm='linear', colorbar=True, width=1000, height=500,
               xlim=(180, -180), ylim=(-90, 90), tools=['hover'], clim=(10, np.nanmax(histogram_ra_dec_180_range_transpose)),
               title=f'Spatial Distribution of Objects - Plate Carrée projection', show_grid=True)
)

# Showing the graph.
hv_image_ra_dec

Second, the Mollweide projection plot. In this plot, **the R.A. values are in the original $[0^{\circ},360^{\circ})$ range**. Unfortunately, the bokeh 'hover' tool does not work with this projection.

In [None]:
### Generating the R.A. and DEC ticks
longitudes = np.arange(30, 360, 30)
latitudes = np.arange(-75, 76, 15)

lon_labels = [f"{lon}°" for lon in longitudes]
lat_labels = [f"{lat}°" for lat in latitudes]

labels_data = {
    "lon": list(np.flip(longitudes)) + [-180] * len(latitudes),
    "lat": [0] * len(longitudes) + list(latitudes),
    "label": lon_labels + lat_labels,
}

df_labels = pd.DataFrame(labels_data)

labels_plot = gv.Labels(df_labels, kdims=["lon", "lat"], vdims=["label"]).opts(
    text_font_size="12pt",
    text_color="black",
    text_align='right',
    text_baseline='bottom',
    projection=ccrs.Mollweide()
)

### Creating the image using holoviews.
gv_image_ra_dec = gv.Image((bins_ra_centers_inverted, bins_dec_centers, histogram_ra_dec_inverted_transpose), [f'R.A.', f'DEC'], f'Counts')

### Doing the Mollweide projection.
gv_image_ra_dec_projected = gv.operation.project(gv_image_ra_dec, projection=ccrs.Mollweide())

### Generating the grid lines.
grid = gf.grid().opts(
    opts.Feature(projection=ccrs.Mollweide(), scale='110m', color='black')
)

### Adjusting the image options.
gv_image_ra_dec_projected = gv_image_ra_dec_projected.opts(cmap='viridis', cnorm='linear', colorbar=True, width=1000, height=500, 
                                                           clim=(10, np.nanmax(histogram_ra_dec_inverted_transpose)), 
                                                           title='Spatial Distribution of Objects - Mollweide projection', 
                                                           projection=ccrs.Mollweide(),  global_extent=True)

### Showing the plot.
combined_plot = gv_image_ra_dec_projected * grid * labels_plot
combined_plot

# Magnitude distributions

In the following subsections, we will read the data of the 1D histograms corresponding to the **magnitude distributions** and we will make plots for each band of the survey.

### How to get the data

The file containing the data of the 1D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 1D histogram. In this Python file, you can personalize the magnitude bin edges for each band. There is also a sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_histo_1d_mag.py```
2. Sbatch script: ```DP02_QA_histo_1d_mag.sbatch```
3. Resulting Parquet file containing the data: ```histo_1d_mag_all_bands.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_1d_mag_all_bands.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the magnitude distributions, we compute one 1D histogram for each input Parquet file, for each band. Given a certain band, we use the same bin edges for each input file.**

## Reading the data

Below we can see the Parquet file structure. When the value of the 'filename' column is a path, the values of the 'counts' column are the 1D histogram counts for the file corresponding to this path. When the value of the 'filename' column is 'bins', the values of the 'counts' column are the bin edges used for each band.

We also split, in the code lines, the original dataframe (df_mag) into a list of dataframes containing only the 2D histograms counts (mag_histograms), for each band, and a dataframe containing only the bins (mag_bins), and we convert the values to numpy.arrays.

In [None]:
### Reading the Parquet file.
df_mag = pd.read_parquet('output/histo_1d_mag_all_bands.parquet', engine='fastparquet')

In [None]:
### Getting the bands names and the number of bands.
bands_mag = df_mag['band'].unique()
num_of_bands_mag = len(bands_mag)

### Separating the histograms of each band and the bin edges, and converting the lists to numpy arrays.
mag_histograms = {}
for band in bands_mag:
    mag_histograms[band] = df_mag[df_mag['band'] == band]
    mag_histograms[band] = mag_histograms[band][mag_histograms[band]['filename'] != 'bins']
    mag_histograms[band] = mag_histograms[band].reset_index(drop=True)
    mag_histograms[band]['counts'] = mag_histograms[band]['counts'].apply(np.array)

mag_bins = df_mag[df_mag['filename'] == 'bins']
mag_bins = mag_bins.reset_index(drop=True)
mag_bins['counts'] = mag_bins['counts'].apply(np.array)

### Getting the input files names and the number of input files.
filenames_paths_mag = df_mag['filename'].unique()
filenames_paths_mag = filenames_paths_mag[filenames_paths_mag!='bins']
filenames_paths_mag = filenames_paths_mag.astype(str)
filenames_mag = np.char.replace(filenames_paths_mag, '/lustre/t1/cl/lsst/dp02/secondary/catalogs/skinny/hdf5/', '')
filenames_len_mag = len(filenames_mag)

In [None]:
### PRINTING THE INFORMATION
### GENERAL INFORMATION
print(f"MAG. DIST. - NUMBER OF BANDS: {num_of_bands_mag}")
print(f"MAG. DIST. - BAND NAMES: {bands_mag}")
print(f"MAG. DIST. - NUMBER OF INPUT FILES: {filenames_len_mag}")
print(f"EACH BAND HAS A TOTAL OF {filenames_len_mag} LINES, CORRESPONDING TO EACH INPUT FILE, PLUS ONE EXTRA LINE, CORRESPONDING TO THE BIN EDGES.\n")

### INDIVIDUAL HISTOGRAMS PREVIEW
printing_lines_for_each_band = 2
print(f"\nPRINTING {printing_lines_for_each_band} LINES FOR EACH BAND, CONTAINING THE 1D HISTOGRAM COUNTS.")
for band in bands_mag:
    print(mag_histograms[band].head(printing_lines_for_each_band))
    print('\n')

### BIN EDGES PREVIEW
print(f"\nPRINTING THE LINES CONTAINING THE BIN EDGES (THEY CAN BE FOUND AT THE END OF THE PARQUET FILE)")
print(mag_bins.head())

### BIN EDGES INFORMATION
print("\n\nINFORMATION ABOUT THE BINS")
for band in bands_mag:
    bin_list = mag_bins[mag_bins['band']==band]['counts'].reset_index(drop=True)[0]
    bin_min = bin_list.min()
    bin_max = bin_list.max()
    step = bin_list[1] - bin_list[0]
    shape = bin_list.shape
    print(f"Min. bin edge - band {band}: {bin_min}")
    print(f"Max. bin edge - band {band}: {bin_max}")
    print(f"Step - band {band}: {step}")
    print(f"Shape - band {band}: {shape}\n")

Now, we will compute the total 1D histogram for each band. This is done by summing the values in each bin for all the individual histograms. Below, we show some information about the total 1D histogram.

In [None]:
### Computing the total 1D histogram.
total_mag_histograms = {}
for band in bands_mag:
    arrays = {}
    num_of_rows = len(mag_histograms[band].index)
    for j in np.arange(0, num_of_rows, 1):
        name = 'array'+str(j)
        arrays[name] = mag_histograms[band]['counts'][j]

    somas = []

    for elementos in zip(*arrays.values()):
        soma = sum(elementos) 
        somas.append(soma)

    data = {'counts': somas,
            'bin_edges': mag_bins.loc[mag_bins['band'] == band, 'counts'].values[0][:-1]}

    total_mag_histograms[band] = pd.DataFrame(data)
    
### Printing the information.
print("INFORMATION ABOUT THE TOTAL 1D HISTOGRAM COUNTS")
for band in bands_mag:
    print(f"Min. count - band {band}: {total_mag_histograms[band]['counts'].min()}")
    print(f"Max. count - band {band}: {total_mag_histograms[band]['counts'].max()}\n") 

## Magnitude distributions plots

### Total 1D histograms

Below, we have the plots of the total 1D histograms for each band, meaning we are considering all input parquet files together.

In [None]:
### General settings
height = 400
width = 400

def setup_plot_data(band, histograms):
    """Sets the data to the plot for the specified band."""
    counts = histograms[band]['counts']
    bin_edges = np.array(histograms[band]['bin_edges'])
    bin_size = bin_edges[1] - bin_edges[0]
    bin_edges = np.append(bin_edges, bin_edges[-1] + bin_size)
    
    max_count_index = histograms[band]['counts'].idxmax()
    max_value = histograms[band].loc[max_count_index, 'bin_edges']
    xlim = (max_value - 5, max_value + 5)

    return counts, bin_edges, xlim

def create_histogram(band, counts, bin_edges, xlim):
    """Creates the histogram using holoviews."""
    title = f'Distribution of Magnitudes - Band {band}'
    label_name = f'mag {band}'
    magnitudes = hv.Dimension(band, label=label_name)
    mag_freqs = hv.Dimension(f'{band}_freqs', label=f'{label_name} freqs')
    
    return hv.Histogram((counts, bin_edges), kdims=magnitudes, vdims=mag_freqs).opts(
        title=title, xlabel=label_name, ylabel='frequencies', height=height, width=width, xlim=xlim)

### Preparing the histograms data.
mag_distribution_histo = {}
for band in bands_mag:
    band = band.lower()
    counts, bin_edges, xlim = setup_plot_data(band, total_mag_histograms)
    mag_distribution_histo[band] = create_histogram(band, counts, bin_edges, xlim)

### Composing the plots in a hv.Layout and showing.
mag_distribution = hv.Layout([mag_distribution_histo[band] for band in bands_mag]).cols(2)
mag_distribution

### Individual 1D histograms for each input file

Now, in the plots below, for a given band, each light-colored curve was obtained from the 1D histogram of each individual Parquet file. The red curve, in turn, represents the mean value of counts in each bin, considering all input Parquet files.

Note that if the histograms from each individual input file are very similar, meaning each input file has enough data so that statistical deviations are very small, **you may need to zoom in on the plot to distinguish the light curves, as they will all be very close to the average curve (red curve)**.

In [None]:
# Função ajustada para criar gráficos para um dataframe específico e banda
def create_plot(df, bins, title, step=10):
    curves = []
    all_counts = []

    for i, row in df.iterrows():
        counts = np.array(row['counts'])
        all_counts.append(counts)
        if i % step == 0:
            centers = (bins[:-1] + bins[1:]) / 2
            curve = hv.Curve((centers, counts), 'Magnitude', 'Counts').opts(line_width=2, alpha=0.3)
            curves.append(curve)

    mean_counts = np.mean(all_counts, axis=0)
    mean_curve = hv.Curve((centers, mean_counts), 'Magnitude', 'Counts').opts(line_width=4, color='red').relabel('Mean Counts')

    overlay = hv.Overlay(curves + [mean_curve]).opts(
        hv.opts.Overlay(title=title, width=400, height=400, legend_position='top_left', show_legend=True, legend_limit=100),
    )
    return overlay

# Criando gráficos para todas as bandas
plots = []
for band in bands_mag:
    band_df = mag_histograms[band]
    bins = np.array(mag_bins.loc[mag_bins['band'] == band, 'counts'].values[0])
    plot = create_plot(band_df, bins, f"Individual Files - Mag. Distrib. - Band {band}", 50)
    plots.append(plot)

# Organizando todos os gráficos em um layout
layout = hv.Layout(plots).opts(opts.Layout(shared_axes=False)).cols(2)

layout

# Magnitude errors distributions

In the following subsections, we will read the data of the 1D histograms corresponding to the **magnitude errors distributions** and we will make plots for each band of the survey.

### How to get the data

The file containing the data of the 1D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 1D histogram. In this Python file, you can personalize the magnitude errors bin edges for each band. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_histo_1d_magerr.py```
2. Sbatch script: ```DP02_QA_histo_1d_magerr.sbatch```
3. Resulting Parquet file containing the data: ```histo_1d_magerr_all_bands.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_1d_magerr_all_bands.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the magnitude errors distributions, we compute one 1D histogram for each input Parquet file, for each band. Given a certain band, we use the same bin edges for each input file.**

## Reading the data

Below we can see the Parquet file structure. When the value of the 'filename' column is a path, the values of the 'counts' column are the 1D histogram counts for the file corresponding to this path. When the value of the 'filename' column is 'bins', the values of the 'counts' column are the bin edges used for each band.

We also split, in the code lines, the original dataframe (df_magerr) into a list of dataframes containing only the 2D histograms counts (magerr_histograms), for each band, and a dataframe containing only the bins (magerr_bins), and we convert the values to numpy.arrays.

In [None]:
### Reading the Parquet file.
df_magerr = pd.read_parquet('output/histo_1d_magerr_all_bands.parquet', engine='fastparquet')

In [None]:
### Getting the bands names and the number of bands.
bands_magerr = df_magerr['band'].unique()
num_of_bands_magerr = len(bands_magerr)

### Separating the histograms of each band and the bin edges, and converting the lists to numpy arrays.
magerr_histograms = {}
for band in bands_magerr:
    magerr_histograms[band] = df_magerr[df_magerr['band'] == band]
    magerr_histograms[band] = magerr_histograms[band][magerr_histograms[band]['filename'] != 'bins']
    magerr_histograms[band] = magerr_histograms[band].reset_index(drop=True)
    magerr_histograms[band]['counts'] = magerr_histograms[band]['counts'].apply(np.array)

magerr_bins = df_magerr[df_magerr['filename'] == 'bins']
magerr_bins = magerr_bins.reset_index(drop=True)
magerr_bins['counts'] = magerr_bins['counts'].apply(np.array)

### Getting the input files names and the number of input files.
filenames_paths_magerr = df_magerr['filename'].unique()
filenames_paths_magerr = filenames_paths_magerr[filenames_paths_magerr!='bins']
filenames_paths_magerr = filenames_paths_magerr.astype(str)
filenames_magerr = np.char.replace(filenames_paths_magerr, '/lustre/t1/cl/lsst/dp02/secondary/catalogs/skinny/hdf5/', '')
filenames_len_magerr = len(filenames_magerr)

In [None]:
### PRINTING THE INFORMATION
### GENERAL INFORMATION
print(f"MAG. ERR. DIST. - NUMBER OF BANDS: {num_of_bands_magerr}")
print(f"MAG. ERR. DIST. - BAND NAMES: {bands_magerr}")
print(f"MAG. ERR. DIST. - NUMBER OF INPUT FILES: {filenames_len_magerr}")
print(f"EACH BAND HAS A TOTAL OF {filenames_len_magerr} LINES, CORRESPONDING TO EACH INPUT FILE, PLUS ONE EXTRA LINE, CORRESPONDING TO THE BIN EDGES.\n")

### INDIVIDUAL HISTOGRAMS PREVIEW
printing_lines_for_each_band = 2
print(f"\nPRINTING {printing_lines_for_each_band} LINES FOR EACH BAND, CONTAINING THE 1D HISTOGRAM COUNTS.")
for band in bands_magerr:
    print(magerr_histograms[band].head(printing_lines_for_each_band))
    print('\n')

### BIN EDGES PREVIEW
print(f"\nPRINTING THE LINES CONTAINING THE BIN EDGES (THEY CAN BE FOUND AT THE END OF THE PARQUET FILE)")
print(magerr_bins.head())

### BIN EDGES INFORMATION
print("\n\nINFORMATION ABOUT THE BINS")
for band in bands_magerr:
    bin_list = magerr_bins[magerr_bins['band']==band]['counts'].reset_index(drop=True)[0]
    bin_min = bin_list.min()
    bin_max = bin_list.max()
    step = bin_list[1] - bin_list[0]
    shape = bin_list.shape
    print(f"Min. bin edge - band {band}: {bin_min}")
    print(f"Max. bin edge - band {band}: {bin_max}")
    print(f"Step - band {band}: {step}")
    print(f"Shape - band {band}: {shape}\n")

Now, we will compute the total 1D histogram for each band. This is done by summing the values in each bin for all the individual histograms. Below, we show some information about the total 1D histogram.

In [None]:
### Computing the total 1D histogram.
total_magerr_histograms = {}
for band in bands_magerr:
    arrays = {}
    num_of_rows = len(magerr_histograms[band].index)
    for j in np.arange(0, num_of_rows, 1):
        name = 'array'+str(j)
        arrays[name] = magerr_histograms[band]['counts'][j]

    somas = []

    for elementos in zip(*arrays.values()):
        soma = sum(elementos) 
        somas.append(soma)

    data = {'counts': somas,
            'bin_edges': magerr_bins.loc[magerr_bins['band'] == band, 'counts'].values[0][:-1]}

    total_magerr_histograms[band] = pd.DataFrame(data)
    
### Printing the information.
print("INFORMATION ABOUT THE TOTAL 1D HISTOGRAM COUNTS")
for band in bands_magerr:
    print(f"Min. count - band {band}: {total_magerr_histograms[band]['counts'].min()}")
    print(f"Max. count - band {band}: {total_magerr_histograms[band]['counts'].max()}\n") 

## Magnitude errors distributions plots

### Total 1D histograms

Below, we have the plots of the total 1D histograms for each band, meaning we are considering all input parquet files together.

In [None]:
### General settings
height = 400
width = 400

def setup_plot_data(band, histograms):
    """Sets the data to the plot for the specified band."""
    counts = histograms[band]['counts']
    bin_edges = np.array(histograms[band]['bin_edges'])
    bin_size = bin_edges[1] - bin_edges[0]
    bin_edges = np.append(bin_edges, bin_edges[-1] + bin_size)
    
    max_count_index = histograms[band]['counts'].idxmax()
    max_value = histograms[band].loc[max_count_index, 'bin_edges']
    xlim = (0, max_value + 0.5)

    return counts, bin_edges, xlim

def create_histogram(band, counts, bin_edges, xlim):
    """Creates the histogram using holoviews."""
    title = f'Distribution of Magnitude Errors - Band {band}'
    label_name = f'magerr {band}'
    magnitude_errors = hv.Dimension(band, label=label_name)
    magerr_freqs = hv.Dimension(f'{band}_freqs', label=f'{label_name} freqs')
    
    return hv.Histogram((counts, bin_edges), kdims=magnitude_errors, vdims=magerr_freqs).opts(
        title=title, xlabel=label_name, ylabel='frequencies', height=height, width=width, xlim=xlim)

### Preparing the histograms data.
magerr_distribution_histo = {}
for band in bands_magerr:
    band = band.lower()
    counts, bin_edges, xlim = setup_plot_data(band, total_magerr_histograms)
    magerr_distribution_histo[band] = create_histogram(band, counts, bin_edges, xlim)

### Composing the plots in a hv.Layout and showing.
magerr_distribution = hv.Layout([magerr_distribution_histo[band] for band in bands_magerr]).cols(2)
magerr_distribution

### Individual 1D histograms for each input file

Now, in the plots below, for a given band, each light-colored curve was obtained from the 1D histogram of each individual Parquet file. The red curve, in turn, represents the mean value of counts in each bin, considering all input Parquet files.

Note that if the histograms from each individual input file are very similar, meaning each input file has enough data so that statistical deviations are very small, **you may need to zoom in on the plot to distinguish the light curves, as they will all be very close to the average curve (red curve)**.

In [None]:
# Função ajustada para criar gráficos para um dataframe específico e banda
def create_plot(df, bins, title, step=10):
    curves = []
    all_counts = []

    for i, row in df.iterrows():
        counts = np.array(row['counts'])
        all_counts.append(counts)
        if i % step == 0:
            centers = (bins[:-1] + bins[1:]) / 2
            curve = hv.Curve((centers, counts), 'Magnitude Errors', 'Counts').opts(line_width=2, alpha=0.3)
            curves.append(curve)

    mean_counts = np.mean(all_counts, axis=0)
    mean_curve = hv.Curve((centers, mean_counts), 'Magnitude Errors', 'Counts').opts(line_width=4, color='red').relabel('Mean Counts')

    overlay = hv.Overlay(curves + [mean_curve]).opts(
        hv.opts.Overlay(title=title, width=400, height=400, legend_position='top_left', show_legend=True, legend_limit=100),
    )
    return overlay

# Criando gráficos para todas as bandas
plots = []
for band in bands_magerr:
    band_df = magerr_histograms[band]
    bins = np.array(magerr_bins.loc[magerr_bins['band'] == band, 'counts'].values[0])
    plot = create_plot(band_df, bins, f"Individual Files - Mag. Err. Distrib. - Band {band}", 50)
    plots.append(plot)

# Organizando todos os gráficos em um layout
layout = hv.Layout(plots).opts(opts.Layout(shared_axes=False)).cols(2)

layout

# Magnitude x Magnitude Error

In the following subsections, we will read the data of the 2D histogram corresponding to the **Magnitude x Magnitude Error** plots, considering the magnitudes and magnitudes errors for each band.

All graphs will also have a **colorbar** corresponding to the counts of objects per magnitude and magnitude error bin of the 2D histogram.

### How to get the data

The file containing the data of the 2D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 2D histogram. In this Python file, you can personalize the magnitude and magnitude errors bin edges. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_histo_2d_mag_magerr.py```
2. Sbatch script: ```DP02_QA_histo_2d_mag_magerr.sbatch```
3. Resulting Parquet file containing the data: ```histo_2d_mag_magerr_all_bands.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_2d_mag_magerr_all_bands.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the Magnitude x Magnitude Error plots, we do not compute one 2D histogram for each input Parquet file, just the total histogram considering all files together, for each band.**

## Reading the data

Below, we can see the output Parquet file structure and information about the bins. For each band, we have two lines, one for the 2D histogram counts, and other for the bins (mag_bins and magerr_bins).

We also split, in the code lines, the original dataframe (df_mag_magerr) into a list of dataframes containing only the 2D histograms counts (mag_magerr_histograms), for each band, and a dataframe containing only the bins (mag_magerr_bins).

In [None]:
### Reading the data.
df_mag_magerr = pd.read_parquet('output/histo_2d_mag_magerr_all_bands.parquet', engine='fastparquet')

### Getting the bands names and the number of bands.
bands_mag_magerr = df_mag_magerr['band'].unique()
num_of_bands_mag_magerr = len(bands_mag_magerr)

### Separating the data of the counts into a different dataframe, and converting to numpy array.
mag_magerr_histograms = {}
for band in bands_mag_magerr:
    mag_magerr_histograms[band] = df_mag_magerr[df_mag_magerr['band'] == band]
    mag_magerr_histograms[band] = mag_magerr_histograms[band][mag_magerr_histograms[band]['type'] != 'bins']
    mag_magerr_histograms[band] = mag_magerr_histograms[band].reset_index(drop=True)

### Separating the data of the bins into a different dataframe, and converting to numpy array.  
mag_magerr_bins = df_mag_magerr[df_mag_magerr['type'] == 'bins']
mag_magerr_bins = mag_magerr_bins.reset_index(drop=True)

In [None]:
df_mag_magerr

In [None]:
### BIN EDGES INFORMATION
print("\n INFORMATION ABOUT THE BINS \n")
for band in bands_mag_magerr:
    bin_list = mag_magerr_bins[mag_magerr_bins['band']==band]['values'].reset_index(drop=True)[0]
    
    mag_bins = np.array(bin_list['mag_bins'])
    bin_mag_min = mag_bins.min()
    bin_mag_max = mag_bins.max()
    step_mag = mag_bins[1] - mag_bins[0]
    shape_mag = mag_bins.shape
    
    magerr_bins = np.array(bin_list['magerr_bins'])
    bin_magerr_min = magerr_bins.min()
    bin_magerr_max = magerr_bins.max()
    step_magerr = magerr_bins[1] - magerr_bins[0]
    shape_magerr = magerr_bins.shape
    
    print(f"BAND {band}")
    print(f"Min. mag bin edge: {bin_mag_min:.3f}")
    print(f"Max. mag bin edge: {bin_mag_max:.3f}")
    print(f"Step mag: {step_mag:.3f}")
    print(f"Shape mag: {shape_mag}\n")
    print(f"Min. magerr bin edge: {bin_magerr_min:.4f}")
    print(f"Max. magerr bin edge: {bin_magerr_max:.4f}")
    print(f"Step magerr: {step_magerr:.4f}")
    print(f"Shape magerr: {shape_magerr}\n")

## Magnitude x Magnitude Error plots

Before making the plots, we must perform some tasks:

1. Change the 0 values in the 2D histogram counts array to NaN values, so that they appear white in the plot.
2. Compute the centers of the bins.
3. Transpose the 2D histogram counts arrays so that they become compatible with HoloViews/GeoViews.

In [None]:
### Converting the zero values to NaN in the 2D lists.
def convert_zeros_to_nan(lst):
    return [[np.nan if x == 0 else x for x in sublist] for sublist in lst]

for band in mag_magerr_histograms:
    mag_magerr_histograms[band]['values'] = mag_magerr_histograms[band]['values'].apply(convert_zeros_to_nan)

### Computing the centers of the bins.
x_edges = {}
y_edges = {}

for band in bands_mag_magerr:
    mag_bins_vals = np.array(mag_magerr_bins.loc[mag_magerr_bins['band'] == band, 'values'].values[0]['mag_bins'])
    magerr_bins_vals = np.array(mag_magerr_bins.loc[mag_magerr_bins['band'] == band, 'values'].values[0]['magerr_bins'])
    
    x_edges[band] = (mag_bins_vals[1:] + mag_bins_vals[:-1])/2
    y_edges[band] = (magerr_bins_vals[1:] +  magerr_bins_vals[:-1])/2
    
### Transposing the histograms.
histo = {}
for band in bands_mag_magerr:
    histo[band] = np.vstack(mag_magerr_histograms[band]['values'])   
    histo[band] = histo[band].T

Finally, we have the Magnitude x Magnitude Error plots:

In [None]:
%%time
plots = []

for band in bands_mag_magerr:
    hv_image = hv.Image((x_edges[band], y_edges[band], histo[band]), [f'mag_{band}', f'magerr_{band}'], f'counts_{band}')
    
    hv_image = hv_image.opts(
        opts.Image(cmap='viridis', cnorm='log', colorbar=True, width=600, height=400,
                   xlim=(20, 27), ylim=(0, 0.2), tools=['hover'], clim=(10, np.max(histo[band])),
                   title=f'Magnitude x Magnitude Error - Band {band}')
    )
    
    plots.append(hv_image)

# Organizando os gráficos em duas colunas
layout = hv.Layout(plots).cols(2)
layout

# Magnitude x Color

In the following subsections, we will read the data of the 2D histogram corresponding to the **Magnitude x Color** plots.

All graphs will also have a **colorbar** corresponding to the counts of objects per magnitude and color bin of the 2D histogram.

### How to get the data

The file containing the data of the 2D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 2D histogram. In this Python file, you can personalize the magnitude and colors graphs and bin edges. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
1. Python script: ```DP02_QA_histo_2d_mag_color.py```
2. Sbatch script: ```DP02_QA_histo_2d_mag_color.sbatch```
3. Resulting Parquet file containing the data: ```histo_2d_mag_color_all_graphs.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_2d_mag_color_all_graphs.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the Magnitude x Color plots, we do not compute one 2D histogram for each input Parquet file, just the total histogram considering all files together, for each graph.**

## Reading the data

Below, we can see the output Parquet file structure and information about the bins. For each graph, we have two lines, one for the 2D histogram counts, and other for the bins (mag_bins and color_bins).

We also split, in the code lines, the original dataframe (df_mag_color) into a list of dataframes containing only the 2D histograms counts (mag_color_histograms), for each graph, and a dataframe containing only the bins (mag_color_bins).

In [None]:
### Reading the data.
df_mag_color = pd.read_parquet('output/histo_2d_mag_color_all_graphs.parquet', engine='fastparquet')

graphs = df_mag_color["graph"].unique()

### Separating the data of the counts into a different dataframe, and converting to numpy array.
mag_color_histograms = {}
for graph in graphs:
    mag_color_histograms[graph] = df_mag_color[df_mag_color['graph'] == graph]
    mag_color_histograms[graph] = mag_color_histograms[graph][mag_color_histograms[graph]['type'] != 'bins']
    mag_color_histograms[graph] = mag_color_histograms[graph].reset_index(drop=True)

### Separating the data of the bins into a different dataframe, and converting to numpy array.  
mag_color_bins = df_mag_color[df_mag_color['type'] == 'bins']
mag_color_bins = mag_color_bins.reset_index(drop=True)

In [None]:
df_mag_color

In [None]:
### BIN EDGES INFORMATION
print("\n INFORMATION ABOUT THE BINS \n")
for graph in graphs:
    bin_list = mag_color_bins[mag_color_bins['graph']==graph]['values'].reset_index(drop=True)[0]
    
    mag_bins = np.array(bin_list['mag_bins'])
    bin_mag_min = mag_bins.min()
    bin_mag_max = mag_bins.max()
    step_mag = mag_bins[1] - mag_bins[0]
    shape_mag = mag_bins.shape
    
    color_bins = np.array(bin_list['color_bins'])
    bin_color_min = color_bins.min()
    bin_color_max = color_bins.max()
    step_color = color_bins[1] - color_bins[0]
    shape_color = color_bins.shape
    
    print(f"GRAPH {graph}")
    print(f"Min. mag bin edge: {bin_mag_min:.3f}")
    print(f"Max. mag bin edge: {bin_mag_max:.3f}")
    print(f"Step mag: {step_mag:.3f}")
    print(f"Shape mag: {shape_mag}\n")
    print(f"Min. color bin edge: {bin_color_min:.4f}")
    print(f"Max. color bin edge: {bin_color_max:.4f}")
    print(f"Step color: {step_color:.4f}")
    print(f"Shape color: {shape_color}\n")

## Magnitude x Color plots

Before making the plots, we must perform some tasks:

1. Change the 0 values in the 2D histogram counts array to NaN values, so that they appear white in the plot.
2. Compute the centers of the bins.
3. Transpose the 2D histogram counts arrays so that they become compatible with HoloViews/GeoViews.

In [None]:
### Converting the zero values to NaN in the 2D lists.
def convert_zeros_to_nan(lst):
    return [[np.nan if x == 0 else x for x in sublist] for sublist in lst]

for graph in mag_color_histograms:
    mag_color_histograms[graph]['values'] = mag_color_histograms[graph]['values'].apply(convert_zeros_to_nan)

### Computing the centers of the bins.
x_edges = {}
y_edges = {}

for graph in graphs:
    mag_bins_vals = np.array(mag_color_bins.loc[mag_color_bins['graph'] == graph, 'values'].values[0]['mag_bins'])
    color_bins_vals = np.array(mag_color_bins.loc[mag_color_bins['graph'] == graph, 'values'].values[0]['color_bins'])
    
    x_edges[graph] = (mag_bins_vals[1:] + mag_bins_vals[:-1])/2
    y_edges[graph] = (color_bins_vals[1:] +  color_bins_vals[:-1])/2
    
### Transposing the histograms.
histo = {}
for graph in graphs:
    histo[graph] = np.vstack(mag_color_histograms[graph]['values'])   
    histo[graph] = histo[graph].T

Finally, we have the Magnitude x Color plots:

In [None]:
%%time
plots = []

i=0
for graph in graphs:
    band, color = graph.split('_x_')
    
    hv_image = hv.Image((x_edges[graph], y_edges[graph], histo[graph]), [f'{i}_mag_{band}', f'{i}_color_{color}'], f'{i}_counts_{graph}')
    
    hv_image = hv_image.opts(
        opts.Image(cmap='viridis', cnorm='log', colorbar=True, width=600, height=400,
                   xlim=(15, 28), ylim=(-2, 2), tools=['hover'], clim=(10, np.max(histo[graph])),
                   title=f'Magnitude x Color - Band {band} vs Color {color}')
    )
    
    plots.append(hv_image)
    i+=1

# Organizando os gráficos em duas colunas
layout = hv.Layout(plots).cols(2)
layout

# Color x Color

In the following subsections, we will read the data of the 2D histogram corresponding to the **Color x Color** plots.

All graphs will also have a **colorbar** corresponding to the counts of objects per color 1 and color 2 bin of the 2D histogram.

### How to get the data

The file containing the data of the 2D histogram was obtained using High Performance Computing (HPC) in the LIneA Apollo Cluster. There is a Python script that reads all the input Parquet files and computes the 2D histogram. In this Python file, you can personalize the color x color graphs and bin edges. There is also an sbatch file used to submit the job to the cluster. They can be found in ```pz-compute/doc/dp02_qa_scripts/```.

Files:
1. Detailed instructions to run the scripts: ```DP02_QA_slurm_scripts_instructions.md```
2. Python script: ```DP02_QA_histo_2d_color_color.py```
3. Sbatch script: ```DP02_QA_histo_2d_color_color.sbatch```
4. Resulting Parquet file containing the data: ```histo_2d_color_color_all_graphs.parquet```

For running the .py script in the cluster, you must follow the steps in the corresponding ```.md``` file.

For this Jupyter notebook, the output file ```histo_2d_color_color_all_graphs.parquet``` must be in the ```output``` folder, and this ```output``` folder must be in the same directory of the notebook itself.

**For the Color x Color plots, we do not compute one 2D histogram for each input Parquet file, just the total histogram considering all files together, for each graph.**

## Reading the data

Below, we can see the output Parquet file structure and information about the bins. For each graph, we have two lines, one for the 2D histogram counts, and other for the bins (color1_bins and color2_bins).

We also split, in the code lines, the original dataframe (df_color_color) into a list of dataframes containing only the 2D histograms counts (color_color_histograms), for each graph, and a dataframe containing only the bins (color_color_bins).

In [None]:
### Reading the data.
df_color_color = pd.read_parquet('output/histo_2d_color_color_all_graphs.parquet', engine='fastparquet')

graphs_color_color = df_color_color["graph"].unique()

### Separating the data of the counts into a different dataframe, and converting to numpy array.
color_color_histograms = {}
for graph in graphs_color_color:
    color_color_histograms[graph] = df_color_color[df_color_color['graph'] == graph]
    color_color_histograms[graph] = color_color_histograms[graph][color_color_histograms[graph]['type'] != 'bins']
    color_color_histograms[graph] = color_color_histograms[graph].reset_index(drop=True)

### Separating the data of the bins into a different dataframe, and converting to numpy array.  
color_color_bins = df_color_color[df_color_color['type'] == 'bins']
color_color_bins = color_color_bins.reset_index(drop=True)

In [None]:
df_color_color

In [None]:
### BIN EDGES INFORMATION
print("\n INFORMATION ABOUT THE BINS \n")
for graph in graphs_color_color:
    bin_list = color_color_bins[color_color_bins['graph']==graph]['values'].reset_index(drop=True)[0]
    
    color1_bins = np.array(bin_list['color1_bins'])
    bin_color1_min = color1_bins.min()
    bin_color1_max = color1_bins.max()
    step_color1 = color1_bins[1] - color1_bins[0]
    shape_color1 = color1_bins.shape
    
    color2_bins = np.array(bin_list['color2_bins'])
    bin_color2_min = color2_bins.min()
    bin_color2_max = color2_bins.max()
    step_color2 = color2_bins[1] - color2_bins[0]
    shape_color2 = color2_bins.shape
    
    print(f"GRAPH {graph}")
    print(f"Min. color1 bin edge: {bin_color1_min:.3f}")
    print(f"Max. color1 bin edge: {bin_color1_max:.3f}")
    print(f"Step color1: {step_color1:.3f}")
    print(f"Shape color1: {shape_color1}\n")
    print(f"Min. color2 bin edge: {bin_color2_min:.4f}")
    print(f"Max. color2 bin edge: {bin_color2_max:.4f}")
    print(f"Step color2: {step_color2:.4f}")
    print(f"Shape color2: {shape_color2}\n")

## Color x Color plots

Before making the plots, we must perform some tasks:

1. Change the 0 values in the 2D histogram counts array to NaN values, so that they appear white in the plot.
2. Compute the centers of the bins.
3. Transpose the 2D histogram counts arrays so that they become compatible with HoloViews/GeoViews.

In [None]:
### Converting the zero values to NaN in the 2D lists.
def convert_zeros_to_nan(lst):
    return [[np.nan if x == 0 else x for x in sublist] for sublist in lst]

for graph in color_color_histograms:
    color_color_histograms[graph]['values'] = color_color_histograms[graph]['values'].apply(convert_zeros_to_nan)

### Computing the centers of the bins.
x_edges = {}
y_edges = {}

for graph in graphs_color_color:
    color1_bins_vals = np.array(color_color_bins.loc[color_color_bins['graph'] == graph, 'values'].values[0]['color1_bins'])
    color2_bins_vals = np.array(color_color_bins.loc[color_color_bins['graph'] == graph, 'values'].values[0]['color2_bins'])
    
    x_edges[graph] = (color1_bins_vals[1:] + color1_bins_vals[:-1])/2
    y_edges[graph] = (color2_bins_vals[1:] +  color2_bins_vals[:-1])/2
    
### Transposing the histograms.
histo = {}
for graph in graphs_color_color:
    histo[graph] = np.vstack(color_color_histograms[graph]['values'])   
    histo[graph] = histo[graph].T

Finally, we have the Color x Color plots:

In [None]:
%%time
plots = []

i=0
for graph in graphs_color_color:
    color1, color2 = graph.split('_x_')
    
    hv_image = hv.Image((x_edges[graph], y_edges[graph], histo[graph]), [f'{i}_color_{color1}', f'{i}_color_{color2}'], f'{i}_counts_{graph}')
    
    hv_image = hv_image.opts(
        opts.Image(cmap='viridis', cnorm='log', colorbar=True, width=600, height=400,
                   xlim=(-6, 6), ylim=(-6, 6), tools=['hover'], clim=(10, np.max(histo[graph])),
                   title=f'Color x Color - {color1} vs {color2}')
    )
    
    plots.append(hv_image)
    i+=1
# Organizando os gráficos em duas colunas
layout = hv.Layout(plots).cols(2)
layout