<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=130 style="padding: 20px"> 

# Photo-z Compute Scalability Tests
## Optimizing software infrastructure to compute photo-zs in the LSST scale: preparing for LSST DR1.

<br><br>

--- 
Main notebook: [PZ_Compute_Tests.ipynp](./PZ_Compute_Tests.ipynp)

Control spreadsheet: [PZ Compute Runs](https://docs.google.com/spreadsheets/d/1GKlDhLx7oXTjwBXoj8pzfrqnE7X-4nUW2sYDuY-tx94/edit?usp=sharing)

Project members: Julia Gschwend, Heloisa Mengisztki, Cristiano Singulani, Henrique Dante

Last verified run: 27/07/2023

--- 



# Test 3: Test variation with SED template library size

Science question: 

_"How does the total runtime depend on the template set size for a template-fitting-based method (BPZ)? Does it scale linearly with the number of templates?"_

Apollo nodes: apl01, apl03, apl05, apl07, apl09, apl11, apl13, apl15 

Input data: DP0.2 Full (1935 pre-processed parquet files = 278,318,452 objects = 33GB) 

Tested for 2 cases: 
- default templates: 8 SEDs from Coleman, Wu & Weedman (1980) 
- 2 identical copies of the same default templates

More SEDs means longer run, as expected, but there is no enough data to draw strong conclusions. Moreover, the results are biased by the use of the slow machines apl13 and apl15.  

We tried to vary the list of SEDs, e.g., COSMOS_MOD.list (32 SEDs used in  Ilbert et al., 2009), or BPZ with CFHTLS_MOD.list (66 SEDs used in Ilbert et al., 2006), but it did not work. It required changes in the code. We found out that the version of bpz_lite wrapped in RAIL is not flexible to use external SED libraries. 


Test inconclusive → future work: implement flexibility to use different SED templates libraries (or open an issue in RAIL's repository requesting it).  



In [None]:
import numpy as np
import pandas as pd
import tables_io
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime 
import time 

%matplotlib inline

In [None]:
apollo_dict = {'10.148.0.11' : 'apl01', 
                '10.148.0.12' : 'apl02', 
                '10.148.0.13' : 'apl03', 
                '10.148.0.14' : 'apl04', 
                '10.148.0.15' : 'apl05', 
                '10.148.0.16' : 'apl06', 
                '10.148.0.17' : 'apl07', 
                '10.148.0.18' : 'apl08', 
                '10.148.0.19' : 'apl09',                
                '10.148.0.27' : 'apl10', 
                '10.148.0.28' : 'apl11', 
                '10.148.0.29' : 'apl12', 
                '10.148.0.30' : 'apl13', 
                '10.148.0.31' : 'apl14', 
                '10.148.0.32' : 'apl15',
                '10.148.0.26' : 'apl16'} 

Read results collected from htcondor log files and stored in CSV summary files: 

In [None]:
# BPZ
bpz_seds_default  = pd.read_csv('results/tests/bpz_seds_default.csv') 
bpz_seds_2x_default  = pd.read_csv('results/tests/bpz_seds_2x_default.csv')

In [None]:
bpz_seds_2x_default.host.unique()

In [None]:
bad_hosts = []
for host, name in apollo_dict.items(): 
    if (name == "apl13") | (name == "apl14") | (name == "apl15"): 
        bad_hosts.append(host) 
        print(host)

Data cleaning: remove results generated by faulty machines (IP hosts above) to minimize bias. 

In [None]:
query = f'host != "{bad_hosts[0]}" & host != "{bad_hosts[1]}" & host != "{bad_hosts[2]}" '  
bpz_seds_default.query(query, inplace=True)  
bpz_seds_2x_default.query(query, inplace=True)  

In [None]:
bpz_seds_2x_default.host.unique()

Organize dataframes from the tests results used in the analysis: 

In [None]:
bpz_runs = {
    'bpz_seds_default'   : bpz_seds_default ,
    'bpz_seds_2x_default': bpz_seds_2x_default
}

In [None]:
for test, df in bpz_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print(np.sort([apollo_dict[host] for host in df.host.unique()]))
    print('---')

Compute speed$^{-1}$ in milliseconds per object and add to each results dataframe: 

In [None]:
for results_df in bpz_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
bpz_runs['bpz_seds_default'].head()

Build a dataframe with process summary info:

Fuction to recalculate effective runtime, taking into account only the files processed by the good nodes: 

In [None]:
def calc_runtime(pz_results_dict, test_name):   
    str_begin = pz_results_dict[test_name]['time_begin'].min()
    str_end = pz_results_dict[test_name]['time_end'].max()
    t_begin = datetime.strptime(str_begin,'%Y-%m-%d %H:%M:%S')
    t_end = datetime.strptime(str_end,'%Y-%m-%d %H:%M:%S')
    dt = (t_end - t_begin)
    runtime = dt.total_seconds()
    return str_begin, str_end, runtime 

example

In [None]:
test = 'bpz_seds_default'
begin, end, runtime  = calc_runtime(bpz_runs, test)
print(f'test {test} starded at {begin}, finished at {end}, and took ~{round(runtime/60.)} minutes')

In [None]:
bpz_info = {}
for key in bpz_runs.keys():
    bpz_info[key] = {}
for test_name, results_df in bpz_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    bpz_info[test_name]['hosts'] = hosts
    bpz_info[test_name]['n_cores'] = len(hosts) * 56 
    bpz_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(bpz_runs, test_name)
    bpz_info[test_name]['time_begin'] = begin
    bpz_info[test_name]['time_end'] = end
    bpz_info[test_name]['runtime'] = runtime
    bpz_info[test_name]['n_files'] = len(results_df['host'])
    bpz_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    bpz_info[test_name]['std_speed'] = np.std(results_df['speed']) 

bpz_info = pd.DataFrame(bpz_info).T

In [None]:
bpz_info.index

In [None]:
bpz_info

--- 
## Linear fit

In [None]:
bpz_info.runtime

In [None]:
x = np.array([8, 16]) # SED library size 
y_estimate = np.array(bpz_info.runtime) 

In [None]:
plt.figure(dpi=100)
plt.grid(True)
plt.plot(x, y_estimate/60., 'o-', label="pz estimate (BPZ)")
# plt.xlabel('dataset size (million objects)')
# plt.ylabel('total runtime (min)')
plt.legend(loc = "upper right")
plt.xlabel('SED library size')
plt.ylabel('total runtime (min)')
plt.xticks([8,16])
plt.xlim(0,20)
plt.ylim(0,40)
plt.legend()
plt.tight_layout()
plt.savefig('test_SED_lib_size.png')