<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=130 style="padding: 20px"> 

# Photo-z Compute Scalability Tests
## Optimizing software infrastructure to compute photo-zs in the LSST scale: preparing for LSST DR1.

<br><br>

--- 
Main notebook: [PZ_Compute_Tests.ipynp](./PZ_Compute_Tests.ipynp)

Control spreadsheet: [PZ Compute Runs](https://docs.google.com/spreadsheets/d/1GKlDhLx7oXTjwBXoj8pzfrqnE7X-4nUW2sYDuY-tx94/edit?usp=sharing)

Project members: Julia Gschwend, Heloisa Mengisztki, Cristiano Singulani, Henrique Dante

Last verified run: 27/07/2023

--- 



# Test 2: Test variation with training set size 

Science question: 

_"How does the total runtime depend on the training set size for a machine-learning-based method (FlexZBoost)? Would a larger training set add any extra complexity that propagates to the estimation stage?"_

Apollo nodes: apl08, apl10, apl12, apl14 

Input data: DP0.2 Full (1935 pre-processed parquet files = 278,318,452 objects = 33GB) 

Tested for 6 different training set sizes (all random samples): 10k, 50k, 100k, 500k, 1M, 2.4M;  arbitrarily chosen from 10k (the order of the training set size used in DES science verification paper) up to 2.4M, which is the order of the current public spec-z sample available and, by coincidence the approximate size of 1 tract in DP0.2.    



In [None]:
import numpy as np
import pandas as pd
import tables_io
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime 
import time 

%matplotlib inline

In [None]:
apollo_dict = {'10.148.0.11' : 'apl01', 
                '10.148.0.12' : 'apl02', 
                '10.148.0.13' : 'apl03', 
                '10.148.0.14' : 'apl04', 
                '10.148.0.15' : 'apl05', 
                '10.148.0.16' : 'apl06', 
                '10.148.0.17' : 'apl07', 
                '10.148.0.18' : 'apl08', 
                '10.148.0.19' : 'apl09',                
                '10.148.0.27' : 'apl10', 
                '10.148.0.28' : 'apl11', 
                '10.148.0.29' : 'apl12', 
                '10.148.0.30' : 'apl13', 
                '10.148.0.31' : 'apl14', 
                '10.148.0.32' : 'apl15',
                '10.148.0.26' : 'apl16'} 

Read results collected from htcondor log files and stored in CSV summary files: 

In [None]:
# FlexZBoost
test_train_10k  = pd.read_csv('results/tests/test_train_10k.csv') 
test_train_50k  = pd.read_csv('results/tests/test_train_50k.csv')
test_train_100k = pd.read_csv('results/tests/test_train_100k.csv')
test_train_500k = pd.read_csv('results/tests/test_train_500k.csv')
test_train_1M   = pd.read_csv('results/tests/test_train_1M.csv')
test_train_2M   = pd.read_csv('results/tests/test_train_2.4M.csv')

In [None]:
test_train_100k.host.unique()

In [None]:
bad_hosts = []
for host, name in apollo_dict.items(): 
    if (name == "apl13") | (name == "apl14") | (name == "apl15"): 
        bad_hosts.append(host) 
        print(host)

Data cleaning: remove results generated by faulty machines (IP hosts above) to minimize bias. 

In [None]:
query = f'host != "{bad_hosts[0]}" & host != "{bad_hosts[1]}" & host != "{bad_hosts[2]}" '  
test_train_10k.query(query, inplace=True)  
test_train_50k.query(query, inplace=True)  
test_train_100k.query(query, inplace=True) 
test_train_500k.query(query, inplace=True) 
test_train_1M.query(query, inplace=True)   
test_train_2M.query(query, inplace=True)   

In [None]:
test_train_10k.host.unique()

Organize dataframes from the tests results used in the analysis: 

In [None]:
fzboost_runs = {
    'test_train_10k' : test_train_10k ,
    'test_train_50k' : test_train_50k ,
    'test_train_100k': test_train_100k,
    'test_train_500k': test_train_500k,
    'test_train_1M'  : test_train_1M,
    'test_train_2M'  : test_train_2M
}

In [None]:
for test, df in fzboost_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print([apollo_dict[host] for host in df.host.unique()])
    print('---')

Compute speed$^{-1}$ in milliseconds per object and add to each results dataframe: 

In [None]:
for results_df in fzboost_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
fzboost_runs['test_train_10k'].head()

Build a dataframe with process summary info:

Fuction to recalculate effective runtime, taking into account only the files processed by the good nodes: 

In [None]:
def calc_runtime(pz_results_dict, test_name):   
    str_begin = pz_results_dict[test_name]['time_begin'].min()
    str_end = pz_results_dict[test_name]['time_end'].max()
    t_begin = datetime.strptime(str_begin,'%Y-%m-%d %H:%M:%S')
    t_end = datetime.strptime(str_end,'%Y-%m-%d %H:%M:%S')
    dt = (t_end - t_begin)
    runtime = dt.total_seconds()
    return str_begin, str_end, runtime 

example

In [None]:
test = 'test_train_10k'
begin, end, runtime  = calc_runtime(fzboost_runs, test)
print(f'test {test} starded at {begin}, finished at {end}, and took ~{round(runtime/60.)} minutes')

In [None]:
fzboost_info = {}
for key in fzboost_runs.keys():
    fzboost_info[key] = {}
for test_name, results_df in fzboost_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    fzboost_info[test_name]['hosts'] = hosts
    fzboost_info[test_name]['n_cores'] = len(hosts) * 56 
    fzboost_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(fzboost_runs, test_name)
    fzboost_info[test_name]['time_begin'] = begin
    fzboost_info[test_name]['time_end'] = end
    fzboost_info[test_name]['runtime'] = runtime
    fzboost_info[test_name]['n_files'] = len(results_df['host'])
    fzboost_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    fzboost_info[test_name]['std_speed'] = np.std(results_df['speed'])   
fzboost_info = pd.DataFrame(fzboost_info).T

In [None]:
fzboost_info.index

In [None]:
fzboost_info

--- 
## Linear fit

In [None]:
test_results_table = pd.read_csv('results/PZ Compute Runs - Tests.csv')
train_times = np.array(test_results_table.query('index == 17 | index == 18 | index == 19 | index == 20 | index == 21 | index == 22 ')['Duration'] )
train_times

In [None]:
def calc_runtime_from_summary_table(list_of_string_times):
    times = []
    for t in list_of_string_times:
        delta = datetime.strptime(t, '%H:%M:%S') - datetime(1900, 1, 1, 0, 0, 0)
        times.append(delta.total_seconds())
    return np.array(times)

In [None]:
len(fzboost_info)

In [None]:
fzboost_info.runtime

In [None]:
x = np.array([10_000, 50_000, 100_000, 500_000, 1_000_000, 2_437_615]) # train_set_size 
y_train =  calc_runtime_from_summary_table(train_times) 
y_estimate = np.array(fzboost_info.runtime) 

In [None]:
y_train

In [None]:
plt.figure(dpi=100)
plt.grid(True)
plt.plot(x/1000., y_train/60., 's-', label="pz inform (train)")
plt.plot(x/1000., y_estimate/60., 'o-', label="pz estimate")
# plt.xlabel('dataset size (million objects)')
# plt.ylabel('total runtime (min)')
plt.legend(loc = "upper right")
plt.xlabel('training size (thousand objects)', fontsize=16)
plt.ylabel('total runtime (min)', fontsize=16)
plt.legend()
plt.tight_layout()
plt.savefig('test_train_size.png')