<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=130 style="padding: 20px"> 

# Photo-z Compute Scalability Tests
## Optimizing software infrastructure to compute photo-zs in the LSST scale: preparing for LSST DR1.

<br><br>

--- 
Main notebook: [PZ_Compute_Tests.ipynp](./PZ_Compute_Tests.ipynp)

Control spreadsheet: [PZ Compute Runs](https://docs.google.com/spreadsheets/d/1GKlDhLx7oXTjwBXoj8pzfrqnE7X-4nUW2sYDuY-tx94/edit?usp=sharing)

Project members: Julia Gschwend, Heloisa Mengisztki, Cristiano Singulani, Henrique Dante

Last verified run: 27/07/2023

--- 



# Test 5: Test impact of cleaning the input data from unnecessary decimal cases

Science question: 

_"Test 5: What is the impact on the pipeline execution speed of cleaning the input data by rounding decimal cases in magnitudes and their errors? Comparison between 2 cases: original data (15 decimal cases) versus rounded data (4 decimal cases). Does it change the photo-z results?"_

Apollo nodes: apl02, apl04, apl06, apl08, apl10, apl12, apl14

Input data: DP0.2 Full (1935 pre-processed parquet files = 278,318,452 objects = 33GB) 

Attempted for 4 cases: 

| Algorithm | process_id | Description | 
| --- | --- | --- | 
| FlexZBoost | fzboost_all_dec_cases_chunk_150k | Original data (15 decimal cases) |
| FlexZBoost | fzboost_trunc4_chunk_150k        | Rounded magnitudes and errors (4 decimal cases) |
| BPZ        | bpz_all_dec_cases_chunk_150k     |  Original data (15 decimal cases) |
| BPZ        | bpz_trunc4_chunk_150k            | Rounded magnitudes and errors (4 decimal cases) |
 

Then repeated for FlexZBoost with 10 copies of the data:

| Algorithm | process_id | Description | 
| --- | --- | --- | 
| FlexZBoost | fzboost_10x_all_dec_cases_chunk_150k | Original data (15 decimal cases) |
| FlexZBoost | fzboost_10x_trunc4_chunk_150k | Rounded magnitudes and errors (4 decimal cases) |




In [None]:
import numpy as np
import pandas as pd
import tables_io
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime 
import time 

%matplotlib inline

In [None]:
apollo_dict = {'10.148.0.11' : 'apl01', 
                '10.148.0.12' : 'apl02', 
                '10.148.0.13' : 'apl03', 
                '10.148.0.14' : 'apl04', 
                '10.148.0.15' : 'apl05', 
                '10.148.0.16' : 'apl06', 
                '10.148.0.17' : 'apl07', 
                '10.148.0.18' : 'apl08', 
                '10.148.0.19' : 'apl09',                
                '10.148.0.27' : 'apl10', 
                '10.148.0.28' : 'apl11', 
                '10.148.0.29' : 'apl12', 
                '10.148.0.30' : 'apl13', 
                '10.148.0.31' : 'apl14', 
                '10.148.0.32' : 'apl15',
                '10.148.0.26' : 'apl16'} 

Read results collected from htcondor log files and stored in CSV summary files: 

In [None]:
fzboost_all     = pd.read_csv('results/tests/test_hardware_t1.csv') 
fzboost_4       = pd.read_csv('results/tests/fzboost_trunc4_chunk_150k.csv')     
fzboost_all_10x = pd.read_csv('results/tests/fzboost_10x_all_dec_cases_chunk_150k.csv') 
fzboost_4_10x   = pd.read_csv('results/tests/fzboost_10x_trunc4_chunk_150k.csv')     
bpz_all = pd.read_csv('results/tests/bpz_all_dec_cases_chunk_150k.csv')      
bpz_4   = pd.read_csv('results/tests/bpz_trunc4_chunk_150k.csv')             

In [None]:
fzboost_all.host.unique()

In [None]:
bad_hosts = []
for host, name in apollo_dict.items(): 
    if (name == "apl13") | (name == "apl14") | (name == "apl15"): 
        bad_hosts.append(host) 
        print(host)

Data cleaning: remove results generated by faulty machines (IP hosts above) to minimize bias. 

In [None]:
query = f'host != "{bad_hosts[0]}" & host != "{bad_hosts[1]}" & host != "{bad_hosts[2]}" '  
fzboost_all.query(query, inplace=True) 
fzboost_4.query(query, inplace=True)  
fzboost_all_10x.query(query, inplace=True) 
fzboost_4_10x.query(query, inplace=True)   
bpz_all.query(query, inplace=True)     
bpz_4.query(query, inplace=True)  

In [None]:
fzboost_all.host.unique()

Organize dataframes from the tests results used in the analysis: 

In [None]:
fzboost_runs = {'fzboost_all': fzboost_all, 'fzboost_4': fzboost_4, 
                'fzboost_all_10x': fzboost_all_10x, 'fzboost_4_10x': fzboost_4_10x}
bpz_runs = {'bpz_all': bpz_all, 'bpz_4': bpz_4}

In [None]:
for test, df in fzboost_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print(np.sort([apollo_dict[host] for host in df.host.unique()]))
    print('---')

In [None]:
for test, df in bpz_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print(np.sort([apollo_dict[host] for host in df.host.unique()]))
    print('---')

Compute speed$^{-1}$ in milliseconds per object and add to each results dataframe: 

In [None]:
for results_df in fzboost_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
for results_df in bpz_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
fzboost_runs['fzboost_all'].head()

Build a dataframe with process summary info:

Fuction to recalculate effective runtime, taking into account only the files processed by the good nodes: 

In [None]:
def calc_runtime(pz_results_dict, test_name):   
    str_begin = pz_results_dict[test_name]['time_begin'].min()
    str_end = pz_results_dict[test_name]['time_end'].max()
    t_begin = datetime.strptime(str_begin,'%Y-%m-%d %H:%M:%S')
    t_end = datetime.strptime(str_end,'%Y-%m-%d %H:%M:%S')
    dt = (t_end - t_begin)
    runtime = dt.total_seconds()
    return str_begin, str_end, runtime 

example

In [None]:
test = 'fzboost_all'
begin, end, runtime  = calc_runtime(fzboost_runs, test)
print(f'test {test} starded at {begin}, finished at {end}, and took ~{round(runtime/60.)} minutes')

In [None]:
fzboost_info = {}
bpz_info = {}

for key in fzboost_runs.keys():
    fzboost_info[key] = {}
for key in bpz_runs.keys():
    bpz_info[key] = {}
    
for test_name, results_df in fzboost_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    fzboost_info[test_name]['hosts'] = hosts
    fzboost_info[test_name]['n_cores'] = len(hosts) * 56 
    fzboost_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(fzboost_runs, test_name)
    fzboost_info[test_name]['time_begin'] = begin
    fzboost_info[test_name]['time_end'] = end
    fzboost_info[test_name]['runtime'] = runtime
    fzboost_info[test_name]['n_files'] = len(results_df['host'])
    fzboost_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    fzboost_info[test_name]['std_speed'] = np.std(results_df['speed'])   
for test_name, results_df in bpz_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    bpz_info[test_name]['hosts'] = hosts
    bpz_info[test_name]['n_cores'] = len(hosts) * 56 
    bpz_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(bpz_runs, test_name)
    bpz_info[test_name]['time_begin'] = begin
    bpz_info[test_name]['time_end'] = end
    bpz_info[test_name]['runtime'] = runtime
    bpz_info[test_name]['n_files'] = len(results_df['host'])
    bpz_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    bpz_info[test_name]['std_speed'] = np.std(results_df['speed']) 
fzboost_info = pd.DataFrame(fzboost_info).T
bpz_info = pd.DataFrame(bpz_info).T

In [None]:
fzboost_info

--- 
## Speed distributions

In [None]:
fzboost_info.index

In [None]:
mean_fzb_all = np.mean(fzboost_all['speed'])
mean_fzb_4 = np.mean(fzboost_4['speed'])
mean_fzb_all_10x = np.mean(fzboost_all_10x['speed'])
mean_fzb_4_10x = np.mean(fzboost_4_10x['speed'])
mean_bpz_all = np.mean(bpz_all['speed'])
mean_bpz_4 = np.mean(bpz_4['speed'])

std_fzb_all     = np.std(fzboost_all['speed'])
std_fzb_4       = np.std(fzboost_4['speed'])
std_fzb_all_10x = np.std(fzboost_all_10x['speed'])
std_fzb_4_10x   = np.std(fzboost_4_10x['speed'])
std_bpz_all     = np.std(bpz_all['speed'])
std_bpz_4       = np.std(bpz_4['speed'])

In [None]:
plt.figure()#dpi=300)
plt.grid(True)
n0, bins0, patches0 = plt.hist(fzboost_all['speed'], histtype='step', lw=2, bins=30, label='15 dec cases', color='blue')
n1, bins1, patches1 = plt.hist(fzboost_4['speed'], histtype='step', lw=2, bins=30, label='4 dec cases', color='orange')
plt.vlines(mean_fzb_all, ymin=0, ymax=1.1*np.max(n0), lw=3, color="blue", label='mean')
plt.vlines(mean_fzb_4, ymin=0, ymax=1.1*np.max(n1), lw=3, color="orange", ls="--", label='mean')
ymax = np.max([1.1*np.max(n0),1.1*np.max(n1)])
plt.vlines(mean_fzb_all-std_fzb_all, ymin=0, ymax=ymax, lw=1, color="blue", label='std')
plt.vlines(mean_fzb_all+std_fzb_all, ymin=0, ymax=ymax, lw=1, color="blue")
plt.vlines(mean_fzb_4-std_fzb_4, ymin=0, ymax=ymax, lw=1, color="orange", label='std')
plt.vlines(mean_fzb_4+std_fzb_4, ymin=0, ymax=ymax, lw=1, color="orange")
plt.ylim(0,ymax)
plt.legend()
plt.xlabel('speed$^{-1}$ (ms/obj)')
plt.ylabel('frequency')
plt.tight_layout()
#plt.savefig('fzboost_trunc4.png')

In [None]:
plt.figure(dpi=300)
plt.grid(True)
plt.title('BPZ - test round decimal cases')
n0, bins0, patches0 = plt.hist(bpz_all['speed'], histtype='step', lw=2, bins=30, label='15  dec cases', color='blue')
n1, bins1, patches1 = plt.hist(bpz_4['speed'], histtype='step', lw=2, bins=30, label='4  dec cases', color='orange')
ymax = np.max([1.1*np.max(n0),1.1*np.max(n1)])
plt.vlines(mean_bpz_all, ymin=0, ymax=ymax, lw=3, color="blue", 
           label=f'mean: {round(mean_bpz_all, 3)}')
plt.vlines(mean_bpz_4, ymin=0, ymax=ymax, lw=3, color="orange", ls="--", 
            label=f'mean: {round(mean_bpz_4, 3)}')
plt.vlines(mean_bpz_all-std_bpz_all, ymin=0, ymax=ymax, lw=1, color="blue", 
           label=f'std: {round(std_bpz_all, 3)}')
plt.vlines(mean_bpz_all+std_bpz_all, ymin=0, ymax=ymax, lw=1, color="blue")
plt.vlines(mean_bpz_4-std_bpz_4, ymin=0, ymax=ymax, lw=1, color="orange", 
           label=f'std: {round(std_bpz_4, 3)}')
plt.vlines(mean_bpz_4+std_bpz_4, ymin=0, ymax=ymax, lw=1, color="orange")
plt.ylim(0,ymax)
plt.legend()
plt.xlabel('speed$^{-1}$ (ms/obj)')
plt.ylabel('frequency')
plt.tight_layout()
plt.savefig('bpz_trunc4.png')

In [None]:
plt.figure(dpi=300)
plt.grid(True)
plt.title('FlexZBoost - test round decimal cases')
n0, bins0, patches0 = plt.hist(fzboost_all_10x['speed'], histtype='step', lw=2, bins=30, label='15 dec cases', color='blue')
n1, bins1, patches1 = plt.hist(fzboost_4_10x['speed'], histtype='step', lw=2, bins=30, label='4  dec cases', color='orange')
ymax = np.max([1.1*np.max(n0),1.1*np.max(n1)])
plt.vlines(mean_fzb_all_10x, ymin=0, ymax=ymax, lw=3, color="blue", 
           label=f'mean: {round(mean_fzb_all_10x, 3)}')
plt.vlines(mean_fzb_4_10x, ymin=0, ymax=ymax, lw=3, color="orange", ls="--", 
            label=f'mean: {round(mean_fzb_4_10x, 3)}')
plt.vlines(mean_fzb_all_10x-std_fzb_all_10x, ymin=0, ymax=ymax, lw=1, color="blue", 
           label=f'std: {round(std_fzb_all_10x, 3)}')
plt.vlines(mean_fzb_all_10x+std_fzb_all_10x, ymin=0, ymax=ymax, lw=1, color="blue")
plt.vlines(mean_fzb_4_10x-std_fzb_4_10x, ymin=0, ymax=ymax, lw=1, color="orange",
                      label=f'std: {round(std_fzb_4_10x, 3)}')
plt.vlines(mean_fzb_4_10x+std_fzb_4_10x, ymin=0, ymax=ymax, lw=1, color="orange")
plt.ylim(0,ymax)
plt.legend()
plt.xlabel('speed$^{-1}$ (ms/obj)')
plt.ylabel('frequency')
plt.tight_layout()
plt.savefig('fzboost_trunc4_10x.png')

In [None]:
fzboost_info.n_cores

In [None]:
TOTAL_CORES = max(fzboost_info.n_cores)
TOTAL_CORES

In [None]:
#weight_fzb = 
np.array(fzboost_info.n_cores) // np.ones(len(fzboost_info))*float(TOTAL_CORES)
#weight_fzb

In [None]:
# weight_fzb = np.ones(len(fzboost_info))*float(TOTAL_CORES) / np.array(fzboost_info.n_cores)
# x_fzb = np.array(fzboost_info.n_obj)
# y_fzb = np.array(fzboost_info.runtime) * weight_fzb

In [None]:
res_all_10x = stats.bootstrap([fzboost_all_10x['speed']], np.mean, confidence_level=0.95)
res_4_10x = stats.bootstrap([fzboost_4_10x['speed']], np.mean, confidence_level=0.95)

In [None]:
res_all_10x# .confidence_interval

As the two distributions are approximatelly normal, let's use a t-test to demonstrate that they are not significantly different. 

- Null hipotesis ($H_0$) = the distrubutions are equal
- Alternative hipotesis ($H_1$) = the distrubutions are different

Assumptions:
- The samples are independent
- The data follow a normal distribution
- The samples have similar variances (homogeneity assumption)

Check homogenity assumption:

In [None]:
np.var(fzboost_all_10x['speed']), np.var(fzboost_4_10x['speed'])

_"If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance."_ 

In [None]:
np.var(fzboost_4_10x['speed']) / np.var(fzboost_all_10x['speed'])

ok!

or, use [Levene test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html):

In [None]:
stats.levene(fzboost_all_10x['speed'], fzboost_4_10x['speed'])

Levene test's p-value > 0.05 $=>$ the variances are not significantly different. 

In [None]:
# Perform the two sample t-test with equal variances
test_result = stats.ttest_ind(a=fzboost_all_10x['speed'], b=fzboost_4_10x['speed'])
test_result

In [None]:
test_result.pvalue

In [None]:
res_all_10x.confidence_interval[0]

In [None]:
plt.figure(figsize=[8,4], dpi=300)
#fig, ax = plt.subplots()
plt.hist(res_all_10x.bootstrap_distribution, bins=25, label='15 dec cases', color='#1f77b4', alpha=0.8)#, histtype='step')
plt.hist(res_4_10x.bootstrap_distribution, bins=25, label='4 dec cases', color='orange', alpha=0.5)
plt.vlines(np.mean(res_all_10x.bootstrap_distribution), ymin=0, ymax=1800, color='#1f77b4', lw=2,
           label=f'mean: {round(np.mean(res_all_10x.bootstrap_distribution), 4)}')
plt.vlines(np.mean(res_4_10x.bootstrap_distribution), ymin=0, ymax=1800, color='orange', lw=2, 
           label=f'mean: {round(np.mean(res_4_10x.bootstrap_distribution), 4)}')
plt.vlines(res_all_10x.confidence_interval[0], ymin=0, ymax=2000, ls='--', color='#1f77b4', label='95% confidence \n intervals')
plt.vlines(res_all_10x.confidence_interval[1], ymin=0, ymax=2000, ls='--', color='#1f77b4')
plt.vlines(res_4_10x.confidence_interval[0], ymin=0, ymax=2000, ls='--', color='orange', label=' ')
plt.vlines(res_4_10x.confidence_interval[1], ymin=0, ymax=2000, ls='--', color='orange')
plt.title('FlexZBoost - Bootstrap resampling 9999x')
plt.plot([-1], [-1], ',', label=f' t-test \n p-value={round(test_result.pvalue, 8)}')#, fontsize=12)
plt.xlabel('speed mean (ms/obj)')
plt.ylabel('frequency')
plt.xlim(1.9, 2.6)
plt.ylim(0,1800)
plt.legend()
plt.tight_layout()
plt.savefig('trunc4_10x_bootstrap.png')

---


## BPZ

In [None]:
res_all_bpz = stats.bootstrap([bpz_all['speed']], np.mean, confidence_level=0.95)
res_4_bpz = stats.bootstrap([bpz_4['speed']], np.mean, confidence_level=0.95)

In [None]:
# Perform the two sample t-test with equal variances
test_result_bpz = stats.ttest_ind(a=bpz_all['speed'], b=bpz_4['speed'])
test_result

In [None]:
plt.figure(figsize=[8,4], dpi=300)
#fig, ax = plt.subplots()
plt.hist(res_all_bpz.bootstrap_distribution, bins=25, label='15 dec cases', color='#1f77b4', alpha=0.8)#, histtype='step')
plt.hist(res_4_bpz.bootstrap_distribution, bins=25, label='4 dec cases', color='orange', alpha=0.5)
plt.vlines(np.mean(res_all_bpz.bootstrap_distribution), ymin=0, ymax=1800, color='#1f77b4', lw=2,
           label=f'mean: {round(np.mean(res_all_bpz.bootstrap_distribution), 4)}')
plt.vlines(np.mean(res_4_bpz.bootstrap_distribution), ymin=0, ymax=1800, color='orange', lw=2, 
           label=f'mean: {round(np.mean(res_4_bpz.bootstrap_distribution), 4)}')
plt.vlines(res_all_bpz.confidence_interval[0], ymin=0, ymax=1800, ls='--', color='#1f77b4', 
           label='95% confidence \n intervals')
plt.vlines(res_all_bpz.confidence_interval[1], ymin=0, ymax=1800, ls='--', color='#1f77b4')
plt.vlines(res_4_bpz.confidence_interval[0], ymin=0, ymax=1800, ls='--', color='orange', label=' ')
plt.vlines(res_4_bpz.confidence_interval[1], ymin=0, ymax=1800, ls='--', color='orange')
plt.title('BPZ - Bootstrap resampling 9999x')
#plt.text(2.69, 200, f' t-test \n p-value={round(test_result.pvalue, 6)}', fontsize=12)
plt.plot([-1], [-1], ',', label=f' t-test \n p-value={round(test_result_bpz.pvalue, 8)}')#, fontsize=12)

plt.xlabel('speed mean (ms/obj)')
plt.ylabel('frequency')
plt.xlim(1.55, 1.61)
plt.ylim(0,1200)
plt.legend()
plt.tight_layout()
plt.savefig('trunc4_bpz_bootstrap.png')

Here, since the p-value (~0.89) is greater than alpha = 0.05, we cannot reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean speed using T0 is faster than using T1. 


In [None]:
np.mean(res_all_10x.bootstrap_distribution)

In [None]:
np.mean(res_4_10x.bootstrap_distribution)

In [None]:
(np.mean(res_all_10x.bootstrap_distribution) - np.mean(res_4_10x.bootstrap_distribution))/np.mean(res_all_10x.bootstrap_distribution)

In [None]:
(np.mean(res_all_bpz.bootstrap_distribution) - np.mean(res_4_bpz.bootstrap_distribution))/np.mean(res_all_bpz.bootstrap_distribution)