<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=130 style="padding: 20px"> 

# Photo-z Compute Scalability Tests
## Optimizing software infrastructure to compute photo-zs in the LSST scale: preparing for LSST DR1.

<br><br>

--- 
Main notebook: [PZ_Compute_Tests.ipynp](./PZ_Compute_Tests.ipynp)

Control spreadsheet: [PZ Compute Runs](https://docs.google.com/spreadsheets/d/1GKlDhLx7oXTjwBXoj8pzfrqnE7X-4nUW2sYDuY-tx94/edit?usp=sharing)

Project members: Julia Gschwend, Heloisa Mengisztki, Cristiano Singulani, Henrique Dante

Last verified run: 27/07/2023

--- 



# Test 1: Test linearity of the relationship between the total time and the data size

To verify the dependency of total time on the dataset size, supposely linear, we estimated photo-zs for diferent subsets of DP0.2 and for 2 copies of the full dataset using the same infrastructure. For this test, we used the odd nodes of cluster Apollo (apl01, apl03, apl05, apl07, apl09, apl11, apl13, apl15).  

Afterwards, we put together the results of similar tests done for other purposes, but valid for this analysis, since they were performed under similar conditions, but using different samples. We applied weights to compensate for differences in hardware infrastructure used and make the results compatible for comparison.   

Summary or runs planned for the tests: 

|Sample description | pre-processed files | number of Rows | FlexZBoost runtime | BPZ runtime | 
|:--|:-:|:-:|:-:|:-:|
|first 50 original files  |  620  (11 GB) |  88,895,872 | 0:11:07 | 0:07:52 |  
|first 100 original files | 1244  (21 GB) | 178,386,176 | 0:24:47 | 0:14:48 |  
|first 150 original files | 1855  (31 GB) | 266,835,897 | 0:30:26 | 0:26:24 |  
|2x original files (314 files) | 3870  (66 GB) | 556,636,904 | 0:57:25 | 0:42:52 |  




In [None]:
import numpy as np
import pandas as pd
import tables_io
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime 
import time 

%matplotlib inline

In [None]:
apollo_dict = {'10.148.0.11' : 'apl01', 
                '10.148.0.12' : 'apl02', 
                '10.148.0.13' : 'apl03', 
                '10.148.0.14' : 'apl04', 
                '10.148.0.15' : 'apl05', 
                '10.148.0.16' : 'apl06', 
                '10.148.0.17' : 'apl07', 
                '10.148.0.18' : 'apl08', 
                '10.148.0.19' : 'apl09',                
                '10.148.0.27' : 'apl10', 
                '10.148.0.28' : 'apl11', 
                '10.148.0.29' : 'apl12', 
                '10.148.0.30' : 'apl13', 
                '10.148.0.31' : 'apl14', 
                '10.148.0.32' : 'apl15',
                '10.148.0.26' : 'apl16'} 

Read results collected from htcondor log files and stored in CSV summary files: 

In [None]:
# FlexZBoost
df_fzb_50 = pd.read_csv('results/tests/test_fzboost_50_files.csv') # 620 pre-processed files 
df_fzb_100 = pd.read_csv('results/tests/test_fzboost_100_files.csv') # 1244 pre-processed files
df_fzb_150 = pd.read_csv('results/tests/test_fzboost_150_files.csv') # 1855 pre-processed files
df_fzb_1x = pd.read_csv('results/tests/test_hardware_t1.csv') # 1x DP0.2 = 1935 pre-processed files 
df_fzb_2x = pd.read_csv('results/tests/test_fzboost_2x_files.csv') #314 original files = 3870 pre-processed files
df_fzb_10x = pd.read_csv('results/tests/fzboost_10x_all_dec_cases_chunk_150k.csv') # 10x DP0.2 = 19350 pre-processed files
# 10x failed (incomplete), apl01 and apl02 died 
#df_fzb_10B = pd.read_csv('results/tests/henrique-log-10B-flexzboost.csv') # Henrique's results w/ huge variance (bug?)
df_fzb_10B = pd.read_csv('results/tests/henrique-log-10B-flexzboost-2.csv') # another Henrique's results
# BPZ
df_bpz_50 = pd.read_csv('results/tests/test_bpz_50_files.csv')
df_bpz_100 = pd.read_csv('results/tests/test_bpz_100_files.csv')
df_bpz_150 = pd.read_csv('results/tests/test_bpz_150_files.csv')
df_bpz_1x = pd.read_csv('results/tests/bpz_all_dec_cases_chunk_150k.csv')
df_bpz_2x = pd.read_csv('results/tests/test_bpz_2x_files.csv')
df_bpz_10B = pd.read_csv('results/tests/henrique-log-10B-bpz.csv') # Henrique's results

In [None]:
df_bpz_10B.host.unique()

In [None]:
bad_hosts = []
for host, name in apollo_dict.items(): 
    if (name == "apl13") | (name == "apl14") | (name == "apl15"): 
        bad_hosts.append(host) 
        print(host)

Data cleaning: remove results generated by faulty machines (IP hosts above) to minimize bias. 

In [None]:
query = f'host != "{bad_hosts[0]}" & host != "{bad_hosts[1]}" & host != "{bad_hosts[2]}" '  
# FlexZBoost
df_fzb_50.query(query, inplace=True)            
df_fzb_100.query(query, inplace=True)            
df_fzb_150.query(query, inplace=True)            
df_fzb_1x.query(query, inplace=True)            
df_fzb_2x.query(query, inplace=True)            
df_fzb_10x.query(query, inplace=True)            
#df_fzb_10B.query(query, inplace=True)            
# BPZ
df_bpz_50.query(query, inplace=True)              
df_bpz_100.query(query, inplace=True)              
df_bpz_150.query(query, inplace=True)              
df_bpz_1x.query(query, inplace=True)              
df_bpz_2x.query(query, inplace=True)  
#df_bpz_10B.query(query, inplace=True)  

In [None]:
df_bpz_50.host.unique()

Organize dataframes from the tests results used in the analysis: 

In [None]:
fzboost_runs = {'fzboost 50 files' : df_fzb_50, 
                'fzboost 100 files': df_fzb_100, 
                'fzboost 150 files': df_fzb_150, 
                'fzboost 1x files' : df_fzb_1x, # double check (are configs compatible?)  
                'fzboost 2x files' : df_fzb_2x,
                #'fzboost 10x files': df_fzb_10x, # 10x failed, apl01 and apl02 died 
                'fzboost 10B obj'  : df_fzb_10B} 
bpz_runs     = {'bpz 50 files' : df_bpz_50, 
                'bpz 100 files': df_bpz_100, 
                'bpz 150 files': df_bpz_150, 
                'bpz 1x files' : df_bpz_1x, # double check (are configs compatible?) 
                'bpz 2x files' : df_bpz_2x,
                'bpz 10B obj'  : df_bpz_10B} 

In [None]:
for test, df in fzboost_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print([apollo_dict[host] for host in np.sort(df.host.unique())])
    print('---')

In [None]:
for test, df in bpz_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print([apollo_dict[host] for host in np.sort(df.host.unique())])
    print('---')

Compute speed$^{-1}$ in milliseconds per object and add to each results dataframe: 

In [None]:
for results_df in fzboost_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.
for results_df in bpz_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
bpz_runs['bpz 2x files'].head()

Build a dataframe with process summary info:

Fuction to recalculate effective runtime, taking into account only the files processed by the good nodes: 

In [None]:
def calc_runtime(pz_results_dict, test_name):   
    str_begin = pz_results_dict[test_name]['time_begin'].min()
    str_end = pz_results_dict[test_name]['time_end'].max()
    t_begin = datetime.strptime(str_begin,'%Y-%m-%d %H:%M:%S')
    t_end = datetime.strptime(str_end,'%Y-%m-%d %H:%M:%S')
    dt = (t_end - t_begin)
    runtime = dt.total_seconds()
    return str_begin, str_end, runtime 

example

In [None]:
test = 'fzboost 2x files'
begin, end, runtime  = calc_runtime(fzboost_runs, test)
print(f'test {test} starded at {begin}, finished at {end}, and took ~{round(runtime/60.)} minutes')

In [None]:
fzboost_info = {}
bpz_info = {}

for key in fzboost_runs.keys():
    fzboost_info[key] = {}
for key in bpz_runs.keys():
    bpz_info[key] = {}
    
for test_name, results_df in fzboost_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    fzboost_info[test_name]['hosts'] = hosts
    fzboost_info[test_name]['n_cores'] = len(hosts) * 56 
    fzboost_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(fzboost_runs, test_name)
    fzboost_info[test_name]['time_begin'] = begin
    fzboost_info[test_name]['time_end'] = end
    fzboost_info[test_name]['runtime'] = runtime
    fzboost_info[test_name]['n_files'] = len(results_df['host'])
    fzboost_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    fzboost_info[test_name]['std_speed'] = np.std(results_df['speed'])   
for test_name, results_df in bpz_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    bpz_info[test_name]['hosts'] = hosts
    bpz_info[test_name]['n_cores'] = len(hosts) * 56 
    bpz_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(bpz_runs, test_name)
    bpz_info[test_name]['time_begin'] = begin
    bpz_info[test_name]['time_end'] = end
    bpz_info[test_name]['runtime'] = runtime
    bpz_info[test_name]['n_files'] = len(results_df['host'])
    bpz_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    bpz_info[test_name]['std_speed'] = np.std(results_df['speed']) 
fzboost_info = pd.DataFrame(fzboost_info).T
bpz_info = pd.DataFrame(bpz_info).T

In [None]:
fzboost_info.index

In [None]:
fzboost_info

In [None]:
bpz_info

In [None]:
TOTAL_CORES = 16*56
TOTAL_CORES

--- 
## Linear fit

In [None]:
len(fzboost_info)

In [None]:
weight_fzb = np.ones(len(fzboost_info))*float(TOTAL_CORES) / np.array(fzboost_info.n_cores)
x_fzb = np.array(fzboost_info.n_obj)
y_fzb = np.array(fzboost_info.runtime) * weight_fzb

In [None]:
weight_bpz = np.ones(len(bpz_info))*float(TOTAL_CORES) / np.array(bpz_info.n_cores)
x_bpz = np.array(bpz_info.n_obj)
y_bpz = np.array(bpz_info.runtime) * weight_bpz

In [None]:
plt.figure(dpi=100)
plt.plot(x_fzb/1000000., y_fzb/60., 'o', label="FlexZBoost")
plt.plot(x_bpz/1000000., y_bpz/60., '^', label="BPZ")
plt.xlabel('dataset size (million objects)')
plt.ylabel('total runtime (min)')
plt.legend(loc = "upper left")
plt.tight_layout()

In [None]:
fzboost_runtime_error = []
bpz_runtime_error = []
for test in fzboost_runs.keys():
    fzboost_runtime_error.append(np.std(fzboost_runs[test]['time_diff'])/60.)
    #fzboost_runtime_error.append(stats.bootstrap([fzboost_runs[test]['time_diff']], np.std).standard_error)
    #print(stats.bootstrap([fzboost_fast_runs[test]['time_diff']], np.std))
for test in bpz_runs.keys():
    bpz_runtime_error.append(np.std(bpz_runs[test]['time_diff'])/60.)
    #bpz_runtime_error.append(stats.bootstrap([bpz_runs[test]['time_diff']], np.std).standard_error)
    

In [None]:
fzboost_runtime_error, bpz_runtime_error

In [None]:
plt.figure(dpi=100)
plt.errorbar(x_fzb/1000000., y_fzb/60., yerr=fzboost_runtime_error, marker='o', ls='', label="FlexZBoost")
plt.errorbar(x_bpz/1000000., y_bpz/60., yerr=bpz_runtime_error, marker='^', ls='', label="BPZ")
plt.xlabel('dataset size (million objects)')
plt.ylabel('total runtime (min)')
plt.legend()
plt.tight_layout()

In [None]:
plt.figure(dpi=100)
plt.grid(True)
x = list(x_fzb/1000000.)
y = list(y_fzb/60.)
coef = np.polyfit(x,y,1)
a, b = coef
plt.text(1000,110,f'$y = {round(a,2)}x+{round(b,2)}$', color='#1f77b4', fontsize=12)
poly1d_fn = np.poly1d(coef) 
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'o', color='#1f77b4', label='FlexZBoost') # '--k'=black dashed line, 'yo' = yellow circle marker
plt.plot(x, poly1d_fn(x), '-', color='#1f77b4') # '--k'=black dashed line, 'yo' = yellow circle marker
x = list(x_bpz/1000000.)
y = list(y_bpz/60.)
coef = np.polyfit(x,y,1)
a, b = coef
plt.text(1000,50,f'$y = {round(a,2)}x+{round(b,2)}$', color='orange', fontsize=12)
poly1d_fn = np.poly1d(coef) 
plt.plot(x,y, '^', color='orange', label='BPZ')
plt.plot(x, poly1d_fn(x), '-', color='orange')
plt.xlabel('dataset size (million objects)')
plt.ylabel('total runtime (min)')
plt.xlim(0,)
plt.ylim(0,)
plt.legend()
plt.tight_layout()
plt.savefig('linear_fit.png')

In [None]:
plt.figure(figsize=[18, 4],dpi=100)
x = list(x_fzb)
y = list(y_fzb)
coef = np.polyfit(x,y,1)
a, b = coef
poly1d_fn = np.poly1d(coef) 
# poly1d_fn is now a function which takes in x and returns an estimate for y
plt.plot(x,y, 'o', color='#1f77b4', label='FlexZBoost') # '--k'=black dashed line, 'yo' = yellow circle marker
plt.plot(x+[40000000000], poly1d_fn(x+[40000000000]), '-', color='#1f77b4') # '--k'=black dashed line, 'yo' = yellow circle marker
plt.plot(40000000000, poly1d_fn(40000000000), 'sk')
x = list(x_bpz)
y = list(y_bpz)
coef = np.polyfit(x,y,1)
a, b = coef
poly1d_fn = np.poly1d(coef) 
plt.plot(x,y, '^', color='orange', label='BPZ')
plt.plot(x, poly1d_fn(x), '-', color='orange')
plt.xlabel('dataset size (million objects)')
plt.ylabel('total runtime (min)')
plt.xlim(0,)
plt.ylim(0,)
plt.legend()
plt.tight_layout()

In [None]:
x = list(x_fzb)
y = list(y_fzb)
a, b = np.polyfit(x,y,1)
poly1d_fn = np.poly1d((a,b)) 
dr11 = poly1d_fn(40000000000)
print(f'FlexZBoost: 40B objects in {round(dr11/3600., 1)} hours') 

In [None]:
print(f'BPZ: 40B objects in {round(runtime_predict(40000000000, "bpz")/3600., 1)} hours') 

# Correlation Test


$$ X = dataset \  size $$  
$$ Y = total \ time $$

In [None]:
x_fzb, y_fzb

In [None]:
result_fzboost = stats.linregress(x_fzb.astype('float64'), y_fzb.astype('float64'))
result_bpz = stats.linregress(x_bpz.astype('float64'), y_bpz.astype('float64'))

In [None]:
result_fzboost

In [None]:
result_bpz

Interpretation of the correlation coefficient (r)

The correlation coefficient (r) ranges from -1 to 1:

- \(r = -1\) indicates a perfect negative linear relationship.
- \(r = 0\) indicates no linear relationship (independence).
- \(r = 1\) indicates a perfect positive linear relationship.

we calculate the sample correlation coefficient ($r$), as a guess for the population correlation ($\rho$).  

In [None]:
print(f'FlexZBoost: r = {round(result_fzboost.rvalue,4)}')
print(f'BPZ: r = {round(result_bpz.rvalue,4)}') 

$R^2$ means a percentual coefficient of how much of the sample variation from the mean can be explained by the X-Y linear relationship. 

In [None]:
print(f'FlexZBoost: R^2 = {round((result_fzboost.rvalue)**2,4)}')
print(f'BPZ: R^2 = {round((result_bpz.rvalue)**2,4)}') 

ref: https://youtu.be/nk2CQITm_eo 

Assess statistical significance: 

To test whether the correlation coefficient is significantly different from zero (i.e., whether there is a statistically significant linear relationship), you can conduct a hypothesis test. The most common approach is to perform a t-test on the correlation coefficient.

- Null hipotesis ($H_0$) = there is no linear relationship between X and Y (i.e., the population correlation ($\rho$) is zero).
- Alternative hipotesis ($H_1$) = there is a linear relationship (i.e., ($\rho$) is not zero).

 Determine the p-value
With the t-statistic, you can determine the p-value using the t-distribution with \(n-2\) degrees of freedom. The p-value represents the probability of obtaining a correlation as extreme as the one observed if the null hypothesis is true.

Make a decision: Compare the obtained p-value with the significance level (alpha) you have chosen (e.g., 0.05). If the p-value is less than alpha, you reject the null hypothesis, indicating that there is a statistically significant linear relationship between X and Y. Otherwise, you fail to reject the null hypothesis, suggesting that there is no statistically significant linear relationship.


In [None]:
print(f'FlexZBoost: p-value = {result_fzboost.pvalue}')
print(f'BPZ: p-value = {result_bpz.pvalue}') 
print('Both cases show a statistically significant linear relationship (p-value << 0.05).')

Prediction for DR1:

$$y = a x + b$$ 


In [None]:
print(f'total time = {result_fzboost.slope} * size + {result_fzboost.intercept} ') 

In [None]:
def runtime_predict(dataset_size, algo=None):
    if algo == 'fzboost': 
        return result_fzboost.slope * dataset_size + result_fzboost.intercept
    if algo == 'bpz': 
        return result_bpz.slope * dataset_size + result_bpz.intercept

In [None]:
print(f'FlexZBoost: 40B objects in {round(runtime_predict(40000000000, "fzboost")/3600., 1)} hours') 
print(f'BPZ: 40B objects in {round(runtime_predict(40000000000, "bpz")/3600., 1)} hours') 

In [None]:
top = 10_000_000_000
xline = [0, top/100000]
yline_fzb = [0, runtime_predict(top, algo='fzboost')/60.]
yline_bpz = [0, runtime_predict(top, algo='bpz')/60.]           

In [None]:
plt.figure(dpi=100)
# plt.plot(xline, yline_fzb, '-', color='#1f77b4')
# plt.plot(xline, yline_bpz, '-', color='orange')
plt.errorbar(x_fzb/1000000., y_fzb/60., yerr=fzboost_runtime_error, marker='o', ls='', label="FlexZBoost")
plt.errorbar(x_bpz/1000000., y_bpz/60., yerr=bpz_runtime_error, marker='^', ls='', label="BPZ")
plt.xlabel('dataset size (million objects)')
plt.ylabel('total runtime (min)')
#plt.xlim(0, top)#max(x_fzb/1000000.))#41000)
#plt.ylim(0, max(y_fzb/60.))#41000)
plt.legend()
plt.tight_layout()

In [None]:
coef