<img align="left" src = https://linea.org.br/wp-content/themes/LIneA/imagens/logo-header.jpg width=130 style="padding: 20px"> 

# Photo-z Compute Scalability Tests
## Optimizing software infrastructure to compute photo-zs in the LSST scale: preparing for LSST DR1.

<br><br>

--- 
Main notebook: [PZ_Compute_Tests.ipynp](./PZ_Compute_Tests.ipynp)

Control spreadsheet: [PZ Compute Runs](https://docs.google.com/spreadsheets/d/1GKlDhLx7oXTjwBXoj8pzfrqnE7X-4nUW2sYDuY-tx94/edit?usp=sharing)

Project members: Julia Gschwend, Heloisa Mengisztki, Cristiano Singulani, Henrique Dante

Last verified run: 27/07/2023

--- 



# Test 4: Test variation with storage system (hardware)

Science question: 

_"Test 4: What is the impact on the pipeline execution speed of using different hardware infrastructures to read and write data during the processes? Comparison between Lustre T0 (SSD), Lustre T1 (HD), and MS04 (mass storage)."_

Apollo nodes: apl02, apl04, apl06, apl08, apl10, apl12, apl14

Input data: DP0.2 Full (1935 pre-processed parquet files = 278,318,452 objects = 33GB) 

Attempted for three cases: 
- Read and write from T0
- Read and write from T1
- Read and write from MS04 (cancelled)

In [None]:
import numpy as np
import pandas as pd
import tables_io
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime 
import time 

%matplotlib inline

In [None]:
apollo_dict = {'10.148.0.11' : 'apl01', 
                '10.148.0.12' : 'apl02', 
                '10.148.0.13' : 'apl03', 
                '10.148.0.14' : 'apl04', 
                '10.148.0.15' : 'apl05', 
                '10.148.0.16' : 'apl06', 
                '10.148.0.17' : 'apl07', 
                '10.148.0.18' : 'apl08', 
                '10.148.0.19' : 'apl09',                
                '10.148.0.27' : 'apl10', 
                '10.148.0.28' : 'apl11', 
                '10.148.0.29' : 'apl12', 
                '10.148.0.30' : 'apl13', 
                '10.148.0.31' : 'apl14', 
                '10.148.0.32' : 'apl15',
                '10.148.0.26' : 'apl16'} 

Read results collected from htcondor log files and stored in CSV summary files: 

In [None]:
# BPZ
test_hardware_t0  = pd.read_csv('results/tests/test_hardware_t0.csv') 
test_hardware_t1  = pd.read_csv('results/tests/test_hardware_t1.csv')

In [None]:
test_hardware_t0.host.unique()

In [None]:
bad_hosts = []
for host, name in apollo_dict.items(): 
    if (name == "apl13") | (name == "apl14") | (name == "apl15"): 
        bad_hosts.append(host) 
        print(host)

Data cleaning: remove results generated by faulty machines (IP hosts above) to minimize bias. 

In [None]:
query = f'host != "{bad_hosts[0]}" & host != "{bad_hosts[1]}" & host != "{bad_hosts[2]}" '  
test_hardware_t0.query(query, inplace=True)  
test_hardware_t1.query(query, inplace=True)  

In [None]:
test_hardware_t0.host.unique()

Organize dataframes from the tests results used in the analysis: 

In [None]:
fzboost_runs = {
    'test_hardware_t0': test_hardware_t0 ,
    'test_hardware_t1': test_hardware_t1
}

In [None]:
for test, df in fzboost_runs.items():
    print(f'{test} run in {len(df.host.unique())} nodes: ')
    print(np.sort([apollo_dict[host] for host in df.host.unique()]))
    print('---')

Compute speed$^{-1}$ in milliseconds per object and add to each results dataframe: 

In [None]:
for results_df in fzboost_runs.values():
    results_df['speed'] = (results_df['time_diff']/results_df['chunks'])*1000.

In [None]:
fzboost_runs['test_hardware_t0'].head()

Build a dataframe with process summary info:

Fuction to recalculate effective runtime, taking into account only the files processed by the good nodes: 

In [None]:
def calc_runtime(pz_results_dict, test_name):   
    str_begin = pz_results_dict[test_name]['time_begin'].min()
    str_end = pz_results_dict[test_name]['time_end'].max()
    t_begin = datetime.strptime(str_begin,'%Y-%m-%d %H:%M:%S')
    t_end = datetime.strptime(str_end,'%Y-%m-%d %H:%M:%S')
    dt = (t_end - t_begin)
    runtime = dt.total_seconds()
    return str_begin, str_end, runtime 

example

In [None]:
test = 'test_hardware_t0'
begin, end, runtime  = calc_runtime(fzboost_runs, test)
print(f'test {test} starded at {begin}, finished at {end}, and took ~{round(runtime/60.)} minutes')

In [None]:
fzboost_info = {}
for key in fzboost_runs.keys():
    fzboost_info[key] = {}
for test_name, results_df in fzboost_runs.items():
    hosts = [] 
    for host, name in apollo_dict.items():
        if host in results_df['host'].unique():
            hosts.append(name)
    fzboost_info[test_name]['hosts'] = hosts
    fzboost_info[test_name]['n_cores'] = len(hosts) * 56 
    fzboost_info[test_name]['n_obj'] = np.sum(results_df['chunks'])
    begin, end, runtime  = calc_runtime(fzboost_runs, test_name)
    fzboost_info[test_name]['time_begin'] = begin
    fzboost_info[test_name]['time_end'] = end
    fzboost_info[test_name]['runtime'] = runtime
    fzboost_info[test_name]['n_files'] = len(results_df['host'])
    fzboost_info[test_name]['avg_speed'] = np.average(results_df['speed'])   
    fzboost_info[test_name]['std_speed'] = np.std(results_df['speed'])   
fzboost_info = pd.DataFrame(fzboost_info).T

In [None]:
fzboost_info.index

In [None]:
fzboost_info

--- 
## Speed distributions

In [None]:
mean_t0 = np.mean(test_hardware_t0['speed'])
mean_t1 = np.mean(test_hardware_t1['speed'])
median_t0 = np.median(test_hardware_t0['speed'])
median_t1 = np.median(test_hardware_t1['speed'])
std_t0 = np.std(test_hardware_t0['speed'])
std_t1 = np.std(test_hardware_t1['speed'])

In [None]:
plt.figure(dpi=300)
plt.grid(True)
n0, bins0, patches0 = plt.hist(test_hardware_t0['speed'], histtype='step', lw=2, bins=30, label='T0', color='blue')
n1, bins1, patches1 = plt.hist(test_hardware_t1['speed'], histtype='step', lw=2, bins=30, label='T1', color='orange')
plt.vlines(mean_t0, ymin=0, ymax=1.1*np.max(n0), lw=3, color="blue", label='mean')
plt.vlines(mean_t1, ymin=0, ymax=1.1*np.max(n1), lw=3, color="orange", ls="--", label='mean')
ymax = np.max([1.1*np.max(n0),1.1*np.max(n1)])
plt.vlines(mean_t0-std_t0, ymin=0, ymax=ymax, lw=1, color="blue", label='std')
plt.vlines(mean_t0+std_t0, ymin=0, ymax=ymax, lw=1, color="blue")
plt.vlines(mean_t1-std_t1, ymin=0, ymax=ymax, lw=1, color="orange", label='std')
plt.vlines(mean_t1+std_t1, ymin=0, ymax=ymax, lw=1, color="orange")
plt.ylim(0,ymax)
plt.legend()
plt.xlabel('speed (ms/obj)')
plt.ylabel('frequency')
plt.tight_layout()
plt.savefig('hardware_t0_t1.png')

In [None]:
mean_t0, mean_t1

In [None]:
#help(stats.bootstrap)

In [None]:
res_t0 = stats.bootstrap([test_hardware_t0['speed']], np.mean, confidence_level=0.95)
res_t1 = stats.bootstrap([test_hardware_t1['speed']], np.mean, confidence_level=0.95)

In [None]:
res_t0# .confidence_interval

As the two distributions are approximatelly normal, let's use a t-test to demonstrate that they are not significantly different. 

- Null hipotesis ($H_0$) = the distrubutions are equal
- Alternative hipotesis ($H_1$) = the distrubutions are different

Assumptions:
- The samples are independent
- The data follow a normal distribution
- The samples have similar variances (homogeneity assumption)

Check homogenity assumption:

In [None]:
np.var(test_hardware_t0['speed']), np.var(test_hardware_t1['speed'])

_"If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance."_ 

In [None]:
np.var(test_hardware_t1['speed']) / np.var(test_hardware_t0['speed'])

ok!

or, use [Levene test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html):

In [None]:
stats.levene(test_hardware_t0['speed'], test_hardware_t1['speed'])

Levene test's p-value > 0.05 $=>$ the variances are not significantly different. 

In [None]:
# Perform the two sample t-test with equal variances
test_result = stats.ttest_ind(a=test_hardware_t0['speed'], b=test_hardware_t1['speed'])
test_result

In [None]:
test_result.pvalue

In [None]:
res_t0.confidence_interval[0]

In [None]:
plt.figure(dpi=300)
#fig, ax = plt.subplots()
plt.hist(res_t0.bootstrap_distribution, bins=25, label='t0', color='#1f77b4', alpha=0.8)#, histtype='step')
plt.hist(res_t1.bootstrap_distribution, bins=25, label='t1', color='orange', alpha=0.5)
plt.vlines(np.mean(res_t0.bootstrap_distribution), ymin=0, ymax=1800, color='#1f77b4', lw=2,
           label=f'mean: {round(np.mean(res_t0.bootstrap_distribution), 4)}')
plt.vlines(np.mean(res_t1.bootstrap_distribution), ymin=0, ymax=1800, color='orange', lw=2, 
           label=f'mean: {round(np.mean(res_t1.bootstrap_distribution), 4)}')
plt.vlines(res_t0.confidence_interval[0], ymin=0, ymax=1800, ls='--', color='#1f77b4', label='95% confidence \n intervals')
plt.vlines(res_t0.confidence_interval[1], ymin=0, ymax=1800, ls='--', color='#1f77b4')
plt.vlines(res_t1.confidence_interval[0], ymin=0, ymax=1800, ls='--', color='orange', label=' ')
plt.vlines(res_t1.confidence_interval[1], ymin=0, ymax=1800, ls='--', color='orange')
plt.title('Bootstrap resampling 9999x')
plt.text(2.69, 200, f' t-test \n p-value={round(test_result.pvalue, 2)}', fontsize=12)
plt.xlabel('speed mean (ms/obj)')
plt.ylabel('frequency')
plt.xlim(2.6,2.73)
plt.ylim(0,1500)
plt.legend()
plt.tight_layout()
plt.savefig('hardware_t0_t1_bootstrap.png')

Here, since the p-value (~0.89) is greater than alpha = 0.05, we cannot reject the null hypothesis of the test. We do not have sufficient evidence to say that the mean speed using T0 is faster than using T1. 
