# Find the average predictions from a particular ML branch
This notebook is a helper that grabs data and processes it into an intermediate result for storage and plotting later. The issue is that the SuperLearner workflow filters out some points during the PCA stage (to find out how different points are from other points for ICON-ModEx iterations) so the final files in `./output_data` do not contain as many points as the initial prediction set. We don't want to filter out those points - we want to use all predictions that are made at the WHONDRS sites. This notebook grabs data from `./ml_models/sl_?/sl_predictions.csv` for each SuperLearner ensemble member, averages those values together, and then saves the file for plotting later.

In [1]:
import pandas as pd

## Grab data

In [2]:
repo_prefix = '~/tmp/'
repo_name = 'sl-archive-whondrs'
repo_url = 'https://github.com/parallelworks/'+repo_name
branch = 'S19S-SSS-log10-extrap-r01'

# Grab the data and get onto the branch if not already there
! mkdir -p {repo_prefix}
! cd {repo_prefix}; git clone {repo_url}
! cd {repo_prefix}/{repo_name}; git checkout {branch}

fatal: destination path 'sl-archive-whondrs' already exists and is not an empty directory.
Checking out files: 100% (253/253), done.
Switched to branch 'S19S-SSS-log10-extrap-r01'


## Load data from each SuperLearner ensemble member and then take the average

In [3]:
num_sl = 10

ml_output_df_list = []
for ll in range(0,num_sl):
    ml_output_df_list.append(pd.read_csv(repo_prefix+repo_name+'/ml_models/sl_'+str(int(ll))+'/sl_predictions.csv'))

ml_output_all_df = pd.concat(ml_output_df_list)
by_id = ml_output_all_df.groupby(ml_output_all_df['Sample_ID'])
predict_avg = by_id.mean()
predict_std = by_id.std()

In [4]:
predict_avg.to_csv('WHONDRS_'+branch+'_predictions.csv')