## Driver code comparing output of different preproc pipelines 
- Note: currently using output after atlas-based grouping
- Atlas used: aparc (Freesurfer) DKT-31 Mindboggle (ANTs: https://mindboggle.readthedocs.io/en/latest/labels.html) 

### Steps
- import data csvs 
- visualize data distributions 
- correlate features across pipelines
- compare performance of machine-learning model (scikit-learn)
- compare performance of statsmodels (ols or logit)

In [1]:
import sys
import numpy as np
import pandas as pd
import itertools
from sklearn import svm

sys.path.append('../')
from lib.data_handling import *
from lib.data_stats import *

### Data paths

In [2]:
proj_dir = '/Users/nikhil/code/git_repos/compare-surf-tools/'
data_dir = proj_dir + 'data/'
fs60_dir = data_dir + 'fs60_group_stats/'
demograph_file = 'ABIDE_Phenotype.csv'
ants_file = 'ABIDE_ants_thickness_data.csv' #uses modified (mindboggle) dkt atlas with 31 ROIs
fs53_file = 'ABIDE_fs5.3_thickness.csv'
fs51_file = 'cortical_fs5.1_measuresenigma_thickavg.csv' 
fs60_lh_file = 'lh.aparc.thickness.table.test1' #'aparc_lh_thickness_table.txt' #'lh.aparc.thickness.table.test1'
fs60_rh_file = 'rh.aparc.thickness.table.test1' #'aparc_rh_thickness_table.txt' #'rh.aparc.thickness.table.test1'


### Global Vars

In [3]:
subject_ID_col = 'SubjID'

### Load data

In [4]:
# Demographics and Dx
demograph = pd.read_csv(data_dir + demograph_file)
demograph = demograph.rename(columns={'Subject_ID':subject_ID_col})

# ANTs
ants_data = pd.read_csv(data_dir + ants_file, header=2)
print('shape of ants data {}'.format(ants_data.shape))
ants_data_std = standardize_ants_data(ants_data, subject_ID_col)
print('shape of stdized ants data {}'.format(ants_data_std.shape))
print('')

# FS
fs53_data = pd.read_csv(data_dir + fs53_file)
print('shape of fs53 data {}'.format(fs53_data.shape))
fs53_data_std = standardize_fs_data(fs53_data, subject_ID_col)
print('shape of stdized fs53 data {}'.format(fs53_data_std.shape))
print('')

fs51_data = pd.read_csv(data_dir + fs51_file)
print('shape of fs51 data {}'.format(fs51_data.shape))
fs51_data_std = standardize_fs_data(fs51_data, subject_ID_col)
print('shape of stdized fs51 data {}'.format(fs51_data_std.shape))
print('')

fs60_lh_data = pd.read_csv(data_dir + fs60_lh_file, delim_whitespace=True)
fs60_rh_data = pd.read_csv(data_dir + fs60_rh_file, delim_whitespace=True)
print('shape of fs60 data l: {}, r: {}'.format(fs60_lh_data.shape,fs60_rh_data.shape))

fs60_data_std = standardize_fs60_data(fs60_lh_data, fs60_rh_data, subject_ID_col)
print('shape of stdized fs51 data {}'.format(fs60_data_std.shape))

shape of ants data (1101, 99)
shape of stdized ants data (1101, 90)

shape of fs53 data (976, 74)
shape of stdized fs53 data (976, 74)

shape of fs51 data (1112, 74)
shape of stdized fs51 data (1112, 74)

shape of fs60 data l: (1047, 36), r: (1047, 36)
shape of left and right merge fs6.0 df (1047, 71)
shape of stdized fs51 data (1047, 71)


### Create master dataframe

In [17]:
data_dict = {'ants' : ants_data_std,
            'fs60' : fs60_data_std,
            'fs53' : fs53_data_std,
            'fs51' : fs51_data_std}

na_action = 'drop' # options: ignore, drop; anything else will not use the dataframe for analysis. 
master_df, common_subs, common_roi_cols = combine_processed_data(data_dict, subject_ID_col, na_action)

print('Number of common ROIs {}'.format(len(common_roi_cols)))

# Add demographic columns to the master_df
useful_demograph = demograph[[subject_ID_col,'SEX','AGE_AT_SCAN','DX_GROUP']].copy()
_,useful_demograph[subject_ID_col] = useful_demograph[subject_ID_col].str.split('_', 1).str
# master_df = pd.merge(master_df, useful_demograph, how='left', on=subject_ID_col)
# print('\nmaster df shape after adding demographic info {}'.format(master_df.shape))

Number of datasets: 4
Finding common subject and columns
Number of common subjects and columns: 593, 63

checking ants dataframe
Shape of the dataframe based on common cols and subs (593, 63)
Basic data check passed
Shape of the concat dataframe (593, 64)

checking fs60 dataframe
Shape of the dataframe based on common cols and subs (593, 63)
Basic data check passed
Shape of the concat dataframe (1186, 64)

checking fs53 dataframe
Shape of the dataframe based on common cols and subs (593, 63)
Basic data check passed
Shape of the concat dataframe (1779, 64)

checking fs51 dataframe
Shape of the dataframe based on common cols and subs (593, 63)
Basic data check passed
Shape of the concat dataframe (2372, 64)
Number of common ROIs 62


### Correlation between pipelines

In [7]:
possible_pairs = list(itertools.combinations(data_dict.keys(), 2))

for pair in possible_pairs:
    pipe1 = pair[0]
    pipe2 = pair[1]
    df1 = master_df[master_df['pipeline']==pipe1][[subject_ID_col]+common_roi_cols]
    df2 = master_df[master_df['pipeline']==pipe2][[subject_ID_col]+common_roi_cols]

    xcorr = cross_correlations(df1,df2,subject_ID_col)
    print('Avg cross correlation between {} & {} = {:4.2f}\n'.format(pipe1,pipe2,np.mean(xcorr)))

Avg cross correlation between ants & fs60 = 0.44

Avg cross correlation between ants & fs53 = 0.48

Avg cross correlation between ants & fs51 = 0.44

Avg cross correlation between fs60 & fs53 = 0.91

Avg cross correlation between fs60 & fs51 = 0.88

Avg cross correlation between fs53 & fs51 = 0.91



### Compare ML performance 

In [11]:
input_cols = common_roi_cols
outcome_col = 'DX_GROUP'
clf = svm.SVC(kernel='linear')
ml_perf = getClassiferPerf(master_df,input_cols,outcome_col,clf)

Running ML classifer on 4 pipelines
Pipeline ants,  Accuracy mean:0.592, sd:0.063
Pipeline fs60,  Accuracy mean:0.561, sd:0.051
Pipeline fs53,  Accuracy mean:0.549, sd:0.057
Pipeline fs51,  Accuracy mean:0.526, sd:0.061



### Compare statsmodels performance 

In [12]:
roi_cols = common_roi_cols
covar_cols = ['SEX','AGE_AT_SCAN']
outcome_col = 'DX_GROUP'
stat_model = 'logit'
sm_perf = getStatModelPerf(master_df,roi_cols,covar_cols,outcome_col,stat_model)
print('Shape of the stats_models results df {}'.format(sm_perf.shape))

Running 62 mass-univariate logit statsmodels on 4 pipelines
Shape of the stats_models results df (248, 4)
