<a href="https://colab.research.google.com/github/jadecci/partialcorr_factors/blob/main/Partial_Correlation_Analysis_on_Influencing_Factors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install pingouin

import numpy as np
import pandas as pd
from pingouin import partial_corr

pd.set_option('display.width', 150)

# Partial correlation analyses on influencing factors in Table 4

Five influencing factors are identified in Table 4, for all studies with relevant extracted data:
- sample size of training set for internal validation
- amount of data for studies recruiting subjects
- participant source
- whether studies controlled for confounds
- validation type for studies that involved both internal and external validation

## 1. Factors influencing internal validation

As the first 4 factors were only tested for their influence on internval validation accuracy, we will not consider their interaction with validation type here. First, we extract these four factors and take a look at their inter-correlations.

In [None]:
# training sample size, participant source, confound control
data = pd.read_excel('Supplementary Table 3.xlsx', sheet_name='IV training set sample size', header=1, usecols=[0, 1, 6, 7, 8])
data = data.merge(pd.read_excel('Supplementary Table 3.xlsx', sheet_name='participant source', header=1, usecols=[0, 7],
                                names=['Paper #', 'participant source']), how='left', on='Paper #')
data = data.merge(pd.read_excel('Supplementary Table 3.xlsx', sheet_name='confound', header=1, usecols=[0, 7],
                                names=['Paper #', 'confound control']), how='left', on='Paper #')

# pool amount of data for recruited subjects and open datasets
data_amount = pd.read_excel('Supplementary Table 3.xlsx', sheet_name='amount of data', header=1, usecols=[0, 7], names=['Paper #', 'amount of data'])
data_amount_open = pd.read_excel('Supplementary Table 3.xlsx', sheet_name='amount of data', header=1, usecols=[0, 8], names=['Paper #', 'amount of data'])
data_amount.update(data_amount_open)

# for studies using multiple datasets, the one corresponding to the higher prediction accuracy is the matching one
data_amount.drop([1, 34, 39, 41, 55], inplace=True)
data_amount.loc[2, 'Paper #'] = 3
data_amount.loc[35, 'Paper #'] = 65
data_amount.loc[40, 'Paper #'] = 71
data_amount.loc[42, 'Paper #'] = 74
data_amount.loc[56, 'Paper #'] = 105
data = data.merge(data_amount, how='left', on='Paper #')

data[['training set sample size', 'participant source', 'confound control', 'amount of data']].dropna().corr()

Unnamed: 0,training set sample size,participant source,confound control,amount of data
training set sample size,1.0,0.309509,-0.173893,0.046111
participant source,0.309509,1.0,0.092555,0.289737
confound control,-0.173893,0.092555,1.0,-0.104911
amount of data,0.046111,0.289737,-0.104911,1.0


Since amount of data is only applicable to functional data, but not for studies using structural data. We also look at the inter-correlations of the other 3 factors excluding amount of data.

In [None]:
data[['training set sample size', 'participant source', 'confound control']].dropna().corr()

Unnamed: 0,training set sample size,participant source,confound control
training set sample size,1.0,0.350423,-0.201256
participant source,0.350423,1.0,-0.088293
confound control,-0.201256,-0.088293,1.0


It seems that training set sample size is highly correlated with participant source (i.e., open datasets tend to have larger sample size), and moderately correlated with confound control (i.e., studies controlling for confounds also tend to use larger training sets), but not with the amount of data.

The amount of data and participant source are also moderately correlated (i.e., open datasets tend to record higher number of volumes).

## 2. Training sample size, participant source & confound control

For these 3 factors, partial correlation can be computed for each factor while controlling the other 2 factors. 

The amount of data can also be included as a covariate, which, however, would then restrict the studies to only those using functional data. Note that amount of data is pooled across studies using recruited subjects and studies using open datasets.

Overall, none of the 3 factors remains significant after controlling for the other factors. The only exception is the training sample size factor when only controlling for confound control, although the P-value is also rather close to 0.05 in this case.

In [None]:
print('Factor: training sample size')
print('Partial correlation controlling for:')
corr_stats = data.partial_corr(x='training set sample size', y='prediction accuracy', method='pearson', y_covar=['participant source'])
corr_stats = corr_stats.append(data.partial_corr(x='training set sample size', y='prediction accuracy', method='pearson', y_covar=['confound control']))
corr_stats = corr_stats.append(data.partial_corr(x='training set sample size', y='prediction accuracy', method='pearson', 
                                                 y_covar=['participant source', 'confound control']))
corr_stats = corr_stats.append(data.partial_corr(x='training set sample size', y='prediction accuracy', method='pearson', 
                                            y_covar=['participant source', 'confound control', 'amount of data']))
corr_stats.index = ['participant source', 'confound control', 'participant source & confound control', 'participant source, confound control & amount of data']
corr_stats

Factor: training sample size
Partial correlation controlling for:


Unnamed: 0,n,r,CI95%,p-val
participant source,81,-0.174491,"[-0.38, 0.05]",0.121614
confound control,81,-0.224938,"[-0.42, -0.01]",0.044853
participant source & confound control,81,-0.140999,"[-0.35, 0.08]",0.215179
"participant source, confound control & amount of data",53,-0.111347,"[-0.38, 0.17]",0.441402


In [None]:
print('Factor: participant source')
print('Partial correlation controlling for:')
corr_stats = data.partial_corr(x='participant source', y='prediction accuracy', method='pearson', y_covar=['training set sample size'])
corr_stats = corr_stats.append(data.partial_corr(x='participant source', y='prediction accuracy', method='pearson', y_covar=['confound control']))
corr_stats = corr_stats.append(data.partial_corr(x='participant source', y='prediction accuracy', method='pearson', 
                                                 y_covar=['training set sample size', 'confound control']))
corr_stats = corr_stats.append(data.partial_corr(x='participant source', y='prediction accuracy', method='pearson', 
                                            y_covar=['training set sample size', 'confound control', 'amount of data']))
corr_stats.index = ['training set sample size', 'confound control', 'training set sample size & confound control', 
                    'training set sample size, confound control & amount of data']
corr_stats

Factor: participant source
Partial correlation controlling for:


Unnamed: 0,n,r,CI95%,p-val
training set sample size,81,-0.192741,"[-0.4, 0.03]",0.086732
confound control,81,-0.265592,"[-0.46, -0.05]",0.017262
training set sample size & confound control,81,-0.192758,"[-0.4, 0.03]",0.088762
"training set sample size, confound control & amount of data",53,-0.262584,"[-0.5, 0.02]",0.065436


In [None]:
print('Factor: confound control')
print('Partial correlation controlling for:')
corr_stats = data.partial_corr(x='confound control', y='prediction accuracy', method='pearson', y_covar=['training set sample size'])
corr_stats = corr_stats.append(data.partial_corr(x='confound control', y='prediction accuracy', method='pearson', y_covar=['participant source']))
corr_stats = corr_stats.append(data.partial_corr(x='confound control', y='prediction accuracy', method='pearson', 
                                                 y_covar=['training set sample size', 'participant source']))
corr_stats = corr_stats.append(data.partial_corr(x='confound control', y='prediction accuracy', method='pearson', 
                                            y_covar=['training set sample size', 'participant source', 'amount of data']))
corr_stats.index = ['training set sample size', 'participant source', 'training set sample size & participant source', 
                    'training set sample size, participant source & amount of data']
corr_stats

Factor: confound control
Partial correlation controlling for:


Unnamed: 0,n,r,CI95%,p-val
training set sample size,81,0.18367,"[-0.04, 0.39]",0.102922
participant source,81,0.214361,"[-0.01, 0.41]",0.056212
training set sample size & participant source,81,0.183698,"[-0.04, 0.39]",0.105116
"training set sample size, participant source & amount of data",53,0.095815,"[-0.19, 0.36]",0.508035


## 3. Amount of data (recruited subjects)

As this factor only concerns studies using recruited subjects, we compute partial correlation controlling for the other factors except participant source.

The amount of data factor remains significant after controlling for training sample size and confound control.

In [None]:
# select participant source (1=self-recruiting, 2=open,shared dataset)
data_recruited = data.loc[data['participant source'] == 1]

print('Partial correlation controlling for:')
corr_stats = data_recruited.partial_corr(x='amount of data', y='prediction accuracy', method='pearson', y_covar=['training set sample size'])
corr_stats = corr_stats.append(data_recruited.partial_corr(x='amount of data', y='prediction accuracy', method='pearson', y_covar=['confound control']))
corr_stats = corr_stats.append(data_recruited.partial_corr(x='amount of data', y='prediction accuracy', method='pearson', 
                                                           y_covar=['training set sample size', 'confound control']))
corr_stats.index = ['training set sample size', 'confound control', 'training set sample size & confound control']
corr_stats

Partial correlation controlling for:


Unnamed: 0,n,r,CI95%,p-val
training set sample size,26,0.606002,"[0.28, 0.81]",0.001324
confound control,26,0.657352,"[0.35, 0.84]",0.000356
training set sample size & confound control,26,0.602651,"[0.26, 0.81]",0.001829


## 4. Validation type

Sample size and confound control are considered for covariates here. For sample size, both the training set and external test set sample size are included.

Since we did not record the participant source and amount of data for internal validation set and external test set seprately, it may not be appropriate to control for them here.

Note that only the 13 studies with both internal and external validations are included for this factor (N=26 because internal and external accuracies are counted separately). For paper #7, only the highest external validation accuracy was used. 

The validation type factor is not significant after controlling for sample size and confound control.

In [None]:
# Extract a new dataframe for the 13 studies
data_val_both = pd.read_excel('Supplementary Table 3.xlsx', sheet_name='validation type', header=1, usecols=[0, 7, 8]).dropna()
data_val = data.merge(data_val_both['Paper #'], how='right', on='Paper #')
data_val['validation type'] = 1

test_size = pd.read_excel('Supplementary Table 3.xlsx', sheet_name='EV test set sample size', header=1, usecols=[0, 7])
test_size.drop([1], inplace=True)
test_size.loc[2, 'Paper #'] = 7
data_val = data_val.merge(test_size, how='left', on='Paper #')

data_val_ext = data_val.copy()
data_val_ext['prediction accuracy'] = data_val_both['external validation prediction accuracy'].reset_index(drop=True)
data_val_ext['validation type'] = 2
data_val = data_val.append(data_val_ext)

print('Partial correlation controlling for:')
corr_stats = data_val.partial_corr(x='validation type', y='prediction accuracy', method='pearson', y_covar=['training set sample size', 'test set sample size'])
corr_stats = corr_stats.append(data_val.partial_corr(x='validation type', y='prediction accuracy', method='pearson', y_covar=['confound control']))
corr_stats = corr_stats.append(data_val.partial_corr(x='validation type', y='prediction accuracy', method='pearson',
                                                     y_covar=['training set sample size', 'test set sample size', 'confound control']))
corr_stats.index = ['sample size', 'confound control', 'sample size & confound control']
corr_stats

Partial correlation controlling for:


Unnamed: 0,n,r,CI95%,p-val
sample size,26,-0.305511,"[-0.63, 0.11]",0.146566
confound control,26,-0.274834,"[-0.6, 0.13]",0.183653
sample size & confound control,26,-0.305865,"[-0.64, 0.12]",0.155794
