## Jackknife Bias Estimation for Morphometric Connectivity

This notebook describes the steps we used for calculating structural covariance and graph measures from CIVET-based cortical thickness:

1. Install & Import Packages
2. Read CIVET Files
3. DKT Parcellation
4. Covariate Regression
5. Structural Covariance
6. Jackknife Bias Estimation
7. Graph Theory

MATLAB-Code available [here](https://github.com/katielavigne/civetsurf).

### 1. Install & Import Packages 

This step ensures all needed packages are either installed or, if already installed, imported.

In [None]:
# Install packages needed for all steps of the analyses
!pip install bctpy
!pip install statsmodels
!pip install pandas
!pip install numpy

In [None]:
# Import packages
import os, glob
import pandas as pd
import numpy as np 
import bct
import statsmodels.formula.api as sm

### 2. Read Files

This step reads [CIVET](https://www.bic.mni.mcgill.ca/ServicesSoftware/CIVET-2-1-0-Table-of-Contents)-based cortical surface text files from a specified directory. CIVET outputs text files consisting of  values for a given structural metric (cortical thickness, surface area, volume) for each vertex separately for each hemisphere. Using cortical thickness as an example with data from the [UK BioBank](https://www.ukbiobank.ac.uk/), this script loads the files for both hemispheres and merges them into a dataframe displaying all vertices. Further, mean cortical metric values are calculated for each participant by taking the mean of all vertices per participant, and a total metric value is calculated by summing the values of each participant. Depending on the cortical metric, one would use either the mean (thickness) or total (surface area) values for later steps.

Inputs are:
- cortical thickness files for each hemisphere of N participants

Outputs are:
- a dataframe with the dimensions N x 81,924 vertices
- a dataframe with mean thickness values
- a dataframe with total thickness values 

___

For high performance computing, see:
- readfiles.py
- readfiles.sh

In [None]:
# Define measures and paths for Step 2. 
measure = "thickness"
civpath = "/project/def-mlepage/ukbb/civet/" + measure + "/"
outdir = "/scratch/katie/ukbb/data/"
dname = 'ukbb_' + measure + '_vertexdata.pkl'
idvar = "eid"

In [None]:
# Find files & IDs
Lfiles = glob.glob(civpath + '*left*')
Lfiles.sort()
Rfiles = glob.glob(civpath + '*right*')
Rfiles.sort()
subjIDs = [ids.split('/')[-1].split('_')[1] for ids in Lfiles]

# Make dataframe & save as pickle
Ldf = pd.concat((pd.read_csv(Lf, dtype=float, header=None).T for Lf in Lfiles))
Rdf = pd.concat((pd.read_csv(Rf, dtype=float, header=None).T for Rf in Rfiles))
df = pd.concat([Ldf,Rdf], axis=1)
df.index = subjIDs
df.index.names = [idvar]
df.to_pickle(outdir + dname)

# Calculate and save mean anatomical measure (mean thickness)
mean_measure = pd.DataFrame(pd.to_numeric(df.mean(axis=1)), columns=["mean_" + measure])
mean_measure.to_csv(outdir + 'mean_' + measure + '.csv')

# Calculate and save total anatomical measure (total thickness)
tot_measure = pd.DataFrame(pd.to_numeric(df.sum(axis=1)), columns=["total_" + measure])
tot_measure.to_csv(outdir + 'total_' + measure + '.csv')

### 3. DKT-Parcellation 

This step parcellates the data of the 81,924 vertices into 62 DKT-regions ([Klein & Tourville, 2012](https://doi.org/10.3389/fnins.2012.00171)).

Inputs are:
- the previously created dataframe with all 81,924 vertices 
- a DKT-file indicating the DKT regions (see the separate file "DKT.csv" in the repository; from [MATLAB code](https://github.com/katielavigne/civetsurf) and the [CIVET Manual](https://www.bic.mni.mcgill.ca/ServicesSoftware/CIVET-2-1-0-Table-of-Contents))
- a DKT-file indicating which vertex belongs to which DKT-region (see the separate file "CIVET_2.0_DKT.txt" in the repository; from [MATLAB code](https://github.com/katielavigne/civetsurf) and the [CIVET Manual](https://www.bic.mni.mcgill.ca/ServicesSoftware/CIVET-2-1-0-Table-of-Contents))

Outputs are:
- a dataframe with the parcellated data of the dimensions N x 62 

___

For high performance computing see:
- parcellation.py
- parcellation.sh

In [None]:
# Define measures and paths for Step 3. 
dpath = '/scratch/katie/ukbb/data/'
ppath = '/project/def-mlepage/ukbb/civet/'
dktvert_file = 'CIVET_2.0_DKT.txt'
dktinfo_file = 'DKT.csv'
roivar = "roi"

In [None]:
# Parcellation
dktvert = pd.read_csv(ppath + dktvert_file, dtype=str, names= [roivar], header=None)
dktinfo = pd.read_csv(ppath + dktinfo_file, dtype=str)

parc = pd.DataFrame(index= df.index.copy())
for r in range(len(dktinfo)):
    roi = dktinfo.label_number[r]
    abr = dktinfo.abbreviation[r]
    means = pd.DataFrame(df.iloc[:,dktvert.index[dktvert.roi == roi]].mean(axis=1),columns=[abr], index= df.index.copy())
    parc = pd.concat([parc,means], axis = 1)

parc.to_csv(dpath + 'dkt_parcellation_' + measure + '.csv') # parcellated data


### 4. Regression

This (optional) step regresses out the influence of other variables on the cortical thickness values, resulting in residual values that can be used in following steps.

Inputs are:
- the parcellated data (N x 62)
- a glimfile including other variables (e.g., age for this example)
- the mean thickness values computed above

Outpus are:
- a dataframe with the residuals of the dimensions N x 62

___

For high performance computing see:
- regression.py
- regression.sh

In [None]:
# Define measures and paths for Step 4.
gfile = 'ukbb_controls.csv'
glim = pd.read_csv(os.path.join(dpath, gfile), index_col=[idvar]) # read glimfile 
glim = glim.join(mean_measure) # merge glim & mean thickness
glim_parc = glim.join(parc) # #join glim + mean thickness & parcellated data
covar1 = glim_parc.age_assess_t2
covar2 = glim_parc.mean_thickness

In [None]:
# Regression
resid=[]
for i in parc:
    reg = sm.ols('parc.loc[:,i] ~ covar1 + covar2', data=glim_parc).fit()
    residuals = reg.resid
    resid.append(residuals)

res = pd.DataFrame(resid).T
res.columns=parc.columns
res.to_csv('residuals.csv')

### 5. Structural Covariance 

This step performs pearson correlations between DKT regions for all participants, resulting in a structural covariance matrix of the full sample ([Alexander-Bloch et al., 2013a](https://doi.org/10.1038/nrn3465); [Alexander-Bloch et al., 2013b](https://doi.org/10.1523/JNEUROSCI.3554-12.2013); [Evans, 2013](https://doi.org/10.1016/j.neuroimage.2013.05.054)).

Inputs are:
- the residual data of the DKT regions

Outputs are:
- a structural covariance matrix of the full sample with the dimensions 62 x 62

___

For high performance computing see:
- strucov.py
- strucov.sh

In [None]:
# Full sample correlation matrix
corrmtrix = res.corr(method="pearson")

### 6. Jackknife Bias Estimation Procedure 

This step calculates the contribution of each participant to the structural covariance matrix of the full sample ([Ajnakina et al., 2021](https://doi.org/10.1093/schbul/sbab035); [Das et al., 2018](https://doi.org/10.1001/jamapsychiatry.2018.0391)). Here, a structural covariance matrix is recalculated for each iteration of N-1 participants, meaning that the structural covariance between DKT regions is recalculated N times, by leaving each participant out of the calculation once. By then subtracting these recalculated matrices from the structural covariance matrix of the full sample, the contribution of each participant to the full-sample structural covariance is estimated. Absolute values are taken and the end result is a 3D matrix with the dimensions 62 x 62 x N.

Inputs are:
- the residual data of the DKT regions
- the structural covariance matrix of the full sample

Outputs are:
- an absolute value 3D matrix of the dimensions 62 x 62 x N
    - each of the N matrices represents the absolute contribution of each participant to the full-sample structural covariance matrix

*NOTE.* For analyses with multiple groups (e.g., patient & control), jackknife should be performed on each group separately.
___

For high performance computing see:
- jackknife.py
- jackknife.sh

In [None]:
# Jackknife correlation 
jack = []
n = len(glim_parc[:]) #full sample 
for i,row in res.iterrows(): #i = eid, so the loop goes over each row in the dataframe meaning each participant
    LOO = res.drop(i) #and drops one row depending on i per loop 
    #print(LOO) #check if matrices have size of N-1 x 62 (sample -1 x DKT regions)
    corrLOO = LOO.corr(method="pearson"); 
    W = (n*corrmtrix)-((n-1)*corrLOO);

    #% Absolute 
    normW = abs(W);
    jack.append(normW)

jk = np.array(jack)
np.save('jackknife_output.npy', jk)

### 7. Graph Theory 

This step calculates the two graph measures of local strength and global efficiency by the means of the [Brain Connectivity Toolbox](http://www.brain-connectivity-toolbox.net/) in Python ([bctpy](https://pypi.org/project/bctpy/); [Rubinov & Sporns, 2010](https://doi.org/10.1016/j.neuroimage.2009.10.003)).

Inputs are:
- the 3D jackknife connectivity matrix 
- the glimfile

Outputs are:
- local strengths
- global efficiency 

___

For high performance computing see:
- graph.py
- graph.sh

In [None]:
# Fixing that the diagonals are all 0 
for i in jk:
     np.fill_diagonal(i, 0, wrap=True)

# Strengths
def strengths_und(jk):
    return np.sum(jk, axis=1) 

strengths = strengths_und(jk)
s = pd.DataFrame(data = strengths, index = glim.index.copy())
s.columns=["Strength_"+str(i) for i in range(1, s.shape[1] + 1)]
data_conn = glim.join(s) 

# Global Efficiency
globeff= []
for i in jk:
    bct.efficiency_wei(i) 
    globeff.append(bct.efficiency_wei(i))
    
e = pd.DataFrame(data = globeff, index = glim.index.copy())
e.columns=["Global Efficiency"]
data_conn = data_conn.join(e)  

# Save dataframe
data_conn.to_csv('ukbb_data_crystal_jackknife_graph_theory.csv')