# Structural MRI Analysis using Jupyter Notebooks and Python3 on brainlife.io

This example notebook will guide the user through grabbing data for their project from the secondary warehouse, compiling data across the entire project, and analyzing and visualizing the compiled results on brainlife.io via the 'Analysis' tab. This example is written for python3, and uses one of the Python3 notebook types.

Within this notebook, I will guide the user through analyses comprising derivatives generated from one of the three main datatypes available on brainlife.io, specifically neuro/anat/t1w. Specifically, I will guide the user through:
        
    1. Load sample data for parcellation statistics
    2. doing some simple data manipulations
    3. generating simple visualizations
    
All of these functions used here are provided in a self-contained python package called "pybrainlife" [https://pypi.org/project/pybrainlife/].

First thing we'll do is load our python modules, specifically pandas, and pybrainlife!

### Import pandas and pybrainlife modules

In [5]:
import os,sys
import pandas as pd
import pybrainlife as pbl
from pybrainlife.data.collect import collect_data
from pybrainlife.data.collect import collect_subject_data
import pybrainlife.data.manipulate as pybldm
import pybrainlife.vis.plots as pyblvp
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import numpy as np
import random
from scipy.stats import ttest_ind

Next thing we'll do is we'll define some useful functions. Specifically, we will define a few fuctions to peform bootstrapping analyses on the data and plotting the histograms

### Define some useful functions for this notebook

In [6]:
# performs a bootstrapping analysis comparing data from two different groups
def bootstrap_analysis_groups(df,group_1,group_2,measures,iterations=1000,sample_size=10,compare_measure='corr'):

    compare = 'corr'
    if compare_measure == 'ttest':
        compare = 'p-value'
        
    correlation = {}
    for meas in measures:
        correlation[meas] = []
        for i in range(0,iterations):
            group_1_df = df.loc[df['classID'] == group_1].sample(sample_size).reset_index(drop=True)
            group_2_df = df.loc[df['classID'] == group_2].sample(sample_size).reset_index(drop=True)

            if compare_measure == 'ttest':
                corr = ttest_ind(group_1_df[meas], group_2_df[meas],equal_var=False)[1]
            else:
                corr = np.corrcoef(group_1_df[meas].values.tolist(),group_2_df[meas].values.tolist())[0][1]
            correlation[meas].append(corr)

    corrs = pd.DataFrame()

    for meas in measures:
        corrs[meas+'_'+compare] = correlation[meas]
    
    return corrs

# performs a bootstrapping analysis within two individual groups comparing between different measures
def bootstrap_analysis_within_groups(df,group_1,group_2,measures,iterations=1000,sample_size=10,compare_measure='corr'):

    compare = 'corr'
    if compare_measure == 'ttest':
        compare = 'p-value'

    group_1_corrs = {}
    group_2_corrs = {}
    for meas in range(len(measures)):
        for meas_2 in range(len(measures)):
            if measures[meas] != measures[meas_2]:
                measures_name = measures[meas]+'_'+measures[meas_2]
                inv_measures_name = measures[meas_2]+'_'+measures[meas]
                if measures_name not in list(group_1_corrs.keys()):
                    if inv_measures_name not in list(group_1_corrs.keys()):
                        group_1_corrs[measures_name] = []
                        group_2_corrs[measures_name] = []
                        for i in range(0,iterations):
                            group_1_df = df.loc[df['classID'] == group_1].sample(sample_size).reset_index(drop=True)
                            group_2_df = df.loc[df['classID'] == group_2].sample(sample_size).reset_index(drop=True)

                            if compare_measure == 'ttest':
                                corr_group_1 = ttest_ind(group_1_df[measures[meas]],group_1_df[measures[meas_2]],equal_var=False)[1]
                                corr_group_2 = ttest_ind(group_2_df[measures[meas]],group_2_df[measures[meas_2]],equal_var=False)[1]
                            else:
                                corr_group_1 = np.corrcoef(group_1_df[measures[meas]].values.tolist(),group_1_df[measures[meas_2]].values.tolist())[0][1]
                                corr_group_2 = np.corrcoef(group_2_df[measures[meas]].values.tolist(),group_2_df[measures[meas_2]].values.tolist())[0][1]
                            group_1_corrs[measures_name].append(corr_group_1)
                            group_2_corrs[measures_name].append(corr_group_2)

    corrs = pd.DataFrame()
    for meas in list(group_1_corrs.keys()):
        corrs[meas+'_'+compare] = group_1_corrs[meas] + group_2_corrs[meas]
        corrs['classID'] = [ group_1 for f in range(len(group_1_corrs[meas])) ] + [ group_2 for f in range(len(group_2_corrs[meas])) ]

    return corrs

# plots overall data
def plot_histogram(df,plot_measure,compare_measure,ax=''):
    
    if ax == '':
        sns.histplot(x=plot_measure,data=df,alpha=0.5)
        ax = plt.gca()
    else:
        sns.histplot(x=plot_measure,data=df,alpha=0.5,ax=ax)

    ax.vlines(x=df[plot_measure].mean(),ymin=0,ymax=ax.containers[1].datavalues.max(),linewidth=2,color='r')
    ax.text(x=df[plot_measure].max() * .4,y=ax.containers[1].datavalues.max() *.75,s='average '+compare_measure+': %s' %(str(df[plot_measure].mean())))
    
# plots individual group data
def plot_histogram_groups(df,plot_measure,palette='',ax=''):

    if ax == '':
        if palette != '':
            sns.histplot(x=plot_measure,hue='classID',data=df,palette=palette,alpha=0.25)
        else:
            sns.histplot(x=plot_measure,hue='classID',data=df,alpha=0.25)
        ax = plt.gca()
    else:
        if palette != '':
            sns.histplot(x=plot_measure,hue='classID',data=df,palette=palette,alpha=0.25,ax=ax)
        else:
            sns.histplot(x=plot_measure,hue='classID',data=df,alpha=0.25,ax=ax)

    if palette:                                                                                                     9
        ax.vlines(x=df.loc[df['classID'] == group_1].mean()[plot_measure],ymin=0,ymax=ax.containers[1].datavalues.max(),color=palette[group_1])
        ax.vlines(x=df.loc[df['classID'] == group_2].mean()[plot_measure],ymin=0,ymax=ax.containers[0].datavalues.max(),color=palette[group_2])
    else:
        ax.vlines(x=df.loc[df['classID'] == group_1].mean()[plot_measure],ymin=0,ymax=ax.containers[1].datavalues.max(),color='r')
        ax.vlines(x=df.loc[df['classID'] == group_2].mean()[plot_measure],ymin=0,ymax=ax.containers[0].datavalues.max(),color='g')
    
#     ax.text(x=df.loc[df['classID'] == group_1][plot_measure].max() * .4,y=ax.containers[0].datavalues.max() *.75,s='average '+group_1+' '+plot_measure.split('_')[-1]+': %s' %(str(df.loc[df['classID'] == group_1][plot_measure].mean())))
#     ax.text(x=df.loc[df['classID'] == group_2][plot_measure].max() * .4,y=ax.containers[1].datavalues.max() *.75,s='average '+group_2+' '+plot_measure.split('_')[-1]+': %s' %(str(df.loc[df['classID'] == group_2][plot_measure].mean())))

### Load sample subjects data

Now that we have our modules, imported, we can load our sample datasets!

First, we will load our subjects dataframe using pandas! We will also build a column containing a color for each group.

In [7]:
### Load the subjects dataframe
## First, let's load the subjects dataframe using the collect_subject_data function in pybrainlife
## collect_subject_data():
## inputs = path where we would like to save the dataframe. if we don't want to save, just leave blank

# load the subjects data
subjects_data = collect_subject_data()

# remove the index column just to keep things clean

# rename the subject column to 'subjectID'

# rename the diagnosis column to 'classID'

# make sure subjectID column is string

### Create a color for each group
## first, let's define a color dictionary for each group
# make a list of all the unique groups in the dataframe

# generate a random color for each group


# create a dictionary mapping a color with each group

# map the groups_colors dictionary to the classID column to create a new column called colors 

### Visualize the dataframe 
## now let's visualize the dataframe to inspect
# print out a random sample of 10 rows
subjects_data.sample(10).head(10)

Unnamed: 0,index,subject,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,...,crt_ne2,crt_time1,crt_err1,crt_pr1,crt_pr2,crt_err2,crt_nm1,crt_time2,crt_nm2,crt_index
127,127,11143,CONTROL,28,M,1.0,1.0,1.0,1.0,1.0,...,0,19,0,0,0,0,1,40,0,1.11
159,159,50052,SCHZ,45,M,1.0,1.0,1.0,1.0,1.0,...,0,43,0,0,0,0,0,91,0,1.12
65,65,10680,CONTROL,22,F,1.0,1.0,1.0,1.0,1.0,...,0,20,0,0,0,0,0,39,0,0.95
114,114,11090,CONTROL,24,M,1.0,1.0,1.0,1.0,1.0,...,0,23,0,0,0,0,1,48,0,1.09
200,200,60042,BIPOLAR,34,M,1.0,1.0,1.0,1.0,1.0,...,0,37,0,0,0,0,0,66,0,0.78
230,230,70002,ADHD,44,F,1.0,,1.0,,,...,0,32,0,0,0,1,1,65,0,1.03
201,201,60043,BIPOLAR,40,F,1.0,1.0,1.0,1.0,1.0,...,0,41,0,0,2,0,0,119,1,1.9
67,67,10692,CONTROL,28,F,1.0,1.0,1.0,1.0,1.0,...,0,26,0,0,0,0,0,46,0,0.77
242,242,70034,ADHD,21,F,1.0,1.0,1.0,1.0,1.0,...,0,25,0,0,0,0,0,50,0,1.0
233,233,70010,ADHD,48,M,1.0,1.0,1.0,,,...,1,29,0,0,0,0,1,61,0,1.1


### Cortical analysis
Now that we have our subjects information loaded, let's load some actual data!

For this notebook, we will load the parc-stats datatype containing all of the morphometric information for the cortical parcellations generated using the "collect_data" function from pybrainlife

In [11]:
## load parc-stats data
# function: collect_data()
# inputs: datatype = name of the datatype (example: 'parc-stats')
#         datatype_tags = list of datatype tags to search for (can leave blank for this analysis)
#         tags = list of object tags to search for (can leave blank for this analysis)
#         filename = name of the file within the datatype (you can search for this by using the File Viewer on the parc-stats datatype)
#         outPath = filepath where to save the concatenated data (can be left blank if you don't want to save the file)
#         duplicates = True or False; True = keep duplicates, False = remove duplicates

# set the output directory and output filename
data_directory = 'data'
output_filename = 'cortical-statistics.csv'
output_filepath = data_directory+'/'+output_filename

# make the output directory if not already made
if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

# if the output file doesn't exist, comb through the warehouse to find the data for the project and concatenate to single dataframe
# if it does exist, just load the dataframe
if not os.path.isfile(output_filepath):
    # collate the data
    
    # make sure to set subjectID as string
    
else:
    # read the dataframe 
    
    # make sure to set subjectID as string
    
# merge subjects data frame with tractmeasures

# visualize the dataframe
cortex_df.sample(10).head(10)

IndentationError: expected an indented block (4221812274.py, line 26)

For this example, let's just look at cortical data from the hcp-mmp-b atlas (i.e. Glasser atlas).

To do this, we can just use some simple pandas .loc functionality to identify only those data that correspond to the hcp-mmp-b atlas

In [9]:
## subsample data for only parcels in hcp-mmp-b

# visualize the dataframe
cortex_df_glasser.sample(10).head(10)

NameError: name 'cortex_df_glasser' is not defined

# Describe the dataframe

#### compute mean, min-max, and quantiles of each column using pandas function describe()

In [802]:
cortex_df_glasser.describe()

#### compute meta data on the dataframe using pandas function info()

In [803]:
cortex_df_glasser.info()

#### count the number of parcels for each subject. ideal value == 360

In [804]:
cortex_df_glasser.groupby('subjectID').count()

#### count the number of unique subjects per group

In [805]:
cortex_df_glasser.groupby(['subjectID','classID']).mean().reset_index().groupby(['classID']).count()['subjectID']

#### subsample to only the primary visual cortex (V1)

In [806]:
# grab the v1's from the dataframe
regions = ['L_V1_ROI','R_V1_ROI']


v1.sample(10).head(10)

#### compute mean across hemispheres

In [807]:

v1_mean

### Data visualizations

Now, let's generate some visualizations to examine differences between groups across multiple brain and behavior measures

#### v1 morphometrical measures

In [810]:
## compute categorical scatter plot for the left and right V1s for gray matter volume and cortical thickness
# create a subplot figure with 1 row and 2 columns
fig, axes = plt.subplots(1,2,figsize=(10,5),sharex=True)

# create a strip plot for each measure
sns.stripplot(x='structureID',y='gray_matter_volume_mm^3',data=v1,hue='classID',palette=groups_colors,ax=axes[0],size=5,legend=False)
sns.stripplot(x='structureID',y='thickness',data=v1,hue='classID',palette=groups_colors,ax=axes[1],size=5)

#### behavioral measures

In [811]:
## compute categorical scatter plot for the left and right V1s for gray matter volume and cortical thickness
# create a subplot figure with 1 row and 2 columns
fig, axes = plt.subplots(1,2,figsize=(10,5),sharex=True)

# create a strip plot for each measure


#### group scatter plots

In [812]:
## compute categorical scatter plot for group across behavioral and cortical measures
# create a subplot figure with 1 row and 3 columns
fig, axes = plt.subplots(1,3,figsize=(15,5),sharex=True)

# create a strip plot for each measure


It looks like there's a difference in the behavioral measures between our groups! Let's investigate a bit more!

In [815]:
# generate a histogram for each group for the v42dr_totalraw measure


Let's quantify the differences in our measures of interest. Because we have unequal sample sizes, we can use a bootstrapping procedure and compute ttests from random samples of data from each group. We will be using our pre-defined functions from above, specifically 'bootstrap_analysis_groups' to perform the bootstrapping analysis and 'plot_histogram' to plot the results

In [814]:
# define our input variables
df = v1_mean
group_1 = 'CONTROL'
group_2 = 'SCHZ'
measures = ['gray_matter_volume_mm^3','thickness','vr2dr_totalraw']
compare_measure = 'ttest'
iterations = 1000

# perform our bootstrapping analysis

# visualize the results


As we can see from the data, it appears there's a statistically significant difference between our groups in the v42dr_totalraw measure!

Next, let's see if we can identify any relationships between measures within each group

In [813]:
# define our input variables
df = v1_mean
group_1 = 'CONTROL'
group_2 = 'SCHZ'
measures = measures
compare_measure = 'corr'
iterations = 1000

# perform our bootstrapping analysis

# visualize the results


#### let's look at the actual scatterplots

In [816]:
# create a linear regression plot for the overall data between variables of interest


In [817]:
# create a linear regression plot for the each group data between variables of interest


## You've now completed your first set of analyses on brainlife.io using the jupyter notebooks!