# Data exploration for MSc5_research_project 

This jupyter notebook deals with exploring the data for my research project within the MSc05 course in the Neurocognitive Psychology lab at Goethe University Frankfurt within the psychology master degree program. 

The main idea of the project is to use machine learning in order to predict whether a person can be classified either as a healthy control or a patient with psychotic disorder based on different MRI metrics and to see which of them provides more accuracy. 





## 1. Data access

The data I am going to use for the project is available on [figshare](https://figshare.com/articles/dataset/Data_for_Functional_MRI_connectivity_accurately_distinguishes_cases_with_psychotic_disorders_from_healthy_controls_based_on_cortical_features_associated_with_brain_network_development_/12361550) and orginates from the paper "Functional MRI connectivity accurately distinguishes cases with psychotic disorders from healthy controls, based on cortical features associated with brain network development" by [Young et al. (2020)](https://doi.org/10.1101/19009894). The github repository for the study can be accessed [here](https://github.com/jmyoung36/fMRI_connectivity_accurately_distinguishes_cases).




## 2. What does the data contain? General overview

The already pre-processed data contains different metrics for three different locations being Dublin, Maastricht and Cobre. 

For the **macro-structural** data ***cortical thickness (CT)*** was estimated for 308 cortical regions according to a derived version of the Desikian-Killiany atlas [(Desikan et al, 2006)](https://www.sciencedirect.com/science/article/abs/pii/S1053811906000437?via%3Dihub). The files for the derived and adjusted atlas can be found in this [github repository](https://github.com/RafaelRomeroGarcia/subParcellation), the respective paper [here](https://doi.org/10.1016/j.neuroimage.2011.10.086).

The **micro-structural** data contains diffusion weighted images (DWI) from which regional cortical measures such as ***mean diffusivity (MD)*** and ***fractional anisotropy (FA)*** were estimated. 

There are further metrics such as ***functional magnet resonance imaging data (fMRI), fMRI connectivity and network data*** and ***structural connectivity*** and ***DWI tractography***. For project purposes, only the **macro and micro-structural** data will be used. On top of that, the data for only Dublin is going to be explored since it provides the best image quality (see Table 1 in [Young et al., 2020)](https://doi.org/10.1101/19009894) and not every modality was measured for the Cobre dataset. 

Demographic data is also provided.

## 3. Demographic data

First of all, I am going to explore the demographic data to get a better understanding of the sample. In the Dublin dataset, there is a different subset of patients with **DWI** data compared to those with **CT** data. Both "subsets" are loaded and compared with regard to basic demographic variables.

In [3]:
#import module to read the CT and DWI data

import pandas as pd

#store CT data in variable "CT_Dublin"

CT_Dublin = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_thickness_Dublin.csv', delimiter = ',')

In [4]:
#get columns of pandas dataframe

CT_Dublin.columns

Index(['Subject ID', 'Age', 'Sex', 'Group', 'lh_bankssts_part1_thickness',
       'lh_bankssts_part2_thickness',
       'lh_caudalanteriorcingulate_part1_thickness',
       'lh_caudalmiddlefrontal_part1_thickness',
       'lh_caudalmiddlefrontal_part2_thickness',
       'lh_caudalmiddlefrontal_part3_thickness',
       ...
       'rh_supramarginal_part5_thickness', 'rh_supramarginal_part6_thickness',
       'rh_supramarginal_part7_thickness', 'rh_frontalpole_part1_thickness',
       'rh_temporalpole_part1_thickness',
       'rh_transversetemporal_part1_thickness', 'rh_insula_part1_thickness',
       'rh_insula_part2_thickness', 'rh_insula_part3_thickness',
       'rh_insula_part4_thickness'],
      dtype='object', length=312)

As we can see, besides the different brain regions there are columns that indicate the demographic data that we are interested in for now. 

In [5]:
#select a subset of the CT data

demographic_CT = CT_Dublin[["Subject ID", "Age", "Sex", "Group"]]

In [6]:
#view demographic data of CT data

demographic_CT

Unnamed: 0,Subject ID,Age,Sex,Group
0,CON9225,21,2,1
1,CON9229,28,2,1
2,CON9231,29,2,1
3,GASP3037,61,1,2
4,GASP3040,47,1,2
...,...,...,...,...
103,RPG9019,31,1,2
104,RPG9102,42,2,2
105,RPG9119,41,1,2
106,RPG9121,51,1,2


As it can be seen, for the **CT** data there is a total of N=108 participants. 

Having the subset of demographic information for the **CT** data, there is an excel file for the demographic information for the **DWI** data. 

In [54]:
#read an Excel file as pandas data frame

demographic_DWI = pd.read_excel(r'/Users/mello/data/MSc5_research_project/data/DTI_demographics_Dublin.xls')
demographic_DWI

Unnamed: 0,Subject ID,Age,Sex,Group
0,con11,34,1,1
1,con12,26,2,1
2,con18,27,2,1
3,con19,36,1,1
4,con20,23,1,1
...,...,...,...,...
118,pat92,26,1,2
119,pat94,27,1,2
120,pat96,23,2,2
121,pat98,32,2,2


For the **DWI** data there is a total of N=123 participants.

<font color='red'>**NOTE:**</font> **In my case, the excel file format had to be in "xls" for it to be converted into a pandas data frame. The formats "xlsm" and "xlsx" did not work for me. In case you get an import error saying "Missing optional dependency "xlrd"", run the following code below in bash. For further help, click [here](https://datatofish.com/read_excel/).**

In [8]:
#use ! to run command in bash
!pip install xlrd



Now, having demographic data subsets of both **CT** and **DWI** data, they can be explored. The documentation on [figshare](https://figshare.com/articles/dataset/Data_for_Functional_MRI_connectivity_accurately_distinguishes_cases_with_psychotic_disorders_from_healthy_controls_based_on_cortical_features_associated_with_brain_network_development_/12361550) provides label information for Sex (1=male, 2=female) and Group (1=control, 2=case).

In [28]:
#n of control and patients
print(demographic_CT['Group'].value_counts()) 

1    80
2    28
Name: Group, dtype: int64


In [29]:
#n of males and females
print(demographic_CT['Sex'].value_counts()) 

1    60
2    48
Name: Sex, dtype: int64


In [66]:
#modules for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

In [38]:
#get age information
demographic_CT['Age'].describe()

count    108.000000
mean      31.231481
std       10.911373
min       18.000000
25%       22.000000
50%       29.000000
75%       38.250000
max       64.000000
Name: Age, dtype: float64

To sum up the demographic data, for the **CT** data within the Dublin sample there is n=80 controls and n=28 patients, a total of N=108 participants of which n=60 are males and n=48 females. The mean age is M=31.23 years with a standard deviation of SD=10.91 years. The age ranges from min=18 years to max=64 years.

In [56]:
#n of control and patients
print(demographic_DWI['Group'].value_counts()) 

2    64
1    59
Name: Group, dtype: int64


In [52]:
#n of males and females
print(demographic_DWI['Sex'].value_counts()) 

1    66
2    57
Name: Sex, dtype: int64


In [53]:
#get age information
demographic_DWI['Age'].describe()

count    123.000000
mean      28.455285
std        8.518304
min       17.000000
25%       22.000000
50%       27.000000
75%       32.000000
max       50.000000
Name: Age, dtype: float64

The **DWI data** subset for the Dublin sample consists of a total of N=123 particpants with n=56 being control and n=64 patients. There are n=66 males and n=57 females. The mean age is M=28.46 years with a standard deviation of SD=8.52 years. The age ranges from min=17 years to max=50 years.

### 3. Exploring the different modalities

### 3.3 sMRT: cortical thickness

For the macro-structural data T1-weighted images were used and the surface was parcellated according to a template derived from the Desikan-Killiany atlas into 308 regions. For each of those regions cortical thickness was estimated. Before getting a deeper look into those data, first the atlas is loaded.
The atlas can be accessed via the following github repository: https://github.com/RafaelRomeroGarcia/subParcellation. The files used for visualization are located in the folder "500mm parcellation (308 regions)".

To visualize the atlas, the nilearn module is used. 

In [None]:
import nibabel as nb

In [None]:
f_one = nb.load('/Users/mello/data/MSc5_research_project/data/atlas/500.aparc_cortical_consecutive.nii')
f_two = nb.load('/Users/mello/data/MSc5_research_project/data/atlas/500.aparc.nii')

In [None]:
type(f_one)

In [None]:
test = f_one.get_data()

In [None]:
test.shape

import numpy as np
np.unique(test)

np.where(test==4)

#For isolating regions
#region_1 = test
#region_1[region_1!=1] = 0

In [None]:
from nilearn import plotting

In [None]:
plotting.plot_roi(f_one, title="Desikan-Killiany atlas")

In [None]:
plotting.plot_roi(f_two, title="Desikan-Killiany atlas")

In [None]:
plotting.plot_img_on_surf(f_one,
                          views=['lateral', 'medial'],
                          hemispheres=['left', 'right'],
                          colorbar=True)
plotting.show()

In [None]:
plotting.plot_img_on_surf(f_two,
                          views=['lateral', 'medial'],
                          hemispheres=['left', 'right'],
                          colorbar=True)
plotting.show()

In [None]:
plotting.plot_glass_brain(f_one, display_mode='r', plot_abs=False,
                          title='Glass brain', threshold=2.)

plotting.plot_stat_map(f_one, display_mode='x', threshold=1.,
                       cut_coords=range(0, 51, 10), title='Slices')

In [None]:
plotting.plot_stat_map(f_one, display_mode='x', threshold=1.,
                       cut_coords=range(0, 51, 10), title='Slices')

### 3.1 DTI Networks

First of all, I am going to explore the DTI network data, followed by the regional MD and FA values and lastly the CT data.

The data that contains the DTI networks are available as matlab files. In the following, it is depicted how the matlab files can be downloaded and read.

####  3.1.1 Download the .mat files

If you click on the name of the matlab datafile on figshare , it then only shows you the preview of the matlab file and the link of that is for the respective preview. If you want to copy the link of the matlab file itself, you have to right-click on the datafile name and then copy the link.

In [None]:
import urllib.request

In [None]:
print('Beginning file download with urllib2...')

url = "https://figshare.com/ndownloader/files/22782440"
urllib.request.urlretrieve(url, '/Users/mello/data/Msc5_research_project/data/DTI_Dublin.mat')

In [None]:
print('Beginning file download with urllib2...')

url = "https://figshare.com/articles/dataset/Data_for_Functional_MRI_connectivity_accurately_distinguishes_cases_with_psychotic_disorders_from_healthy_controls_based_on_cortical_features_associated_with_brain_network_development_/12361550?file=22782443" 
urllib.request.urlretrieve(url, '/Users/mello/data/Msc5_research_project/data/DTI_Maastricht.mat')

#### 3.1.2 Read the .mat files

In [None]:
import scipy.io

**Dublin**

In [None]:
DTI_Dublin = scipy.io.loadmat('/Users/mello/data/Msc5_research_project/data/DTI_Dublin.mat')

In [None]:
DTI_Dublin.keys()

In [None]:
DTI_Dublin['nostreamlines_new'].shape

123 Probanden mit DTI Matrizen.
Wie sehen DTI Matrizen aus?

In [None]:
DTI_Dublin['nostreamlines_new'][0].shape

In [None]:
DTI_Dublin['nostreamlines_new'][0][0].shape

Zwischen 308 Regionen basierend auf DTI Daten Werte (siehe course website, nochmal durchlesen!!)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,7))
sns.heatmap(DTI_Dublin['nostreamlines_new'][0][0], xticklabels=False, cmap='rocket')

**Maastricht**

In [None]:
DTI_Maastricht = scipy.io.loadmat('/Users/mello/Desktop/Dataset//DTI_Maastricht.mat')

In [None]:
DTI_Maastricht.keys()

### 3.2 Regional Mean diffusivity (MD) and functional anisotropy (FA) values

In [None]:
import pandas as pd

In [None]:
MD_Dublin = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_MD_cortexAv_mean_Dublin.csv', delimiter = ',')
MD_Maastricht = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_MD_cortexAv_mean_Maastricht.csv', delimiter = ',')

In [None]:
FA_Dublin = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_FA_cortexAv_mean_Dublin.csv', delimiter = ',')
FA_Maastricht = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_FA_cortexAv_mean_Maastricht.csv', delimiter = ',')

In [None]:
MD_Dublin

In [None]:
FA_Dublin

In [None]:
CT_Dublin = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_thickness_Dublin.csv', delimiter = ',')
CT_Maastricht = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_thickness_Maast.csv', delimiter = ',')
CT_Cobre = pd.read_csv('/Users/mello/Desktop/Dataset/PARC_500.aparc_thickness_Cobre.csv', delimiter = ',')

In [None]:
CT_Dublin

In [None]:
CT_Maastricht