# Judi's Python Project
This project is using the publicly available data from the Human Connectome Project (HCP)

Data available here: https://db.humanconnectome.org
Data dictionary available here: https://wiki.humanconnectome.org/display/PublicData/HCP+Data+Dictionary+Public-+500+Subject+Release?src=contextnavpagetreemode

The aim of this analysis is to test how performance differences in a line orientation task is explained by:  
*) gender 
*) characteristics of hippocampus (volume) 

For this: 
*) extract appropriate data and save in pandas data frame
*) plot distribution of spatial orientation performance for males and females
*) create score for left and right hippocampal area for each person
*) normalise hippocampal area for each person
*) plot all possible combinations
*) linear regression 

NECESSARY:

    load source data from a file [X]
    plot at least one histogram of the data, with title and labelled axes [X]
    create at least one plot of analysis results, with title and labelled axes [X]
    use at least one numpy array [X]
    use short but descriptive variable names in your code [X}
    document your code: use markdown in your .ipynb and/or directly comment your python code with # or ''' or """ [X]

MINIMUM OF SIX:

    use an if-elif-else clause
    use a for loop [X}
    use a while loop
    write at least one function, include a docstring
    print out some results in at least one nicely formatted string, using string operator % or .format() method
    use at least one vectorized math operation on an array [X]
    use at least one matrix operation on a 2D array
    create a figure with multiple axes (i.e., use plt.subplots(nrows, ncols)) [X]
    do a statistical test - show that the test assumptions hold for your data 
    manipulate and analyze data in a pandas series or dataframe [X]
    use an image processing algorithm
    use a clustering algorithm
    use some other non-trivial algorithm: e.g. regression, curve fitting, signal analysis… [X]
    version control your code using git: create a local repository and make at least 5 commits while developing your code [X]


In [None]:
#Load packages required
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline


In [None]:
#Load data from source file
# (substitute with path to the raw csv data & manipulate and analyze data in a pandas series or dataframe)
HCP_data = pd.read_csv('C:/Users/Judi/Documents/Project_HCP/unrestricted_jhuber_6_23_2017_7_25_10.csv') 
HCP_data.shape

In [None]:
#group data by gender 
gender_grouping = HCP_data.groupby('Gender')

In [None]:
#Plotting distribution of performance in a Line Orientation task ("VSPLOT_TC")
#(plot at least one histogram of the data, with title and labelled axes)
gender_hist = gender_grouping.VSPLOT_TC.hist(bins=15)
plt.xlabel("Variable Short Penn Line Orientation: Total Number Correct (VSPLOT_TC)")
plt.ylabel("frequency")
plt.title("Line Orientation task performance by gender")
plt.legend(('female', 'male'))


### There seems to be a performance difference across genders. Could this be due to differences in hippocampal volume? 

In [None]:
#select appropriate columns (i.e. gender, age, task performance and hippocampal volume information)
#first create filter
#(use a for loop)
filter_col = [col for col in HCP_data.columns 
            if col.startswith('FS_L_Hipp') | 
            col.startswith('FS_R_Hipp') |
            col.startswith('Age') |
            col.startswith('VSPLOT_TC') |
            col.startswith('Gender')  |
             col.startswith('FS_SubCort_GM')]
filter_col



In [None]:
#use filter to select appropriate columns in data 
HCP_data_f = HCP_data[filter_col] #use filter in data frame to select specified columns
HCP_data_f.shape #check shape

In [None]:
#exclude empty entries (otherwise get error message later)
HCP_data_f = HCP_data_f[~HCP_data_f.FS_L_Hippo_Vol.isnull()]
HCP_data_f = HCP_data_f[~HCP_data_f.FS_R_Hippo_Vol.isnull()]
HCP_data_f = HCP_data_f[~HCP_data_f.FS_SubCort_GM_Vol.isnull()]
HCP_data_f = HCP_data_f[~HCP_data_f.VSPLOT_TC.isnull()]
HCP_data_f.shape #check shape as expected 




In [None]:
#create composite scores to determine 
# *) total hippocampal volume 
# *) normalised hippocampal volume 
# (use at least one numpy array & use at least one vectorized math operation on an array)

#normalise hippocampal volume by dividing by total subcortical grey matter volume
FS_L_Hippo_Vol_norm = np.array(HCP_data_f.FS_L_Hippo_Vol / HCP_data_f.FS_SubCort_GM_Vol) #left hippocampus normalised volume
FS_R_Hippo_Vol_norm = np.array(HCP_data_f.FS_R_Hippo_Vol / HCP_data_f.FS_SubCort_GM_Vol) #right hippocampus normalised volume
FS_R_Hippo_Vol_norm.size

#calculate total hippocampal volume
FS_HPCvol_sum = np.array(HCP_data_f.FS_R_Hippo_Vol + HCP_data_f.FS_L_Hippo_Vol)
#calculate total normalised hippocampal volume
FS_HPCvol_sum_norm = np.array(FS_HPCvol_sum / HCP_data_f.FS_SubCort_GM_Vol)

#include arrays in data frame
HCP_data_f['FS_L_Hippo_Vol_norm'] = FS_L_Hippo_Vol_norm 
HCP_data_f['FS_R_Hippo_Vol_norm'] = FS_R_Hippo_Vol_norm
HCP_data_f['FS_HPCvol_sum'] = FS_HPCvol_sum
HCP_data_f['FS_HPCvol_sum_norm'] = HCP_data_f.FS_HPCvol_sum / HCP_data.FS_SubCort_GM_Vol 

#check
HCP_data_f.shape 



In [None]:
#create a figure with multiple axes
sns.set(style="ticks")
all_corr = sns.pairplot(HCP_data_f,dropna=True, hue = "Gender", markers=['o','x'])
all_corr = all_corr.map_offdiag(plt.scatter,s=35,alpha=0.3)
plt.subplots_adjust(top=0.9)
all_corr.fig.suptitle('Correlations between Line Orientation & Hippocampal Volume', fontsize = 34)


In [None]:
# Calculate the effects of gender and total hippocampal volume on performance in the line orientation task
# using linear regresssion 
from statsmodels.formula.api import ols
model1 = ols("VSPLOT_TC ~  FS_HPCvol_sum + Gender", HCP_data_f).fit()  
print(model1.summary()) #print summary of regression modely


### The second warning suggests multicollinearity, suggesting that the explanatory variables might not be the most appropriate. Hippocampal volume may correlate with gender as well. Thus, possibly it would be more appropriate to use normalised hippocpal volume to adjust for different head volumes. 

In [None]:
# Calculate the effects of gender and total hippocampal volume on performance 
#  use some other non-trivial algorithm: e.g. regression
from statsmodels.formula.api import ols
model2 = ols("VSPLOT_TC ~  FS_HPCvol_sum_norm + Gender", HCP_data_f).fit() #
print(model2.summary())



### The second warning disappeared. And indeed, hippocampal volume does not have an effect after normalisation. So while there is a gender effect in Line Orientation task performace, this cannot be explained by volume differences in the hippocampus

In [None]:
#Calculate the effects of gender and total hippocampal volume on performance 
#(create at least one plot of analysis results, with title and labelled axes)
corr_res = sns.pairplot(HCP_data_f, vars=['VSPLOT_TC', 'FS_HPCvol_sum_norm', 'FS_HPCvol_sum'], kind='reg', hue = 'Gender')  
plt.subplots_adjust(top=0.9)
corr_res = corr_res.fig.suptitle('Relationship between the orientation test, gender and hippocampal volume ', fontsize = 12)