# Exploring the "clean" ADNI dataset using Pandas

The "clean" ADNI dataset is the dataset for year 5 (i.e. 5 years after baseline) cleaned by Xiao (Gavin) Gao in the following way:
1. Only particular "ADNI" datasers were used: ADNI-1, ADNI-GO, ADNI-2
    - Different versions of ADNI used different versions of freesurfer which, in turn, impacted how Intra Cranial Volume is calculated.
    - We use the average ICV for the first 3 years
    - We also use the average volume per region calculated from the first 3 years
2. Images that didn't pass the ADNI's quality control (QC) were not considered (see `ADNI_123_V4.mat`).
3. If volume of a region increases by over 10% between 2 visits, volume is replaced by the upper limit (1.10x average) calculated before.
4. Not yet done here but performed during Gavin's calculations: If more than 10 regions in the brain go over the threshold (volume increased more than 10%), this data is discarded.

Lets start by importing our useful libraries

In [1]:
from scipy.io import loadmat
import os
import numpy as np
from pathlib import Path
import pandas as pd

## Loading data in to dataframe

Let's load the data into a pandas dataframe.
First lets find the file.

In [2]:
def path_to_file(filename):
    '''
    Returns path for file 'filename`. 
    Assumes file to be in the relative path: '../data/'
    '''
    here_dir    = os.path.dirname(os.path.realpath('__file__'))
    par_dir = os.path.abspath(os.path.join(here_dir, os.pardir))
    dataset_dir = os.path.join(par_dir, 'data',str(filename))
    return dataset_dir

In [3]:
adni_5y = loadmat(path_to_file('vec_a2b_5y_clean.mat'))
#get the proper matrix from the .mat file, ignoring metadata
adni_5y = adni_5y['vec_a2b_5y']

In order to make a dataframe, we'll need the name of each column in the adni_5y matrix.

We have created a dictionary (`../data/dictionary.csv`) associating the column names to the column numbers they refer to. Let's use it to make a dataframe.

In [58]:
def make_dataframe(adni_matrix, dictionary_file):
    """makes panda dataframe from dictionary_file.

    Args:
       adni_5y (arr): numpy array with adni entries
       dictionary_file (str): file with column names (in ./data/ directory).

    Returns:
        dataframe (pd dataframe)

    """
   
    def read_column_names_from_csv(dictionary_file):
        """Returns a dictionary of shape {column name: slice}.

        Args:
            dictionary_file (str): file with column names (in ./data/ directory).

        Returns:
            name_dict (dict): dictionary {column names: slice(column_begin, column_end, None)}

        """
      
        import csv
        path = path_to_file(dictionary_file)
        reader = csv.reader(open(path, 'r'))
        name_dict = {}
        for row in reader:
            legend, column_numbers = row
            column_numbers_list = column_numbers.split(':')
            column_numbers_list = list(map(int, column_numbers_list))
            column_slice = slice(column_numbers_list[0],column_numbers_list[1],None) if len(column_numbers_list) > 1 \
                else slice(column_numbers_list[0],column_numbers_list[0]+1,None)
            name_dict[legend] = column_slice
        return(name_dict)
    
    names_dict = read_column_names_from_csv(dictionary_file)
    
    dataframe = pd.DataFrame({k: [adni_matrix[i][slice] for i in range(len(adni_matrix))] \
                              for k, slice in zip(names_dict.keys(), names_dict.values())})
    
    #remove brackets from dataframe
    for c in dataframe.columns:
        if len(dataframe[c][0]) == 1:
            dataframe[c] = dataframe[c].str[0]
        
    return(dataframe)

Now we can use these column names to create a pandas dataframe:

In [59]:
df = make_dataframe(adni_5y, 'dictionary_ADNI.csv')

In [61]:
# df

In [48]:
df

Unnamed: 0,ID,BaselineDx,1yDx,End-of-studyDx,Baselineatrophy,Futureatrophy,GeneticInfo(APOE4),blAge,gender,educationyear,...,Future_Age,sum_volume_all_regions-baseline,sum_volume_all_regions-future,baseline_ICV / sum_volume_all_regions-baseline,baseline_ICV / sum_volume_all_regions-future,average_ICSV_first_3_visits,num_regions_over_upper_limit-baseline,num_regions_over_upper_limit-future,end-of-study-conversion,num_months_from_baseline
0,42.0,3.0,3.0,3.0,0.386044,0.676818,0.0,-0.311726,1.0,0.570933,...,78.85339,494503.0,458625.0,3.075512,3.282311,1.511358e+06,3.0,1.0,0.0,60.0
1,42.0,3.0,3.0,3.0,0.435094,0.612401,0.0,0.028592,1.0,0.570933,...,80.83012,484518.0,458582.0,3.115674,3.294111,1.511358e+06,1.0,4.0,0.0,60.0
2,51.0,2.0,3.0,3.0,0.789532,0.846892,2.0,-1.180313,1.0,0.570933,...,73.51437,576915.0,530504.0,2.829082,3.148487,1.628523e+06,2.0,1.0,1.0,60.0
3,51.0,2.0,3.0,3.0,0.819118,0.893798,2.0,-1.015772,1.0,0.570933,...,74.46715,581089.0,496010.0,2.818725,3.344741,1.628523e+06,5.0,0.0,1.0,60.0
4,56.0,1.0,1.0,2.0,0.344070,0.337576,0.0,-1.000813,2.0,-1.255780,...,74.56646,493528.0,486438.0,2.679990,2.694752,1.325962e+06,1.0,2.0,0.0,60.0
5,56.0,1.0,1.0,2.0,0.331854,0.402586,0.0,-0.841219,2.0,-1.255780,...,75.58220,496912.0,493722.0,2.679287,2.656049,1.325962e+06,1.0,1.0,0.0,60.0
6,59.0,1.0,1.0,2.0,0.439460,0.678659,0.0,-0.787350,2.0,-1.255780,...,75.86646,485096.0,447729.0,2.530798,2.789500,1.237642e+06,2.0,2.0,0.0,60.0
7,59.0,1.0,1.0,2.0,0.645579,0.592909,0.0,-0.627756,2.0,-1.255780,...,76.88220,485136.0,484212.0,2.555016,2.592026,1.237642e+06,4.0,5.0,0.0,60.0
8,61.0,1.0,3.0,3.0,0.659642,0.702290,0.0,0.708802,2.0,-0.525095,...,85.00274,508358.0,490977.5,2.900063,3.004007,1.473486e+06,3.0,0.0,1.0,60.0
9,68.0,1.0,1.0,1.0,0.495193,0.583687,0.0,-0.212642,2.0,-1.621123,...,79.47324,541620.0,505269.0,2.536335,2.718169,1.375186e+06,8.0,1.0,0.0,60.0
