# Exploring the "clean" ADNI dataset using Pandas

The "clean" ADNI dataset is the dataset for year 5 (i.e. 5 years after baseline) cleaned by Xiao (Gavin) Gao in the following way:
1. Only particular "ADNI" datasers were used: ADNI-1, ADNI-GO, ADNI-2
    - Different versions of ADNI used different versions of freesurfer which, in turn, impacted how Intra Cranial Volume is calculated.
    - We use the average ICV for the first 3 years
    - We also use the average volume per region calculated from the first 3 years
2. Images that didn't pass the ADNI's quality control (QC) were not considered (see `ADNI_123_V4.mat`).
3. If volume of a region increases by over 10% between 2 visits, volume is replaced by the upper limit (1.10x average) calculated before.
4. Not yet done here but performed during Gavin's calculations: If more than 10 regions in the brain go over the threshold (volume increased more than 10%), this data is discarded.

Lets start by importing our useful libraries

In [1]:
from scipy.io import loadmat
import os
import numpy as np
from pathlib import Path
import pandas as pd

## Loading data in to dataframe

Let's load the data into a pandas dataframe.
First lets find the file.

In [2]:
def path_to_file(filename):
    '''
    Returns path for file 'filename`. 
    Assumes file to be in the relative path: '../data/'
    '''
    here_dir    = os.path.dirname(os.path.realpath('__file__'))
    par_dir = os.path.abspath(os.path.join(here_dir, os.pardir))
    dataset_dir = os.path.join(par_dir, 'data',str(filename))
    return dataset_dir

In [3]:
adni_5y = loadmat(path_to_file('vec_a2b_5y_clean.mat'))
#get the proper matrix from the .mat file, ignoring metadata
adni_5y = adni_5y['vec_a2b_5y']

In order to make a dataframe, we'll need the name of each column in the adni_5y matrix.

We have created a dictionary (`../data/dictionary.csv`) associating the column names to the column numbers they refer to. Let's use it to make a dataframe.

In [4]:
def make_dataframe(adni_matrix, dictionary_file):
    """makes panda dataframe from dictionary_file.

    Args:
       adni_5y (arr): numpy array with adni entries
       dictionary_file (str): file with column names (in ./data/ directory).

    Returns:
        dataframe (pd dataframe)

    """
   
    def read_column_names_from_csv(dictionary_file):
        """Returns a dictionary of shape {column name: slice}.

        Args:
            dictionary_file (str): file with column names (in ./data/ directory).

        Returns:
            name_dict (dict): dictionary {column names: slice(column_begin, column_end, None)}

        """
      
        import csv
        path = path_to_file(dictionary_file)
        reader = csv.reader(open(path, 'r'))
        name_dict = {}
        for row in reader:
            legend, column_numbers = row
            column_numbers_list = column_numbers.split(':')
            column_numbers_list = list(map(int, column_numbers_list))
            column_slice = slice(column_numbers_list[0],column_numbers_list[1],None) if len(column_numbers_list) > 1 \
                else slice(column_numbers_list[0],column_numbers_list[0]+1,None)
            name_dict[legend] = column_slice
        return(name_dict)
    
    names_dict = read_column_names_from_csv(dictionary_file)
    
    dataframe = pd.DataFrame({k: [adni_matrix[i][slice] for i in range(len(adni_matrix))] \
                              for k, slice in zip(names_dict.keys(), names_dict.values())})
    
    #remove brackets from dataframe
    for c in dataframe.columns:
        if len(dataframe[c][0]) == 1:
            dataframe[c] = dataframe[c].str[0]
        
    return(dataframe)

Now we can use these column names to create a pandas dataframe:

In [5]:
df = make_dataframe(adni_5y, 'dictionary_ADNI.csv')

In [6]:
# df

In [7]:
df.columns

Index(['ID', 'BaselineDx', '1yDx', 'End-of-studyDx', 'Baselineatrophy',
       'Futureatrophy', 'GeneticInfo(APOE4)', 'blAge', 'gender',
       'educationyear', 'marriage', 'blADAS11', 'blADAS13',
       'blRAVLT_immediate', 'blRAVLT_learning', 'blRAVLT_forgetting',
       'blRAVLT_perc_forgetting', 'blFAQ', 'blCDR', 'blMMSE', 'ftAge',
       'ftADAS11', 'ftADAS13', 'ftRAVLT_immediate', 'ftRAVLT_learning',
       'ftRAVLT_forgetting', 'ftRAVLT_perc_forgetting', 'ftFAQ', 'ftCDR',
       'ftMMSE', 'blIntracranialVolume', 'ftIntracranialVolume',
       'Baseline_Age', 'Future_Age', 'sum_volume_all_regions-baseline',
       'sum_volume_all_regions-future',
       'baseline_ICV / sum_volume_all_regions-baseline',
       'baseline_ICV / sum_volume_all_regions-future',
       'average_ICSV_first_3_visits', 'num_regions_over_upper_limit-baseline',
       'num_regions_over_upper_limit-future', 'end-of-study-conversion',
       'num_months_from_baseline'],
      dtype='object')

Lets cound the number of unique patients:

In [8]:
df['ID'].nunique()

96

Lets now group patients based on their ID and show the first occurence for each of them:

In [18]:
df[['ID','Baselineatrophy']].groupby(df['ID']).aggregate('first')['Baselineatrophy'][42.0].shape

(86,)