# Exploring the ADNI dataset using Pandas

Dataset from http://adni.loni.usc.edu

First, how to load matrices in to a Pandas data frame and start exploring them. The code (module) below was put together by Pablo, and is the start of a package to deal with the ADNI data in Python that we can make as a group. This is called 'load_ADNI'. 

We need to load some packages (and modules). With the standard Anaconda distribution I think most of the packages will already be included. However, if you are ever missing anything, the easiest way to get it is to 'pip install ...' on the command line. 

In [1]:
from scipy.io import loadmat #scipy for dealing with scientific specific things -- eg. matrices from matlab
import os #dealing with features of the operating system you are running on.
import numpy as np #for Python's array format -- numpy
from pathlib import Path #this is another way of dealing with paths which was chosen for convenience.
import pandas as pd #pandas dataframes are essentially excel for python, and dealing with this is the focus here.

To play with how to write functions, and to clean up load_ADNI, finding the file path is placed in a function. This is very specific and not flexible -- more generally useful ways of doing this will be better in future, but the function can be amended to reflect that. 

In [14]:
def find_data_path():
    '''
    Returns path of the data
    '''
    here_dir    = os.path.dirname(os.path.realpath('__file__'))
    par_dir = os.path.abspath(os.path.join(here_dir, os.pardir))
    dataset_dir = os.path.join(par_dir, 'data')
    return dataset_dir

In [15]:
find_data_path()

'/Users/Megan/RajLab/ADNI/data'

The next function is called load_baselines_into_dataframes(). Let's break down what it is doing step by step and explore the data in the process. 

In [19]:
dataset_dir = find_data_path()
print(dataset_dir)
adni = [dataset_dir + '/vec_a2b_1y_atr_factor.mat',
            dataset_dir + '/vec_a2b_3y_atr_factor.mat',
            dataset_dir + '/vec_a2b_5y_atr_factor.mat']

/Users/Megan/RajLab/ADNI/data


In [20]:
adni 

['/Users/Megan/RajLab/ADNI/data/vec_a2b_1y_atr_factor.mat',
 '/Users/Megan/RajLab/ADNI/data/vec_a2b_3y_atr_factor.mat',
 '/Users/Megan/RajLab/ADNI/data/vec_a2b_5y_atr_factor.mat']

In [21]:
adni = [loadmat(adni[0]),loadmat(adni[1]),loadmat(adni[2])]

Let's take a look at what this data looks like. It is a list of dictionaries. 

In [28]:
adni

[{'__globals__': [],
  '__header__': b'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Wed Oct 17 13:12:32 2018',
  '__version__': '1.0',
  'vec_a2b_1y_atr_factor': array([[3.00000e+00, 3.00000e+00, 3.00000e+00, ..., 1.90382e+06,
          0.00000e+00, 1.20000e+01],
         [3.00000e+00, 3.00000e+00, 3.00000e+00, ..., 1.90342e+06,
          0.00000e+00, 1.20000e+01],
         [4.00000e+00, 2.00000e+00, 2.00000e+00, ..., 1.65674e+06,
          0.00000e+00, 1.20000e+01],
         ...,
         [5.14200e+03, 1.00000e+00, 1.00000e+00, ..., 1.45484e+06,
          0.00000e+00, 1.20000e+01],
         [5.14200e+03, 1.00000e+00, 1.00000e+00, ..., 1.42988e+06,
          0.00000e+00, 1.20000e+01],
         [5.14700e+03, 1.00000e+00, 1.00000e+00, ..., 1.54621e+06,
          0.00000e+00, 1.20000e+01]])},
 {'__globals__': [],
  '__header__': b'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Wed Oct 17 13:13:14 2018',
  '__version__': '1.0',
  'vec_a2b_3y_atr_factor': array([[4.00000e+00, 2.00

Extract the year 1 data, addressing the actual data in the dictionary using the key 'vec_a2b_1y_atr_factor'

In [31]:
adni[0]['vec_a2b_1y_atr_factor'].shape

(1865, 207)

Get rid of metadata and rearrange in to a big list of matrices.

In [32]:
adni = [adni[0]['vec_a2b_1y_atr_factor'],adni[1]['vec_a2b_3y_atr_factor'],adni[2]['vec_a2b_5y_atr_factor']]

In [34]:
adni[0].shape

(1865, 207)

In [35]:
len(adni[0])

1865

So, we can address each year by the first index. There are 207 attributes, including 86-long atrophy vectors at baseline and the time point plus all sorts of other patient information. For instance, the number of months since the baseline measurement was taken. 

eg. make a list of the IDS of the patients for the year 1 data.

In [60]:
IDS = [adni[0][i][0] for i in range(len(adni[0]))]

In [61]:
IDS

[3.0,
 3.0,
 4.0,
 4.0,
 5.0,
 6.0,
 6.0,
 10.0,
 14.0,
 14.0,
 14.0,
 15.0,
 16.0,
 16.0,
 16.0,
 19.0,
 21.0,
 21.0,
 21.0,
 21.0,
 21.0,
 23.0,
 23.0,
 23.0,
 23.0,
 23.0,
 23.0,
 23.0,
 23.0,
 31.0,
 31.0,
 31.0,
 31.0,
 38.0,
 40.0,
 40.0,
 40.0,
 42.0,
 42.0,
 42.0,
 42.0,
 43.0,
 47.0,
 51.0,
 51.0,
 51.0,
 51.0,
 51.0,
 54.0,
 54.0,
 56.0,
 56.0,
 56.0,
 59.0,
 59.0,
 59.0,
 60.0,
 60.0,
 61.0,
 61.0,
 61.0,
 66.0,
 67.0,
 67.0,
 69.0,
 69.0,
 70.0,
 72.0,
 72.0,
 72.0,
 72.0,
 76.0,
 77.0,
 80.0,
 81.0,
 81.0,
 83.0,
 83.0,
 84.0,
 87.0,
 87.0,
 89.0,
 89.0,
 90.0,
 90.0,
 90.0,
 91.0,
 91.0,
 94.0,
 94.0,
 95.0,
 96.0,
 97.0,
 97.0,
 98.0,
 106.0,
 107.0,
 108.0,
 108.0,
 109.0,
 109.0,
 110.0,
 111.0,
 112.0,
 112.0,
 112.0,
 112.0,
 116.0,
 116.0,
 116.0,
 120.0,
 120.0,
 120.0,
 120.0,
 120.0,
 120.0,
 120.0,
 123.0,
 125.0,
 125.0,
 125.0,
 125.0,
 126.0,
 126.0,
 126.0,
 126.0,
 126.0,
 126.0,
 126.0,
 126.0,
 127.0,
 127.0,
 127.0,
 127.0,
 127.0,
 128.0,
 128.0,
 128.0

There are some repetitions, something that will be explored once we put it in to dataframes.

So, the bit of the code that does that. It loops over an iterator 'j' which addresses each of the three years, and for each feature loops through the full length of all the patient IDS. This actually results in three dataframes, one for each year, in a list. 

In [62]:
adni_df = [pd.DataFrame({
                        'ID'                        : [int(adni[j][i][0]) for i in range(len(adni[j]))],
                        'Baseline Dx'               : [int(adni[j][i][1]) for i in range(len(adni[j]))],
                        '1y Dx'                     : [int(adni[j][i][2]) for i in range(len(adni[j]))],
                        'End-of-study Dx'           : [int(adni[j][i][3]) for i in range(len(adni[j]))],
                        'Baseline atrophy'          : [adni[j][i][4:90] for i in range(len(adni[j]))],
                        'Future atrophy'            : [adni[j][i][90:176] for i in range(len(adni[j]))],
                        'Genetic Info (APOE4)'      : [adni[j][i][176] for i in range(len(adni[j]))],
                        'bl Age'                    : [adni[j][i][177] for i in range(len(adni[j]))],
                        'gender'                    : [adni[j][i][178] for i in range(len(adni[j]))],
                        'education year'            : [adni[j][i][179] for i in range(len(adni[j]))],
                        'marriage'                  : [adni[j][i][180] for i in range(len(adni[j]))],
                        'bl ADAS11'                 : [adni[j][i][181] for i in range(len(adni[j]))],
                        'bl ADAS13'                 : [adni[j][i][182] for i in range(len(adni[j]))],
                        'bl RAVLT_immediate'        : [adni[j][i][183] for i in range(len(adni[j]))],
                        'bl RAVLT_learning'         : [adni[j][i][184] for i in range(len(adni[j]))],
                        'bl RAVLT_forgetting'       : [adni[j][i][185] for i in range(len(adni[j]))],
                        'bl RAVLT_perc_forgetting'  : [adni[j][i][186] for i in range(len(adni[j]))],
                        'bl FAQ'                    : [adni[j][i][187] for i in range(len(adni[j]))],
                        'bl CDR'                    : [adni[j][i][188] for i in range(len(adni[j]))],
                        'bl MMSE'                   : [adni[j][i][189] for i in range(len(adni[j]))],
                        'ft Age'                    : [adni[j][i][190] for i in range(len(adni[j]))],
                        'gender'                    : [adni[j][i][191] for i in range(len(adni[j]))],
                        'education year'            : [adni[j][i][192] for i in range(len(adni[j]))],
                        'marriage'                  : [adni[j][i][193] for i in range(len(adni[j]))],
                        'ft ADAS11'                 : [adni[j][i][194] for i in range(len(adni[j]))],
                        'ft ADAS13'                 : [adni[j][i][195] for i in range(len(adni[j]))],
                        'ft RAVLT_immediate'        : [adni[j][i][196] for i in range(len(adni[j]))],
                        'ft RAVLT_learning'         : [adni[j][i][197] for i in range(len(adni[j]))],
                        'ft RAVLT_forgetting'       : [adni[j][i][198] for i in range(len(adni[j]))],
                        'ft RAVLT_perc_forgetting'  : [adni[j][i][199] for i in range(len(adni[j]))],
                        'ft FAQ'                    : [adni[j][i][200] for i in range(len(adni[j]))],
                        'ft CDR'                    : [adni[j][i][201] for i in range(len(adni[j]))],
                        'ft MMSE'                   : [adni[j][i][202] for i in range(len(adni[j]))],
                        'bl Intracranial Volume'    : [adni[j][i][203] for i in range(len(adni[j]))],
                        'ft Intracranial Volume'    : [adni[j][i][204] for i in range(len(adni[j]))],
                        'end-of-study is_convert'   : [adni[j][i][205] for i in range(len(adni[j]))],
                        'number of months from bl'  : [adni[j][i][206] for i in range(len(adni[j]))]
                         }) for j in range(3)]

Now we can start using Pandas functionality (I suppose we ought to know how to do this).

In [67]:
adni_df[0].head() #just display thte first few entries.

Unnamed: 0,ID,Baseline Dx,1y Dx,End-of-study Dx,Baseline atrophy,Future atrophy,Genetic Info (APOE4),bl Age,gender,education year,...,ft RAVLT_learning,ft RAVLT_forgetting,ft RAVLT_perc_forgetting,ft FAQ,ft CDR,ft MMSE,bl Intracranial Volume,ft Intracranial Volume,end-of-study is_convert,number of months from bl
0,3,3,3,3,"[0.42899300267558316, 0.6101146248911582, 0.82...","[0.49130654601460605, 0.5657404677305382, 0.76...",1.0,0.920354,1.0,0.570933,...,-1.6344,1.302757,2.279192,21.999117,12.774359,-9.680302,1920690.0,1903820.0,0.0,12.0
1,3,3,3,3,"[0.49130654601460605, 0.5657404677305382, 0.76...","[0.4797399004600559, 0.6013764392125581, 0.791...",1.0,1.084444,1.0,0.570933,...,-2.054917,0.57334,2.279192,18.069977,29.549822,-8.068754,1903820.0,1903420.0,0.0,12.0
2,4,2,2,2,"[0.38735705827325556, 0.16893355463538812, 0.6...","[0.49331091696947693, 0.373494804762419, 0.685...",0.0,-1.345638,1.0,-2.351808,...,-0.793366,-0.885494,-0.64628,-0.26601,3.454657,-2.428334,1679440.0,1656740.0,0.0,12.0
3,4,2,2,2,"[0.3269262100101499, 0.27174938386729525, 0.68...","[0.540077225624267, 0.18117533851644893, 0.625...",0.0,-1.266965,1.0,-2.351808,...,0.888701,-0.520786,-0.50697,1.043703,3.454657,-1.622559,1661130.0,1648510.0,0.0,12.0
4,5,1,1,1,"[0.5883427925210888, 0.7011389219041197, 0.664...","[0.4331046075873159, 0.642863560168384, 0.6006...",0.0,0.000596,1.0,-0.159752,...,0.047667,0.938048,0.759468,-0.26601,3.454657,0.794764,1634180.0,1632880.0,0.0,12.0


In [69]:
adni_df[0].describe() #not necessarily useful here, but a generally useful function.

Unnamed: 0,ID,Baseline Dx,1y Dx,End-of-study Dx,Genetic Info (APOE4),bl Age,gender,education year,marriage,bl ADAS11,...,ft RAVLT_learning,ft RAVLT_forgetting,ft RAVLT_perc_forgetting,ft FAQ,ft CDR,ft MMSE,bl Intracranial Volume,ft Intracranial Volume,end-of-study is_convert,number of months from bl
count,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,...,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0,1865.0
mean,1991.796247,1.812869,1.87882,2.06756,0.557105,-0.315146,1.453619,-0.145452,2.286863,1.585068,...,-0.655937,0.293209,0.862821,6.139998,6.665832,-1.683046,1529524.0,1528636.0,0.08311,12.0
std,1760.258624,0.665558,0.728266,0.814135,0.666556,1.120308,0.497978,1.013903,0.701595,2.242841,...,1.179793,0.942934,1.280223,9.428528,8.647229,2.908585,159874.4,160560.1,0.276122,0.0
min,3.0,1.0,1.0,1.0,0.0,-3.398167,1.0,-3.813178,0.0,-1.937757,...,-3.316467,-5.626705,-14.437788,-0.26601,-0.273224,-16.932271,716133.0,716133.0,0.0,12.0
25%,522.0,1.0,1.0,1.0,0.0,-1.049737,1.0,-0.890438,2.0,-0.099709,...,-1.6344,-0.520786,-0.228355,-0.26601,-0.273224,-2.428334,1419560.0,1418240.0,0.0,12.0
50%,1155.0,2.0,2.0,2.0,0.0,-0.261903,1.0,-0.159752,2.0,1.00312,...,-0.793366,0.208631,0.846309,1.043703,3.454657,-0.816785,1519400.0,1518680.0,0.0,12.0
75%,4173.0,2.0,2.0,3.0,1.0,0.444167,2.0,0.570933,2.0,2.719856,...,0.047667,0.938048,2.279192,10.211697,10.910419,0.794764,1632775.0,1632270.0,0.0,12.0
max,5147.0,3.0,3.0,3.0,2.0,2.777301,2.0,1.301618,4.0,12.031405,...,3.411802,4.220425,2.279192,39.025391,51.917107,0.794764,2110290.0,2108790.0,1.0,12.0


In [70]:
adni_df[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1865 entries, 0 to 1864
Data columns (total 34 columns):
ID                          1865 non-null int64
Baseline Dx                 1865 non-null int64
1y Dx                       1865 non-null int64
End-of-study Dx             1865 non-null int64
Baseline atrophy            1865 non-null object
Future atrophy              1865 non-null object
Genetic Info (APOE4)        1865 non-null float64
bl Age                      1865 non-null float64
gender                      1865 non-null float64
education year              1865 non-null float64
marriage                    1865 non-null float64
bl ADAS11                   1865 non-null float64
bl ADAS13                   1865 non-null float64
bl RAVLT_immediate          1865 non-null float64
bl RAVLT_learning           1865 non-null float64
bl RAVLT_forgetting         1865 non-null float64
bl RAVLT_perc_forgetting    1865 non-null float64
bl FAQ                      1865 non-null float64
bl 

So there are no missing rows of data, which is good!