# Biomarker Raw Data

The purpose of this notebook is to explore, filter, and clean the raw biomarker data available on the ADNI database. ADNI contains a diverse set of biomarker data. From this broader data, we focused on three biomarker data sets:

- Apolipoprotein E (ApoE) patient genotypes
- Protein measurements from patient cerebrospinal fluid (CSF)
- Chemical screenings of patient blood and urine

**Import Dependencies**

Before getting started, lets import our python dependencies and the data files of interest.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# import custom dependencies
import sys
sys.path.append('../ADNI_')
from ADNI_utilities import define_terms, describe_meta_data, append_meta_cols

Initialize data structures

In [2]:
# import adni dictionary for defining terms
apo_dict = pd.read_csv("../data/Biomarker Data/APOERES_DICT.csv")

# import data set specific dictionaries
csf_dict = pd.read_csv("../data/Biomarker Data/UPENNBIOMK_MASTER_DICT.csv")
lab_dict = pd.read_csv("../data/Biomarker Data/LABDATA_DICT.csv")
adni_dict_df = pd.read_csv("../data/study info/DATADIC.csv")

In [3]:
# define dataframes from the biomarker dataset
apo_df = pd.read_csv("../data/Biomarker Data/APOERES.csv")
csf_df = pd.read_csv("../data/Biomarker Data/UPENNBIOMK_MASTER.csv")
lab_df =  pd.read_csv("../data/Biomarker Data/LABDATA.csv")

## ApoE Genotypes

The APOERES table contains information about patient alleles for the ApoE gene which has been linked to alzheimers. More specifically, ApoE is thought to promote proteolytic degredation of amyloid-beta. Build up of amyloid-beta is thought to be part of the pathogenesis of Alzheimer's Disease. The prevalance of Alzheimer's disease is much higher in patients with a copy of E4 allele of ApoE and is higher still in patients with two copies of E4.

Let's start by characterizing some general features of the data.

In [4]:
# describe data structure
describe_meta_data(apo_df)

Phases:	 ['ADNI1' 'ADNIGO2']
Num Entries: 2067
Num Columns: 16
Num Patients: 2067
Records per Patient: 1-1
Phases spanned per patient: 1-1
Patients w/ Duplicates: 0


It looks like there is a single entry for each patient with over ADNI1, ADNIGO, and ADNI2, which should cover most of the patients in the study. With a total patient count of 2,067, nearly every patient in the study shoudl have measurement. We can take a look at the features to see what we might be interested in.

In [7]:
# define and print terms from apo table
term_defs = define_terms(apo_df, adni_dict_df, "APOERES")
term_defs

Unnamed: 0,FLDNAME,TYPE,TBLNAME,TEXT,CODE
0,,,,,
1,ID,N,APOERES,Record ID,"""crfname"",""ApoE Genotyping - Results"",""indexes..."
2,RID,N,APOERES,Participant roster ID,
3,SITEID,N,APOERES,Site ID,
4,VISCODE,T,APOERES,Visit code,
5,USERDATE,S,APOERES,Date record created,
6,USERDATE2,S,APOERES,Date record last updated,
7,APTESTDT,D,APOERES,Date Test Performed,
8,APGEN1,N,APOERES,Genotype - Allele 1,2..4
9,APGEN2,N,APOERES,Genotype - Allele 2,2..4


The two columns of interest here are categorical variables describing the two alleles of ApoE for each patient. The rest of data here is meta data about the patient visit or the sample.

In [5]:
# record the columns of interest
apo_cols = ["APGEN1","APGEN2"]

The alleles should be categorical variables (i.e. `int64`). We can see from the term codes that these should take integer values from 2-4, corresponding to the ApoE alleles E2, E3, and E4. 

In [9]:
apo_df[apo_cols].apply(lambda x: x.unique())

Unnamed: 0,APGEN1,APGEN2
0,3,3
1,2,4
2,4,2


In [9]:
# check to ensure data types are int for categorical data
apo_df[apo_cols].dtypes

APGEN1    int64
APGEN2    int64
dtype: object

We can see from the above list of unique values that we have genotype information for both alleles for every patient in the data set.

## Cerebrospinal Fluid Measurements

UPENN Biomarker master table contains Amyloid-beta, Tau, and pTau protein measurements taken from patient CSF. Higher levels of these proteins in the CSF have all previously been linked to higher incidence of Alzheimer's disease. We can repeat the same process above for this table.

In [10]:
# describe the meta data
describe_meta_data(csf_df)

No phases listed
Num Entries: 5876
Num Columns: 14
Num Patients: 1249
Records per Patient: 2-26
Phases spanned per patient: 0-0
Patients w/ Duplicates: 1249


The summary description above shows that we likely have anywhere from 2-26 records per patient, suggesting that these protein levels were periodically measured throughout the study.

In [11]:
# define and print terms from CSF biomarker master table
term_defs = define_terms(csf_df, csf_dict, "UPENNBIOMK_MASTER")
term_defs

Unnamed: 0,FLDNAME,TYPE,TBLNAME,TEXT,CODE
0,RID,-4.0,UPENNBIOMK_MASTER,Participant roster ID,
1,VISCODE,-4.0,UPENNBIOMK_MASTER,Visit code,
2,BATCH,-4.0,UPENNBIOMK_MASTER,"Name of LONI table, corresponding to analytica...",
3,KIT,-4.0,UPENNBIOMK_MASTER,Reagents lot number,
4,STDS,-4.0,UPENNBIOMK_MASTER,Calibrators and Quality Controls lot number,
5,,,,,
6,RUNDATE,-4.0,UPENNBIOMK_MASTER,Date of analytical run,
7,ABETA,-4.0,UPENNBIOMK_MASTER,Result rescaled to UPENNBIOMK,
8,TAU,-4.0,UPENNBIOMK_MASTER,Result rescaled to UPENNBIOMK,
9,PTAU,-4.0,UPENNBIOMK_MASTER,Result rescaled to UPENNBIOMK,


There is no phase information in this table. Looks like we won't be able to group the data by phase. Taking a quick glance at the features, the most interesting ones look like the re-scaled measurement of `ABETA`, `TAU`, and `PTAU`. The protocols for these measurements changed between ADNI phases, meaning that the re-scaled measurement is what we will want to keep if we want to make comparisons across phases.

In [16]:
# record columns for later use
csf_cols = ["ABETA","TAU","PTAU"]

ADNI uses 3 different missing value indicators: `-1`, `-4`, and `NaN`. We will want to ensure that all missing values values are marked with a single indicator: `-1`. If we want to store our categorical variables as `int` data type, we cannot use `NaN` as our indicator since it can only be specified for `float` data types.

In [18]:
# ensure standardized missing value is compatible with int
csf_df.replace({np.nan:-1, -4:-1}, inplace=True)

In [19]:
# check to ensure data type is float for continous variable
csf_df[csf_cols].dtypes

ABETA    float64
TAU      float64
PTAU     float64
dtype: object

## Laboratory Chemical Screenings

The lab master data set contains lab results from a variety of chemical tests performed on patient blood and urine. This is a very large and diverse data set that is likely to be loaded with predictors that we have no prior expectation should be strongly linked to Alzheimer's Disease.

In [14]:
# define and print terms from lab data master table
describe_meta_data(lab_df)

Phases:	 ['ADNI1' 'ADNIGO' 'ADNI2']
Num Entries: 2463
Num Columns: 131
Num Patients: 2285
Records per Patient: 1-3
Phases spanned per patient: 1-1
Patients w/ Duplicates: 171


As another example of disorganisation and non-standardization of ADNI data, the ADNI dictionary does not contain proper definitions of the lab tests. A separate dictionary of lab codes is provided, but the dictionary is not formatted like other ADNI dictionaries. So we can define another lookup function for the lab codes.

In [15]:
# create a function to extract lab codes from the lab dict (has a different structure from other dictionaries)
def define_labcodes(df, dict_df):
    
    keys=["Test Code","Test Description"]
    term_dicts = []
    for col in df.columns:

        term_dict = dict.fromkeys(keys)
        loc = (dict_df["Test Code"] == col)
        
        if any(loc):
            tmp = dict_df.loc[loc][keys]

            for key in keys:
                if tmp[key].unique().shape[0]:
                    term_dict[key] = tmp[key].unique()[0]
                else:
                    term_dict[key] = float('nan')

            term_dicts.append(term_dict)
            #print("Name: {FLDNAME},\nType: {TYPE},\nTable: {TBLNAME},\nDesc: {TEXT},\nCode:{CODE}\n".format(**term_dict))
    
    data_dict = pd.DataFrame.from_dict(term_dicts).reindex(columns=keys)
    return(data_dict)

In [16]:
# extract lab codes and descriptions of each test
lab_codes = define_labcodes(lab_df, lab_dict)
lab_codes.head(10)

Unnamed: 0,Test Code,Test Description
0,AXT117,Thyroid Stim. Hormone-QT
1,BAT126,Vitamin B12
2,CMT1,Color-QT
3,CMT10,Urine Nitrite-QT
4,CMT11,Leukocyte Esterase-QT
5,CMT2,Specific Gravity-QT
6,CMT3,pH-QT
7,CMT43,Blood (+)-QT
8,CMT49,Urine Protein (3+)-QT
9,CMT5,Urine Glucose-QT


From the short list above we can see that the lab results contain diverse analysis of the patient blood and urine, including hormone levels, protein measurements, blood sugar, and the presence of vitamins and minerals. Since we have no apriori hypothesis about which of these lab measurements might be more interesting than others, let's keep measurements from all the tests.

In [17]:
# record all lab codes and use them to extract lab code data from 
lab_cols = lab_codes["Test Code"]

Let's move on to missing data for this data set.

From the ADNI website:

>**Laboratory Data**: Screening clinical lab results (i.e. urine, chemistry panel).
Data contains some character coding (i.e. SCC09: No specimen received ), and
they can be treated as missing data. (LABDATA.csv)

Keeping this in mind, we can define a function to replace all alpha-numeric entries with the standard missing value `NaN`. All the data is in a string format by default; so we need to determine which strings are numeric and which are alpha-numeric.

In [18]:
# determine if a string contains any non numeric characters
def is_number(string: str):
    
    # define valid numeric characters 
    # (including decimal and negative sign)
    valid_chars = set(str(np.arange(0,10,1))[1:-1] + '.-')
    is_num = not bool(set(string)-valid_chars)
    return(is_num)

In [19]:
# find columns of lab df with strings
str_cols = lab_df[lab_cols].dtypes == object
str_cols = lab_cols[str_cols.values]

# define anonymous function to replace missing data with NaN
str_isnumber = lab_df[str_cols].apply(lambda x: x.apply(is_number))

# convert values with strings to missing val (-1)
str_vals = lab_df[str_cols].values
str_vals[~str_isnumber] = '-1'
num_vals = str_vals.astype(float)

# store new numeric values in dataframe
lab_df[str_cols] = num_vals

# convert missing values to nan
lab_df = lab_df.replace(to_replace=-1, value=np.nan)

# look for columns where all values are missing
# and remove them from the list of columns
all_missing_cols = str_cols[(num_vals==-1).all(0)]
lab_cols = list(set(lab_cols) - set(all_missing_cols))

In [20]:
# check to make sure all of our lab test columns are numeric
lab_df[lab_cols].dtypes.unique()

array([dtype('float64')], dtype=object)

## Save cleaned biomarker data to file

Before moving on to additional analysis, we can save the dataframes with the missing values updated and the columns restricted to our columns of interest.

In [21]:
# intialize dataframe list and empty placeholder
all_dfs = [apo_df, csf_df, lab_df]
all_df_cols = [apo_cols, csf_cols, lab_cols]
df_names = ["apoe","csf","lab"]

# iterate over dataframes
for i,df in enumerate(all_dfs):
    
    # ensure standardized missing value
    df.replace({np.nan:-1, -4:-1}, inplace=True)
    
    # ensure RID is in column list for indexing
    cols = all_df_cols[i]
    cols = append_meta_cols(df.columns, cols)
    
    # write data to csv
    to_write = df[cols]
    to_write.to_csv("../data/Cleaned/" + df_names[i] + "_clean.csv")