# Temporal Ordering of Alzheimer’s Biomarkers: Evidence from the TADPOLE Longitudinal Cohort
## Do amyloid, tau, and neurodegenerative biomarkers appear in a predictable sequence preceding cognitive decline?

The decision to focus on amyloid (A), tau (T), and neurodegeneration (N) biomarkers is based on the A/T/N Framework proposed by the National Institute on Aging and Alzheimer’s Association (NIA-AA, Jack et al., 2018).

These biomarkers follow the Alzheimer’s disease cascade:
(Amyloid deposition → Tau pathology → Neurodegeneration → Cognitive decline).

More details in the final report...

In [1]:
import pandas as pd
import numpy as np
import re #supports regular expressions (search, match, replace, etc.)

#### Create a datapath to make life easy

In [2]:
DATA_PATH = r"C:/Users/jdaly/OneDrive/Desktop/ISM645-Pred Analytics/TADPOLE/tadpole_challenge_201911210/tadpole_challenge/TADPOLE_D1_D2.csv"

#### Load the TADPOLE data, look at size/shape

In [3]:
df = pd.read_csv(DATA_PATH, low_memory=False) #It's a large dataset, but would like to keep datatype inference consistant
print("Shape of data: ",df.shape)

Shape of data:  (12741, 1907)


#### Preview the data

In [4]:
print("\n--- Preview of the first 5 rows ---")
display(df.head())


--- Preview of the first 5 rows ---


Unnamed: 0,RID,PTID,VISCODE,SITE,D1,D2,COLPROT,ORIGPROT,EXAMDATE,DX_bl,...,PHASE_UPENNBIOMK9_04_19_17,BATCH_UPENNBIOMK9_04_19_17,KIT_UPENNBIOMK9_04_19_17,STDS_UPENNBIOMK9_04_19_17,RUNDATE_UPENNBIOMK9_04_19_17,ABETA_UPENNBIOMK9_04_19_17,TAU_UPENNBIOMK9_04_19_17,PTAU_UPENNBIOMK9_04_19_17,COMMENT_UPENNBIOMK9_04_19_17,update_stamp_UPENNBIOMK9_04_19_17
0,2,011_S_0002,bl,11,1,1,ADNI1,ADNI1,2005-09-08,CN,...,,,,,,,,,,
1,3,011_S_0003,bl,11,1,0,ADNI1,ADNI1,2005-09-12,AD,...,ADNI1,UPENNBIOMK9,P06-MP02-MP01,P06-MP02-MP01/2,2016-12-14,741.5,239.7,22.83,,2017-04-20 14:39:54.0
2,3,011_S_0003,m06,11,1,0,ADNI1,ADNI1,2006-03-13,AD,...,,,,,,,,,,
3,3,011_S_0003,m12,11,1,0,ADNI1,ADNI1,2006-09-12,AD,...,ADNI1,UPENNBIOMK9,P06-MP02-MP01,P06-MP02-MP01/2,2016-12-14,601.4,251.7,24.18,,2017-04-20 14:39:54.0
4,3,011_S_0003,m24,11,1,0,ADNI1,ADNI1,2007-09-12,AD,...,,,,,,,,,,


In [5]:
print("\n--- Dataset Information ---")
df.info()


--- Dataset Information ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12741 entries, 0 to 12740
Columns: 1907 entries, RID to update_stamp_UPENNBIOMK9_04_19_17
dtypes: float64(72), int64(8), object(1827)
memory usage: 185.4+ MB


In [6]:
print("\n--- Data Types Summary ---")
print(df.dtypes.value_counts())


--- Data Types Summary ---
object     1827
float64      72
int64         8
dtype: int64


In [10]:
missing_ratio = df.isna().mean().sort_values(ascending=False)
print("\n--- Columns with Most Missing Data ---")
print(missing_ratio.head())


--- Top 10 Columns with Most Missing Data ---
PIB_bl           0.988541
PIB              0.982497
AV45             0.833765
FDG              0.736912
EcogSPOrgan      0.618711
EcogPtOrgan      0.614551
EcogSPDivatt     0.611333
MOCA             0.610156
EcogPtVisspat    0.608822
EcogPtDivatt     0.608665
dtype: float64


In [14]:
# Calculate missing percentage for each column
missing_percent = df.isna().mean() * 100

#Converting to dataframe and sort descending
missing_df = missing_percent.reset_index()
missing_df.columns = ['Column', 'MissingPercent']
missing_df = missing_df.sort_values(by='MissingPercent', ascending=False)
print(f"Total columns: {len(df.columns)}")
print("\n--- Top 50 columns with Most Missing Data ---")
display(missing_df.head(50))

print("\n--- Summary Stats ---")
print(missing_df.describe())

Total columns: 1907

--- Top 50 columns with Most Missing Data ---


Unnamed: 0,Column,MissingPercent
90,PIB_bl,98.854093
19,PIB,98.249745
20,AV45,83.376501
18,FDG,73.691233
42,EcogSPOrgan,61.871125
35,EcogPtOrgan,61.455145
43,EcogSPDivatt,61.133349
30,MOCA,61.015619
33,EcogPtVisspat,60.882191
36,EcogPtDivatt,60.866494



--- Summary Stats ---
       MissingPercent
count     1907.000000
mean         4.028187
std         10.206953
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         98.854093


In [16]:
#So many columns...moving this to a new CSV file for a full missing report.
missing_df.to_csv("missing_values_overview.csv", index=False)

#### Counting biomarkers vs. non-biomarker columns from the list of variables set to the CSV file

In [22]:
biomarker_candidate_cols = [col for col in df.columns if any (key in col.upper() for key in col.upper() 
                                                              for key in ['ABETA', 'TAU', 'PTAU','FDG', 'HIPP',
                                                                         'MMSE', 'ADAS', 'VENT', 'PIB', 'AV45',
                                                                         'VOLUME'])]
print(f"Biomarker Candidates: {len(biomarker_candidate_cols)}")
biomarker_candidate_cols[:15]



Biomarker Candidates: 293


['FDG',
 'PIB',
 'AV45',
 'ADAS11',
 'ADAS13',
 'MMSE',
 'Ventricles',
 'Hippocampus',
 'ADAS11_bl',
 'ADAS13_bl',
 'MMSE_bl',
 'Ventricles_bl',
 'Hippocampus_bl',
 'FDG_bl',
 'PIB_bl']

In [25]:
# Calculate missingness for each biomarker candidate and save to a CSV file for reference
biomarker_summary = pd.DataFrame(biomarker_candidate_cols, columns=['BiomarkerColumn'])
biomarker_summary['MissingPercent'] = biomarker_summary['BiomarkerColumn'].apply(lambda col: df[col].isna().mean() * 100)
biomarker_summary = biomarker_summary.sort_values(by='MissingPercent', ascending=False)
biomarker_summary.to_csv('biomarker_candidates_with_missingness.csv', index=False)