# Data preparation

In this notebook, each file from the `/data/raw` folder will be processed using the following steps:

1. Dataframe general overview.
2. Column processing: drop, rename, order.
3. Missing data treatment.

Then, if possible, datasets will be merged and saved in the `/data/processed` folder.  

Lastly, a dataset card with all relevant information will be created for each dataset.

## 1. Data cleaning

In [1]:
import pandas as pd
import re
pd.options.display.max_columns = None

### Dataset 1: alzheimers_disease_data

In [117]:
# General overview

df_ad = pd.read_csv('../data/raw/alzheimers_disease_data.csv')
df_ad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  2149 non-null   int64  
 1   Age                        2149 non-null   int64  
 2   Gender                     2149 non-null   int64  
 3   Ethnicity                  2149 non-null   int64  
 4   EducationLevel             2149 non-null   int64  
 5   BMI                        2149 non-null   float64
 6   Smoking                    2149 non-null   int64  
 7   AlcoholConsumption         2149 non-null   float64
 8   PhysicalActivity           2149 non-null   float64
 9   DietQuality                2149 non-null   float64
 10  SleepQuality               2149 non-null   float64
 11  FamilyHistoryAlzheimers    2149 non-null   int64  
 12  CardiovascularDisease      2149 non-null   int64  
 13  Diabetes                   2149 non-null   int64

In [118]:
# Column processing

# Column selection
df_ad.drop(columns = ['DoctorInCharge'], inplace = True)

In [121]:
# Translate categoric-numeric content to categorid-string
ethnicity = ['Caucasian', 'African American', 'Asian', 'Other']
education = ['None', 'High School', 'Bachelors', 'Higher']

df_ad['Ethnicity'] = df_ad['Ethnicity'].apply(lambda x: ethnicity[x])
df_ad['EducationLevel'] = df_ad['EducationLevel'].apply(lambda x: education[x])

In [122]:
df_ad

Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,FamilyHistoryAlzheimers,CardiovascularDisease,Diabetes,Depression,HeadInjury,Hypertension,SystolicBP,DiastolicBP,CholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides,MMSE,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis
0,4751,73,0,Caucasian,Bachelors,22.927749,0,13.297218,6.327112,1.347214,9.025679,0,0,1,1,0,0,142,72,242.366840,56.150897,33.682563,162.189143,21.463532,6.518877,0,0,1.725883,0,0,0,1,0,0
1,4752,89,0,Caucasian,,26.827681,0,4.542524,7.619885,0.518767,7.151293,0,0,0,0,0,0,115,64,231.162595,193.407996,79.028477,294.630909,20.613267,7.118696,0,0,2.592424,0,0,0,0,1,0
2,4753,73,0,Other,High School,17.795882,0,19.555085,7.844988,1.826335,9.673574,1,0,0,0,0,0,99,116,284.181858,153.322762,69.772292,83.638324,7.356249,5.895077,0,0,7.119548,0,1,0,1,0,0
3,4754,74,1,Caucasian,High School,33.800817,1,12.209266,8.428001,7.435604,8.392554,0,0,0,0,0,0,118,115,159.582240,65.366637,68.457491,277.577358,13.991127,8.965106,0,1,6.481226,0,0,0,0,0,0
4,4755,89,0,Caucasian,,20.716974,0,18.454356,6.310461,0.795498,5.597238,0,0,0,0,0,0,94,117,237.602184,92.869700,56.874305,291.198780,13.517609,6.045039,0,0,0.014691,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,6895,61,0,Caucasian,High School,39.121757,0,1.561126,4.049964,6.555306,7.535540,0,0,0,0,0,0,122,101,280.476824,94.870490,60.943092,234.520123,1.201190,0.238667,0,0,4.492838,1,0,0,0,0,1
2145,6896,75,0,Caucasian,Bachelors,17.857903,0,18.767261,1.360667,2.904662,8.555256,0,0,0,0,0,0,152,106,186.384436,95.410700,93.649735,367.986877,6.458060,8.687480,0,1,9.204952,0,0,0,0,0,1
2146,6897,77,0,Caucasian,High School,15.476479,0,4.594670,9.886002,8.120025,5.769464,0,0,0,0,0,0,115,118,237.024558,156.267294,99.678209,294.802338,17.011003,1.972137,0,0,5.036334,0,0,0,0,0,1
2147,6898,78,1,Other,High School,15.299911,0,8.674505,6.354282,1.263427,8.322874,0,1,0,0,0,0,103,96,242.197192,52.482961,81.281111,145.253746,4.030491,5.173891,0,0,3.785399,0,0,0,0,1,1


In [5]:
# Column rename
add_ = lambda s: re.sub(r'(?<=[a-z])([A-Z])', r'_\1', s)
df_ad.columns = [add_(col) for col in df_ad.columns]

new_col_ad = {
    'Diagnosis': 'DX',
    'Cardiovascular_Disease': 'CVD'
}

# df_ad.rename(columns = new_col_ad, inplace = True)



### Dataset 2: blood_marker

In [123]:
# General overview

df_biomarker = pd.read_excel('../data/raw/blood_marker.xlsx')
df_biomarker.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Name number                  113 non-null    object 
 1   sex                          113 non-null    object 
 2   age                          113 non-null    int64  
 3   Height/meter                 113 non-null    float64
 4   weight                       113 non-null    float64
 5   BMI                          113 non-null    float64
 6   education years              113 non-null    int64  
 7   smoking 1=yes 0=no           113 non-null    int64  
 8   Drinking 1=yes 0=no          113 non-null    int64  
 9   Hypertension 1=yes 0=no      113 non-null    int64  
 10  Coronary disease 1=yes 0=no  113 non-null    int64  
 11  Diabetes 1=yes 0=no          113 non-null    int64  
 12  MMSE Score                   113 non-null    int64  
 13  MoCA Score          

In [124]:
# Column processing

# Column selection
df_biomarker.drop(columns = ['Height/meter', 'weight'], inplace = True)

# Change 'sex' column to binary
df_biomarker['sex'] = df_biomarker['sex'].apply(lambda x: 1 if x == 'female' else 0)

# Insert column for main diagnosis
df_biomarker.insert(1, 'DX', df_biomarker['Name number'].apply(lambda x: re.sub('[0-9]', '', x).strip()) ,allow_duplicates = True)

# Column rename
new_col_biomarker = {
    'Name number': 'Patient_ID',
    'sex': 'Gender',
    'age': 'Age',
    'education years': 'Education_yrs',
    'smoking 1=yes 0=no': 'Smoking',
    'Drinking 1=yes 0=no': 'Drinking',
    'Hypertension 1=yes 0=no': 'Hypertension',
    'Coronary disease 1=yes 0=no': 'CVD',
    'Diabetes 1=yes 0=no': 'Diabetes',
    'MMSE Score': 'MMSE',
    'MoCA Score': 'MOCA',
    'Plasma GFAP': 'Plasma_GFAP',
    'Plasma NfL': 'Plasma_NfL',
    'Plasma p-tau181': 'Plasma_ptau181'
}

df_biomarker.rename(columns = new_col_biomarker, inplace = True)


In [125]:
df_biomarker

Unnamed: 0,Patient_ID,DX,Gender,Age,BMI,Education_yrs,Smoking,Drinking,Hypertension,CVD,Diabetes,MMSE,MOCA,Plasma_GFAP,Plasma_NfL,Plasma_ptau181
0,CU1,CU,1,71,22.656250,10,0,0,1,0,1,28,23,187.788983,44.382631,3.530901
1,CU2,CU,0,61,22.093170,14,1,1,1,0,0,30,30,129.526091,13.127498,2.684318
2,CU3,CU,0,55,25.734393,10,1,1,0,0,0,29,27,57.363792,10.554058,2.670783
3,CU4,CU,1,53,19.879103,10,0,0,0,0,0,30,28,88.835118,16.894295,1.310089
4,CU5,CU,0,74,25.711662,7,1,0,1,0,1,30,27,160.402572,25.697172,3.562334
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108,AD11,AD,1,77,19.531250,8,0,0,0,0,1,12,10,158.630000,58.310000,2.830000
109,AD12,AD,1,75,24.444444,12,0,0,0,0,0,24,20,180.050000,23.010000,2.210000
110,AD13,AD,0,81,22.491349,16,1,1,0,0,0,22,15,295.310000,69.180000,3.490000
111,AD14,AD,0,90,20.399714,16,1,1,1,0,1,22,18,377.460000,54.290000,3.040000


In [126]:
# Save clean dataset

# df_biomarker.to_csv('../data/processed/biomarker_data.csv', index = False)

### Dataset 3: baseline_data

In [152]:
df_bl = pd.read_csv('../data/raw/baseline_data.csv', index_col = 0)
df_bl['RID'].value_counts()
df_bl.loc[df_bl['RID'] == 2002]

Unnamed: 0,RID,VISCODE,time,BLPLASMA,DXCHANGE,GROUP_abeta,DX,AGE,PTGENDER,PTEDUCAT,APOE4,GROUP,DX_bl,DXX,EXAMDATE.key,PLASMAPTAU181,FDG,AV45,ABETA42,TAU,PTAU,TAUAB,PTAUAB,MMSE,MOCA,ADNI_MEM,ADNI_EF,ADNI_LAN,ADNI_VS,TOTAL_HIPPO,TOTAL_WMH,TOTAL_BRAIN,INFARCT,CVD,Cancer,Current.Depression,DM2,Hypertension,lipid,smoking,stroke,BLFDG,BLAV45,BLABETA42,BLTAU,BLPTAU,BLTAUAB,BLPTAUAB,BLMMSE,BLMOCA,BLMEM,BLEF,BLLAN,BLVS,BLHIPPO,BLWMH,BLBRAIN
1,2002,bl,0,6.777,2,N,3,64.8,0,16,0,E,EMCI,MCI,2010/6/16,6.777,1.20619,0.9784,1504.0,170.1,14.73,0.113098,0.009794,28,28.0,1.558,1.452,1.922,0.706,6.9168,3.3276,1397.85,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.20619,0.9784,1504.0,170.1,14.73,0.113098,0.009794,28,28.0,1.558,1.452,1.922,0.706,6.9168,3.3276,1397.85


In [107]:
def edu_clasif(years):
    if years > 18:
        level = 'Higher'
    elif years >= 16:
        level = 'Bachelors'
    elif years >= 12:
        level = 'High School'
    else:
        level = 'None'
    return level

df_bl['PTEDUCAT'].apply(edu_clasif).value_counts(normalize=True)*100

PTEDUCAT
Bachelors      47.066493
High School    30.638853
Higher         21.121252
None            1.173403
Name: proportion, dtype: float64

In [153]:
# Column selection
to_keep = ['RID', 'AGE', 'PTGENDER', 'PTEDUCAT', 'PLASMAPTAU181', 'DX_bl', 'DXX', 'INFARCT', 'CVD', 'Cancer', 'Current.Depression', 'DM2', 'Hypertension', 'smoking', 'stroke']
doubt = ['GROUP_abeta', 'APOE4', 'lipid', 'ADNI_MEM', 'ADNI_EF', 'ADNI_LAN', 'ADNI_VS']

df_bl = df_bl[to_keep]

# # Column rename
# df_bl.columns = [col.lower().capitalize() if col not in ['CVD', 'MMSE', 'MOCA'] else col for col in df_bl.columns]

# new_col_bl = {
#     'Rid': 'Patient_ID',
#     'Ptgender': 'Gender',
#     'Pteducat': 'Education_yr',
#     'Plasmaptau181': 'Plasma_ptau181',
#     'Dm2': 'Diabetes',
# }

# df_bl.rename(columns = new_col_bl, inplace = True)


In [137]:
df_bl[['ADNI_MEM', 'BLMEM']]
df_bl['ADNI_MEM'].equals(df_bl['BLMEM'])

True

In [157]:
df_bl

Unnamed: 0,RID,AGE,PTGENDER,PTEDUCAT,PLASMAPTAU181,DX_bl,DXX,INFARCT,CVD,Cancer,Current.Depression,DM2,Hypertension,smoking,stroke
1,2002,64.8,0,16,6.777,EMCI,MCI,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
7,2007,83.4,1,20,37.897,EMCI,MCI,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
12,2010,62.9,1,20,23.263,EMCI,MCI,0.0,,,,,,,
17,2018,76.4,1,18,10.252,EMCI,MCI,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
21,2022,66.0,0,18,16.576,EMCI,MCI,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3116,5289,59.7,1,16,8.672,SMC,CN,,,,,,,,
3120,5290,67.0,1,12,18.583,SMC,CN,,1.0,1.0,1.0,0.0,1.0,0.0,0.0
3123,5292,74.3,1,13,17.408,SMC,CN,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3127,5295,75.5,1,15,10.932,SMC,CN,,,,,,,,


### Dataset 4: ADNIMERGE

In [12]:
df_adni = pd.read_csv('../data/raw/ADNIMERGE.csv', low_memory=False)
df_adni['DX'].value_counts()
# cond = (df_2['RID'] == 2002) & (df_2['VISCODE'] == 'bl')
# df_2.loc[cond]

DX
MCI         4614
CN          3570
Dementia    2315
Name: count, dtype: int64

## 2. Dataset Cards

#### 1. Dataset ad_data

* **Dataset**: ad_data.csv
* **Description**: This dataset contains health information for 2,149 patients. It includes demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, symptoms, and a diagnosis of Alzheimer's Disease.
* **Time frame**:  
* **Source**: [Alzheimer's Disease Dataset](https://www.kaggle.com/datasets/rabieelkharoua/alzheimers-disease-dataset)

</br>

| Column | Description | Variable type | Relevance | Notes |
|--------|-------------|---------------|-----------|-------|
| | | | | |
| | | | | |
| | | | | |
| | | | | |

#### 2. Dataset biomarker_data

* **Dataset**: biomarker_data.csv
* **Description**: This dataset contains information for 113 patients. It includes relevant diagnosis, demographic details, lifestyle factors, cognitive assessments and blood marker measurements.
* **Time frame**: unknown / 2024
* **Source**: [Blood marker data](https://figshare.com/articles/dataset/Blood_marker_data_XLSX/26316985?file=47733910)  

</br>

| Column         | Description                                | Variable type | Relevance | Notes |
|----------------|--------------------------------------------|---------------|-----------|-------|
| patient_id     | ID, mix of letters and numbers             | -             | -         | - |
| dx             | Main neurological diagnosis of the patient | Categoric     | 0         | CU = Cognitively Unimpaired, MCI = Mild Cognitive Impairment, AD = Alzheimer's Disease|
| gender         | Gender of the patient                      | Binary        | 3         | 0 = male, 1 = female |
| age            | Age of the patient                         | Numeric       | 1         | - |
| BMI            | Body Mass Index of the patient             | Numeric       | 1         | - |
| education_yr   | Number of years of formal education        | Numeric       | 3         | - |
| smoking        | Smoking status                             | Binary        | 3         | 0 = no, 1 = yes |
| drinking       | Alcohol consumption status                 | Binary        | 3         | 0 = no, 1 = yes |
| hypertension   | Diagnosed hypertension                     | Binary        | 2         | 0 = no, 1 = yes |
| CVD            | Diagnosed cardiovascular (coronary) disease| Binary        | 2         | 0 = no, 1 = yes |
| diabetes       | Diagnosed diabetes                         | Binary        | 2         | 0 = no, 1 = yes |
| MMSE           | Mini-Mental State Examination Score        | Numeric       | 0         | Range 0-30 |
| MoCA           | Montreal Cognitive Assessment Score        | Numeric       | 0         | Range 0-30 |
| plasma_GFAP    | GFAP plasma quantification (pg/mL)         | Numeric       | 1         | - |
| plasma_NfL     | NfL plasma quantification (pg/mL)          | Numeric       | 1         | - |
| plasma_ptau181 | P-tau181 plasma quantification (pg/mL)     | Numeric       | 0         | - |

#### 3. Dataset adni_data

* **Dataset**: adni_data.csv
* **Description**: This dataset contains extensive health information for 2,149 patients, each uniquely identified with IDs ranging from 4751 to 6900. The dataset includes demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, symptoms, and a diagnosis of Alzheimer's Disease.
* **Time frame**:  
* **Source**: [Plasma p-tau181 Level Predicts Neurodegeneration and Progression to Alzheimer's Dementia: A Longitudinal Study](https://figshare.com/articles/dataset/Data_Sheet_1_Plasma_p-tau181_Level_Predicts_Neurodegeneration_and_Progression_to_Alzheimer_s_Dementia_A_Longitudinal_Study_ZIP/16576709?file=30681404)

</br>

| Column | Description | Variable type | Relevance | Notes |
|--------|-------------|---------------|-----------|-------|
| | | | | |
| | | | | |
| | | | | |
| | | | | |