# New PPMI patients

We will use this notebook to get more patients from the PPMI dataset. The goal is to reach patients per group. We currently have 48 per group. **The goal is to add as many new patients as we can to match 72 patients per group.**

In [1]:
import os
import pandas as pd
import numpy as np

In [3]:
def print_variable_stats(df):
    print(f"Progressive group - M/F gender: {df[(df['gen']==1) & (df['group']==1)].shape[0]}/{df[(df['gen']==2) & (df['group']==1)].shape[0]}")
    print(f"Stable group - M/F gender: {df[(df['gen']==1) & (df['group']==0)].shape[0]}/{df[(df['gen']==2) & (df['group']==0)].shape[0]}")

    age_prog_mean = df[df["group"]==1]["age"].mean()
    age_prog_std = df[df["group"]==1]["age"].std()
    age_stable_mean = df[df["group"]==0]["age"].mean()
    age_stable_std = df[df["group"]==0]["age"].std()
    print(f"Progressive group - Age: {age_prog_mean} +- {age_prog_std}")
    print(f"Stable group - Age: {age_stable_mean} +- {age_stable_std}")

    hy_stable_stage1_bl = len(df[df["initialHY"]==1])
    hy_stable_stage2_bl = len(df[df["initialHY"]==2])
    hy_stable_stage1_3y = len(df[df["followUpHY"]==1])
    hy_stable_stage2_3y = len(df[df["followUpHY"]==2])
    
    print(f"H&Y baseline stage 1: {hy_stable_stage1_bl}")
    print(f"H&Y baseline stage 2: {hy_stable_stage2_bl}")
    print(f"H&Y 3Y follow-up stage 1: {hy_stable_stage1_3y}")
    print(f"H&Y 3Y follow-up stage 2: {hy_stable_stage2_3y}")
    
    print(f"Stable dataset size: {len(df[df['group'] == 0])}")
    print(f"Progressive dataset size: {len(df[df['group'] == 1])}")

The following dataframe is a data file from PPMI containing data on all scans. We will narrow it down to the following filters:

- BL & V08 H&Y stage available
- Verio scanner used

In [4]:
baselineData = "../data/ppmi-data/HY_Baseline_Stage.csv"
df = pd.read_csv(baselineData)
df = df[["PATNO", "EVENT_ID", "hy_on"]]

# Remove null and missing data
df = df.dropna(subset=['hy_on'])
df = df[df['hy_on'].apply(lambda x: isinstance(x, float))]

# Keep BL and V08 patients
df = df[(df["EVENT_ID"] == "BL") | (df["EVENT_ID"] == "V08")]

# # Remove current patients
CURRENT_PATIENT_ID = [4037, 3168, 3131, 4024, 4001, 3373, 4081, 3127, 3380, 3107, 3834,
       3134, 3825, 4035, 3307, 3567, 3124, 3181, 3577, 3778, 3775, 3818,
       4029, 3559, 4020, 3387, 3771, 4012, 3111, 3154, 3835, 3366, 3831,
       3591, 3826, 4027, 4135, 4026, 3556, 3371, 3323, 3752, 3175, 3308,
       3374, 3166, 3829, 3787, 3309, 3770, 3332, 3182, 3762, 4033, 3174,
       4083, 3173, 3176, 3322, 4034, 3113, 3125, 3377, 3119, 4005, 3102,
       3587, 4022, 3552, 4082, 3824, 3802, 3150, 3118, 3757, 3585, 3185,
       3777, 3120, 3190, 3814, 3305, 3808, 3311, 3838, 3364, 3378, 3832,
       3588, 3383, 3866, 3386, 3116, 3780, 3828, 3130, 3815, 3819, 3781,
       4038, 3178, 3108, 3557, 3128, 3365, 3800, 3593, 3564, 4021, 3328,
       3830, 3367, 3575, 3126, 3758, 3823, 3760, 3325, 3132, 3385, 3822,
       3372, 3589, 3586, 3321]

df = df.loc[~df['PATNO'].isin(CURRENT_PATIENT_ID)]

# Remove rows with missing info
listOfSubjectID = df["PATNO"].unique()
for subId in listOfSubjectID:
    if len(df[df["PATNO"] == subId]) != 2:
        indexNames = df[ df['PATNO'] == subId ].index
        df.drop(indexNames , inplace=True)
    
#     if 0 in df[df["PATNO"] == subId]["hy_on"].values:
#         indexNames = df[ df['PATNO'] == subId ].index
#         df.drop(indexNames , inplace=True)
      
df = df.rename(columns={"PATNO":"subjectId"})

print(f"There are {len(df['subjectId'].unique())} patients available.")

There are 386 patients available.


The next step is to extract the list ID and pass it to PPMI. The following dataframe represents the potential candidates. **There are 75 patients in total that can be added with SIEMENS scanners. However, only 13 patients were scanned using a Verio scanner.**

In [5]:
newPPMIPatients = "../data/ppmi-data/newPatientsFound.csv"
ppmiDF = pd.read_csv(newPPMIPatients)
ppmiDF = ppmiDF.rename(columns={"Subject ID":"subjectId", "Sex": "gen", "Age": "age"})
ppmiDF = ppmiDF[["subjectId", "gen", "age"]]
ppmiDF['gen'] = ppmiDF['gen'].map(dict(zip(['M','F'],[1,2])))
ppmiDF

Unnamed: 0,subjectId,gen,age
0,3105,1,68.5
1,3106,2,70.2
2,3112,2,62.7
3,3114,2,63.9
4,3115,1,61.3
...,...,...,...
70,4019,2,54.9
71,4025,2,49.2
72,4032,1,68.0
73,4067,1,61.6


**Let's add the initial and follow up H&Y stage for each patient as well as the group.**

In [6]:
# Init H&Y list
baselineStageList = []
followUpStageList = []

for subId in ppmiDF["subjectId"].values:
    baselineStage = df.loc[(df['subjectId'] == subId) & (df['EVENT_ID'] == "BL")]["hy_on"].values[0]
    followUpStage = df.loc[(df['subjectId'] == subId) & (df['EVENT_ID'] == "V08")]["hy_on"].values[0]
    baselineStageList.append(baselineStage)
    followUpStageList.append(followUpStage)
    
ppmiDF["initialHY"] = baselineStageList
ppmiDF["followUpHY"] = followUpStageList
ppmiDF["group"] = (ppmiDF["initialHY"] < ppmiDF["followUpHY"]).astype(int)
print_variable_stats(ppmiDF)
ppmiDF

Progressive group - M/F gender: 6/2
Stable group - M/F gender: 44/23
Progressive group - Age: 52.425000000000004 +- 12.081124592875096
Stable group - Age: 60.17164179104478 +- 10.781828230451195
H&Y baseline stage 1: 7
H&Y baseline stage 2: 6
H&Y 3Y follow-up stage 1: 1
H&Y 3Y follow-up stage 2: 11
Stable dataset size: 67
Progressive dataset size: 8


Unnamed: 0,subjectId,gen,age,initialHY,followUpHY,group
0,3105,1,68.5,2.0,2.0,0
1,3106,2,70.2,0.0,0.0,0
2,3112,2,62.7,0.0,0.0,0
3,3114,2,63.9,0.0,0.0,0
4,3115,1,61.3,0.0,0.0,0
...,...,...,...,...,...,...
70,4019,2,54.9,1.0,2.0,1
71,4025,2,49.2,2.0,2.0,0
72,4032,1,68.0,0.0,0.0,0
73,4067,1,61.6,0.0,0.0,0


Fetch the current data we have

In [7]:
currentData = "../data/volume-data/preMatchVolumes.csv"
currentDf = pd.read_csv(currentData)
currentDf = currentDf.drop("Unnamed: 0", 1)
currentDf = currentDf[["subjectId", "gen", "age", "initialHY", "followUpHY", "group"]]
print_variable_stats(currentDf)
currentDf

Progressive group - M/F gender: 33/15
Stable group - M/F gender: 49/28
Progressive group - Age: 60.54356644710416 +- 10.246004545004064
Stable group - Age: 61.396429567870136 +- 9.26828414251294
H&Y baseline stage 1: 59
H&Y baseline stage 2: 66
H&Y 3Y follow-up stage 1: 23
H&Y 3Y follow-up stage 2: 94
Stable dataset size: 77
Progressive dataset size: 48


  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,subjectId,gen,age,initialHY,followUpHY,group
0,4037,1,52.831492,1.0,2.0,1
1,3168,2,63.094798,2.0,3.0,1
2,3131,1,71.205479,2.0,2.0,0
3,4024,1,72.292350,2.0,2.0,0
4,4001,1,49.893151,2.0,2.0,0
...,...,...,...,...,...,...
120,3822,1,55.994536,1.0,2.0,1
121,3372,1,71.390710,1.0,2.0,1
122,3589,2,74.909589,2.0,2.0,0
123,3586,1,62.927210,1.0,2.0,1


In [8]:
final_df = pd.concat([currentDf, ppmiDF])
final_df.to_csv("../data/volume-data/preMatchVolumes.csv")
print_variable_stats(final_df)

Progressive group - M/F gender: 39/17
Stable group - M/F gender: 93/51
Progressive group - Age: 59.383771240375005 +- 10.793706586411549
Stable group - Age: 60.82656303281945 +- 9.98410708403054
H&Y baseline stage 1: 66
H&Y baseline stage 2: 72
H&Y 3Y follow-up stage 1: 24
H&Y 3Y follow-up stage 2: 105
Stable dataset size: 144
Progressive dataset size: 56


### Let's call our R script and perform cohort matching with our experimental larger dataset.

In [9]:
os.system("Rscript match-data.R")


Call:
matchit(formula = group ~ age + gen + initialHY, data = data, 
    method = "nearest", distance = "glm", replacement = F)

Summary of Balance for All Data:
          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance         0.2859        0.2777          0.2334     0.5791    0.0754
age             59.3838       60.8266         -0.1337     1.1688    0.0541
gen              1.3036        1.3542         -0.1091     0.9346    0.0253
initialHY        1.1071        1.0278          0.2543     0.1089    0.2606
          eCDF Max
distance    0.2480
age         0.1438
gen         0.0506
initialHY   0.4306


Summary of Balance for Matched Data:
          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance         0.2859        0.2855          0.0115     0.9715    0.0095
age             59.3838       59.7019         -0.0295     1.3591    0.0463
gen              1.3036        1.3393         -0.0770     0.9431    0.0179
initialHY        1.1071     

0

In [10]:
matchedVolumesDf = pd.read_csv("../data/volume-data/matchedVolumes.csv")
matchedVolumesDf = matchedVolumesDf.drop(["Unnamed: 0", "X", 'distance', 'weights', 'subclass'], 1)
print_variable_stats(matchedVolumesDf)

Progressive group - M/F gender: 39/17
Stable group - M/F gender: 37/19
Progressive group - Age: 59.383771240375005 +- 10.793706586411549
Stable group - Age: 59.70187340260718 +- 9.258640636077349
H&Y baseline stage 1: 59
H&Y baseline stage 2: 35
H&Y 3Y follow-up stage 1: 11
H&Y 3Y follow-up stage 2: 74
Stable dataset size: 56
Progressive dataset size: 56


  
