# PPMI data preparation script

2018-09-10 Hirotaka Iwaki

Before staring the analysis, Donwnload data from PPMI LONI site.

**steps**
1. Create 1-1 matching table for Event ID and DATE of each participant.     
(DATEs in the original data are available as YYYY/MM. DATE will be converted to the numeric number from 1970/01/01 to YYYY/MM/01
2. Pull PD diagnosis information    
In the form of CurrentEnrollmentStatus_LatestDiagnosis_ChangeInDiagnosis_ImagingConfirmation_Comments    
e.g. In__CBD_convsn_YsImg_PPMI or In__iPD_stable_YsImg_LRRK2+

In [15]:
# Setting
library(data.table)
library(dplyr)
FOLDER = c("download181018")
OUTPUT = c("out181018")
options("width"=110)
dir.create(OUTPUT)

"'out181018' already exists"

## Step 1 PATNO-EVENT-DATE matching table

In [4]:
cat("Populate DATE and EVENT_ID from all the available files.\n
Skip 'Signature_Form.csv' because it contains many EVENT_ID which are not seen in other files.
Additionally, the following files will be skipped because they don't contain PATNO in the file;\n ")
FILES = dir(path = FOLDER, full.names = F, recursive = T)
TEMP=c()
for (i in 1:length(FILES)){
  if(FILES[i]=="Signature_Form.csv"){next} 
  CHECK = fread(paste(FOLDER, FILES[i], sep="/"))
  if(length(grep("^PATNO$", names(CHECK)))==0){cat(FILES[i], ", ");next}
  TEMP1 = fread(paste(FOLDER, FILES[i], sep="/"), na.strings = '', header = T, quote="\"", 
                colClasses = c("PATNO"="character")) 
  if(length(grep("^EVENT_ID$|^INFODT$", names(TEMP1)))==2){
    TEMP2 = TEMP1 %>% select(PATNO, EVENT_ID, INFODT) %>% 
      mutate(DATE = as.numeric(as.Date(paste("01", INFODT, sep="/"), format="%d/%m/%Y"))) %>% # pad 01 and cal days from 1970-01-01
      arrange(DATE) %>% 
      distinct(PATNO, EVENT_ID, .keep_all = T) %>%
      .[complete.cases(.),]%>% 
      select(PATNO, EVENT_ID, DATE) %>% mutate(FILE = FILES[i]) %>% data.frame
    if(!exists("TEMP")){TEMP=TEMP2}else{
      TEMP = bind_rows(TEMP, TEMP2)
    }
  }else{next}
}


cat("\n \n \n Check how consistent the EVENT_ID-DATEs are. 
Calculate P: the probability of a certain event of individual recorded as such a date\n")
temp = TEMP %>% arrange(DATE) %>% 
  group_by(PATNO, EVENT_ID) %>% mutate(NUM_E = n()) %>% data.frame %>%  
  group_by(PATNO, EVENT_ID, DATE) %>% mutate(NUM_E_D = n()) %>% data.frame %>% 
  mutate(P = NUM_E_D / NUM_E) %>% 
  distinct(PATNO, EVENT_ID, NUM_E_D, .keep_all = T) %>%
  arrange(PATNO, DATE) 

cat("\n\nList of files that contains only 1 event for any participant across all files. Numbers are such obs in the file.")
temp %>%  filter(NUM_E == 1) %>% with(table(FILE, NUM_E==1))
temp %>% filter(P<1) %>% nrow %>% cat("\n", ., "observations have different DATEs for the same EVENT_ID.
For these events, take the most frequent DATE")
temp2 = temp %>% arrange(PATNO, EVENT_ID, desc(P)) %>% distinct(PATNO, EVENT_ID, .keep_all = T) 
temp2 %>% filter(P == 0.5) %>% nrow %>% cat("\nFor", ., "observations, the most freqent DATE of the EVENT_ID is 0.5,
DATEs of those EVENT_IDs will be determined by the order of the filenames (because we cannot determine which is true).\n
Next, check the chronological order of EVENT_ID. 
EVENT_ID of schduled site visits and telephone visits are used to detect inconsistency.\n")

# Code to see the names of EVENT_IDs
# unique(temp$EVENT_ID) %>% .[-grep("^T", .)] %>% paste(., collapse = '","')

Visits = c("SC", "BL", sprintf("%s%02d", "V", 1:20))
TVisits= c("TSC", "TBL", sprintf("%s%02d", "T", 1:99))
temp3.1 = temp2 %>% 
  filter(EVENT_ID %in% Visits) %>% 
  mutate(V = factor(EVENT_ID, levels = Visits)) %>% 
  arrange(PATNO, V) %>% 
  group_by(PATNO) %>% 
  mutate(DATEbef = lag(DATE, default = first(DATE)),
         DATEaft = lead(DATE, default = last(DATE))) %>% 
  data.frame %>% filter(DATE < DATEbef | DATE > DATEaft ) %>% select(-"V")
temp3.2 = temp2 %>% 
  filter(EVENT_ID %in% TVisits) %>% 
  mutate(V = factor(EVENT_ID, levels = TVisits)) %>% 
  arrange(PATNO, V) %>% 
  group_by(PATNO) %>% 
  mutate(DATEbef = lag(DATE, default = first(DATE)),
         DATEaft = lead(DATE, default = last(DATE))) %>% 
  data.frame %>% filter(DATE < DATEbef | DATE > DATEaft ) %>% select(-"V")
temp3 = bind_rows(temp3.1, temp3.2) # the list of not chronologially consistent events
nrow(temp2) %>% cat("\nAmong", ., "obs, ")
nrow(temp3) %>% cat(., " will be deleted because chronological order is inconsitent")
temp4 = anti_join(temp2, temp3, by=c("PATNO", "EVENT_ID"))
cat("\nAfter exclude these obs, check chronological orders again.")
temp4 %>% 
  filter(EVENT_ID %in% Visits) %>% 
  mutate(V = factor(EVENT_ID, levels = Visits)) %>% 
  arrange(PATNO, V) %>% 
  group_by(PATNO) %>% 
  mutate(DATEdiff = DATE - lag(DATE, default = first(DATE))) %>% 
  data.frame %>% filter(DATEdiff<0) %>% select(-"V") %>% nrow %>%
  cat("\n",., ": Number of obs still problematic. 0 is expected." )
temp4 %>% 
  filter(EVENT_ID %in% TVisits) %>% 
  mutate(V = factor(EVENT_ID, levels = TVisits)) %>% 
  arrange(PATNO, V) %>% 
  group_by(PATNO) %>% 
  mutate(DATEdiff = DATE - lag(DATE, default = first(DATE))) %>% 
  data.frame %>% filter(DATEdiff<0) %>% select(-"V")%>% nrow %>%
  cat("\n",., ": Number of obs still problematic. 0 is expected." )
cat("\nSave PATNO_EVENT_DATE matching file")
EDtable = temp4 %>% select(PATNO, EVENT_ID, DATE)
write.csv(EDtable, paste(OUTPUT, "PATNO_EVENTID_DATE.csv", sep="/"), row.names=F)

Populate DATE and EVENT_ID from all the available files.

Skip 'Signature_Form.csv' because it contains many EVENT_ID which are not seen in other files.
Additionally, the following files will be skipped because they don't contain PATNO in the file;
 AV-133_SBR_Results.csv , Code_List.csv , Data_Dictionary.csv , Derived_Variable_Definitions_and_Score_Calculations.csv , FOUND_RFQ_Alcohol.csv , FOUND_RFQ_Anti-Inflammatory_Meds.csv , FOUND_RFQ_Caffeine.csv , FOUND_RFQ_Calcium_Channel_Blockers.csv , FOUND_RFQ_Female_Reproductive_Health.csv , FOUND_RFQ_Head_Injury.csv , FOUND_RFQ_Height___Weight.csv , FOUND_RFQ_Occupation.csv , FOUND_RFQ_Pesticides_Non-Work.csv , FOUND_RFQ_Pesticides_at_Work.csv , FOUND_RFQ_Physical_Activity.csv , FOUND_RFQ_Residential_History.csv , FOUND_RFQ_Smoking_History.csv , FOUND_RFQ_Toxicant_History.csv , IUSM_BIOSPECIMEN_CELL_CATALOG.csv , IUSM_CATALOG.csv , MRI_Imaging_Data_Transfer_Information_Source_Document.csv , Olfactory_UPSIT.csv , Page_Descriptions.csv , SPE

                                       
FILE                                    TRUE
  AV-133_Imaging.csv                       3
  Conclusion_of_Study_Participation.csv  372
  Contact_Information_Brain_Bank.csv     290
  Contact_Information_FOUND.csv           12
  DNA_Sample_Collection.csv                2
  DaTscan_Imaging.csv                     42
  Features_of_REM_Behavior_Disorder.csv    5
  Florbetaben_Eligibility.csv             14
  Gait_Data___Arm_swing.csv                1
  General_Medical_History.csv              5
  Genetic_Testing_Results.csv           1131
  Lumbar_Puncture_Sample_Collection.csv    5
  Magnetic_Resonance_Imaging.csv           3
  Primary_Diagnosis.csv                    1
  Research_Advance_Directive.csv           1
  Surgery_for_Parkinson_Disease.csv        7
  TAP-PD_Conclusion.csv                   12
  TAP-PD_OPDM_Assessment.csv               1
  TAP-PD_Subject_Eligibility.csv           3
  Use_of_PD_Medication.csv                 2
  Vital_Signs.c


 1558 observations have different DATEs for the same EVENT_ID.
For these events, take the most frequent DATE
For 4 observations, the most freqent DATE of the EVENT_ID is 0.5,
DATEs of those EVENT_IDs will be determined by the order of the filenames (because we cannot determine which is true).

Next, check the chronological order of EVENT_ID. 
EVENT_ID of schduled site visits and telephone visits are used to detect inconsistency.

Among 15393 obs, 2  will be deleted because chronological order is inconsitent
After exclude these obs, check chronological orders again.
 0 : Number of obs still problematic. 0 is expected.
 0 : Number of obs still problematic. 0 is expected.
Save PATNO_EVENT_DATE matching file

## Step 2 PD diagnosis

In [5]:
cat("Pull information from 'Patient_Status.csv'
Note: ENROLL_CAT was only given who had enrolled. For those who had not, give RECRUITMENT_CAT instead. The variable named RECRUIT.\n\n")
temp = fread(paste(FOLDER, 'Patient_Status.csv', sep='/'), 
                 colClasses = c("PATNO"="character")) %>%
  mutate(RECRUIT = ifelse(is.na(ENROLL_CAT) | ENROLL_CAT=="", RECRUITMENT_CAT, ENROLL_CAT),
         IMG = IMAGING_CAT,
         SUBCAT = DESCRP_CAT) %>% 
  select(PATNO, RECRUIT, IMG, SUBCAT, ENROLL_DATE, ENROLL_STATUS)
cat("Check ENROLL_STATUS by contingency table. 
Note:If dup > 1 it indiates duplicated IDs in the file and need to be checked.\n")
temp %>% group_by(PATNO) %>% mutate(dup = n()) %>% with(table(dup, ENROLL_STATUS, useNA = "always"))
cat("\nContingency table for cohorts and image results. RECRUIT=PD has those with SWEDD image. Most people in GEN has image but not those in REG")
temp %>% with(table(RECRUIT, IMG, useNA = "always"))
cat("\nContingency table for RECRUIT cohort and Subcategories. RECRUIT=PRODROMA has HYP and RBD. RECRUIT=GEN/REG have genetics information
Sub-category is blank for PD/HC/SWEDD")
temp %>% with(table(RECRUIT, SUBCAT, useNA = "always"))

Pull information from 'Patient_Status.csv'
Note: ENROLL_CAT was only given who had enrolled. For those who had not, give RECRUITMENT_CAT instead. The variable named RECRUIT.

Check ENROLL_STATUS by contingency table. 
Note:If dup > 1 it indiates duplicated IDs in the file and need to be checked.


      ENROLL_STATUS
dup    Declined Enrolled Excluded Pending Withdrew <NA>
  1         111     1685      201      19      131    0
  <NA>        0        0        0       0        0    0


Contingency table for cohorts and image results. RECRUIT=PD has those with SWEDD image. Most people in GEN has image but not those in REG

          IMG
RECRUIT    GENPD GENUN  HC  PD PRODROMA REGPD REGUN SWEDD no image <NA>
  GENPD      210     0   0   0        0     0     0     0       61    0
  GENUN        0   342   0   0        0     0     0     0       25    0
  HC           0     0 241   0        0     0     0     0        0    0
  PD           0     0   0 450        0     0     0    17       39    0
  PRODROMA     0     0   0   0      194     0     0     0       22    0
  REGPD        0     0   0   0        0     4     0     0      209    0
  REGUN        0     0   0   0        0     0     1     0      268    0
  SWEDD        0     0   0   0        0     0     0    62        2    0
  <NA>         0     0   0   0        0     0     0     0        0    0


Contingency table for RECRUIT cohort and Subcategories. RECRUIT=PRODROMA has HYP and RBD. RECRUIT=GEN/REG have genetics information
Sub-category is blank for PD/HC/SWEDD

          SUBCAT
RECRUIT        GBA+ GBA- HYP LRRK2+ LRRK2- RBD SNCA+ SNCA- <NA>
  GENPD      0   83    0   0    165      0   0    23     0    0
  GENUN      0  151    1   0    206      2   0     7     0    0
  HC       241    0    0   0      0      0   0     0     0    0
  PD       506    0    0   0      0      0   0     0     0    0
  PRODROMA   1    0    0 119      0      0  96     0     0    0
  REGPD      0   52    0   0    156      0   0     5     0    0
  REGUN      0  102    2   0    144     12   0     7     2    0
  SWEDD     64    0    0   0      0      0   0     0     0    0
  <NA>       0    0    0   0      0      0   0     0     0    0

In [7]:
cat("Obatin the initial diagnosis, and the latest diagnosis from 'Primary_Diagnosis.csv'.
Further pull the initial and the last DX from 'Prodromal_Diagnostic_Questionnaire.csv'.
And 'Genetic_and_Registry_Diagnostic_Questionnaire.csv'\n\n")
dxcode=data.frame(
    CODE = c(1:18, 23, 24, 97), 
    DX = c("iPD", "Alz", "FTD_Chr17", "CBD", "DLB", "Dysto_DopaR", "ET", "Hemi_PKS", "ARPD", "MNDwPKS",
            "MSA", "DrugPKS", "NPH", "PSP", "Psychogenic", "VasPKS", "NoDisease", "SCA", "PrdrNM", "PrdrM","OtherD"))
diagTMP1=fread(paste(FOLDER, 'Primary_Diagnosis.csv', sep='/'), 
            colClasses = c("PATNO"="character")) %>%
  filter(!(is.na(PATNO))) %>% 
  mutate(DXCODE = as.numeric(PRIMDIAG)) %>% 
  select(PATNO, EVENT_ID, DXCODE) %>% 
  inner_join(., EDtable, by = c("PATNO", "EVENT_ID"))
diagTMP2=fread(paste(FOLDER, 'Prodromal_Diagnostic_Questionnaire.csv', sep='/'), 
            colClasses = c("PATNO"="character")) %>%
  filter(!(is.na(PATNO))) %>% 
  mutate(DXCODE = as.numeric(PRIMDIAG)) %>% 
  select(PATNO, EVENT_ID, DXCODE) %>% 
  inner_join(., EDtable, by = c("PATNO", "EVENT_ID"))
diagTMP3=fread(paste(FOLDER, 'Genetic_and_Registry_Diagnostic_Questionnaire.csv', sep='/'), 
            colClasses = c("PATNO"="character")) %>%
  filter(!(is.na(PATNO))) %>% 
  mutate(DXCODE = as.numeric(PRIMDIAG)) %>% 
  select(PATNO, EVENT_ID, DXCODE) %>% 
  inner_join(., EDtable, by = c("PATNO", "EVENT_ID"))
diagTMP = bind_rows(diagTMP1, diagTMP2, diagTMP3)
diag1 = diagTMP %>% 
  arrange(desc(DATE)) %>% 
  distinct(PATNO, .keep_all = T) %>%
  left_join(., dxcode, by = c("DXCODE"="CODE")) %>% 
  rename(DATE_LASTDX = DATE, DX_LAST = DX, EVENT_LAST=EVENT_ID)
diag2 = diagTMP %>% 
  arrange(DATE) %>% 
  distinct(PATNO, .keep_all = T) %>%
  left_join(., dxcode, by = c("DXCODE"="CODE")) %>% 
  rename(DATE_INITDX = DATE, DX_INIT = DX, EVENT_INIT=EVENT_ID) 
diag = full_join(diag1, diag2, by = "PATNO") %>% select(-starts_with("DXCODE"))
cat("Contingency table for DX_LAST x RECRUIT cohort. The file only covers primary PPMI cohorts (PD/HC/SWEDD)
NOTE: Most SWEDD is given the diagnosis of 'iPD'.")
left_join(temp, diag, by = "PATNO") %>% mutate(DX_LAST = as.character(DX_LAST)) %>% with(table(DX_LAST, RECRUIT, useNA = "always"))
cat("EVENT_ID that the participant's initial diagnosis was provided.  if not BL or SC, exclude from the cohort")
diag %>% with(table(EVENT_INIT, useNA = 'always'))
diag = diag %>% filter(EVENT_INIT == "SC" | EVENT_INIT == "BL") %>% mutate(DAYS_DXDIFF = DATE_LASTDX - DATE_INITDX)
cat("Dupliated PATNO. If exists, it will be shown below. ")
diag %>% group_by(PATNO) %>% mutate(dup = n()) %>% filter(dup>1) %>% select(-dup)

Obatin the initial diagnosis, and the latest diagnosis from 'Primary_Diagnosis.csv'.
Further pull the initial and the last DX from 'Prodromal_Diagnostic_Questionnaire.csv'.
And 'Genetic_and_Registry_Diagnostic_Questionnaire.csv'

Contingency table for DX_LAST x RECRUIT cohort. The file only covers primary PPMI cohorts (PD/HC/SWEDD)
NOTE: Most SWEDD is given the diagnosis of 'iPD'.

             RECRUIT
DX_LAST       GENPD GENUN  HC  PD PRODROMA REGPD REGUN SWEDD <NA>
  CBD             0     0   1   1        0     0     0     0    0
  DLB             0     0   0   3        2     1     0     0    0
  DrugPKS         0     1   0   0        0     0     1     0    0
  ET              1    10   1   2        1     0     2     2    0
  Hemi_PKS        0     1   0   0        0     0     0     0    0
  MSA             0     0   0   5        2     0     0     0    0
  NoDisease       2   312 223   1       56     0   242     1    0
  OtherD          0     2   6   0        5     0     8     6    0
  PrdrM           0    21   0   0        9     0     5     0    0
  PrdrNM          0     4   0   0      100     0     1     0    0
  Psychogenic     1     0   0   1        0     0     0     1    0
  iPD           254    11   3 477       13   205     3    54    0
  <NA>           13     5   7  16       28     7     7     0    0

EVENT_ID that the participant's initial diagnosis was provided.  if not BL or SC, exclude from the cohort

EVENT_INIT
  BL   SC  U01 <NA> 
 465 1606    1    0 

Dupliated PATNO. If exists, it will be shown below. 

PATNO,EVENT_LAST,DATE_LASTDX,DX_LAST,EVENT_INIT,DATE_INITDX,DX_INIT,DAYS_DXDIFF


In [9]:
cat("\nCompare the initial diagnosis and the last one. Note this contingency table includes people only had SC visit")
diag %>% mutate_at(vars(starts_with("DX")), as.character) %>% 
  with(table(DX_LAST, DX_INIT, useNA = "always"))
cat("\nCompare the initial diagnosis and the last one. Exclude people only had SC visit")
diag %>% mutate_at(vars(starts_with("DX")), as.character) %>% 
  filter(DAYS_DXDIFF>0) %>% 
  with(table(DX_LAST, DX_INIT, useNA = "always"))


Compare the initial diagnosis and the last one. Note this contingency table includes people only had SC visit

             DX_INIT
DX_LAST       DLB  ET Hemi_PKS NoDisease OtherD PrdrM PrdrNM Psychogenic iPD <NA>
  CBD           0   0        0         1      0     0      0           0   1    0
  DLB           2   0        0         0      0     0      2           0   2    0
  DrugPKS       0   0        0         2      0     0      0           0   0    0
  ET            0  12        0         4      0     0      0           0   3    0
  Hemi_PKS      0   0        1         0      0     0      0           0   0    0
  MSA           0   0        0         0      1     0      1           0   5    0
  NoDisease     0   2        0       832      1     1      1           0   3    0
  OtherD        0   0        0         5     16     0      1           0   5    0
  PrdrM         0   0        0        22      0     5      8           0   0    0
  PrdrNM        0   0        0         2      0     0    103           0   0    0
  Psychogenic   0   0        0         0      0     0      0           1   2 


Compare the initial diagnosis and the last one. Exclude people only had SC visit

             DX_INIT
DX_LAST        ET NoDisease OtherD PrdrM PrdrNM Psychogenic iPD <NA>
  CBD           0         1      0     0      0           0   1    0
  DLB           0         0      0     0      2           0   2    0
  DrugPKS       0         2      0     0      0           0   0    0
  ET            3         4      0     0      0           0   3    0
  MSA           0         0      1     0      1           0   5    0
  NoDisease     2       506      1     1      1           0   3    0
  OtherD        0         5      3     0      1           0   5    0
  PrdrM         0        22      0     1      8           0   0    0
  PrdrNM        0         2      0     0     36           0   0    0
  Psychogenic   0         0      0     0      0           0   2    0
  iPD           0         8      4     2      9           1 709    0
  <NA>          0         0      0     0      0           0   0    0

In [10]:
temp = left_join(temp, diag, by = "PATNO")
cat("All people enrolled should have diagnosis. \n UP : recruitment category and inital diagnosis \n BOTTOM : the same but with latest diagnosis.")
temp %>% filter(ENROLL_STATUS == "Enrolled") %>% mutate(DX_INIT = as.character(DX_INIT)) %>% with(table(DX_INIT, RECRUIT, useNA = "always"))
temp %>% filter(ENROLL_STATUS == "Enrolled") %>% mutate(DX_LAST = as.character(DX_LAST)) %>% with(table(DX_LAST, RECRUIT, useNA = "always"))

All people enrolled should have diagnosis. 
 UP : recruitment category and inital diagnosis 
 BOTTOM : the same but with latest diagnosis.

             RECRUIT
DX_INIT       GENPD GENUN  HC  PD PRODROMA REGPD REGUN SWEDD <NA>
  ET              0     7   1   0        0     0     2     0    0
  NoDisease       0   325 166   0        2     1   240     0    0
  OtherD          0     2   2   0        2     3     7     0    0
  PrdrM           1     2   0   0        1     0     3     0    0
  PrdrNM          0     1   0   0       55     0     1     0    0
  Psychogenic     1     0   0   0        0     0     0     0    0
  iPD           241     7   0 358        0   194     3    55    0
  <NA>            1     1   0   0        0     0     0     0    0

             RECRUIT
DX_LAST       GENPD GENUN  HC  PD PRODROMA REGPD REGUN SWEDD <NA>
  CBD             0     0   1   1        0     0     0     0    0
  DLB             0     0   0   2        2     0     0     0    0
  DrugPKS         0     1   0   0        0     0     1     0    0
  ET              1     9   1   0        0     0     2     2    0
  MSA             0     0   0   4        2     0     0     0    0
  NoDisease       2   299 159   0        2     0   236     0    0
  OtherD          0     2   5   0        1     0     8     4    0
  PrdrM           0    20   0   0        7     0     5     0    0
  PrdrNM          0     3   0   0       34     0     1     0    0
  Psychogenic     1     0   0   0        0     0     0     0    0
  iPD           239    10   3 351       12   198     3    49    0
  <NA>            1     1   0   0        0     0     0     0    0

In [11]:
cat("Create new indicator. Enrollment status/Latest Diagnosis/Image. Recruitment Category is not included.\n")
temp1= temp %>% filter(!is.na(DX_INIT)) %>%
  mutate(DIAG = paste(
  ifelse(ENROLL_STATUS=="Enrolled", "In_", "Out"), # ENROLL STATUAS
  case_when( # LATEST DIAGNOSIS
    is.na(DX_LAST) ~ "XXX",
    DX_LAST == "iPD" ~ "iPD",
    DX_LAST == "NoDisease" ~ "CTR",
    DX_LAST == "PrdrM" | DX_LAST == "PrdrNM" ~ "PRD",
    DX_LAST == "OtherD" ~ "DFR",
    DX_LAST == "ET" ~ "EST",
    DX_LAST == "DrugPKS" ~ "DRG",
    DX_LAST == "Psychogenic" ~ "PSY",
    TRUE ~ as.character(DX_LAST)),
  ifelse(DX_LAST == DX_INIT, "stable", "convsn"), # DX stable or conversion.
  case_when(
    IMG == "SWEDD" ~ "SWEDD",
    IMG == "no image" ~ "NoImg",
    TRUE ~ "YsImg"),
  ifelse(SUBCAT=="" & I(RECRUIT %in% c("PD", "HC", "SWEDD")) , "PPMI", SUBCAT),
  sep = "_"
))
cat("\nSTATUS of PPMI(PD/HC/SWEDD) cohort")
temp1 %>% .[grep("PPMI", .$DIAG), ] %>% 
  with(table(DIAG, RECRUIT, useNA = "always"))
cat("\nSTATUS of Other cohorts")
temp1 %>% .[-grep("PPMI", .$DIAG), ] %>% 
  with(table(DIAG, RECRUIT, useNA = "always"))

Create new indicator. Enrollment status/Latest Diagnosis/Image. Recruitment Category is not included.

STATUS of PPMI(PD/HC/SWEDD) cohort

                           RECRUIT
DIAG                         HC  PD SWEDD <NA>
  In__CBD_convsn_YsImg_PPMI   1   1     0    0
  In__CTR_convsn_YsImg_PPMI   1   0     0    0
  In__CTR_stable_YsImg_PPMI 158   0     0    0
  In__DFR_convsn_SWEDD_PPMI   0   0     4    0
  In__DFR_convsn_YsImg_PPMI   3   0     0    0
  In__DFR_stable_YsImg_PPMI   2   0     0    0
  In__DLB_convsn_YsImg_PPMI   0   2     0    0
  In__EST_convsn_SWEDD_PPMI   0   0     2    0
  In__EST_convsn_YsImg_PPMI   1   0     0    0
  In__MSA_convsn_YsImg_PPMI   0   4     0    0
  In__iPD_convsn_YsImg_PPMI   3   0     0    0
  In__iPD_stable_NoImg_PPMI   0   3     2    0
  In__iPD_stable_SWEDD_PPMI   0   0    47    0
  In__iPD_stable_YsImg_PPMI   0 348     0    0
  Out_CTR_convsn_SWEDD_PPMI   0   0     1    0
  Out_CTR_stable_SWEDD_PPMI   0   1     0    0
  Out_CTR_stable_YsImg_PPMI  64   0     0    0
  Out_DFR_convsn_SWEDD_PPMI   0   0     1    0
  Out_DFR_stable_SWEDD_PPMI   0   0     1    0
  Out_DFR_stable_YsImg_PP


STATUS of Other cohorts

                                RECRUIT
DIAG                             GENPD GENUN PRODROMA REGPD REGUN <NA>
  In__CTR_convsn_YsImg_GBA+          0     1        0     0     0    0
  In__CTR_convsn_YsImg_LRRK2+        2     2        0     0     0    0
  In__CTR_convsn_YsImg_RBD           0     0        1     0     0    0
  In__CTR_stable_NoImg_GBA+          0     5        0     0    89    0
  In__CTR_stable_NoImg_GBA-          0     0        0     0     1    0
  In__CTR_stable_NoImg_LRRK2+        0     7        0     0   133    0
  In__CTR_stable_NoImg_LRRK2-        0     1        0     0    10    0
  In__CTR_stable_NoImg_SNCA+         0     0        0     0     3    0
  In__CTR_stable_YsImg_GBA+          0   127        0     0     0    0
  In__CTR_stable_YsImg_HYP           0     0        1     0     0    0
  In__CTR_stable_YsImg_LRRK2+        0   150        0     0     0    0
  In__CTR_stable_YsImg_SNCA+         0     6        0     0     0    0
  In__DFR_convsn_NoImg_LRRK2+        

In [12]:
cat("Add sex, birthday, racial information from 'Screening___Demographics.csv',
the date of initial diagnosis from 'PD_Features.csv',
Years of education from 'Socio-Economics.csv',
Family history from 'Family_History__PD_.csv'\n")
scr = fread(paste(FOLDER, 'Screening___Demographics.csv', sep='/'), 
             colClasses = c("PATNO"="character")) %>%
  mutate(FEMALE = if_else(GENDER=='0'|GENDER=='1', 1, 0), # 0/1 Child-bearing/unbearing Female. 2 Male
         BIRTHDT = as.Date(paste("01",'06', BIRTHDT, sep = "/"), format='%d/%m/%Y') %>% as.numeric, # Original only has years
         EUROPEAN = if_else(RAWHITE=='1', 1, 0)) %>% 
  select(PATNO, FEMALE, BIRTHDT, EUROPEAN) %>% 
  distinct(PATNO, .keep_all = T)
features = fread(paste(FOLDER, "PD_Features.csv", sep='/'), 
            colClasses = c("PATNO"="character")) %>%
  mutate(DIAGDATE = as.Date(paste("01",PDDXDT, sep = "/"), format='%d/%m/%Y') %>% as.numeric) %>% 
  select(PATNO, DIAGDATE) %>% 
  distinct(PATNO, .keep_all = T)
SES=fread(paste(FOLDER, 'Socio-Economics.csv', sep='/'),
          colClasses = c("PATNO"="character")) %>% 
  mutate(YEARSEDUC = as.numeric(EDUCYRS)) %>% 
  select(PATNO, YEARSEDUC) %>% 
  distinct(PATNO, .keep_all = T)
FH=fread(paste(FOLDER, 'Family_History__PD_.csv', sep='/'),
         colClasses = c("PATNO"="character")) %>%
  data.frame %>% 
  mutate_at(vars(names(.)[grep("PD$", names(.))]), as.numeric) %>% 
  mutate(FH = rowSums(.[,grep("BIODADPD|BIOMOMPD|FULSIBPD", names(.))], na.rm = T)) %>% 
  mutate(FAMILY_HISTORY=ifelse(FH>0, 1, 0)) %>% 
  select(PATNO, FAMILY_HISTORY) %>% 
  distinct(PATNO, .keep_all = T)
temp2 = left_join(scr, features, by ="PATNO") %>% left_join(., SES, by ="PATNO") %>% 
  left_join(., FH, by="PATNO") %>% left_join(., temp1, by ="PATNO")
summary(temp2)

Add sex, birthday, racial information from 'Screening___Demographics.csv',
the date of initial diagnosis from 'PD_Features.csv',
Years of education from 'Socio-Economics.csv',
Family history from 'Family_History__PD_.csv'


    PATNO               FEMALE          BIRTHDT          EUROPEAN         DIAGDATE       YEARSEDUC  FAMILY_HISTORY  
 Length:2159        Min.   :0.0000   Min.   :-17746   Min.   :0.0000   Min.   : 5479   Min.   : 0   Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.: -9345   1st Qu.:1.0000   1st Qu.:14791   1st Qu.:14   1st Qu.:0.0000  
 Mode  :character   Median :0.0000   Median : -6789   Median :1.0000   Median :15294   Median :16   Median :0.0000  
                    Mean   :0.4605   Mean   : -6342   Mean   :0.9623   Mean   :14964   Mean   :16   Mean   :0.3987  
                    3rd Qu.:1.0000   3rd Qu.: -3867   3rd Qu.:1.0000   3rd Qu.:15614   3rd Qu.:18   3rd Qu.:1.0000  
                    Max.   :1.0000   Max.   : 10013   Max.   :1.0000   Max.   :17622   Max.   :30   Max.   :1.0000  
                    NA's   :5        NA's   :11       NA's   :8        NA's   :1137    NA's   :78   NA's   :120     
   RECRUIT              IMG               SUBCAT          ENROLL

In [13]:
cat("For those recruited as PD, check the distribution of the time from original DX_DATE to DATE_DXINIT(DX_DATE at cohort entry)
The table shows that GENPD and REGPD are relatively long after original diagnosis. ")
temp2 %>% filter(RECRUIT %in% c("GENPD", "REGPD", "PD", "SWEDD")) %>%
  group_by(RECRUIT) %>% 
  mutate(DIFF = DATE_INITDX - DIAGDATE) %>%
  summarize(n = n(), 
            mean = mean(DIFF, na.rm = T), sd = sd(DIFF, na.rm = T),
            min = min(DIFF, na.rm = T), median = median(DIFF, na.rm = T), max = max(DIFF, na.rm = T))

For those recruited as PD, check the distribution of the time from original DX_DATE to DATE_DXINIT(DX_DATE at cohort entry)
The table shows that GENPD and REGPD are relatively long after original diagnosis. 

RECRUIT,n,mean,sd,min,median,max
GENPD,258,1147.7868,784.9005,0,1034.5,3167
PD,490,168.8569,197.6531,0,92.0,1247
REGPD,206,3404.8488,2364.5461,0,3502.0,10773
SWEDD,64,130.7188,160.9197,0,61.0,640


In [16]:
temp2 %>% filter(! (RECRUIT %in% c("GENPD", "REGPD", "PD", "SWEDD"))) %>% filter(!is.na(DIAGDATE)) %>% nrow %>%
  cat("Among", temp2 %>% nrow, "people,",.,"in non-PD category has DIAGDATE. They will be excluded from the analysis.\n")
cat("Check if there are PD patients diagnosed after SC. Peple with TRUE and <NA> of the following table will be excluded")
temp2 %>% filter(RECRUIT %in% c("GENPD", "REGPD", "PD", "SWEDD")) %>%
  mutate(DIAGNOSED_AFTER_SCREEN = I(DATE_INITDX - DIAGDATE) < 0 ) %>%
  with(table(DIAGNOSED_AFTER_SCREEN, useNA = "always"))
temp3 = temp2 %>% mutate(FILTER = case_when(
    RECRUIT %in% c("GENPD", "REGPD", "PD", "SWEDD") & is.na(DIAGDATE) ~ TRUE, # DIAGDATE==NA in PD category
    RECRUIT %in% c("GENPD", "REGPD", "PD", "SWEDD") ~ I(DATE_INITDX - DIAGDATE) < 0, # DIAGDATE > DATE_INITDX
    TRUE~!is.na(DIAGDATE))) %>% # Non-PD category with DIAGDATE!=NA
  filter(!FILTER)
temp3 %>% filter(is.na(DX_LAST) | is.na(FEMALE)) %>% nrow %>%
  cat("\nAnd we will only keep  people with diagnosis (such as PD, HC, etc.) and Sex known: Excluded", ., 
      "\nSo the final summary of the cohorts are;")
temp4 = temp3 %>% filter(!is.na(DX_LAST) & !is.na(FEMALE)) %>% 
  select(PATNO, RECRUIT, EUROPEAN, FEMALE, BIRTHDT, YEARSEDUC, FAMILY_HISTORY, DIAGDATE, DX_INIT, DATE_INITDX, DATE_LASTDX, DIAG)
temp4 %>% summary
write.csv(temp4, paste(OUTPUT, "DEMOG_DIAG.csv", sep="/"), row.names=F)

Among 2159 people, 6 in non-PD category has DIAGDATE. They will be excluded from the analysis.
Check if there are PD patients diagnosed after SC. Peple with TRUE and <NA> of the following table will be excluded

DIAGNOSED_AFTER_SCREEN
FALSE  <NA> 
 1016     2 


And we will only keep  people with diagnosis (such as PD, HC, etc.) and Sex known: Excluded 94 
So the final summary of the cohorts are;

    PATNO             RECRUIT             EUROPEAN         FEMALE          BIRTHDT         YEARSEDUC 
 Length:2057        Length:2057        Min.   :0.000   Min.   :0.0000   Min.   :-17746   Min.   : 0  
 Class :character   Class :character   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: -9345   1st Qu.:14  
 Mode  :character   Mode  :character   Median :1.000   Median :0.0000   Median : -6789   Median :16  
                                       Mean   :0.966   Mean   :0.4623   Mean   : -6276   Mean   :16  
                                       3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.: -3501   3rd Qu.:18  
                                       Max.   :1.000   Max.   :1.0000   Max.   : 10013   Max.   :30  
                                                                                                     
 FAMILY_HISTORY      DIAGDATE          DX_INIT      DATE_INITDX     DATE_LASTDX        DIAG          
 Min.   :0.0000   Min.   : 5479   iPD      :1015   Min.   :14761   Min.   :14761  