# Notebook to organize my ABCD data analysis

## Variables of interest
### General 
* pds_ht2_y = Would you say that your growth in height?
* pds_skin2_y = Have you noticed any skin changes, especially pimples?
* pds_bdyhair_y = And how about the growth of your body hair? ("Body hair" means hair any place other than your head, such as under your arms) Would you say that your body hair growth:
### Female specific
* pds_f4_2_y = Have you noticed that your breasts have begun to grow?
* pds_f5_y = Have you begun to menstruate (started to have your period)?
### Male specific
* pds_m4_y = Have you noticed a deepening of your voice?
* pds_m5_y = Have you begun to grow hair on your face?

### Scoring Algorithms: 

For Items 1 through 4 on the girls’ version and all items on the boys’ version, response options were: not yet started (1 point); barely started (2 points); definitely started (3 points); seems complete (4 points); I don’t know (missing). Yes on the menstruation item = 4 points; no = 1 point. Point values are averaged for all items to give a Pubertal Development Scale (PDS) score. 

Puberty Category Scores are computed using the criteria of Crockett (1988, unpublished) by totaling the scale values given above. 

### To compute Puberty Category Scores for boys use body hair growth, voice change, and facial hair growth as follows:
* Prepubertal = 3
* Early Pubertal = 4 or 5 (no 3-point responses)
* Midpubertal = 6, 7, or 8 (no 4-points)
* Late pubertal = 9-11
* Postpubertal = 12


### To compute Puberty Category Scores for girls use body hair growth, breast development, and menarche as follows:
* Prepubertal = 2 and no menarche
* Early Puberty = 3 and no menarche
* Midpubertal = > 3 and no menarche
* Late Puberty = <= 7 and menarche
* Postpubertal = 8 and menarche.

In [1]:
library(psych)
library(reshape)
library(ggplot2)
library(plyr)
library(dplyr)
#install.packages("AGD")
library(AGD)


Attaching package: ‘ggplot2’

The following objects are masked from ‘package:psych’:

    %+%, alpha


Attaching package: ‘plyr’

The following objects are masked from ‘package:reshape’:

    rename, round_any


Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from ‘package:reshape’:

    rename

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

“package ‘AGD’ was built under R version 3.4.4”

In [2]:
BMI_cal<-function(height, weight){
  weight_kg<-(weight/2.20462)
  height_m<-(height*2.54)/100
  BMI=weight_kg/(height_m)^2
  return(BMI)
}

In [3]:
BMItiler<-function(data){
    datM<-subset(data, data$gender == "M")
    datM$interview_age<-(datM$interview_age)/12
    datM$zTest <- y2z(y=datM$BMI, x=datM$interview_age, sex="M", ref=cdc.bmi)
    datM$BMItile<-100*pnorm(datM$zTest)
    
    datF<-subset(data, data$gender == "F")
    datF$interview_age<-(datF$interview_age)/12
    datF$zTest <- y2z(y=datF$BMI, x=datF$interview_age, sex="F", ref=cdc.bmi)
    datF$BMItile<-100*pnorm(datF$zTest)
    
    data<-rbind.fill(datF, datM)
    return(data)
}

In [4]:
OVOB_kid<-function(BMItile, data){
  data$ov_ob[BMItile<=5]<-"Underweight"
  data$ov_ob[BMItile>5 & BMItile<=85]<-"Normalweight"
  data$ov_ob[BMItile>85 & BMItile<=95]<-"Overweight"
  data$ov_ob[BMItile>95]<-"Obese"
  return(data$ov_ob)
}

In [5]:
PCS<-function(x){
    earlyM <-subset(x, x$gender == "M")
    earlyM$sum <- earlyM$pds_m4_y  + earlyM$pds_m5_y + earlyM$pds_bdyhair_y

    earlyM$PCS[earlyM$sum <= 3]<-"prepubertal"
    earlyM$PCS[earlyM$sum > 3 & earlyM$sum <=5]<-"earlypubertal"
    earlyM$PCS[earlyM$sum > 5 & earlyM$sum <=8]<-"midpubertal"
    earlyM$PCS[earlyM$sum > 8 & earlyM$sum <=11]<-"latepubertal"
    earlyM$PCS[ earlyM$sum > 11]<-"latepubertal"
#    earlyM$PCS[earlyM$sum > 11]<-"latepubertal"

    early <-subset(x, x$pds_f5_y <= 1)

    early$sum <- early$pds_f4_2_y + early$pds_f5_y + early$pds_bdyhair_y
    early$PCS[early$sum == 2] <- "prepubertal"
    early$PCS[early$sum == 3] <- "earlypubertal"
    early$PCS[early$sum > 3] <- "midpubertal"

    late <-subset(x, x$pds_f5_y > 1)

    late$sum <- late$pds_f4_2_y + late$pds_f5_y + late$pds_bdyhair_y
    late$PCS[late$sum >= 7] <- "latepubertal"
#    late$PCS[late$sum > 7] <- "postpubertal"

    data<-rbind.fill(earlyM, early, late)
    return(data)
}

In [6]:
anthro<-read.table("~/Google Drive/ABCD/important_txt/abcd_ypdms01.csv", sep=",", header=T)
puberty<-read.table("~/Google Drive/ABCD/important_txt/abcd_ant01.csv", sep=",", header=T)

In [7]:
x<-join(anthro, puberty)
dim(x)
x[x == 999] <- NA
dim(x)

Joining by: src_subject_id, interview_date, interview_age, gender, eventname


In [8]:
x$BMI<-mapply(BMI_cal, height=x$anthroheightcalc, weight=x$anthroweightcalc)

In [9]:
y<-BMItiler(data = x)

In [10]:
data<-PCS(y)

In [11]:
data$OVOB<-OVOB_kid(BMItile = data$BMItile, data = data)

In [12]:
dataMissing<-data[is.na(data$PCS),]

In [13]:
dim(dataMissing)

In [14]:
summary(dataMissing)

             src_subject_id interview_date interview_age    gender 
 sub-NDARINVANUGPBR4:  2    3/25/17:  6    Min.   : 8.917   F:378  
 sub-NDARINVHBZ05H20:  2    6/19/17:  6    1st Qu.: 9.250   M:  9  
 sub-NDARINV03XVEBPM:  1    5/13/17:  5    Median : 9.750          
 sub-NDARINV042UJKFB:  1    6/24/17:  5    Mean   : 9.812          
 sub-NDARINV0C471G23:  1    6/26/17:  5    3rd Qu.:10.333          
 sub-NDARINV0GND16RW:  1    6/28/17:  5    Max.   :11.000          
 (Other)            :379    (Other):355                            
                 eventname     pds_sex_y   pds_ht2_y      pds_skin2_y   
 baseline_year_1_arm_1:387   Min.   :2   Min.   :1.000   Min.   :1.000  
                             1st Qu.:2   1st Qu.:2.000   1st Qu.:1.000  
                             Median :2   Median :2.000   Median :1.000  
                             Mean   :2   Mean   :2.301   Mean   :1.509  
                             3rd Qu.:2   3rd Qu.:3.000   3rd Qu.:2.000  
                  

In [15]:
mytable <- xtabs(~PCS+OVOB+gender, data=data)
ftable(mytable) # print table 
summary(mytable) # chi-square test of indepedence

                           gender   F   M
PCS           OVOB                       
earlypubertal Normalweight        348 729
              Obese                33 184
              Overweight           32 158
              Underweight          40  44
latepubertal  Normalweight         24  24
              Obese                18  10
              Overweight           13   9
              Underweight           0   0
midpubertal   Normalweight        824 321
              Obese               187  92
              Overweight          193  83
              Underweight          47  12
prepubertal   Normalweight          0 485
              Obese                 0 105
              Overweight            0  84
              Underweight           0  25

Call: xtabs(formula = ~PCS + OVOB + gender, data = data)
Number of cases in table: 4124 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 1325.1, df = 24, p-value = 4.912e-265
	Chi-squared approximation may be incorrect

In [16]:
3*3*2

In [17]:
DF<-subset(data, data$PCS != "prepubertal")
DF<-subset(DF, DF$OVOB !="Underweight")

In [18]:
mytable <- xtabs(~PCS+OVOB+gender, data=DF)
ftable(mytable) # print table 
summary(mytable)

                           gender   F   M
PCS           OVOB                       
earlypubertal Normalweight        348 729
              Obese                33 184
              Overweight           32 158
latepubertal  Normalweight         24  24
              Obese                18  10
              Overweight           13   9
midpubertal   Normalweight        824 321
              Obese               187  92
              Overweight          193  83

Call: xtabs(formula = ~PCS + OVOB + gender, data = DF)
Number of cases in table: 3282 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 640.3, df = 12, p-value = 2.577e-129

Limiting factor is the late pubertal with a cell of 9. 

In [19]:
avail_scans <- read.table("~/Google Drive/ABCD/important_txt/scan_list.txt")

In [20]:
avail_scans$have_data <- rep(1,nrow(avail_scans))

In [21]:
names(avail_scans)<-c("src_subject_id", "have_data")

In [22]:
dim(avail_scans)

In [23]:
data0<-join(DF, avail_scans)

Joining by: src_subject_id


In [24]:
dim(data0)

In [25]:
data<-data0[!is.na(data0$have_data),]

In [26]:
dim(data)

In [27]:
mytable <- xtabs(~gender+PCS+OVOB, data=data)
ftable(mytable) # print table 
summary(mytable) # chi-square test of indepedence

                     OVOB Normalweight Obese Overweight
gender PCS                                             
F      earlypubertal               130     8          2
       latepubertal                 20    11         11
       midpubertal                 271    68         60
M      earlypubertal               240    56         55
       latepubertal                 17     8          8
       midpubertal                 122    34         28

Call: xtabs(formula = ~gender + PCS + OVOB, data = data)
Number of cases in table: 1149 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 209.12, df = 12, p-value = 4.248e-38

## Get a list of subjects needed for analysis

In [28]:
randomly<-function(y){
    nines<-y[sample(nrow(y), 8), ]
    return(nines)
}

In [29]:
#randomly(data)

In [30]:
get_subs<-function(x){
    big_list<-list(
        ob_late_F<-subset(x, x$OVOB == "Obese" & x$PCS == "latepubertal" & x$gender == "F"),
        ob_mid_F<-subset(x, x$OVOB == "Obese" & x$PCS == "midpubertal" & x$gender == "F"),
         ob_early_F<-subset(x, x$OVOB == "Obese" & x$PCS == "earlypubertal" & x$gender == "F"),
    
        ov_late_F<-subset(x, x$OVOB == "Overweight" & x$PCS == "latepubertal" & x$gender == "F"),
        ov_mid_F<-subset(x, x$OVOB == "Overweight" & x$PCS == "midpubertal" & x$gender == "F"),
#         ov_early_F<-subset(x, x$OVOB == "Overweight" & x$PCS == "earlypubertal" & x$gender == "F"),
    
        no_late_F<-subset(x, x$OVOB == "Normalweight" & x$PCS == "latepubertal" & x$gender == "F"),
        no_mid_F<-subset(x, x$OVOB == "Normalweight" & x$PCS == "midpubertal" & x$gender == "F"),
        no_early_F<-subset(x, x$OVOB == "Normalweight" & x$PCS == "earlypubertal" & x$gender == "F"),
    ################################################################################################
        ob_late_M<-subset(x, x$OVOB == "Obese" & x$PCS == "latepubertal" & x$gender == "M"),
        
        ob_mid_M<-subset(x, x$OVOB == "Obese" & x$PCS == "midpubertal" & x$gender == "M"),
        ob_early_M<-subset(x, x$OVOB == "Obese" & x$PCS == "earlypubertal" & x$gender == "M"),

        ov_late_M<-subset(x, x$OVOB == "Overweight" & x$PCS == "latepubertal" & x$gender == "M"),
        
        ov_mid_M<-subset(x, x$OVOB == "Overweight" & x$PCS == "midpubertal" & x$gender == "M"),
        ov_early_M<-subset(x, x$OVOB == "Overweight" & x$PCS == "earlypubertal" & x$gender == "M"),

        no_late_M<-subset(x, x$OVOB == "Normalweight" & x$PCS == "latepubertal" & x$gender == "M"),
        
        no_mid_M<-subset(x, x$OVOB == "Normalweight" & x$PCS == "midpubertal" & x$gender == "M"),
        no_early_M<-subset(x, x$OVOB == "Normalweight" & x$PCS == "earlypubertal" & x$gender == "M")
    )
#     check_dims<-lapply(big_list, dim)
    final_list<-lapply(big_list, randomly)
    return(final_list)
}

In [31]:
data2<-get_subs(data)

In [32]:
length(data2)

In [33]:
x<-ldply(data2, rbind)

In [34]:
names(x)

In [35]:
myvars<-c('src_subject_id','gender','PCS', 'OVOB')
y<-x[myvars]

In [36]:
write.table(y, "~/Google Drive//ABCD/important_txt/data4analysis.txt", sep=" ", col.names=F, row.names = F)

## Check missing data

In [37]:
missing<-data0[is.na(data0$have_data),]

In [38]:
dim(missing)

In [39]:
mytable <- xtabs(~gender+PCS+OVOB, data=missing)
ftable(mytable) # print table 
summary(mytable) # chi-square test of indepedence

                     OVOB Normalweight Obese Overweight
gender PCS                                             
F      earlypubertal               218    25         30
       latepubertal                  4     7          2
       midpubertal                 553   119        133
M      earlypubertal               489   128        103
       latepubertal                  7     2          1
       midpubertal                 199    58         55

Call: xtabs(formula = ~gender + PCS + OVOB, data = missing)
Number of cases in table: 2133 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 450.5, df = 12, p-value = 7.232e-89
	Chi-squared approximation may be incorrect

Looks like the cell sizes are really uneven, will need to increase data in the female scans in particular

In [40]:
dim(missing[missing$OVOB == "Obese",])

In [41]:
ov_early<-subset(missing, missing$OVOB == "Overweight" & missing$PCS == "earlypubertal" & missing$gender == "F")
dim(ov_early)

In [42]:
ob_late<-subset(missing, missing$OVOB == "Obese" & missing$PCS == "latepubertal")
dim(ob_late)

In [43]:
ob_post<-subset(missing, missing$OVOB == "Obese" & missing$PCS == "postpubertal")
dim(ob_post)

In [44]:
ov_post<-subset(missing, missing$OVOB == "Overweight" & missing$PCS == "postpubertal")
dim(ov_post)

In [45]:
ov_late<-subset(missing, missing$OVOB == "Overweight" & missing$PCS == "latepubertal")
dim(ov_late)

In [46]:
no_late<-subset(missing, missing$OVOB == "Normalweight" & missing$PCS == "latepubertal")
dim(no_late)

In [47]:
no_post<-subset(missing, missing$OVOB == "Normalweight" & missing$PCS == "postpubertal")
dim(no_post)

In [48]:
gather<-rbind.fill(ob_late, ob_post, ov_post, ov_late, no_late, no_post)
dim(gather)

In [49]:
mytable <- xtabs(~gender+PCS+OVOB, data=gather)
ftable(mytable) # print table 
summary(mytable)

                    OVOB Normalweight Obese Overweight
gender PCS                                            
F      latepubertal                 4     7          2
M      latepubertal                 7     2          1

Call: xtabs(formula = ~gender + PCS + OVOB, data = gather)
Number of cases in table: 23 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 3.599, df = 2, p-value = 0.1654
	Chi-squared approximation may be incorrect

In [50]:
#write.table(gather$src_subject_id, "~/Google Drive/ABCD/important_txt/missing_grab.txt", sep="\t", row.names=F)

In [51]:
#write.table(ov_early$src_subject_id, "~/Google Drive/ABCD/important_txt/missing_grab2.txt", sep="\t", row.names=F)