# Overall DNAm data
**S3 locations**
* s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/samplesheet/data.Robj
* s3://rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv

The `B3176_datasetids.csv` file is a linking file. It links the IDs across different data types. We want to convert the DNAm and phenotype data to a consistent ID, so we will convert everything to the cidB3176 format.

The `samplesheet.Robj` is an R object that contains information for converting the Sample_Name to the ALN. We then convert the ALN to cidB3176 using the linking file. The linking chain looks like this then:

`Sample_Name --> ALN == dnam_450_g0m_g1 --> cidB3176`

The ALN from the samplesheet corresponds to the dnam_450_g0m_g1 in the linking file. 

Now the issue is that multiple samples from Sample_Name in `samplesheet.Robj` map to the same ALN. For example, the Sample_Name SLIDE210_R04C02, SLIDE281_R01C01, SLIDE78_R06C02, SLIDE184_R03C02, and SLIDE290_R05C01 all map to the ALN 45554. Even though they are all distinct samples.
```
Sample_Name ALN QLET Slide sentrix_row sentrix_col time_code time_point Sex BCD_plate sample_type additive age duplicate.rm genotypeQCkids genotypeQCmums
SLIDE210_R04C02 45554 A SLIDE210 04 02 F17 15up M PLATE110 whitecells EDTA 17.9166666666667 NA Y NA
SLIDE281_R01C01 45554 A SLIDE281 01 01 F7 F7 M PLATE115 whitecells EDTA 7.41666666666667 NA Y NA
SLIDE78_R06C02 45554 M SLIDE78 06 02 TF1-3 FOM F PLATE87 PBL CPDA 40.55578371 NA NA Y
SLIDE184_R03C02 45554 M SLIDE184 03 02 antenatal antenatal F PLATE70 wholeblood EDTA 25 NA NA Y
SLIDE290_R05C01 45554 A SLIDE290 05 01 cord cord M PLATE109 whitecells heparin NA NA Y NA
```

We can see the difference in these samples when we look at the time_code:
```
time_code
F17
F7
TF1-3
antenatal
cord
```

For our EWAS analyses, we are interested in the mothers TF1–3 (~42.9yrs) & FOM (~47.8yrs), and the children 15yrs (TF3) & 17yrs (F17).

So for the above samples, we would split the samplesheet up by mothers and children:
**Mothers**
```
SLIDE78_R06C02 45554 M SLIDE78 06 02 TF1-3 FOM F PLATE87 PBL CPDA 40.55578371 NA NA Y
```

**Children**
```
SLIDE210_R04C02 45554 A SLIDE210 04 02 F17 15up M PLATE110 whitecells EDTA 17.9166666666667 NA Y NA
```

In [None]:
head(samplesheet)
#     Sample_Name   ALN QLET    Slide sentrix_row sentrix_col time_code time_point Sex BCD_plate sample_type additive age duplicate.rm genotypeQCkids genotypeQCmums
#4874 SLIDE210_R04C02 45554    A SLIDE210          04          02       F17   15up   M  PLATE110  whitecells     EDTA 17.91667         <NA>              Y           <NA>
#5266 SLIDE380_R02C01 36465    M SLIDE380          02          01     TF1-3  FOM   F   PLATE87         PBL     CPDA 43.92608         <NA>          <NA>              Y

head B3176_datasetids.csv
#gwa_660_g0m,gwa_550_g1,gi_1000g_g0m_g1,gi_hrc_g0m_g1,dnam_450_g0m_g1,cidB3176
#54406,30734,36464,42623,35599,18110
#42102,38724,51506,41542,52612,7438

In [None]:
# load in DNAm data sample sheet
load("samplesheet.Robj")

#Remove duplicate and population stratification samples 
sample.rm<-which(samplesheet$duplicate.rm=="Remove"|samplesheet$genotypeQCkids=="ETHNICITY"|samplesheet$genotypeQCkids=="HZT;ETHNICITY"|samplesheet$genotypeQCmums=="/strat")

#261 samples are removed
length(sample.rm)
# 261

samplesheet_filtered<-samplesheet[-sample.rm,]
dim(samplesheet_filtered) # 4593


In [None]:
# linking IDs
linking <- read.csv("B3176_datasetids.csv")
head(linking)

In [None]:
# derived surrogate variables using control probes
svs <- read.table("alspac_control_probe_sva.txt",header=T,stringsAsFactors=F)
head(svs)

In [None]:
# derived cell type proportions using Houseman method
cellCounts <- read.table("cellCounts.data.txt",header=T,stringsAsFactors=F)
head(cellCounts)

# DNAm data for mothers

In [None]:
# making subset of mothers from only two time points: TF1-3 and FOM
mothers <- samplesheet_filtered[which(samplesheet_filtered$QLET=="M" & 
                         (samplesheet_filtered$time_code=="TF1-3" | samplesheet_filtered$time_code=="FOM")),]
dim(mothers)
# [1] 947  16
sum(duplicated(mothers$ALN))
# [1] 0


In [None]:
# Get SVs for mother samples
mothers_svs <- merge(svs,mothers,by="Sample_Name")
head(mothers_svs)
sum(is.na(mothers_svs$ALN))
# [1] 0
sum(duplicated(mothers_svs$ALN))
# [1] 0


In [None]:
# Get cell counts for mother samples
mothers_svs_cellCounts <- merge(mothers_svs,cellCounts,by="Sample_Name")
dim(mothers_svs_cellCounts)
sum(duplicated(mothers_svs_cellCounts$ALN))
mothers_svs_cellCounts$cidB3176 <- linking$cidB3176[match(mothers_svs_cellCounts$ALN,linking$dnam_450_g0m_g1)]
dim(mothers_svs_cellCounts)
# [1] 947  29


In [None]:
write.table(mothers_svs_cellCounts,file="alspac_mothers_samplesheet_svs_cellCounts.txt",quote=F,row.names=F)

# DNAm data for kids (F15 and F17)

In [None]:
# making subset of teenager kids from only two time points: TF3 and F17
kids <- samplesheet_filtered[which(samplesheet_filtered$QLET!="M" & 
                         (samplesheet_filtered$time_code=="TF3" | samplesheet_filtered$time_code=="F17")),]
sum(duplicated(kids$ALN))
# [1] 2
table(kids$QLET)
#   A   B   M
# 923   2   0
# needs to take care of the two pairs of twins

In [None]:
# Get SVs for kid samples
kids_svs <- merge(svs,kids,by="Sample_Name")
head(kids_svs)
sum(is.na(kids_svs$ALN))
# [1] 0
sum(duplicated(kids_svs$ALN))
# [1] 2


In [None]:
# Get cell counts for mother samples
kids_svs_cellCounts <- merge(kids_svs,cellCounts,by="Sample_Name")
dim(kids_svs_cellCounts)
sum(duplicated(kids_svs_cellCounts$ALN))
# [1] 2
kids_svs_cellCounts$cidB3176 <- linking$cidB3176[match(kids_svs_cellCounts$ALN,linking$dnam_450_g0m_g1)]
dim(kids_svs_cellCounts)
# [1] 925  29
table(kids_svs_cellCounts$QLET)

#   A   B   M
# 923   2   0
# Remember to check QLET when merging with phenotype data

In [None]:
write.table(kids_svs_cellCounts,file="alspac_kids_samplesheet_svs_cellCounts.txt",quote=F,row.names=F)

# cidB3176 to Sample_Name
I think it will be simpler to convert the cidB3176 sample names in the phenotype files to Sample_Name format. Sample_Name format is what the DNAm data are already in as well as the cell type proportions and SVs from the SVA, so it would be most straightforward. Less chance of messing the preprocessing up.


Note that it appears that there are actually some legitimate duplicates in the data. For example, there are 16 instances of mothers data with duplicates. I will keep one from each pair of duplicates--the one that does not say remove.

In [None]:
### R
# update the phenotype file sample IDs from cidB3176 to Sample_Name
mothers_pheno <- read.csv("pheno_mothers_combined_FOM_TF1_3_n977_ewas.txt", sep = " ")
linking_file <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv")
load("~/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/samplesheet/data.Robj")

dim(mothers_pheno) # 977 x 7
dim(linking_file) # 21506 x 6
dim(samplesheet) # 4854 x 16

# keep only mothers in TF1-3 and FOM
samplesheet_mothers <-  samplesheet[which(samplesheet$time_code == "TF1-3" | 
                                          samplesheet$time_code == "FOM"),]
dim(samplesheet_mothers) # 992 x 16

# Remove duplicate and population stratification samples 
sample.rm <- which(samplesheet_mothers$duplicate.rm=="Remove" | 
                   samplesheet_mothers$genotypeQCkids == "ETHNICITY" | 
                   samplesheet_mothers$genotypeQCkids=="HZT;ETHNICITY"|
                   samplesheet_mothers$genotypeQCmums=="/strat" )

samplesheet_mothers <- samplesheet_mothers[-sample.rm, ]                                     

dim(samplesheet_mothers) # 947 x 16

# remove duplicated cidB3176 IDs from phenotype file
remove_dups <- which(duplicated(mothers_pheno$cidB3176))
mothers_pheno <- mothers_pheno[-remove_dups, ]

dim(mothers_pheno) # 975 x 7

# add ALN IDs to the phenotype file from the linking file (ALN == dnam_450_g0m_g1)
mothers_pheno$ALN <- linking_file$dnam_450_g0m_g1[match(mothers_pheno$cidB3176, linking_file$cidB3176)]

# NOTE: there are no duplicated ALNs this time. If there were, we would have to address 
# them using the "qlet" to get the unique Sample_Name
which(duplicated(mothers_pheno$ALN)) # integer(0)

# remove any mothers from the phenotype file that were not in the curated samplesheet
keep <- mothers_pheno$ALN %in% samplesheet_mothers$ALN
mothers_pheno <- mothers_pheno[keep,] # 946 x 8

# add Sample_Name to the phenotype file
mothers_pheno$Sample_Name <- samplesheet_mothers$Sample_Name[match(mothers_pheno$ALN, samplesheet_mothers$ALN)]

dim(mothers_pheno) # 946 x 9
                            
outname = "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_with_sample_name_ids.txt"
write.table(mothers_pheno, outname, quote = F, row.names = F)

## Add cell-type proportions and SVs to phenotype files

In [None]:
### R 
pheno_mothers <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_with_sample_name_ids.txt",
                          sep = " ")
cellcounts <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/data.txt",
                      sep = "\t")
sv_data <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/alspac_control_probe_sva.txt", sep = " ")

dim(cellcounts) # 4854 x 7
dim(pheno_mothers) # 946 x 9
dim(sv_data) # 4854 x 7

# add cellcounts
# combine or merge? look at how Fang did it in Gulf study
add_cellcounts <- which(cellcounts$Sample_Name %in% pheno_mothers$Sample_Name)
cellcounts[add_cellcounts, 1]

# attach cell type estimations and SVs to pheno
pheno_mothers_cellcounts <- merge(pheno_mothers,cellcounts,by.x="Sample_Name",by.y="Sample_Name")
pheno_mothers_cellcounts_svs <- merge(pheno_mothers_cellcounts,sv_data,by.x="Sample_Name",by.y="Sample_Name")

# reorder
pheno_mothers_cellcounts_svs <- pheno_mothers_cellcounts_svs[, c(1,2,9,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21)]
dim(pheno_mothers_cellcounts_svs) # 946 x 21

outfile <- "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_ewas_final.txt"
write.table(pheno_mothers_cellcounts_svs, file=outfile, quote=F, row.names=F)

In [None]:
# upload to s3
cd ~/rti-cannabis/shared_data/post_qc/alspac/phenotypes
aws s3 cp pheno_mothers_combined_FOM_TF1_3_n946_ewas_final.txt \
    s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/
