# Link Phenotype to DNAm & Genotype Data

email from Fang

```css
I have done the phenotype summarization for the ALSPAC data. Also, I realized that it is pretty complicated to deliver all the information in the task assignment on Teams, so along with that, I am writing down more details in this email for you to check.
 
The DNA methylation data includes three waves for Mother and four waves for Child:
```

**Methylation**:
Mother & Child

| Mother               | Mother           | Mother         | Child   | Child     | Child       | Child       |
|----------------------|------------------|----------------|---------|-----------|-------------|-------------|
| Antenatal (~28,7yrs) | TF1-3 (~42.9yrs) | FOM (~47.8yrs) |   Cord  | 7yrs (F7) | 15yrs (TF3) | 17yrs (F17) |
|          n/a         |        n/a       |       n/a      | n = 914 |  n = 980  |   n = 254   |   n = 727   |
|        n = 987       |      n = 182     |     n = 810    |   n/a   |    n/a    |     n/a     |     n/a     |

```css
I have created phenotype files separately for each wave with cannabis use information (mothers_antenatal, mothers_TF1_3, mothers_FOM, kids_F15, kids_F17) at: s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/.
 
To link the phenotype files, DNA methylation and genetic data, you will need the linking id file: s3://rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv
 
The raw DNA methylation data is at s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/data.Robj
 
The genetic data are at s3://rti-cannabis/shared_data/raw_data/alspac/gwa_550_g1/ and s3://rti-cannabis/shared_data/raw_data/alspac/gwa_660_g0m/.
 
So I think the first thing is to extract DNA methylation data and genetic data matching the phenotype data, giving a demographic summary for each wave. 
```
 

## Download data
Use the GOBOT sandbox (https://github.rti.org/RTI/Bio-research-machine) because it has a lot of storage and memory.

Note that the GOBOT sandbox server has ARM achitecture and therefore PLINK will not work as of 10/26/2021. To resolve this issue, we will have to use a different server when we use PLINK.

In [None]:
## on a large memory node (r5.2xlarge with 64GB)

sudo yum update -y
aws configure # enter credentials and configurations

# install R
#sudo amazon-linux-extras install R4 -y

# install Docker
sudo amazon-linux-extras install docker
sudo service docker start
sudo usermod -a -G docker ec2-user
# then logout and back in


mkdir -p rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/ 
mkdir -p rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/ 
mkdir -p rti-cannabis/shared_data/post_qc/alspac/dna_methylation/
mkdir -p rti-cannabis/shared_scripts/DNAm_xreactive/

aws s3 cp s3://rti-cannabis/shared_scripts/DNAm_xreactive/cross.csv \ # cross-reactive probes
    ~/rti-cannabis/shared_scripts/DNAm_xreactive/ 

mkdir -p rti-cannabis/shared_data/post_qc/alspac/phenotypes
mkdir -p rti-cannabis/shared_data/raw_data/alspac/gwa_550_g1/
mkdir -p rti-cannabis/shared_data/raw_data/alspac/gwa_660_g0m/

# DNAm file
cd rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/
aws s3 cp s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/data.Robj .

# phenotype files
cd ~/rti-cannabis/shared_data/post_qc/alspac/phenotypes/
aws s3 sync s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/ .

# genetic data
cd ~/rti-cannabis/shared_data/raw_data/alspac/gwa_550_g1/
aws s3 sync s3://rti-cannabis/shared_data/raw_data/alspac/gwa_550_g1/version1/ .

cd ~/rti-cannabis/shared_data/raw_data/alspac/gwa_660_g0m/
aws s3 sync s3://rti-cannabis/shared_data/raw_data/alspac/gwa_660_g0m/version1/ .

# linking ID file
cd ~/rti-cannabis/shared_data/raw_data/alspac
aws s3 cp s3://rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv .

# DNAm Data
Use the EWAS Docker image that Fang Fang created.

*Note*: an updated ewas image is available at `rtibiocloud/ewas:v0.0.2_99db04b`

In [None]:
cd rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas/

## start interactive session
docker run \
    --rm \
    -v $PWD:$PWD \
    -it ffang8/ewas:v041221 \
    /bin/bash

cd /home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/betas 

## Load libraries

In [None]:
## Start R
R

##############################
##LOAD LIBRARIES
##############################

library(minfi)
library(ggplot2)
    theme_set(theme_bw(base_size=18) + 
        theme(panel.grid.major = element_blank(),
         panel.grid.minor = element_blank(),
         plot.title = element_text(hjust = 0.5),
         legend.position="none"))
library(ggrepel)
library(jaffelab)
library(gridExtra)
library(grid)
library(dplyr)
library(RColorBrewer)
library(FlowSorted.Blood.EPIC)
library(data.table)# to process results
library(MASS) # rlm function for robust linear regression
library(sandwich) #Huberís estimation of the standard error
library(lmtest) # to use coeftest
library(parallel) # to use multicore approach - part of base R – can be omitted if lapply() used instead of mclapply()
# library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) #This is the newest version, LIBD processed data using older version#
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(ENmix)


```
We strongly recommend to drop the probes that contain either a SNP at the CpG interrogation or at the single nucleotide extension.The function dropLociWithSnps allows to drop the corresponding probes. 
```
-- https://www.bioconductor.org/help/course-materials/2015/BioC2015/methylation450k.html


```


This step is to filter out probes with SNPs, since SNPs will affect the probe hybridization and then the estimation of the methylation results will be inaccurate.
```
-- Fang Fang


```


Hybridization probe
In molecular biology, a hybridization probe is a fragment of DNA or RNA of variable length which can be radioactively or fluorescently labeled. It can then be used in DNA or RNA samples to detect the presence of nucleotide substances that are complementary to the sequence in the probe.
```

## Apply probe filters

In [None]:
### R 
load("data.Robj", verbose=TRUE)
# betas

##############################
##PROBE FILTERING
##############################
#In order to use minfi for additional annotation, we need to create a GenomicRatioSet
#beta_grs<-makeGenomicRatioSetFromMatrix(beta,array = "IlluminaHumanMethylationEPIC", annotation = "ilm10b4.hg19") 
beta_grs <- makeGenomicRatioSetFromMatrix(betas,
                                          array = "IlluminaHumanMethylation450k",
                                          annotation = "ilmn12.hg19")
dim(beta_grs) #482855 x 4854

#DROP PROBES WITH SNP IN SBE OR CPG
beta_grs <- addSnpInfo(beta_grs)
beta_grs_filter <- dropLociWithSnps(beta_grs, snps=c("SBE","CpG"), maf=0.01) 

dim(beta_grs_filter) #465873 x 4854

save(beta_grs_filter,file="~/rti-cannabis/shared_data/post_qc/alspac/dna_methylation/beta_grs_filter.Rdata")

## close R session to free up memory

In [None]:
cd ~/rti-cannabis/shared_data/post_qc/alspac/dna_methylation/

## Start R session with new Rdata
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)

load("beta_grs_filter.Rdata", verbose=TRUE)

>head(rownames(beta_grs_filter))
#[1] "cg13869341" "cg14008030" "cg12045430" "cg20826792" "cg00381604"
#[6] "cg20253340"

>head(colnames(betas))
#[1] "SLIDE210_R04C02" "SLIDE380_R02C01" "SLIDE306_R02C02" "SLIDE16_R06C01"
#[5] "SLIDE189_R04C02" "SLIDE132_R03C01"

#DROP PROBES WITH CROSSREACTIVE BEHAVIOR
xReactiveProbes<-read.csv(file="~/rti-cannabis/shared_scripts/DNAm_xreactive/cross.csv", stringsAsFactors=FALSE)
keep <- !(featureNames(beta_grs_filter) %in% xReactiveProbes[,1])
table(keep) # 26,769 to remove
beta_grs_filter <- beta_grs_filter[keep,] 

dim(beta_grs_filter) # 439104   4854

beta_grs_filter <- beta_grs_filter[keep,] 

#DROP PROBES THAT MAP TO THE CHRY, but keep CHRX
annEPIC = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
table(annEPIC$chr)
#  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20
# 46857 24388 28794 24539 12285 15078 15259 21969 27879  5922 25521 34810 10379
# chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX  chrY
# 4243  8552 25159 20464 24327 36611 30017 20950  9861 11232   416


for(chr in names(table(annEPIC$chr))){
    keep_chr <- featureNames(beta_grs_filter) %in% annEPIC$Name[annEPIC$chr==chr]
    beta_chr <- beta_grs_filter[keep_chr,]
    bVals_chr <- getBeta(beta_chr)
    save(bVals_chr, file = paste0('alspac_dnameth_betas_',chr,'.rda'))
}

keep <- !(featureNames(beta_grs_filter) %in% annEPIC$Name[annEPIC$chr %in% c("chrY")])
table(keep) #303 to remove
beta_grs_filter <- beta_grs_filter[keep,]
dim(beta_grs_filter) # 438801   4854
   
setwd("~/rti-cannabis/shared_data/post_qc/alspac/dna_methylation/")
save(beta_grs_filter, file = 'alspac_dnameth_withchrX_beta_grs_filter.rda')


######################################
##CONVERT BACK TO BETA MATRIX
######################################

bVals <- getBeta(beta_grs_filter)
dim(bVals) # 438801   4854

##########################################
##SAVE METH data
###########################################
save(bVals, file = 'alspac_dnameth_withchrX_bVals.rda')

## Upload DNAm to S3
Upload to S3 after removing cross-reactive and SNP-associated probes.

In [None]:
cd ~/rti-cannabis/shared_data/post_qc/alspac/dna_methylation/

aws s3 cp alspac_dnameth_withchrX_beta_grs_filter.rda s3://rti-cannabis/shared_data/post_qc/alspac/dna_methylation/
aws s3 cp alspac_dnameth_withchrX_bVals.rda s3://rti-cannabis/shared_data/post_qc/alspac/dna_methylation/

for chr in {1..22} X Y; do
    aws s3 cp alspac_dnameth_betas_chr$chr.rda s3://rti-cannabis/shared_data/post_qc/alspac/dna_methylation/
done

# Phenotype

## SVA and celltype proportions
Surrogate variables from non-negative control probes.

We do not have the rgset, but Fang calculated the SVs. Add these to the phenotype file.

```css
[10/29/21 12:35 PM] Fang, Fang
Hey, Jesse, I finally made the sva files based on the summary control probes they provided in the raw data, using customized code mimicking the core code in ENmix. The file contains 6 SVs that should be included in our EWAS. 

s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/alspac_control_probe_sva.txt

The codes are recorded in s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/ALSPAC surrogate variables based on control probes.ipynb 

I would like to try another way using the R package meffil later on.
```

### Issue

- ALSPAC does not provide raw data (IDAT files) or rgSet, only beta values and qc.objects using meffil.

`Min, J. L., Hemani, G., Davey Smith, G., Relton, C., & Suderman, M. (2018). Meffil: efficient normalization and analysis of very large DNA methylation datasets. Bioinformatics, 34(23), 3983-3989.` 

- We cannot generate surrogate variables as we usually do with ENmix.

`sv <- ctrlsva(rgSet,percvar=0.95,npc=6,flag=1)`

### Solution 1 - mimic the method in ENmix (ctrlsva)

In [None]:
# the core codes for ctrlsva
percvar=0.95
npc=1
flag=1
pca <- prcomp(t(ctrl_nneg))
eigenvalue = pca$sdev^2
perc = eigenvalue/sum(eigenvalue)
if (flag == 1) {
    npc=1
    while(sum(perc[1:npc] < percvar)){
        npc <- npc + 1
    }
    npc
    ctrlsva = pca$x[,1:npc]
}
else {
    ctrlsva = pca$x[,1:npc]
}
cat(npc," surrogate variables explain ",sum(perc[1:npc])*100,"% of \n data variation\n")

In [None]:
cd ~/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/control_matrix/
aws s3 cp s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/control_matrix/data.txt .

### R
# start from the data.txt with the control probe intensities, renamed as "control.data.txt"
control <- read.table("control.data.txt",header=T,stringsAsFactors=F)

percvar=0.95
npc=1
flag=1
pca <- prcomp(control[,2:43])
eigenvalue = pca$sdev^2
perc = eigenvalue/sum(eigenvalue)
if (flag == 1) {
    npc=1
    while(sum(perc[1:npc]) < percvar){
        npc <- npc + 1
    }
    npc
    ctrlsva = pca$x[,1:npc]
}else {
    ctrlsva = pca$x[,1:npc]
}

cat(npc," surrogate variables explain ",sum(perc[1:npc])*100,"% of \n data variation\n")
# 6  surrogate variables explain  95.68676 % of
#  data variation



sva <- data.frame(Sample_Name=control$Sample_Name,
                  sv1=ctrlsva[,1],sv2=ctrlsva[,2],sv3=ctrlsva[,3],
                  sv4=ctrlsva[,4],sv5=ctrlsva[,5],sv6=ctrlsva[,6])

write.table(sva,file="alspac_control_probe_sva.txt",quote=F,row.names=F)

In [None]:
aws s3 cp alspac_control_probe_sva.txt s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/

### Cell type estimation

ALSPAC precomputed the cell-type proportions for us at `s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/data.txt`

In [None]:
mkdir -p ~/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/
cd ~/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/
aws s3 cp s3://rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/data.txt .

## Mothers phenotype files

### combine TF1–3 and the FOM

In [None]:
cd /home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes

### R 
tf13 <- read.csv("pheno_mothers_TF1_3_ewas.csv")
head(tf13)
#  cidB3176 qlet age_at_DNAm cannabisUse smoking drinking
#1        7    A    51.63860           2   Never  Current
#2       92    A    34.71321           2 Current     Ever
#3       97    A    43.97810           2 Current  Current

fom <- read.csv("pheno_mothers_FOM_ewas.csv")
dim(fom) # 795 x 7
head(fom)
#   cidB3176 qlet age_at_DNAm cannabisUse smoking drinking      BMI
#1       21    A          46           2   Never  Current 21.03090
#2       84    A          57           2    Ever  Current 37.19008
#3      113    A          52           2    Ever  Current 24.00127

# combine datasets
tf13[,"BMI"] <- NA
dim(tf13) # 182 x 7
combined_data <- rbind(tf13, fom)
dim(combined_data)# 977 x 7
head(combined_data)
#  cidB3176 qlet age_at_DNAm cannabisUse smoking drinking BMI
#1        7    A    51.63860           2   Never  Current  NA

write.table(combined_data,file="pheno_mothers_combined_FOM_TF1_3_n977_ewas.txt",quote=F,row.names=F)

### Remove outliers

In [None]:
### R
# update the phenotype file sample IDs from cidB3176 to Sample_Name
mothers_pheno <- read.csv("pheno_mothers_combined_FOM_TF1_3_n977_ewas.txt", sep = " ")
linking_file <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv")
load("~/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/samplesheet/data.Robj")

dim(mothers_pheno) # 977 x 7
dim(linking_file) # 21506 x 6
dim(samplesheet) # 4854 x 16

# keep only mothers in TF1-3 and FOM
samplesheet_mothers <-  samplesheet[which(samplesheet$time_code == "TF1-3" | 
                                          samplesheet$time_code == "FOM"),]
dim(samplesheet_mothers) # 992 x 16

# Remove duplicate and population stratification samples 
sample.rm <- which(samplesheet_mothers$duplicate.rm=="Remove" | 
                   samplesheet_mothers$genotypeQCkids == "ETHNICITY" | 
                   samplesheet_mothers$genotypeQCkids=="HZT;ETHNICITY"|
                   samplesheet_mothers$genotypeQCmums=="/strat" )

samplesheet_mothers <- samplesheet_mothers[-sample.rm, ]                                     

dim(samplesheet_mothers) # 947 x 16

# remove duplicated cidB3176 IDs from phenotype file
remove_dups <- which(duplicated(mothers_pheno$cidB3176))
mothers_pheno <- mothers_pheno[-remove_dups, ]

dim(mothers_pheno) # 975 x 7

# add ALN IDs to the phenotype file from the linking file (ALN == dnam_450_g0m_g1)
mothers_pheno$ALN <- linking_file$dnam_450_g0m_g1[match(mothers_pheno$cidB3176, linking_file$cidB3176)]

# NOTE: there are no duplicated ALNs this time. If there were, we would have to address 
# them using the "qlet" to get the unique Sample_Name
which(duplicated(mothers_pheno$ALN)) # integer(0)

# remove any mothers from the phenotype file that were not in the curated samplesheet
keep <- mothers_pheno$ALN %in% samplesheet_mothers$ALN
mothers_pheno <- mothers_pheno[keep,] # 946 x 8

# add Sample_Name to the phenotype file
mothers_pheno$Sample_Name <- samplesheet_mothers$Sample_Name[match(mothers_pheno$ALN, samplesheet_mothers$ALN)]

dim(mothers_pheno) # 946 x 9
                            
outname = "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_with_sample_name_ids.txt"
write.table(mothers_pheno, outname, quote = F, row.names = F)

### Combine SVs and cell-type proportions

In [None]:
### R 
pheno_mothers <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_with_sample_name_ids.txt",
                          sep = " ")
cellcounts <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/data.txt",
                      sep = "\t")
sv_data <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/alspac_control_probe_sva.txt",
                    sep = " ")

dim(cellcounts) # 4854 x 7
dim(pheno_mothers) # 946 x 9
dim(sv_data) # 4854 x 7

# add cellcounts
add_cellcounts <- which(cellcounts$Sample_Name %in% pheno_mothers$Sample_Name)
cellcounts[add_cellcounts, 1]

# attach cell type estimations and SVs to pheno
pheno_mothers_cellcounts <- merge(pheno_mothers,cellcounts,by.x="Sample_Name",by.y="Sample_Name")
pheno_mothers_cellcounts_svs <- merge(pheno_mothers_cellcounts,sv_data,by.x="Sample_Name",by.y="Sample_Name")

# reorder
pheno_mothers_cellcounts_svs <- pheno_mothers_cellcounts_svs[, c(1,2,9,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21)]
dim(pheno_mothers_cellcounts_svs) # 946 x 21

outfile <- "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_mothers_combined_FOM_TF1_3_n946_ewas_final.txt"
write.table(pheno_mothers_cellcounts_svs, file=outfile, quote=F, row.names=F)

## Childrens phenotype files

### combine 15yrs and 17yrs

In [None]:
cd /home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes

### R 
setwd("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes")
f15 <- read.csv("pheno_kids_F15_ewas.csv")
f17 <- read.csv("pheno_kids_F17_ewas.csv")
linking_file <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/B3176_datasetids.csv")
load("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/samplesheet/data.Robj")

head(f15); head(f17)
dim(f15) # 253 x 6
dim(f17) # 716 x 6

# combine phenotype datasets
combined_pheno <- rbind(f15, f17)
dim(combined_pheno) # 969 x 6
head(combined_pheno)

# see which samples have duplicate cidB3176 IDs. This indicates twins. These samples will have
# separate Sample_Name IDs in the samplesheet. Take into account the qlet to determine the Sample_Name
dup_ids <- which(duplicated(combined_pheno$cidB3176)) # 295 441
combined_pheno[c(dup_ids-1,dup_ids), ]
#    cidB3176 qlet age_at_DNAm cannabisUse smoking drinking
#294     1525    A    18.00000           2    Ever  Current
#440     5646    A    17.58333           2   Never     Ever
#295     1525    B    18.00000           2    Ever     Ever
#441     5646    B    17.58333           2   Never     Ever


# keep only kids in F17 (17yr old) and TF3 (15yr old) in the samplesheet
samplesheet_kids <-  samplesheet[which(samplesheet$time_code == "F17" | 
                                          samplesheet$time_code == "TF3"),]
dim(samplesheet_kids) # 981 x 16

# Remove duplicate and population stratification samples 
sample.rm <- which(samplesheet_kids$duplicate.rm=="Remove" | 
                   samplesheet_kids$genotypeQCkids == "ETHNICITY" | 
                   samplesheet_kids$genotypeQCkids=="HZT;ETHNICITY"|
                   samplesheet_kids$genotypeQCmums=="/strat" )

samplesheet_kids <- samplesheet_kids[-sample.rm, ]                                     

dim(samplesheet_kids) # 925 x 16

samplesheet_kids[which(duplicated(samplesheet_kids$ALN)), c(1,2,3,7,9)]
#        Sample_Name   ALN QLET time_code Sex
#4357 SLIDE84_R06C01 37785    B       F17   M
#2347 SLIDE78_R03C02 36044    A       F17   F


# add ALN IDs to the phenotype file from the linking file (ALN == dnam_450_g0m_g1)
combined_pheno$ALN <- linking_file$dnam_450_g0m_g1[match(combined_pheno$cidB3176, linking_file$cidB3176)]

# remove kids samples from phenotype file that are not in the curated samplesheet
keep <- combined_pheno$ALN %in% samplesheet_kids$ALN
combined_pheno <- combined_pheno[keep,]
dim(combined_pheno) # 924 x 7

# add Sample_Name to the phenotype file
combined_pheno$Sample_Name <- samplesheet_kids$Sample_Name[match(combined_pheno$ALN, samplesheet_kids$ALN)]

# add Sex to the phenotype file
combined_pheno$sex <- samplesheet_kids$Sex[match(combined_pheno$ALN, samplesheet_kids$ALN)]

dim(combined_pheno) # 924 x 9

# NOTE: These below duplicates indicate twins. We will separate them out using
# the qlet. This will enable us to get the unique Sample_Name
dup_ids <- which(duplicated(combined_pheno$Sample_Name)) # 278, 420
combined_pheno[c(dup_ids-1,dup_ids), ] # 277:278 and 419:420
#    cidB3176 qlet age_at_DNAm cannabisUse smoking drinking   ALN Sample_Name    sex
#294     1525    A    18.00000           2    Ever  Current 36044 SLIDE159_R04C01    F
#295     1525    B    18.00000           2    Ever     Ever 36044 SLIDE159_R04C01    M
#440     5646    A    17.58333           2   Never     Ever 37785 SLIDE345_R06C02    F
#441     5646    B    17.58333           2   Never     Ever 37785 SLIDE345_R06C02    M

samplesheet_kids[which(samplesheet_kids$ALN == 36044 | samplesheet_kids$ALN == 37785), 1:4]
#         Sample_Name   ALN QLET    Slide
#2350 SLIDE159_R04C01 36044    B SLIDE159
#2347  SLIDE78_R03C02 36044    A  SLIDE78
#4355 SLIDE345_R06C02 37785    A SLIDE345
#4357  SLIDE84_R06C01 37785    B  SLIDE84

# manually assign the Sample_Name to the twins
combined_pheno[277, "Sample_Name"] <- "SLIDE78_R03C02"
combined_pheno[420, "Sample_Name"] <- "SLIDE84_R06C01"

which(duplicated(combined_pheno$Sample_Name)) # integer(0)

dim(combined_pheno) # 924 x 9

# reorder columns
combined_pheno <- combined_pheno[, c(8,1,7,2,3,4,5,6,9)]

# convert sex coding to PLINK format (M=1, F=2)
combined_pheno[which(combined_pheno$sex == "F"), "sex"] <- 2
combined_pheno[which(combined_pheno$sex == "M"), "sex"] <- 1

outfile <- "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_kids_combined_f15_f17_n924_ewas.txt"
write.table(combined_pheno,file=outfile,quote=F,row.names=F)

### add SVs and Cell-type proportions

In [None]:
cellcounts <- read.csv("/home/ec2-user/rti-cannabis/shared_data/raw_data/alspac/dnam_450_g0m_g1/version1/derived/cellcounts/houseman/data.txt",
                      sep = "\t")
sv_data <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/alspac_control_probe_sva.txt", 
                    sep = " ")
kids_pheno <- read.csv("/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_kids_combined_f15_f17_n924_ewas.txt",
                      sep = " ")

dim(cellcounts) # 4854 x 7
dim(sv_data) # 4854 x 7
dim(kids_pheno) # 924 x 9

# attach cell type estimations and SVs to pheno
pheno_kids_cellcounts <- merge(kids_pheno,cellcounts,by.x="Sample_Name",by.y="Sample_Name")
pheno_kids_cellcounts_svs <- merge(pheno_kids_cellcounts,sv_data,by.x="Sample_Name",by.y="Sample_Name")
dim(pheno_kids_cellcounts_svs) # 924 x 21

outfile <- "/home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/pheno_kids_combined_f15_f17_n924_ewas_final.txt"
write.table(pheno_kids_cellcounts_svs, file=outfile, quote=F, row.names=F)

### S3 upload

In [None]:
# upload to S3
cd /home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes

aws s3 cp pheno_kids_combined_f15_f17_n924_ewas_final.txt \
    s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/

aws s3 cp pheno_mothers_combined_FOM_TF1_3_n946_ewas_final.txt \
    s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/

## Childrens Cannabis Use in Never Smokers
| Never Smokers | cannabisUse 1 | cannabisUse 2 |
|---------------|---------------|---------------|
| 218           | 20            | 198           |

According to Fang Fang 1=ever and 2=never.

In [None]:
cd /home/ec2-user/rti-cannabis/shared_data/post_qc/alspac/phenotypes/children_cannabis_in_never_smokers/

aws s3 cp s3://rti-cannabis/ewas/alspac/data/pheno_kids_combined_f15_f17_n924_ewas_final.txt .

wc -l pheno_kids_combined_f15_f17_n924_ewas_final.txt
#925 pheno_kids_combined_f15_f17_n924_ewas_final.txt
#Sample_Name cidB3176 ALN qlet age_at_DNAm cannabisUse smoking drinking sex Bcell CD4T CD8T Gran Mono NK sv1 sv2 sv3 sv4 sv5 sv6

cat pheno_kids_combined_f15_f17_n924_ewas_final.txt  | awk '$7=="Never" {print $7}'|wc -l
#218
cat pheno_kids_combined_f15_f17_n924_ewas_final.txt  | awk '$7!="Never" {print $7}'| wc -l
#707


head -1 pheno_kids_combined_f15_f17_n924_ewas_final.txt >\
    pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt

cat pheno_kids_combined_f15_f17_n924_ewas_final.txt  |\
    awk '$7=="Never" {print $0}' >>\
    pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt

awk '$6==2' pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt  | wc -l
#198
ec2-user@ip-172-31-15-238:children_cannabis_in_never_smokers$ awk '$6==1' pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt  | wc -l
#20

In [None]:
## upload to S3
aws s3 cp pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt \
    s3://rti-cannabis/ewas/alspac/data/

aws s3 cp pheno_kids_combined_f15_f17_n218_never_smokers_ewas_final.txt \
    s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/
