# Issue

- ALSPAC does not provide raw data (IDAT files) or rgSet, only beta values and qc.objects using meffil.

`Min, J. L., Hemani, G., Davey Smith, G., Relton, C., & Suderman, M. (2018). Meffil: efficient normalization and analysis of very large DNA methylation datasets. Bioinformatics, 34(23), 3983-3989.` 

- We cannot generate surrogate variables as we usually do with ENmix.

`sv <- ctrlsva(rgSet,percvar=0.95,npc=6,flag=1)`

# Solution 1 - mimic the method in ENmix (ctrlsva)

In [None]:
# the core codes for ctrlsva
percvar=0.95
npc=1
flag=1
pca <- prcomp(t(ctrl_nneg))
eigenvalue = pca$sdev^2
perc = eigenvalue/sum(eigenvalue)
if (flag == 1) {
    npc=1
    while(sum(perc[1:npc] < percvar)){
        npc <- npc + 1
    }
    npc
    ctrlsva = pca$x[,1:npc]
}
else {
    ctrlsva = pca$x[,1:npc]
}
cat(npc," surrogate variables explain ",sum(perc[1:npc])*100,"% of \n data variation\n")

In [None]:
# start from the data.txt with the control probe intensities, renames as "control.data.txt"
control <- read.table("control.data.txt",header=T,stringsAsFactors=F)

percvar=0.95
npc=1
flag=1
pca <- prcomp(control[,2:43])
eigenvalue = pca$sdev^2
perc = eigenvalue/sum(eigenvalue)
if (flag == 1) {
    npc=1
    while(sum(perc[1:npc]) < percvar){
        npc <- npc + 1
    }
    npc
    ctrlsva = pca$x[,1:npc]
}else {
    ctrlsva = pca$x[,1:npc]
}

cat(npc," surrogate variables explain ",sum(perc[1:npc])*100,"% of \n data variation\n")
# 6  surrogate variables explain  95.68676 % of
#  data variation



sva <- data.frame(Sample_Name=control$Sample_Name,
                  sv1=ctrlsva[,1],sv2=ctrlsva[,2],sv3=ctrlsva[,3],
                  sv4=ctrlsva[,4],sv5=ctrlsva[,5],sv6=ctrlsva[,6])

write.table(sva,file="alspac_control_probe_sva.txt",quote=F,row.names=F)

In [None]:
aws s3 cp alspac_control_probe_sva.txt s3://rti-cannabis/shared_data/post_qc/alspac/phenotypes/