# General QC process

From S4 study

## Steps
1. Converting report to plink binary
2. Genotyping quality check (Keep GenTrain Score > 0.7)
3. Sample call rate check (Keep F_MISS < 0.5)
4. Sex check
5. Keep SNP with maf>0.05, missing call rate <0.05 and HWD > 1e-6
6. Extreme heterozygosity (Keep within +/- 3 SD)
7. Ancestry filtering (Keep Europeans)
8. Filtering out related individuals (Keep <0.125)


### 1. Convert report to plink binary
Create .map, .lgne, and .fam and create plink binary file

In [None]:
%%bash
awk 'BEGIN{OFS="\t"}NR>1{print $3,$2,"0",$4}' genotyped/SNP_Map.txt > cleaning/RawData.map
awk 'BEGIN{OFS="\t"}NR>10{print $1,$1,$2,$3,$4}' genotyped/P100318_Kuldip_S4_092418_FinalReport.txt > cleaning/RawData.lgen
awk 'BEGIN{OFS="\t"}NR>1{print $2,$2,"0","0",$3,"-9"}' genotyped/Sample_Map.txt | sed -e 's/Female/2/g' -e 's/Male/1/g' > cleaning/RawData.fam
module load plink
plink --noweb --lfile cleaning/RawData --missing-genotype - --make-bed --out cleaning/RawData_Binary

### 2. Genotyping quality check
Get the list of loci to exclude by GenTrain < 0.7 and exclude them

In [None]:
%%bash
awk 'BEGIN{OFS="\t"}NR>1 && $5<0.7{print $2,$5}' genotyped/SNP_Map.txt > cleaning/LowGenTrainSnpsToExclude.txt
module load plink
plink --noweb --bfile cleaning/RawData_Binary --exclude cleaning/LowGenTrainSnpsToExclude.txt --make-bed --out cleaning/Gc070_Consented

### 3. Sample call rate check
Create .imiss file, get the list of Individuals with low call rate (F_MISS > 0.5), and exclude these individuals.

In [None]:
%%bash
module load plink
plink --noweb --bfile cleaning/Gc070_Consented --missing --out cleaning/chip_MISSINGNESS
awk '$6>0.05{print}' cleaning/chip_MISSINGNESS.imiss > cleaning/LowCallSamplesToRemove.txt
plink --noweb --bfile cleaning/Gc070_Consented --remove cleaning/LowCallSamplesToRemove.txt --make-bed --out cleaning/CallRate95

### 4. Sex check & 5. Filtering SNP (by maf>0.05, missing call rate >0.05 and  HWD > 1e-6)
.sexcheck file will be created and analyzed. Individuals with failing sex check will be removed.    
(NeuroX doens't have GWAS back bones, use F cut-off of 0.5 insterad of conventional 0.25/0.75)


In [None]:
%%bash
module load plink
plink --noweb --bfile cleaning/CallRate95 --check-sex --maf 0.1 --geno 0.05 --out cleaning/CallRate95-SEXCHECK
echo '
data <- read.table("cleaning/CallRate95-SEXCHECK.sexcheck",header = T, 
    colClasses=c("character", "character", "numeric", "factor", "character", "numeric"))
data$PEDSEX_STATUS <- paste(data$PEDSEX,data$STATUS,sep = "_")
data$PEDSEX_STATUS = as.factor(data$PEDSEX_STATUS)
summary(data)
library(ggplot2)
plot.temp <- ggplot(data,aes(x = F, colour = PEDSEX_STATUS, group = PEDSEX_STATUS))
plot.sex <- plot.temp + geom_density(fill = NA)
ggsave("cleaning/SexCheck.jpeg", width = 8, height = 3, units = "in")
data$TrueFailMale <- ifelse(data$PEDSEX_STATUS == "1_PROBLEM" & data$F < 0.50,1,0)
data$TrueFailFemale <- ifelse(data$PEDSEX_STATUS == "2_PROBLEM" & data$F > 0.50,1,0)
dat <- subset(data, TrueFailMale == 1 | TrueFailFemale == 1)
write.table(paste(dat$FID, dat$IID, sep = " "),"cleaning/SexCheckFailedSamplesToRemove.txt",quote = F,col.names= F,row.names = F)
' > cleaning/SexCheck.R
module load R
Rscript --vanilla cleaning/SexCheck.R
plink --noweb --bfile cleaning/CallRate95 --remove cleaning/SexCheckFailedSamplesToRemove.txt\
 --geno 0.05 --maf 0.05 --hwe 0.000001 --make-bed --out cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant

SexCheck.jpeg![image.png](fig/SexCheck.jpeg)
1_: Male
2_: Female

### 6. Check extreme heterozygosity
+/-3 SDs: Not meaningfull for this sample size. **Don't use this filter for this study.**

In [None]:
%%bash
module load plink
plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant\
 --geno 0.01 --maf 0.05 --indep-pairwise 50 5 0.5 --out cleaning/pruningForCheckHet
plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant\
 --extract cleaning/pruningForCheckHet.prune.in --het --make-bed --out cleaning/CheckHet
echo '
data <- read.table("cleaning/CheckHet.het",header = T, 
    colClasses=c("character", "character", "numeric", "numeric", "numeric", "numeric"))
summary(data)
library(ggplot2)
plot.temp <- ggplot(data,aes(x = F, ))
plot.het <- plot.temp + geom_density(fill = NA)
ggsave("cleaning/HetCheck.jpeg", width = 8, height = 3, units = "in")
LowHet <- mean(data$F) - 3*sd(data$F) # -0.15
HiHet <- mean(data$F) + 3*sd(data$F) # 0.15
cat("mean of F", mean(data$F), "\n")
cat("sd of F", sd(data$F), "\n")
data$HetOutlier <- ifelse(data$F < LowHet | data$F > HiHet,1,0)
dat <- subset(data, HetOutlier == 1 )
cat("N of extreme heterogeneity (+/- 3 SD), ", length(dat$FID), "\n")
write.table(paste(dat$FID, dat$IID, sep = " "),"cleaning/HetOutliersToRemove.txt",quote = F,col.names= F,row.names = F)
' > cleaning/HetCheck.R
module load R
Rscript --vanilla cleaning/HetCheck.R
# plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant --remove HetOutliersToRemove.txt\
#  --make-bed --out cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd
plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant\
 --make-bed --out cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd

HetCheck.jpeg![image.png](fig/HetCheck.jpeg)

### 7. Ancestry filtering
Using hapmap to conduct ancestry check
1. Get the list of palindromes (to exclude)
2. Prune the data 
3. merge with hapmap binary
4. Create PCs and conduct PCA

In [None]:
%%bash
#1 palindromes
echo '
library(data.table)
data <- fread("cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd.bim")
data$alleles <- paste(data$V5, data$V6, sep = "_")
dat <- subset(data, alleles == "A_T" | alleles == "T_A" | alleles == "G_C" | alleles == "C_G")
write.table(dat$V2, "cleaning/palindromes.txt", quote = F, row.names = F, col.names = F)
cat("N of excluded SNPs =", nrow(dat), "among", nrow(data), "\n")
' > cleaning/palindromes.R
module load R
Rscript --vanilla cleaning/palindromes.R

In [None]:
%%bash
module load plink
#2 prune data
plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd\
 --geno 0.01 --maf 0.05 --indep-pairwise 50 5 0.5 --out cleaning/pruning
plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd --extract cleaning/pruning.prune.in\
 --make-bed --out cleaning/pruned

In [None]:
%%bash
#3 merge with hapmap
# prune Hapmap3
module load plink
plink --file ../tool/hapmap/hapmap3_r3_b36_fwd.consensus.qc.poly --make-bed --out cleaning/hapmap3
plink --bfile cleaning/hapmap3 --extract cleaning/pruning.prune.in --make-bed --out cleaning/hapmap3_pruned
# Get the list of uncommon snps between the cohort and hapmap3
plink --bfile cleaning/pruned --bmerge cleaning/hapmap3_pruned --out cleaning/hapmap3_bin_snplis --make-bed
# Flip the uncommon snps in the cohort
plink --bfile cleaning/pruned --flip cleaning/hapmap3_bin_snplis-merge.missnp --make-bed --out cleaning/pruned_flipped
# Merge the cohort with hapmap3
plink --bfile cleaning/pruned_flipped --bmerge cleaning/hapmap3_pruned --geno 0.05 --out cleaning/hapmap3_merged  --make-bed

In [None]:
%%bash
module load plink
plink --noweb --bfile cleaning/hapmap3_merged --out cleaning/pca --pca header
# awk '{print $1"\t"$2"\t"$6"\t""STUDY"}' cleaning/pruned.fam > cleaning/pruned_pop.txt
awk '{$6=2 ; print $1"\t"$2"\t"$6"\t""STUDY"}' cleaning/pruned.fam  > cleaning/pruned_pop.txt # $6 == 2 (All cases)
cat ../tool/hapmap/hapmap3_popsAnotated.txt  cleaning/pruned_pop.txt > cleaning/hapmap3_merged_pop.txt
echo '
library("ggplot2")
pcs <- read.table("cleaning/pca.eigenvec", header = T)
pops <- read.table("cleaning/hapmap3_merged_pop.txt", header = T, stringsAsFactors = F)
pops$Population = ifelse(pops$Population=="1", "STUDY_CTRL", pops$Population)
pops$Population = ifelse(pops$Population=="2", "STUDY_CASE", pops$Population)
pops$Population = ifelse(pops$Population=="-9", "STUDY_UNKOWN", pops$Population)
pops$Population <- factor(pops$Population, levels = c("ASW", "CEU", "CHB", "CHD", "GIH", "JPT", "LWK", "MEX", "MKK", "TSI", "YRI",
 "STUDY_CTRL", "STUDY_CASE", "STUDY_UNNOWN"))
pcs$index <- paste(pcs$FID, pcs$IID, sep = "_")
pops$index <- paste(pops$FID, pops$IID, sep = "_")
data <- merge(pcs, pops, by = "index")
data$FID = data$FID.x
data$IID = data$IID.x
#### now build reference ranges
asia <- subset(data, Continent == "Asia")
africa <- subset(data, Continent == "Africa")
europe <- subset(data, Continent == "Europe")
asia.mean.pc1 <- mean(asia$PC1)
asia.mean.pc2 <- mean(asia$PC2)
asia.sd.pc1 <- sd(asia$PC1)
asia.sd.pc2 <- sd(asia$PC2)
asia.low.pc1 <- asia.mean.pc1 - (6*asia.sd.pc1)
asia.low.pc2 <- asia.mean.pc2 - (6*asia.sd.pc2)
asia.hi.pc1 <- asia.mean.pc1 + (6*asia.sd.pc1)
asia.hi.pc2 <- asia.mean.pc2 + (6*asia.sd.pc2)
africa.mean.pc1 <- mean(africa$PC1)
africa.mean.pc2 <- mean(africa$PC2)
africa.sd.pc1 <- sd(africa$PC1)
africa.sd.pc2 <- sd(africa$PC2)
africa.low.pc1 <- africa.mean.pc1 - (6*africa.sd.pc1)
africa.low.pc2 <- africa.mean.pc2 - (6*africa.sd.pc2)
africa.hi.pc1 <- africa.mean.pc1 + (6*africa.sd.pc1)
africa.hi.pc2 <- africa.mean.pc2 + (6*africa.sd.pc2)
europe.mean.pc1 <- mean(europe$PC1)
europe.mean.pc2 <- mean(europe$PC2)
europe.sd.pc1 <- sd(europe$PC1)
europe.sd.pc2 <- sd(europe$PC2)
europe.low.pc1 <- europe.mean.pc1 - (6*europe.sd.pc1)
europe.low.pc2 <- europe.mean.pc2 - (6*europe.sd.pc2)
europe.hi.pc1 <- europe.mean.pc1 + (6*europe.sd.pc1)
europe.hi.pc2 <- europe.mean.pc2 + (6*europe.sd.pc2)
data$Ancestry <- "Admixed"
data$Ancestry[data$PC1 >= europe.low.pc1 & data$PC2 >= europe.low.pc2 & data$PC1 <= europe.hi.pc1 & data$PC2 <= europe.hi.pc2] <- "European"
data$Ancestry[data$PC1 >= africa.low.pc1 & data$PC2 >= africa.low.pc2 & data$PC1 <= africa.hi.pc1 & data$PC2 <= africa.hi.pc2] <- "African"
data$Ancestry[data$PC1 >= asia.low.pc1 & data$PC2 >= asia.low.pc2 & data$PC1 <= asia.hi.pc1 & data$PC2 <= asia.hi.pc2] <- "Asian"
### export your data by ancestry
cohort <- subset(data, Continent == "STUDY")
cohort.europe <- subset(cohort, Ancestry == "European")
cohort.african <- subset(cohort, Ancestry == "African")
cohort.asian <- subset(cohort, Ancestry == "Asian")
cohort.admixed <- subset(cohort, Ancestry == "Admixed")
cohort.europe.ids <- cohort.europe[,c("FID","IID")]
cohort.african.ids <- cohort.african[,c("FID","IID")]
cohort.asian.ids <- cohort.asian[,c("FID","IID")]
cohort.admixed.ids <- cohort.admixed[,c("FID","IID")]
write.table(cohort.europe.ids, "cleaning/cohort.europe.txt", quote = F, sep = "\t", row.names = F)
write.table(cohort.african.ids, "cleaning/cohort.african.txt", quote = F, sep = "\t", row.names = F)
write.table(cohort.asian.ids, "cleaning/cohort.asian.txt", quote = F, sep = "\t", row.names = F)
write.table(cohort.admixed.ids, "cleaning/cohort.admixed.txt", quote = F, sep = "\t", row.names = F)
#### repot
plotTemp1 <- ggplot(data, aes(PC1, PC2, color = Population, shape = Continent)) + geom_point() + theme_bw()
plotTemp1 <- ggplot(data, aes(PC1, PC2, color = Population, shape = Continent)) + geom_point() +
    geom_point(data = cohort) + theme_bw()
plotTemp2 <- plotTemp1 + 
 geom_rect(aes(xmin = europe.low.pc1, xmax = europe.hi.pc1, ymin = europe.low.pc2, ymax = europe.hi.pc2), fill = NA, color = "grey", linetype = 2) +
 geom_rect(aes(xmin = africa.low.pc1, xmax = africa.hi.pc1, ymin = africa.low.pc2, ymax = africa.hi.pc2), fill = NA, color = "grey", linetype = 2) + 
 geom_rect(aes(xmin = asia.low.pc1, xmax = asia.hi.pc1, ymin = asia.low.pc2, ymax = asia.hi.pc2), fill = NA, color = "grey", linetype = 2)
ggsave(filename = "cleaning/pcaPlot.jpeg", plot = plotTemp1)
ggsave(filename = "cleaning/pcaPlotPlusBoxes.jpeg", plot = plotTemp2)' > cleaning/hapmap3_merged_PCA.R
module load R
Rscript --vanilla cleaning/hapmap3_merged_PCA.R
plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sd\
 --keep cleaning/cohort.europe.txt --make-bed --out cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR

pcaPlotPlusBoxes.jpeg![image.png](fig/pcaPlotPlusBoxes.jpeg)

### 8. Filter out related individuals
cutoff 0.125

In [None]:
%%bash
mkdir -p cleaned
module load plink GCTA
plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR\
 --geno 0.01 --maf 0.05 --hwe 0.0001 --indep-pairwise 50 5 0.5 --out cleaning/pruningAgain
plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR\
 --extract cleaning/pruningAgain.prune.in --make-bed --out cleaning/forRelatedCheck
gcta64 --bfile cleaning/forRelatedCheck --make-grm --out cleaning/GRM_matrix --autosome --maf 0.05
gcta64 --grm-cutoff 0.125 --grm cleaning/GRM_matrix --out cleaning/GRM_matrix_0125 --make-grm
plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR\
 --keep cleaning/GRM_matrix_0125.grm.id --make-bed --out cleaned/S4_QCed
## Code only using plink 
# plink --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR --rel-cutoff 0.125\
#  --out cleaning/checkingRelatives0125
# plink --noweb --bfile cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordantHetLess3sdEUR\
#  --keep cleaning/checkingRelatives0125.rel.id --make-bed --out cleaned/S4_QCed

## Summary of general QC process
Please see the following ouput.

In [1]:
%%bash
echo "1. Input file
Number of people in the cohort"
cat cleaning/RawData_Binary.fam | wc -l
echo 'Number of variants'
tail -n +2 genotyped/SNP_Map.txt | wc -l
echo '
2. GenTrain > 0.7
Number of variants excluded by this step'
cat cleaning/LowGenTrainSnpsToExclude.txt | wc -l
echo "
3. Sample call rate check
Number of individuals excluded by this step"
tail -n +2 cleaning/LowCallSamplesToRemove.txt | wc -l
echo "
4. Sex check
Number of individuals excluded by this step"
cat -n cleaning/SexCheckFailedSamplesToRemove.txt | wc -l
echo '
5. SNP filtering
(SNP with maf>0.05, missing call rate <0.05 and HWD > 1e-6)
Number of variants LEFT'
wc -l cleaning/Gc070geno05maf05hwe6_ConsentedCallRate95SexConcordant.bim | cut -d' ' -f1
echo '
6. Extreme heterozygosity
Number of individuals excluded by this step'
# cat cleaning/HetOutliersToRemove.txt | wc -l

echo '
7. Ancestry check
(Only keep europeans in the next step)'
echo 'Number of Europeans'
tail -n +2 cleaning/cohort.europe.txt | wc -l
echo 'Number of Africans'
tail -n +2 cleaning/cohort.african.txt | wc -l
echo 'Number of Asians'
tail -n +2 cleaning/cohort.asian.txt | wc -l
echo 'Number of Admixed'
tail -n +2 cleaning/cohort.admixed.txt | wc -l
echo '
8. Relatedness check
Number of samples IN the final data'
cat cleaning/GRM_matrix_0125.grm.id | wc -l

1. Input file
Number of people in the cohort
81
Number of variants
487374

2. GenTrain > 0.7
Number of variants excluded by this step
190674

3. Sample call rate check
Number of individuals excluded by this step
0

4. Sex check
Number of individuals excluded by this step
1

5. SNP filtering
(SNP with maf>0.05, missing call rate <0.05 and HWD > 1e-6)
Number of variants LEFT
234375

6. Extreme heterozygosity
Number of individuals excluded by this step

7. Ancestry check
(Only keep europeans in the next step)
Number of Europeans
76
Number of Africans
1
Number of Asians
0
Number of Admixed
3

8. Relatedness check
Number of samples IN the final data
76


### Supplement. PCs for Qced samples

In [None]:
%%bash
module load plink
plink --noweb --bfile cleaned/S4_QCed --geno 0.01 --maf 0.05 --hwe 0.000001 --indep 50 5 2 --out cleaning/QCed_pruning
plink --noweb --bfile cleaned/S4_QCed --extract cleaning/QCed_pruning.prune.in --make-bed --out cleaning/QCed_pruned
plink --bfile cleaning/QCed_pruned --pca header --out cleaned/S4_QCed

# Genetic risk score
Calculated from the latest GWAS
https://doi.org/10.1101/388165

In [None]:
%%bash
rm -f grs/_toSearch.txt
awk '{print $48, $1,toupper($26),$29}' grs/Meta5.tab |  sed 's/ /\t/g' | sed 's/chr//g' > grs/_Meta5reduced.txt
awk '{print $1,$3,$4}' grs/_Meta5reduced.txt  > grs/_Meta5score.txt
cut -f1 grs/_Meta5reduced.txt | tail -n +2 >> grs/_toSearch.txt
cut -f2 grs/_Meta5reduced.txt | tail -n +2 >> grs/_toSearch.txt
sed -i 's/$/ /' grs/_toSearch.txt
grep -f grs/_toSearch.txt interest/forSearch.txt > grs/_Meta5inChip.txt

In [13]:
%%bash
module load plink
plink --bfile cleaning/CallRate95 --score grs/_Meta5reduced.txt 1 3 4 header  --out grs/grs

PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to grs/grs.log.
Options in effect:
  --bfile cleaning/CallRate95
  --out grs/grs
  --score grs/_Meta5reduced.txt 1 3 4 header

257653 MB RAM detected; reserving 128826 MB for main workspace.
296700 variants loaded from .bim file.
81 people (51 males, 30 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 81 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%6

[+] Loading plink  1.9.0-beta4.4  on cn3267 
treat these as missing.
treat these as missing.
to allele code mismatch); see grs/grs.nopred for details.
