# Imputed Data Processing

This notebook records the steps to prepare input for imputation and post-imputation processing steps

## 1. Imputation input preparation

### 1.1 Divide the 168,206 individuals into 7 batches of 25k

TOPMed imputation server can only accept maximum 25k ID in one file.

In [1]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis")

idx <- lapply(c(1:7), function(x) rep(x, 25000)) %>% unlist()
idx <- idx[c(1:168206)]

iid_lst <- fread("168206ind.sample.txt") %>% mutate(idx = idx) %>% group_by(idx) %>% group_split() %>% lapply(function(x) x$V1)
for(i in c(1:7)){
    write.table(iid_lst[[i]], sprintf("./imputation_input/168206ind_sample_batch%d.txt", i), col.names = FALSE, row.names = FALSE, quote = FALSE)
}


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last




### 1.2 Create per chromosome .vcf.gz file 

In [2]:
for i in list((1,2,11)):
    for j in range(1,8):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N make_impute_input_chr%i_batch%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr%i_batch%i_$JOB_ID.out
#$ -e ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr%i_batch%i_$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/1.9.10

plink \
    --bfile /mnt/mfs/statgen/UKBiobank/QCed_Plink_autosomal_files_hg38/QCed_White_EU_460649ind_10212022_hg38_sorted \
    --chr %i \
    --keep-fam ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/168206ind_sample_batch%i.txt \
    --make-bed \
    --output-chr chrM \
    --recode vcf bgz \
    --out ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/imputation_input_hg38_sorted_unrelated_white_eur_extracted_168206ind_chr%i_batch%i

'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr"+str(i)+"_batch"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [3]:
for i in 1 2 11; do
    for j in {1..7}; do
        qsub ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr${i}_batch${j}.sh
    done;
done

Your job 8646158 ("make_impute_input_chr1_batch1") has been submitted
Your job 8646159 ("make_impute_input_chr1_batch2") has been submitted
Your job 8646160 ("make_impute_input_chr1_batch3") has been submitted
Your job 8646161 ("make_impute_input_chr1_batch4") has been submitted
Your job 8646162 ("make_impute_input_chr1_batch5") has been submitted
Your job 8646163 ("make_impute_input_chr1_batch6") has been submitted
Your job 8646164 ("make_impute_input_chr1_batch7") has been submitted
Your job 8646165 ("make_impute_input_chr2_batch1") has been submitted
Your job 8646166 ("make_impute_input_chr2_batch2") has been submitted
Your job 8646167 ("make_impute_input_chr2_batch3") has been submitted
Your job 8646168 ("make_impute_input_chr2_batch4") has been submitted
Your job 8646169 ("make_impute_input_chr2_batch5") has been submitted
Your job 8646170 ("make_impute_input_chr2_batch6") has been submitted
Your job 8646171 ("make_impute_input_chr2_batch7") has been submitted
Your job 8646172 ("m

## 2. Upload file to server and download

[TOPMed Imputation Server - TOPMed](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!)

[Michigan Imputation Server - HRC](https://imputationserver.sph.umich.edu/index.html#!)

## 3. Imputation file post-processing

### 3.1 TOPMed r3

Because of the sample size limit of the imputation servers, we need to concatenate imputed files into one in order to calculate the overall $R^2$ for each variant.

The TOPMed team has provided [hds-util](https://github.com/statgen/hds-util), a post-processing tool for Minimac4 and Michigan Imputation Server (MIS). It can generate FORMAT fields from HDS, convert from the SAV file format to BCF or VCF, and paste together sample groups that were split due to MIS sample size limit.

**Problem:**

TOPMed-r3 panel has changed there way of handling allele flipping problem, therefore resulting in mis-matched variants and cannot be directly pasted together using their provided `hds-utils`. This problem occured becuase when splitting the imputation input into batches of 25,000 ids, `Plink 1.9` will code the reference and alternative alleles as major and minor based on their MAF in that batch. This leads to a problem, since for each of the variant, each batch could have a different major and minor allele. 

**Provided solution:**

The TOPMed imputation team has suggested two solutions: 1). Process the imputation input and re-run the imputation; 2). Remove the mis-matched variants and paste them together. 


**Our solution:**

We wanted to choose the second solution, but before removing, we neeed to check how much variants we are removing and whether the removal going to affect our rare variant aggregate analysis result a lot.

#### 3.11 Recode downloaded vcf files

In [4]:
for i in list((1,2,11)):
    for j in list((1,2,3,4,5,6,7)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N recode_dosage_chr%i_batch%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_dosage_chr%i_batch%i_$JOB_ID.out
#$ -e ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_dosage_chr%i_batch%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_batch%i

chr=%i

plink2 \
    --vcf chr${chr}.dose.vcf.gz dosage=HDS \
    --make-bpgen \
    --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --new-id-max-allele-len 200 \
    --keep-allele-order \
    --recode vcf bgz vcf-dosage=HDS-force \
    --out chr${chr}.dose.recoded

'''%(i,j,i,j,i,j,j,i)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_dosage_chr"+str(i)+"_batch"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

#### 3.12 Check mismatched alleles

In [5]:
library(dplyr)
library(data.table)

setwd("/mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3")

for(chr in c(1,2,11)){
    batch1 <- fread(sprintf("./topmed_batch1/chr%i.dose.recoded.bim", chr))
    batch2 <- fread(sprintf("./topmed_batch2/chr%i.dose.recoded.bim", chr))
    batch3 <- fread(sprintf("./topmed_batch3/chr%i.dose.recoded.bim", chr))
    batch4 <- fread(sprintf("./topmed_batch4/chr%i.dose.recoded.bim", chr))
    batch5 <- fread(sprintf("./topmed_batch5/chr%i.dose.recoded.bim", chr))
    batch6 <- fread(sprintf("./topmed_batch6/chr%i.dose.recoded.bim", chr))
    batch7 <- fread(sprintf("./topmed_batch7/chr%i.dose.recoded.bim", chr))
    
    all_lists <- list(batch1$V2, batch2$V2, batch3$V2, batch4$V2, batch5$V2, batch6$V2, batch7$V2) 
    common_elements <- Reduce(intersect, all_lists)
    
    batch1 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch1/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch2 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch2/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch3 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch3/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch4 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch4/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch5 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch5/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch6 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch6/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
    batch7 %>% filter(!V2 %in% common_elements) %>% fwrite(sprintf("./topmed_batch7/chr%i.dose.mismatch.bim", chr), col.names = FALSE, sep = "\t")
}

for(chr in c(1,2,11)){
    mis_batch1 <- fread(sprintf("./topmed_batch1/chr%i.dose.mismatch.bim", chr))
    mis_batch2 <- fread(sprintf("./topmed_batch2/chr%i.dose.mismatch.bim", chr))
    mis_batch3 <- fread(sprintf("./topmed_batch3/chr%i.dose.mismatch.bim", chr))
    mis_batch4 <- fread(sprintf("./topmed_batch4/chr%i.dose.mismatch.bim", chr))
    mis_batch5 <- fread(sprintf("./topmed_batch5/chr%i.dose.mismatch.bim", chr))
    mis_batch6 <- fread(sprintf("./topmed_batch6/chr%i.dose.mismatch.bim", chr))
    mis_batch7 <- fread(sprintf("./topmed_batch7/chr%i.dose.mismatch.bim", chr))
    
    all_mis <- rbind(mis_batch1, mis_batch2, mis_batch3, mis_batch4, mis_batch5, mis_batch6, mis_batch7)
    unique(all_mis) %>% fwrite(sprintf("./chr%i.mismatch.all.bim", chr), col.names = FALSE, sep = "\t")
}

#### 3.13 Annotate mismatches

In [6]:
for i in list((1,2,11)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N annotate_topmed_v3_mismatch_chr%i
#$ -o /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_mismatch_chr%i_$JOB_ID.out
#$ -e /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_mismatch_chr%i_$JOB_ID.err
#$ -j y
#$ -q csg.q

source ~/mamba_activate.sh
module load Singularity

sos run /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/notebooks/annovar.ipynb annovar \
    --build 'hg38' \
    --cwd /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3 \
    --bim_name /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/chr%i.mismatch.all.bim \
    --humandb /mnt/vast/hpc/csg/isabelle/REF/humandb  \
    --job_size 1 \
    --name_prefix topmed_v3_chr%i_mismatch \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif

'''%(i,i,i,i,i)
        f=open("/mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_mismatch_topmed_chr"+str(i)+".sh", 'w')
        f.write(script)
        f.close()

In [1]:
setwd('/mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3')

library(data.table)
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
## Check mismatched annotations

chr1_mis_annot <- fread("./chr1.mismatch.all.hg38.hg38_multianno.csv")
chr2_mis_annot <- fread("./chr2.mismatch.all.hg38.hg38_multianno.csv")
chr11_mis_annot <- fread("./chr11.mismatch.all.hg38.hg38_multianno.csv")

genome_name <- c("AF_genome", "AF_raw_genome", "AF_male_genome", "AF_female_genome", "AF_afr_genome", 
                 "AF_ami_genome", "AF_amr_genome", "AF_asj_genome", "AF_eas_genome", "AF_fin_genome", 
                 "AF_nfe_genome", "AF_oth_genome", "AF_sas_genome")
exome_name <- c("AF_exome", "AF_popmax_exome", "AF_male_exome", "AF_female_exome", "AF_raw_exome",
                "AF_afr_exome", "AF_sas_exome", "AF_amr_exome", "AF_eas_exome", "AF_nfe_exome",
                "AF_fin_exome", "AF_asj_exome", "AF_oth_exome")

colnames(chr1_mis_annot)[29:41] <- colnames(chr2_mis_annot)[29:41] <- colnames(chr11_mis_annot)[29:41] <- genome_name
colnames(chr1_mis_annot)[42:54] <- colnames(chr2_mis_annot)[42:54] <- colnames(chr11_mis_annot)[42:54] <- exome_name

dim(chr1_mis_annot)
chr1_mis_annot %>% filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing"))

dim(chr2_mis_annot)
chr2_mis_annot %>% filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing"))

dim(chr11_mis_annot)
chr11_mis_annot %>% filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing"))

Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,⋯,CLNDISDB,CLNREVSTAT,CLNSIG,DN ID,Patient ID,Phenotype,Platform,Study,Pubmed ID,Otherinfo1
<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,23520972,23520972,A,C,exonic,E2F2,.,synonymous SNV,E2F2:NM_004091:exon4:c.G678G:p.Q226Q,⋯,.,.,.,.,.,.,.,.,.,chr1:23520972:A:C
1,41513154,41513154,G,C,exonic,HIVEP3,.,synonymous SNV,"HIVEP3:NM_001127714:exon7:c.G6067G:p.A2023A,HIVEP3:NM_024503:exon8:c.G6067G:p.A2023A",⋯,.,.,.,.,.,.,.,.,.,chr1:41513154:G:C
1,64647885,64647885,C,T,exonic,CACHD1,.,synonymous SNV,"CACHD1:NM_001293274:exon8:c.T353T:p.M118M,CACHD1:NM_020925:exon9:c.T1241T:p.M414M",⋯,.,.,.,.,.,.,.,.,.,chr1:64647885:C:T
1,66822362,66822362,T,C,exonic;splicing,DNAI4;DNAI4,.,synonymous SNV,DNAI4:NM_024763:exon16:c.G2495G:p.R832R,⋯,.,.,.,.,.,.,.,.,.,chr1:66822362:T:C
1,99851033,99851033,G,A,splicing,AGL,NM_000644:exon2:UTR5,.,.,⋯,.,.,.,.,.,.,.,.,.,chr1:99851033:G:A
1,175077653,175077653,G,A,exonic,TNN,.,synonymous SNV,TNN:NM_022093:exon2:c.A235A:p.R79R,⋯,.,.,.,.,.,.,.,.,.,chr1:175077653:G:A
1,202335740,202335740,C,T,exonic,UBE2T,.,synonymous SNV,UBE2T:NM_014176:exon2:c.A15A:p.S5S,⋯,.,.,.,.,.,.,.,.,.,chr1:202335740:C:T
1,207090484,207090484,G,A,splicing,C4BPB,NM_001017366:exon3:c.229+3G>A;NM_001017365:exon3:c.232+3G>A;NM_000716:exon2:c.232+3G>A;NM_001017364:exon2:c.229+3G>A;NM_001017367:exon3:c.232+3G>A,.,.,⋯,.,.,.,.,.,.,.,.,.,chr1:207090484:G:A
1,197101771,197101771,G,A,exonic,ASPM,.,synonymous SNV,ASPM:NM_018136:exon18:c.T7480T:p.Y2494Y,⋯,.,.,.,.,.,.,.,.,.,chr1:197101771:G:A
1,201386394,201386394,C,T,exonic,LAD1,.,synonymous SNV,LAD1:NM_005558:exon3:c.A967A:p.K323K,⋯,.,.,.,.,.,.,.,.,.,chr1:201386394:C:T


Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,⋯,CLNDISDB,CLNREVSTAT,CLNSIG,DN ID,Patient ID,Phenotype,Platform,Study,Pubmed ID,Otherinfo1
<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
2,37951263,37951263,G,A,exonic,RMDN2,.,synonymous SNV,RMDN2:NM_144713:exon2:c.A48A:p.R16R,⋯,.,.,.,.,.,.,.,.,.,chr2:37951263:G:A
2,86206117,86206117,T,C,exonic;splicing,MRPL35;MRPL35,.,synonymous SNV,"MRPL35:NM_001363782:exon2:c.C55C:p.P19P,MRPL35:NM_016622:exon2:c.C55C:p.P19P,MRPL35:NM_145644:exon2:c.C55C:p.P19P",⋯,.,.,.,.,.,.,.,.,.,chr2:86206117:T:C
2,219172671,219172671,G,A,exonic,CNPPD1,.,synonymous SNV,"CNPPD1:NM_015680:exon8:c.T1148T:p.L383L,CNPPD1:NM_001321389:exon9:c.T1148T:p.L383L,CNPPD1:NM_001321390:exon9:c.T1148T:p.L383L,CNPPD1:NM_001321391:exon9:c.T1148T:p.L383L",⋯,.,.,.,.,.,.,.,.,.,chr2:219172671:G:A
2,105308053,105308053,G,A,exonic,TGFBRAP1,.,synonymous SNV,"TGFBRAP1:NM_001142621:exon2:c.T249T:p.R83R,TGFBRAP1:NM_001328646:exon2:c.T249T:p.R83R,TGFBRAP1:NM_004257:exon2:c.T249T:p.R83R",⋯,.,.,.,.,.,.,.,.,.,chr2:105308053:G:A
2,227270915,227270915,T,C,exonic,COL4A3,.,synonymous SNV,COL4A3:NM_000091:exon25:c.C1721C:p.P574P,⋯,.,.,.,.,.,.,.,.,.,chr2:227270915:T:C
2,241897212,241897212,C,G,exonic,FAM240C,.,synonymous SNV,"FAM240C:NM_001382368:exon2:c.C135C:p.I45I,FAM240C:NM_001382369:exon2:c.C120C:p.I40I,FAM240C:NM_001382370:exon2:c.C120C:p.I40I",⋯,.,.,.,.,.,.,.,.,.,chr2:241897212:C:G
2,227089883,227089883,A,G,exonic,COL4A4,.,synonymous SNV,COL4A4:NM_000092:exon21:c.C1444C:p.P482P,⋯,.,.,.,.,.,.,.,.,.,chr2:227089883:A:G
2,127564195,127564195,A,G,exonic,MYO7B,.,synonymous SNV,MYO7B:NM_001080527:exon3:c.G61G:p.G21G,⋯,.,.,.,.,.,.,.,.,.,chr2:127564195:A:G


Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,⋯,CLNDISDB,CLNREVSTAT,CLNSIG,DN ID,Patient ID,Phenotype,Platform,Study,Pubmed ID,Otherinfo1
<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
11,1107741,1107741,C,T,exonic,MUC2,.,unknown,UNKNOWN,⋯,.,.,.,.,.,.,.,.,.,chr11:1107741:C:T
11,44309959,44309959,G,C,exonic,ALX4,.,synonymous SNV,ALX4:NM_021926:exon1:c.G104G:p.R35R,⋯,.,.,.,.,.,.,.,.,.,chr11:44309959:G:C
11,47309565,47309565,C,T,exonic,MADD,.,synonymous SNV,"MADD:NM_001376663:exon23:c.T3268T:p.L1090L,MADD:NM_001376611:exon24:c.T3832T:p.L1278L,MADD:NM_001376613:exon24:c.T3820T:p.L1274L,MADD:NM_001376614:exon24:c.T3820T:p.L1274L,MADD:NM_001376650:exon24:c.T3691T:p.L1231L,MADD:NM_001376661:exon24:c.T3625T:p.L1209L,MADD:NM_001376662:exon24:c.T3478T:p.L1160L,MADD:NM_001135943:exon25:c.T3802T:p.L1268L,MADD:NM_001135944:exon25:c.T3793T:p.L1265L,MADD:NM_001376584:exon25:c.T3934T:p.L1312L,MADD:NM_001376593:exon25:c.T3922T:p.L1308L,MADD:NM_001376594:exon25:c.T3922T:p.L1308L,MADD:NM_001376612:exon25:c.T3934T:p.L1312L,MADD:NM_001376615:exon25:c.T3805T:p.L1269L,MADD:NM_001376616:exon25:c.T3805T:p.L1269L,MADD:NM_001376618:exon25:c.T3793T:p.L1265L,MADD:NM_001376619:exon25:c.T3793T:p.L1265L,MADD:NM_001376620:exon25:c.T3793T:p.L1265L,MADD:NM_001376621:exon25:c.T3793T:p.L1265L,MADD:NM_001376654:exon25:c.T3661T:p.L1221L,MADD:NM_130470:exon25:c.T3934T:p.L1312L,MADD:NM_130472:exon25:c.T3805T:p.L1269L,MADD:NM_130474:exon25:c.T3805T:p.L1269L,MADD:NM_130476:exon25:c.T3931T:p.L1311L,MADD:NM_001376579:exon26:c.T3985T:p.L1329L,MADD:NM_001376580:exon26:c.T3985T:p.L1329L,MADD:NM_001376583:exon26:c.T3937T:p.L1313L,MADD:NM_001376603:exon26:c.T3868T:p.L1290L,MADD:NM_001376604:exon26:c.T3856T:p.L1286L,MADD:NM_001376607:exon26:c.T3853T:p.L1285L,MADD:NM_001376622:exon26:c.T3985T:p.L1329L,MADD:NM_001376623:exon26:c.T3985T:p.L1329L,MADD:NM_001376631:exon26:c.T3958T:p.L1320L,MADD:NM_001376635:exon26:c.T3724T:p.L1242L,MADD:NM_001376644:exon26:c.T3715T:p.L1239L,MADD:NM_001376646:exon26:c.T3712T:p.L1238L,MADD:NM_001376647:exon26:c.T3706T:p.L1236L,MADD:NM_001376651:exon26:c.T3868T:p.L1290L,MADD:NM_001376652:exon26:c.T3868T:p.L1290L,MADD:NM_001376653:exon26:c.T3865T:p.L1289L,MADD:NM_001376656:exon26:c.T3856T:p.L1286L,MADD:NM_001376657:exon26:c.T3835T:p.L1279L,MADD:NM_001376658:exon26:c.T3808T:p.L1270L,MADD:NM_001376660:exon26:c.T3706T:p.L1236L,MADD:NM_130471:exon26:c.T3865T:p.L1289L,MADD:NM_130473:exon26:c.T3994T:p.L1332L,MADD:NM_001376574:exon27:c.T4057T:p.L1353L,MADD:NM_001376575:exon27:c.T4051T:p.L1351L,MADD:NM_001376576:exon27:c.T4039T:p.L1347L,MADD:NM_001376577:exon27:c.T4039T:p.L1347L,MADD:NM_001376578:exon27:c.T4012T:p.L1338L,MADD:NM_001376585:exon27:c.T3919T:p.L1307L,MADD:NM_001376586:exon27:c.T3916T:p.L1306L,MADD:NM_001376595:exon27:c.T3910T:p.L1304L,MADD:NM_001376596:exon27:c.T4117T:p.L1373L,MADD:NM_001376597:exon27:c.T3910T:p.L1304L,MADD:NM_001376598:exon27:c.T3910T:p.L1304L,MADD:NM_001376602:exon27:c.T3892T:p.L1298L,MADD:NM_001376605:exon27:c.T4057T:p.L1353L,MADD:NM_001376606:exon27:c.T4054T:p.L1352L,MADD:NM_001376608:exon27:c.T4048T:p.L1350L,MADD:NM_001376609:exon27:c.T4045T:p.L1349L,MADD:NM_001376610:exon27:c.T4039T:p.L1347L,MADD:NM_001376617:exon27:c.T3916T:p.L1306L,MADD:NM_001376626:exon27:c.T3895T:p.L1299L,MADD:NM_001376627:exon27:c.T3766T:p.L1256L,MADD:NM_001376633:exon27:c.T4054T:p.L1352L,MADD:NM_001376634:exon27:c.T4045T:p.L1349L,MADD:NM_001376636:exon27:c.T3928T:p.L1310L,MADD:NM_001376637:exon27:c.T3928T:p.L1310L,MADD:NM_001376638:exon27:c.T3925T:p.L1309L,MADD:NM_001376639:exon27:c.T3925T:p.L1309L,MADD:NM_001376640:exon27:c.T3922T:p.L1308L,MADD:NM_001376641:exon27:c.T3919T:p.L1307L,MADD:NM_001376642:exon27:c.T3916T:p.L1306L,MADD:NM_001376643:exon27:c.T3916T:p.L1306L,MADD:NM_001376645:exon27:c.T3910T:p.L1304L,MADD:NM_001376648:exon27:c.T3895T:p.L1299L,MADD:NM_001376649:exon27:c.T3895T:p.L1299L,MADD:NM_001376655:exon27:c.T3856T:p.L1286L,MADD:NM_001376659:exon27:c.T3766T:p.L1256L,MADD:NM_001376571:exon28:c.T4111T:p.L1371L,MADD:NM_001376572:exon28:c.T4099T:p.L1367L,MADD:NM_001376573:exon28:c.T4099T:p.L1367L,MADD:NM_001376581:exon28:c.T3970T:p.L1324L,MADD:NM_001376582:exon28:c.T3970T:p.L1324L,MADD:NM_001376599:exon28:c.T4099T:p.L1367L,MADD:NM_001376600:exon28:c.T4099T:p.L1367L,MADD:NM_001376601:exon28:c.T4099T:p.L1367L,MADD:NM_001376624:exon28:c.T3982T:p.L1328L,MADD:NM_001376625:exon28:c.T3982T:p.L1328L,MADD:NM_001376628:exon28:c.T3979T:p.L1327L,MADD:NM_001376629:exon28:c.T3970T:p.L1324L,MADD:NM_001376630:exon28:c.T3970T:p.L1324L,MADD:NM_001376632:exon28:c.T3943T:p.L1315L,MADD:NM_003682:exon28:c.T4111T:p.L1371L,MADD:NM_130475:exon28:c.T4111T:p.L1371L",⋯,.,.,.,.,.,.,.,.,.,chr11:47309565:C:T
11,67665333,67665333,C,T,exonic,ALDH3B2,.,synonymous SNV,"ALDH3B2:NM_001031615:exon7:c.A658A:p.S220S,ALDH3B2:NM_001354345:exon8:c.A658A:p.S220S",⋯,.,.,.,.,.,.,.,.,.,chr11:67665333:C:T
11,113323446,113323446,C,A,exonic;splicing,TTC12;TTC12,.,synonymous SNV,"TTC12:NM_001352037:exon2:c.A142A:p.M48M,TTC12:NM_001378063:exon2:c.A142A:p.M48M,TTC12:NM_001318533:exon3:c.A217A:p.M73M,TTC12:NM_001378064:exon3:c.A217A:p.M73M,TTC12:NM_001378065:exon3:c.A217A:p.M73M,TTC12:NM_017868:exon3:c.A217A:p.M73M",⋯,.,.,.,.,.,.,.,.,.,chr11:113323446:C:A
11,5046754,5046754,G,A,exonic,OR52J3,.,synonymous SNV,OR52J3:NM_001001916:exon1:c.A229A:p.T77T,⋯,.,.,.,.,.,.,.,.,.,chr11:5046754:G:A
11,5046947,5046947,T,A,exonic,OR52J3,.,synonymous SNV,OR52J3:NM_001001916:exon1:c.A422A:p.Q141Q,⋯,.,.,.,.,.,.,.,.,.,chr11:5046947:T:A
11,5058838,5058838,A,G,exonic,OR52E2,.,synonymous SNV,OR52E2:NM_001005164:exon1:c.C790C:p.R264R,⋯,.,.,.,.,.,.,.,.,.,chr11:5058838:A:G
11,20090946,20090946,A,G,exonic,NAV2,.,synonymous SNV,"NAV2:NM_001111019:exon17:c.G2772G:p.T924T,NAV2:NM_145117:exon27:c.G5580G:p.T1860T,NAV2:NM_182964:exon27:c.G5589G:p.T1863T,NAV2:NM_001111018:exon28:c.G5388G:p.T1796T,NAV2:NM_001244963:exon29:c.G5757G:p.T1919T",⋯,.,.,.,.,.,.,.,.,.,chr11:20090946:A:G
11,119182117,119182117,A,C,exonic,NLRX1,.,synonymous SNV,"NLRX1:NM_001282143:exon9:c.C2378C:p.A793A,NLRX1:NM_001282144:exon9:c.C2378C:p.A793A,NLRX1:NM_001282358:exon9:c.C2378C:p.A793A,NLRX1:NM_024618:exon9:c.C2378C:p.A793A",⋯,.,.,.,.,.,.,.,.,.,chr11:119182117:A:C


Since all mismatched variants of do not have functions that we are interested in, it is safe to simply remove all of them.

#### 3.14 Remove mis-matched variants

In [3]:
for i in list((1,2,11)):
    for j in list((1,2,3,4,5,6,7)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N remove_mismatch_chr%i_batch%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/remove_mismatch_chr%i_batch%i_$JOB_ID.out
#$ -e ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/remove_mismatch_chr%i_batch%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_batch%i

chr=%i

plink2 \
    --vcf chr${chr}.dose.recoded.vcf.gz dosage=HDS \
    --exclude chr${chr}.dose.mismatch.bim \
    --make-bpgen \
    --export vcf bgz vcf-dosage=HDS-force \
    --out chr${chr}.dose.nomismatch

'''%(i,j,i,j,i,j,j,i)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/remove_mismatch_chr"+str(i)+"_batch"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

#### 3.15 Split variants into 10 groups

Since pasting all vcf's together takes a long time, we split all variants into 10 groups and paste 7 batches of each group together to get R2 values.

In [5]:
for(i in c(1,2,11)){
    bim <- fread(sprintf("./topmed_batch1/chr%i.dose.recoded.bim", i))
    mismatch <- fread(sprintf("./topmed_batch1/chr%i.dose.mismatch.bim", i))
    no_mismatch <- anti_join(bim, mismatch)
    
    rows_per_df <- ceiling(nrow(no_mismatch)/10)
    group_indices <- rep(1:10, each = rows_per_df, length.out = nrow(no_mismatch))
    split_df <- split(no_mismatch, group_indices)
    for(g in c(1:10)){
        split_df[[g]] %>% fwrite(sprintf("./topmed_groups/chr%i.dose.nomismatch.group%i.bim", i, g), col.names = FALSE, sep = "\t")
    }
}

[1m[22mJoining with `by = join_by(V1, V2, V3, V4, V5, V6)`
[1m[22mJoining with `by = join_by(V1, V2, V3, V4, V5, V6)`
[1m[22mJoining with `by = join_by(V1, V2, V3, V4, V5, V6)`


In [6]:
for i in list((1,2,11)):
    for j in list(range(1,8)):
        for k in list(range(1,11)):
            script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N create_groups_chr%i_batch%i_group%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/create_groups_chr%i_batch%i_group%i_$JOB_ID.out
#$ -e ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/create_groups_chr%i_batch%i_group%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/

chr=%i
batch=%i
grp=%i

plink2 \
    --vcf ./topmed_batch${batch}/chr${chr}.dose.recoded.vcf.gz dosage=HDS \
    --extract ./topmed_groups/chr${chr}.dose.nomismatch.group${grp}.bim \
    --make-bpgen \
    --export vcf bgz vcf-dosage=HDS-force \
    --out ./topmed_groups/chr${chr}.batch${batch}.group${grp}

'''%(i,j,k,i,j,k,i,j,k,i,j,k)
            f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/create_groups_chr"+str(i)+"_batch"+str(j)+"_group"+str(k)+".sh", 'w')
            f.write(script)
            f.close()

#### 3.16 Paste each group for all 7 batches

We still need to use the provided `hds-utils` tool to paste together batches.

In [7]:
## Software installation
# mkdir ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/
# cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/
# git clone https://github.com/statgen/hds-util

# cd hds-util

# pip3 install --user cget
# cget install -f ./requirements.txt
# mkdir build; cd build
# cmake -DCMAKE_TOOLCHAIN_FILE=../cget/cget/cget.cmake -DCMAKE_BUILD_TYPE=Release ..
# make
# make install

# export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

# hds-util --help

In [8]:
## paste for each group
## without min-r2 flag to retain all variants

for i in list((1,2,11)):
    for j in list(range(1,11)):
        script='''#!/bin/bash
#SBATCH --mem=50G
#SBATCH --time=240:00:00
#SBATCH --job-name=paste_chr%i_group%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/paste_chr%i_group%i_nominr2_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/paste_chr%i_group%i_nominr2_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups
export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

chr=%i
grp=%i

hds-util -f GT,DS,HDS -O vcf.gz \
    ./chr${chr}.batch1.group${grp}.vcf.gz \
    ./chr${chr}.batch2.group${grp}.vcf.gz \
    ./chr${chr}.batch3.group${grp}.vcf.gz \
    ./chr${chr}.batch4.group${grp}.vcf.gz \
    ./chr${chr}.batch5.group${grp}.vcf.gz \
    ./chr${chr}.batch6.group${grp}.vcf.gz \
    ./chr${chr}.batch7.group${grp}.vcf.gz  > ./topmed_chr${chr}_merged_group${grp}_nominr2.vcf.gz
'''%(i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/new_paste_chr"+str(i)+"_group"+str(j)+"_nominr2.sh", 'w')
        f.write(script)
        f.close()

In [9]:
## obtain for each pasted file their variant info
for i in list((1,2,11)):
    for j in list(range(1,11)):
        script='''#!/bin/bash
#SBATCH --mem=10G
#SBATCH --time=72:00:00
#SBATCH --job-name=obtain_info_chr%i_group%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/obtain_info_chr%i_group%i_nominr2_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/obtain_info_chr%i_group%i_nominr2_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

module load BCFTOOLS
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups

chr=%i
grp=%i

bcftools query -f '%%CHROM\\t%%POS\\t%%ID\\t%%REF\\t%%ALT\\t%%R2\\t%%MAF\\n' ./topmed_chr${chr}_merged_group${grp}_nominr2.vcf.gz -o ./topmed_chr${chr}_merged_group${grp}_nominr2_info.txt

'''%(i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/obtain_info_chr"+str(i)+"_group"+str(j)+"_nominr2.sh", 'w')
        f.write(script)
        f.close()

In [10]:
## merge info file for each chromosome
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups")

library(dplyr)
library(data.table)

for(chr in c(1,2,11)){
    info_df <- data.frame(matrix(ncol = 7, nrow = 0))
    colnames(info_df) <- c("CHR", "POS", "ID", "REF", "ALT", "R2", "MAF")
    
    for(i in c(1:10)){
        df <- fread(sprintf("./topmed_chr%i_merged_group%i_nominr2_info.txt", chr, i))
        colnames(df) <- c("CHR", "POS", "ID", "REF", "ALT", "R2", "MAF")
        info_df <- rbind(info_df, df)
    }
    
    info_df <- info_df %>% mutate(R2 = as.numeric(R2))
    
    info_df %>% fwrite(sprintf("../topmed_168206ids_chr%i_all_rsq.txt", chr))
    
    info_df %>% filter(R2 > 0.3) %>% fwrite(sprintf("../topmed_168206ids_chr%i_rsq03_rsq.txt", chr))
    info_df %>% filter(R2 > 0.3) %>% select(ID) %>% fwrite(sprintf("../topmed_168206ids_chr%i_rsq03_snplist.txt", chr), col.names = FALSE)
    info_df %>% filter(R2 > 0.8) %>% fwrite(sprintf("../topmed_168206ids_chr%i_rsq08_rsq.txt", chr))
    info_df %>% filter(R2 > 0.8) %>% select(ID) %>% fwrite(sprintf("../topmed_168206ids_chr%i_rsq08_snplist.txt", chr), col.names = FALSE)
}

[1m[22m[36mℹ[39m In argument: `R2 = as.numeric(R2)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `R2 = as.numeric(R2)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `R2 = as.numeric(R2)`.
[33m![39m NAs introduced by coercion”


#### 3.17 Concatenate files together for each chromosome

In [11]:
## index for each pasted file
for i in list((1,2,11)):
    for j in list(range(1,11)):
        script='''#!/bin/bash
#SBATCH --mem=20G
#SBATCH --time=24:00:00
#SBATCH --job-name=index_vcf_chr%i_group%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/index_vcf_chr%i_group%i_nominr2_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/index_vcf_chr%i_group%i_nominr2_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

module load BCFTOOLS
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups

chr=%i
group=%i

bcftools index --threads 15 topmed_chr${chr}_merged_group${group}_nominr2.vcf.gz; 

'''%(i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/index_vcf_chr"+str(i)+"_group"+str(j)+"_nominr2.sh", 'w')
        f.write(script)
        f.close()

In [12]:
## merge each pasted file
for i in list((1,2,11)):
    script='''#!/bin/bash
#SBATCH --mem=20G
#SBATCH --time=240:00:00
#SBATCH --job-name=concat_vcf_chr%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/concat_vcf_chr%i_nominr2_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/concat_vcf_chr%i_nominr2_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

module load BCFTOOLS
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups

chr=%i
bcftools concat --threads 15 \
    topmed_chr${chr}_merged_group1_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group2_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group3_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group4_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group5_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group6_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group7_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group8_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group9_nominr2.vcf.gz \
    topmed_chr${chr}_merged_group10_nominr2.vcf.gz \
    -Oz -o ../topmed_chr${chr}_merged_168206ids_dose.vcf.gz

'''%(i,i,i,i)
    f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/concat_vcf_chr"+str(i)+"_nominr2.sh", 'w')
    f.write(script)
    f.close()

In [25]:
## recoding vcf to pgen file for reach group
for i in list((1,2,11)):
    for j in list((1,2,3,4,5,6,7,8,9,10)):
        script='''#!/bin/bash
#SBATCH --mem=300G
#SBATCH --time=100:00:00
#SBATCH --job-name=recode_vcf_chr%i_group%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_chr%i_grp%i_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_chr%i_grp%i_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/

chr=%i
grp=%i

plink2 \
    --vcf ./topmed_groups/topmed_chr${chr}_merged_group${grp}_nominr2.vcf.gz dosage=DS \
    --make-bpgen --sort-vars --threads 5 --memory 150000 \
    --out ./topmed_groups/topmed_chr${chr}_merged_group${grp}_nominr2

'''%(i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_chr"+str(i)+"_grp"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [None]:
## use plink to merge them together
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_groups

plink2 --bpfile topmed_chr1_merged_group1_nominr2 \
    --pmerge-list chr1_mergelist bpfile \
    --make-bpgen \
    --out ../topmed_chr1_merged_168206ids_dose_plink

plink2 --bpfile topmed_chr2_merged_group1_nominr2 \
    --pmerge-list chr2_mergelist bpfile \
    --make-bpgen \
    --out ../topmed_chr2_merged_168206ids_dose_plink

plink2 --bpfile topmed_chr11_merged_group1_nominr2 \
    --pmerge-list chr11_mergelist bpfile \
    --make-bpgen \
    --out ../topmed_chr11_merged_168206ids_dose_plink

## extract for rsq03 and rsq08
plink2 --bpfile ../topmed_chr1_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr1_rsq03_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr1_merged_168206ids_rsq03_dose

plink2 --bpfile ../topmed_chr1_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr1_rsq08_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr1_merged_168206ids_rsq08_dose

plink2 --bpfile ../topmed_chr2_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr2_rsq03_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr2_merged_168206ids_rsq03_dose

plink2 --bpfile ../topmed_chr2_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr2_rsq08_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr2_merged_168206ids_rsq08_dose

plink2 --bpfile ../topmed_chr11_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr11_rsq03_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr11_merged_168206ids_rsq03_dose

plink2 --bpfile ../topmed_chr11_merged_168206ids_dose_plink \
    --extract ../topmed_168206ids_chr11_rsq08_snplist.txt \
    --make-bpgen \
    --out ../topmed_chr11_merged_168206ids_rsq08_dose

#### 3.18 Summary

| Chr | After removing mismatches |  removing monomorphic |     $R^2$ > 0.3     |     $R^2$ > 0.8     |
|:---:|:-------------------------:|:---------------------:|:----------------:|:----------------:|
|  1  |      33,930,494             |       32,512,988        | 12,818,589 (0.689) | 4,680,641 (0.903)  |
|  2  |      37,160,046             |       35,569,645        | 13,814,401 (0.691) | 5,126,997 (0.904)  |
|  11 |      20,994,846             |       20,080,669        | 7,791,663 (0.690)  | 2,878,270 (0.904)  |

### 3.2 HRC

#### 3.21 Paste batches together
For HRC, allele flipping problem does not exist, so we can just paste them together.

In [13]:
## write out script hrc
for i in list((1,2,11)):
    for j in list((0,3,8)):
        script='''#!/bin/sh
#!/bin/sh
#$ -l h_rt=700:00:00
#$ -l h_vmem=5G
#$ -N paste_hrc_chr%i_rsq0%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/paste_hrc_chr%i_rsq0%i_$JOB_ID.out
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/paste_hrc_chr%i_rsq0%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

chr=%i

hds-util -f GT,DS,HDS --min-r2 0.%i -O vcf.gz \
    ./hrc_batch1/chr${chr}.dose.vcf.gz \
    ./hrc_batch2/chr${chr}.dose.vcf.gz \
    ./hrc_batch3/chr${chr}.dose.vcf.gz \
    ./hrc_batch4/chr${chr}.dose.vcf.gz \
    ./hrc_batch5/chr${chr}.dose.vcf.gz \
    ./hrc_batch6/chr${chr}.dose.vcf.gz \
    ./hrc_batch7/chr${chr}.dose.vcf.gz > ./hrc_chr${chr}_merged_168206ids_rsq0%i_dose.vcf.gz
'''%(i,j,i,j,i,j,i,j,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/pasting_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

#### 3.22 Liftover HRC (hg19->hg38)

In [14]:
# Recode VCF into .pgen format, recode variant name
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
module load Plink/2.00a

plink2 \
    --vcf hrc_chr1_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr1_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr1_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr1_merged_168206ids_rsq08_dose
    
plink2 \
    --vcf hrc_chr2_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr2_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr2_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr2_merged_168206ids_rsq08_dose
    
plink2 \
    --vcf hrc_chr11_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr11_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr11_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr11_merged_168206ids_rsq08_dose
    
    
# Check for monomorphic variants
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr1_merged_168206ids_rsq03_dose.acount > monomprphic_chr1_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr1_merged_168206ids_rsq08_dose.acount > monomprphic_chr1_rsq08_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr2_merged_168206ids_rsq03_dose.acount > monomprphic_chr2_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr2_merged_168206ids_rsq08_dose.acount > monomprphic_chr2_rsq08_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr11_merged_168206ids_rsq03_dose.acount > monomprphic_chr11_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr11_merged_168206ids_rsq08_dose.acount > monomprphic_chr11_rsq08_SNPs

In [15]:
## Running our in-house liftover pipeline

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/ \
    --input_file ./hrc_168206_chr1.bim \
    --output_file ./hrc_168206_chr1_hg38.bim

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_168206_chr2.bim \
    --output_file ./hrc_168206_chr2_hg38.bim

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_168206_chr11.bim \
    --output_file ./hrc_168206_chr11_hg38.bim

## 4. HRC Data Processing

In HRC annotation, ID_hg19 is the original id, ID_hg38 is the liftedover id in hg38, ID is the id created by pasting annotations.

### 4.1 Recode VCF

In [16]:
## writing out script
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N recode_vcf_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
module load Plink/2.00a

plink2 \
    --vcf hrc_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz dosage=DS \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --new-id-max-allele-len 200 \
    --out hrc_chr%i_merged_168206ids_rsq0%i_dose

awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr%i_merged_168206ids_rsq0%i_dose.acount > monomprphic_chr%i_rsq0%i_SNPs

'''%(i,j,i,j,i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [17]:
cd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/
for i in recode_vcf_hrc_chr*_rsq0*.sh; do qsub $i; done

Your job 8658534 ("recode_vcf_hrc_chr11_rsq03") has been submitted
Your job 8658535 ("recode_vcf_hrc_chr11_rsq08") has been submitted
Your job 8658536 ("recode_vcf_hrc_chr1_rsq03") has been submitted
Your job 8658537 ("recode_vcf_hrc_chr1_rsq08") has been submitted
Your job 8658538 ("recode_vcf_hrc_chr2_rsq03") has been submitted
Your job 8658539 ("recode_vcf_hrc_chr2_rsq08") has been submitted


### 4.2 Annotate HRC

In [18]:
## writing out script
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N annotate_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Singularity

sos run ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/notebooks/annovar.ipynb annovar \
    --build 'hg38' \
    --cwd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc \
    --bim_name /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/hrc_chr%i_merged_168206ids_rsq0%i_dose_hg38.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --job_size 1 \
    --name_prefix hrc_chr%i_merged_168206ids_rsq0%i_dose \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif

'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [18]:
## rename columns
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

for(chr in c(1,2,11)){
    for (rsq in c(3,8)){
        annot <- data.table::fread(sprintf("hrc_chr%d_merged_168206ids_rsq0%d_dose_hg38.hg38.hg38_multianno.csv", chr, rsq))
        bim <- data.table::fread(sprintf("hrc_chr%d_merged_168206ids_rsq0%d_dose.bim", chr, rsq))
        colnames(annot)[29:41] <- c("AF_genome",
                                    "AF_raw_genome",
                                    "AF_male_genome",
                                    "AF_female_genome",
                                    "AF_afr_genome",
                                    "AF_ami_genome",
                                    "AF_amr_genome",
                                    "AF_asj_genome",
                                    "AF_eas_genome",
                                    "AF_fin_genome",
                                    "AF_nfe_genome",
                                    "AF_oth_genome",
                                    "AF_sas_genome")
        colnames(annot)[42:54] <- c("AF_exome",
                                    "AF_popmax_exome",
                                    "AF_male_exome",
                                    "AF_female_exome",
                                    "AF_raw_exome",
                                    "AF_afr_exome",
                                    "AF_sas_exome",
                                    "AF_amr_exome",
                                    "AF_eas_exome",
                                    "AF_nfe_exome",
                                    "AF_fin_exome",
                                    "AF_asj_exome",
                                    "AF_oth_exome")
        annot <- annot %>% 
            mutate(AF_nfe_exome = as.numeric(AF_nfe_exome)) %>% 
            mutate(MAF_nfe_exome = ifelse(AF_nfe_exome > 0.5, 1 - AF_nfe_exome, AF_nfe_exome)) %>% 
            rename("ID_hg38" = "Otherinfo1") %>%
            mutate(ID = paste(Chr, Start, Ref, Alt, sep = ":"), ID_hg19 = bim$V2) %>%
            mutate(ID = paste0("chr", ID)) %>%
            select(Chr, Start, End, Ref, Alt, 
                   Func.refGene, Gene.refGene, ExonicFunc.refGene, 
                   MAF_nfe_exome, REVEL_score,
                   ID_hg38, ID_hg19, ID, CADD_phred)
        # data.table::fwrite(annot, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq))
    }
}

[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `AF_nfe_exome = as.numeric(AF_nfe_exome)`.
[33m![39m NAs introduced by coercion”


### 4.3 Filter HRC

**MAF (0.01, 0.005, 0.001) $\times$ R2 (0.3, 0.8)**

In [19]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

In [20]:
filter_df <- data.frame(data.frame(matrix(ncol = 7, nrow = 0)))

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("hrc_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq)) %>% select(-CADD_phred)
            mono_list <- fread(sprintf("monomprphic_chr%d_rsq0%d_SNPs", chr, rsq))$ID # no monomorphic variants

            annot <-  annot %>% filter(Chr == chr)
            annot_maf <-  annot %>% 
                filter(!ID_hg19 %in% mono_list) %>%
                filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)

            annot_func <- annot_maf %>% 
                filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing")) %>%
                filter(ExonicFunc.refGene != 'unknown') %>% 
                filter(ExonicFunc.refGene != 'synonymous SNV' & ExonicFunc.refGene != 'nonframeshift substitution') %>%
                mutate(Function = ifelse(ExonicFunc.refGene == "nonsynonymous SNV", "missense", "")) %>%
                mutate(Function = ifelse(grepl("splicing", Func.refGene), "splicing", Function)) %>%
                mutate(Function = ifelse(ExonicFunc.refGene %in% c("stopgain", "stoploss", "startloss", "frameshift substitution"), "LoF", Function))
        
            annot_func <- annot_func %>% 
                tidyr::separate(Gene.refGene, c("Gene.refGene", "discard_1", "discard_2"), sep = ";") %>% 
                select(-discard_1, -discard_2)

            gene_list <- annot_func %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_func %>% filter(Gene.refGene %in% gene_list)
            
            data.table::fwrite(annot_func, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_func %>% select(ID_hg19), sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)
            
            data.table::fwrite(annot_final, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_final %>% select(ID_hg19), sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)

            sub_df <- data.frame(data = "hrc", chromosome = chr, maf = maf, rsq = rsq/10,
                                 total_num_var = nrow(annot), maf_filtering_var = nrow(annot_maf),
                                 function_filtering_var = sprintf("%d (%d)", nrow(annot_func), length(unique(annot_func$Gene.refGene))),
                                 gene_filtering_var = sprintf("%d (%d)", nrow(annot_final), length(gene_list)))
            filter_df <- rbind(filter_df, sub_df)
        }
    }
}

“[1m[22mExpected 3 pieces. Additional pieces discarded in 13 rows [6556, 6557, 6561,
8110, 8112, 8120, 9478, 12706, 12714, 15013, 15015, 15027, 15031].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 20633 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 13 rows [6275, 6276, 6280,
7778, 7779, 7787, 9095, 12212, 12220, 14429, 14431, 14443, 14447].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 19838 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 9 rows [5220, 5221, 5223,
6503, 6504, 10249, 10254, 12131, 12133].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 16730 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 4 rows [3255, 6248, 6250,
6257].”
“[1m

In [21]:
filter_df

data,chromosome,maf,rsq,total_num_var,maf_filtering_var,function_filtering_var,gene_filtering_var
<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<chr>,<chr>
hrc,1,0.01,0.3,2658041,2646976,20656 (1904),20551 (1799)
hrc,1,0.005,0.3,2658041,2644855,19861 (1902),19743 (1784)
hrc,1,0.001,0.3,2658041,2637260,16746 (1883),16591 (1728)
hrc,1,0.01,0.8,1322393,1311707,8709 (1694),8396 (1381)
hrc,1,0.005,0.8,1322393,1309907,7996 (1658),7666 (1328)
hrc,1,0.001,0.8,1322393,1304988,5655 (1510),5243 (1098)
hrc,2,0.01,0.3,2950782,2942882,14733 (1173),14683 (1123)
hrc,2,0.005,0.3,2950782,2941416,14184 (1172),14133 (1121)
hrc,2,0.001,0.3,2950782,2935812,11928 (1163),11856 (1091)
hrc,2,0.01,0.8,1489256,1481568,6402 (1054),6238 (890)


### 4.4 Subsetting for CADD score

For obtaining CADD score, we manually retrieved from the CADD [website](https://cadd.gs.washington.edu/) by uploading vcf files.

In [23]:
for i in list((1,2,11)):
        script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_maf001_chr%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -q csg.q
#$ -S /bin/bash

export PATH=$HOME/miniconda3/bin:$PATH
module load HTSLIB/1.17
module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc

plink2 \
    --bpfile hrc_chr%i_merged_168206ids_rsq03_dose \
    --extract hrc_chr%i_rsq03_hg19_hg38_maf001_LOF_missense_all_snplist \
    --make-bpgen --sort-vars \
    --export vcf-4.2 vcf-dosage=DS bgz \
    --out hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted

zcat hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted.vcf.gz | cut -f-5 > hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf
bgzip hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf

'''%(i,i,i,i,i,i,i,i,i)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr"+str(i)+".sh", 'w')
        f.write(script)
        f.close()

In [24]:
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts

qsub make_5col_vcf_maf001_chr11.sh
qsub make_5col_vcf_maf001_chr1.sh
qsub make_5col_vcf_maf001_chr2.sh

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr1_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr1_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr2_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr2_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr11_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr11_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

Your job 8646185 ("extract_filtered_maf001_chr11") has been submitted
Your job 8646186 ("extract_filtered_maf001_chr1") has been submitted
Your job 8646187 ("extract_filtered_maf001_chr2") has been submitted


In [22]:
library(data.table)
library(dplyr)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("hrc_chr%i_rsq0%s_hg19_hg38_maf001_LOF_missense_all_annot.csv.gz", chr, rsq))

            cadd_all <- fread(sprintf("GRCh37-v1.6_chr%i.tsv", chr), header = TRUE) %>% arrange(Pos) 
            colnames(cadd_all) <- c("Chr", "Start", "Ref", "Alt", "RawScore", "PHRED")
            cadd_all <- cadd_all %>% mutate(ID_hg19 = paste(Chr, Start, Ref, Alt, sep = ":")) %>% mutate(ID_hg19 = paste0("chr", ID_hg19))

            annot_all <- left_join(annot, cadd_all %>% select(ID_hg19, RawScore, PHRED)) %>% filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)
            annot_all_lof <- annot_all %>% filter(Function == "LoF")
            annot_all_cadd <- annot_all %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            gene_list <- annot_all %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_all %>% filter(Gene.refGene %in% gene_list)
            annot_final_lof <- annot_final %>% filter(Function == "LoF")
            annot_final_cadd <- annot_final %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            # >= 1 variant
            fwrite(annot_all, 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_all$ID_hg19, 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 1 variant + CADD filtering
            fwrite(rbind(annot_all_lof, annot_all_cadd), 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_annot.csv.gz", chr, rsq, maf_c),  
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg19), 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant
            fwrite(annot_final, 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_final$ID_hg19, 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant + CADD filtering
            fwrite(rbind(annot_final_lof, annot_final_cadd), 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg19), 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_snplist", chr, rsq, maf_c),
                        col.names = FALSE, row.name = FALSE, quote = FALSE)
        }
    }   
}


[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, 

In [23]:
## extracting
for i in list((1,2,11)):
    for j in list((3,8)):
        for k in list((("1", "05", "01"))):
            for c in list(("", "_cadd")):
                script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_chr%i_rsq0%i_maf00%s%s
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/2.00a

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
plink2 \
    --bpfile hrc_chr%i_merged_168206ids_rsq0%i_dose \
    --extract hrc_chr%i_rsq0%i_hg19_hg38_maf00%s_LOF_missense%s_snplist \
    --make-bpgen --sort-vars \
    --out hrc_chr%i_rsq0%i_maf00%s_LOF_missense%s_extracted

'''%(i,j,k,c,i,j,k,c,i,j,k,c,i,j,i,j,k,c,i,j,k,c)
                f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr"+str(i)+"_rsq0"+str(j)+"_maf00"+k+c+".sh", 'w')
                f.write(script)
                f.close()

## 5. TOPMed Data Processing

### 5.1 Recode VCF

In [24]:
## writing out script
for i in list((1,2,11)):
    # for j in list((3,8)):
        script='''#!/bin/bash
#SBATCH --mem=50G
#SBATCH --time=100:00:00
#SBATCH --job-name=recode_vcf_chr%i_rsq0%i
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_chr%i_rsq0%i_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_chr%i_rsq0%i_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/

plink2 \
    --vcf topmed_chr%i_merged_168206ids_dose.vcf.gz dosage=DS \
    --make-bpgen --sort-vars --threads 50 \
    --set-all-var-ids chr@:#:\$r:\$a \
    --new-id-max-allele-len 300 \
    --extract topmed_168206ids_chr%i_rsq0%i_snplist.txt \
    --out topmed_chr%i_merged_168206ids_rsq0%i_dose

'''%(i,j,i,j,i,j,i,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/recode_vcf_topmed_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

### 5.2 Annotate TOPMed

In [25]:
## annotate topmed
for i in list((1,2,11)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=80G
#$ -N annotate_topmed_v3_all_chr%i
#$ -o /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_all_chr%i_$JOB_ID.out
#$ -e /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_all_chr%i_$JOB_ID.err
#$ -j y
#$ -q csg.q

source ~/mamba_activate.sh
module load Singularity

sos run /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/notebooks/annovar.ipynb annovar \
    --build 'hg38' \
    --cwd /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3 \
    --bim_name /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/topmed_batch1/chr%i.dose.nomismatch.bim \
    --humandb /mnt/vast/hpc/csg/isabelle/REF/humandb  \
    --job_size 1 \
    --name_prefix topmed_chr%i \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif

'''%(i,i,i,i,i)
        f=open("/mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/annotate_all_chr"+str(i)+".sh", 'w')
        f.write(script)
        f.close()

In [26]:
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts
for i in annotate*sh; do qsub ${i}; done

Your job 8658540 ("annotate_topmed_v3_all_chr11") has been submitted
Your job 8658541 ("annotate_topmed_v3_all_chr1") has been submitted
Your job 8658542 ("annotate_topmed_v3_all_chr2") has been submitted
Your job 8658543 ("annotate_topmed_v3_mismatch_chr11") has been submitted
Your job 8658544 ("annotate_topmed_v3_mismatch_chr1") has been submitted
Your job 8658545 ("annotate_topmed_v3_mismatch_chr2") has been submitted


In [1]:
## rename columns
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3")

for(chr in c(1,2,11)){
    annot <- fread(sprintf("chr%i.dose.nomismatch.hg38.hg38_multianno.csv", chr))
    
    colnames(annot)[29:41] <- c("AF_genome",
                                "AF_raw_genome",
                                "AF_male_genome",
                                "AF_female_genome",
                                "AF_afr_genome",
                                "AF_ami_genome",
                                "AF_amr_genome",
                                "AF_asj_genome",
                                "AF_eas_genome",
                                "AF_fin_genome",
                                "AF_nfe_genome",
                                "AF_oth_genome",
                                "AF_sas_genome")
    colnames(annot)[42:54] <- c("AF_exome",
                                "AF_popmax_exome",
                                "AF_male_exome",
                                "AF_female_exome",
                                "AF_raw_exome",
                                "AF_afr_exome",
                                "AF_sas_exome",
                                "AF_amr_exome",
                                "AF_eas_exome",
                                "AF_nfe_exome",
                                "AF_fin_exome",
                                "AF_asj_exome",
                                "AF_oth_exome")

    for (rsq in c(3,8)){
        snplist <- fread(sprintf("topmed_168206ids_chr%i_rsq0%i_snplist.txt", chr, rsq), header=FALSE) %>% pull(V1)
        
        annot_subset <- annot %>% 
            mutate(AF_nfe_exome = as.numeric(AF_nfe_exome)) %>% 
            mutate(MAF_nfe_exome = ifelse(AF_nfe_exome > 0.5, 1 - AF_nfe_exome, AF_nfe_exome)) %>% 
            rename("ID_hg38" = "Otherinfo1") %>%
            mutate(ID = paste(Chr, Start, Ref, Alt, sep = ":")) %>%
            mutate(ID = paste0("chr", ID)) %>%
            filter(ID_hg38 %in% snplist) %>%
            select(Chr, Start, End, Ref, Alt, 
                   Func.refGene, Gene.refGene, ExonicFunc.refGene, 
                   MAF_nfe_exome, REVEL_score,
                   ID_hg38, ID, CADD_phred)
        print(sprintf("chromosome %i, rsq %i, snplist length %i, nrow annot %i", chr, rsq, length(snplist), nrow(annot_subset)))
        fwrite(annot_subset, sprintf("topmed_chr%d_rsq0%d_hg38_hg38_sel_col_annot.csv.gz", chr, rsq))
    }
}

### 5.3 Filter TOPMed

In [2]:
library(dplyr)
library(data.table)
library(stringr)

setwd("/mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last




In [3]:
filter_df <- data.frame(data.frame(matrix(ncol = 7, nrow = 0)))

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("topmed_chr%d_rsq0%d_hg38_hg38_sel_col_annot.csv.gz", chr, rsq)) %>% select(-CADD_phred)
            
            annot <-  annot %>% filter(Chr == chr)
            annot_maf <- annot %>% 
                filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)

            annot_func <- annot_maf %>% 
                filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing")) %>%
                filter(ExonicFunc.refGene != 'unknown') %>% 
                filter(ExonicFunc.refGene != 'synonymous SNV' & ExonicFunc.refGene != 'nonframeshift substitution') %>%
                mutate(Function = ifelse(ExonicFunc.refGene == "nonsynonymous SNV", "missense", "")) %>%
                mutate(Function = ifelse(grepl("splicing", Func.refGene), "splicing", Function)) %>%
                mutate(Function = ifelse(ExonicFunc.refGene %in% c("stopgain", "stoploss", "startloss", "frameshift substitution"), "LoF", Function))
        
             annot_func <- annot_func %>% 
                mutate(cat = if_else(grepl(";", Gene.refGene) & Function == "splicing", 2, 1)) %>%
                tidyr::separate(Gene.refGene, c("Gene.refGene", "discard_1", "discard_2"), sep = ";") %>% 
                mutate(Gene.refGene = if_else(cat == 1, Gene.refGene, discard_1)) %>%
                select(-discard_1, -discard_2)
            
            gene_list <- annot_func %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_func %>% filter(Gene.refGene %in% gene_list)
            
            data.table::fwrite(annot_func, sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_func %>% select(ID_hg38), sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)
            
            data.table::fwrite(annot_final, sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_final %>% select(ID_hg38), sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)

            sub_df <- data.frame(data = "topmed", chromosome = chr, maf = maf, rsq = rsq/10,
                                 total_num_var = nrow(annot), maf_filtering_var = nrow(annot_maf),
                                 function_filtering_var = sprintf("%d (%d)", nrow(annot_func), length(unique(annot_func$Gene.refGene))),
                                 gene_filtering_var = sprintf("%d (%d)", nrow(annot_final), length(gene_list)))
            filter_df <- rbind(filter_df, sub_df)
        }
    }
}

“[1m[22mExpected 3 pieces. Additional pieces discarded in 85 rows [11932, 11933, 11936,
11941, 15485, 22040, 22043, 47720, 47721, 47731, 47732, 47745, 47746, 47752,
47756, 58581, 58583, 58592, 58595, 58598, ...].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 141824 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 85 rows [11859, 11860, 11863,
11868, 15387, 21912, 21915, 47435, 47436, 47446, 47447, 47460, 47461, 47467,
47471, 58236, 58238, 58247, 58249, 58252, ...].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 140998 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 82 rows [11557, 11558, 11561,
11566, 15002, 21390, 21393, 46355, 46356, 46364, 46365, 46377, 46378, 46384,
46388, 56923, 56925, 56934, 56936, 56939, ...].”
“[1m[22mExpected 3 pieces. Missing piece

In [4]:
filter_df

data,chromosome,maf,rsq,total_num_var,maf_filtering_var,function_filtering_var,gene_filtering_var
<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<chr>,<chr>
topmed,1,0.01,0.3,12818589,12807304,141990 (1995),141967 (1972)
topmed,1,0.005,0.3,12818589,12805019,141164 (1995),141141 (1972)
topmed,1,0.001,0.3,12818589,12796823,137951 (1995),137927 (1971)
topmed,1,0.01,0.8,4680641,4669659,45698 (1935),45669 (1906)
topmed,1,0.005,0.8,4680641,4667541,44917 (1934),44886 (1903)
topmed,1,0.001,0.8,4680641,4660905,42166 (1933),42131 (1898)
topmed,2,0.01,0.3,13814401,13806217,100205 (1222),100197 (1214)
topmed,2,0.005,0.3,13814401,13804635,99644 (1222),99636 (1214)
topmed,2,0.001,0.3,13814401,13798576,97313 (1222),97305 (1214)
topmed,2,0.01,0.8,5126997,5118996,33877 (1208),33865 (1196)


### 5.4 Subsetting for CADD score

In [5]:
## make 5 column vcf file to upload for CADD website
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3")
library(dplyr)
library(data.table)

for(i in c(1,2,11)){
    annot <- fread(sprintf("topmed_chr%i_rsq03_hg38_hg38_maf001_LOF_missense_all_annot.csv.gz", i))

    annot_vcf <- annot %>% select(Chr, Start, ID_hg38, Ref, Alt)
    colnames(annot_vcf) <- c("#CHROM", "POS", "ID", "REF", "ALT")
    # annot_vcf %>% fwrite(sprintf("topmed_chr%i_rsq03_maf001_LOF_missense_all_5col.vcf", i), sep = "\t")
}

In [6]:
# cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3
# module load HTSLIB/1.17

# bgzip topmed_chr1_rsq03_maf001_LOF_missense_all_5col.vcf
# bgzip topmed_chr2_rsq03_maf001_LOF_missense_all_5col.vcf
# bgzip topmed_chr11_rsq03_maf001_LOF_missense_all_5col.vcf

In [7]:
library(data.table)
library(dplyr)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3")

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("topmed_chr%i_rsq0%s_hg38_hg38_maf001_LOF_missense_all_annot.csv.gz", chr, rsq))

            cadd_all <- fread(sprintf("GRCh38-v1.6_chr%i.tsv.gz", chr), header = TRUE) %>% arrange(Pos) 
            colnames(cadd_all) <- c("Chr", "Start", "Ref", "Alt", "RawScore", "PHRED")
            cadd_all <- cadd_all %>% mutate(ID_hg38 = paste(Chr, Start, Ref, Alt, sep = ":")) %>% mutate(ID_hg38 = paste0("chr", ID_hg38))

            annot_all <- left_join(annot, cadd_all %>% select(ID_hg38, RawScore, PHRED)) %>% filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)
            annot_all_lof <- annot_all %>% filter(Function == "LoF")
            annot_all_cadd <- annot_all %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            gene_list <- annot_all %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_all %>% filter(Gene.refGene %in% gene_list)
            annot_final_lof <- annot_final %>% filter(Function == "LoF")
            annot_final_cadd <- annot_final %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            # >= 1 variant
            fwrite(annot_all, 
                   sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_all$ID_hg38, 
                        sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 1 variant + CADD filtering
            fwrite(rbind(annot_all_lof, annot_all_cadd), 
                   sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_cadd_annot.csv.gz", chr, rsq, maf_c),  
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg38), 
                        sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_all_cadd_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant
            fwrite(annot_final, 
                   sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_final$ID_hg38, 
                        sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant + CADD filtering
            fwrite(rbind(annot_final_lof, annot_final_cadd), 
                   sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_cadd_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg38), 
                        sprintf("topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_cadd_snplist", chr, rsq, maf_c),
                        col.names = FALSE, row.name = FALSE, quote = FALSE)
        }
    }   
}


[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg38, RawScore, 

In [61]:
## extracting
for i in list((1,2,11)):
    for j in list((3,8)):
        for k in list((("1", "05", "01"))):
            for c in list(("", "_cadd")):
                script='''#!/bin/bash
#SBATCH --mem=50G
#SBATCH --time=100:00:00
#SBATCH --job-name=extract_filtered_chr%i_rsq0%i_maf00%s%s
#SBATCH --output=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_%%j.out
#SBATCH --error=/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_%%j.err
#SBATCH -p CSG
#SBATCH --mail-type=FAIL
#SBATCH --mail-user tl3031@cumc.columbia.edu

export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/2.00a

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3
plink2 \
    --bpfile topmed_chr%i_merged_168206ids_rsq0%i_dose \
    --extract topmed_chr%i_rsq0%i_hg38_hg38_maf00%s_LOF_missense%s_snplist \
    --make-bpgen --sort-vars \
    --out topmed_chr%i_rsq0%i_maf00%s_LOF_missense%s_extracted

'''%(i,j,k,c,i,j,k,c,i,j,k,c,i,j,i,j,k,c,i,j,k,c)
                f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed_v3/scripts/extract_filtered_chr"+str(i)+"_rsq0"+str(j)+"_maf00"+k+c+".sh", 'w')
                f.write(script)
                f.close()

## 6. Make Merged Dataset

We create 2 merged dataset, HRC_TOPMed and ES_HRC_TOPMed. For the HRC_TOPMed dataset, we compare individual R-sqaure for each variants and choose the one with higher R-square score. For ES_HRC_TOPMed, we prioritize usin exome sequenced variants, then use whichever variant with higher R-square score.

### 6.1 HRC + TOPMed

#### 6.11 Obtain R2

Since we need to compare the individual R2 for each variant, we need to query from the original HRC imputed vcf file. The R2 has retreieved for TOPMed.

In [8]:
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N extract_topmed_rsq-%i-%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr%i_rsq0%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr%i_rsq0%i_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc

zcat hrc_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz | cut -f-8 >> ../hrc_topmed/hrc_168206ids_chr%i_rsq0%i_rsq.txt
'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [9]:
for i in 1 2 11;do
    sed -i '1,19d' hrc_168206ids_chr${i}_rsq03_rsq.txt
    sed -i '1,19d' hrc_168206ids_chr${i}_rsq08_rsq.txt
done

#### 6.12 Make Merged Annotation

We merge Chromosome 1 and 2 into 1 file and will do Chromosome 11 separately. This is because we need variants from chromosome 1 and 2 for follow simulation studies and Chromosome 11 is only needed for APOC3 analysis

##### Chromsome 1 + 2

In [13]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis")

In [14]:
# add HRC rsq information to HRC annotation
for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq03_rsq_formatted.txt", chr))
    
    for(rsq in c(3,8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq0%d_maf%s_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./hrc/hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg19) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg19" = "ID"))
            print(sprintf("annot %i; rsq_maf %i; annot_rsq %i", nrow(annot), nrow(rsq_maf), nrow(annot_rsq)))
            
            fwrite(annot_rsq, fname_out)
        }
    } 
}

[1] "annot 20551; rsq_maf 20551; annot_rsq 20551"
[1] "annot 19743; rsq_maf 19743; annot_rsq 19743"
[1] "annot 16591; rsq_maf 16591; annot_rsq 16591"
[1] "annot 8396; rsq_maf 8396; annot_rsq 8396"
[1] "annot 7666; rsq_maf 7666; annot_rsq 7666"
[1] "annot 5243; rsq_maf 5243; annot_rsq 5243"
[1] "annot 14683; rsq_maf 14683; annot_rsq 14683"
[1] "annot 14133; rsq_maf 14133; annot_rsq 14133"
[1] "annot 11856; rsq_maf 11856; annot_rsq 11856"
[1] "annot 6238; rsq_maf 6238; annot_rsq 6238"
[1] "annot 5723; rsq_maf 5723; annot_rsq 5723"
[1] "annot 3927; rsq_maf 3927; annot_rsq 3927"
[1] "annot 13392; rsq_maf 13392; annot_rsq 13392"
[1] "annot 12867; rsq_maf 12867; annot_rsq 12867"
[1] "annot 10872; rsq_maf 10872; annot_rsq 10872"
[1] "annot 5811; rsq_maf 5811; annot_rsq 5811"
[1] "annot 5315; rsq_maf 5315; annot_rsq 5315"
[1] "annot 3788; rsq_maf 3788; annot_rsq 3788"


In [15]:
# add TOPMed rsq information to TOPMed annotation
for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./topmed_v3/topmed_168206ids_chr%i_rsq03_rsq.txt", chr))
    
    for(rsq in c(3,8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/topmed_v3_168206ids_chr%d_rsq0%d_maf%s_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./topmed_v3/topmed_chr%d_rsq0%d_hg38_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg38) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg38" = "ID"))
            print(sprintf("annot %i; rsq_maf %i; annot_rsq %i", nrow(annot), nrow(rsq_maf), nrow(annot_rsq)))

            fwrite(annot_rsq, fname_out)
        }
    } 
}

[1] "annot 141967; rsq_maf 141967; annot_rsq 141967"
[1] "annot 141141; rsq_maf 141141; annot_rsq 141141"
[1] "annot 137927; rsq_maf 137927; annot_rsq 137927"
[1] "annot 45669; rsq_maf 45669; annot_rsq 45669"
[1] "annot 44886; rsq_maf 44886; annot_rsq 44886"
[1] "annot 42131; rsq_maf 42131; annot_rsq 42131"
[1] "annot 100197; rsq_maf 100197; annot_rsq 100197"
[1] "annot 99636; rsq_maf 99636; annot_rsq 99636"
[1] "annot 97305; rsq_maf 97305; annot_rsq 97305"
[1] "annot 33865; rsq_maf 33865; annot_rsq 33865"
[1] "annot 33324; rsq_maf 33324; annot_rsq 33324"
[1] "annot 31283; rsq_maf 31283; annot_rsq 31283"
[1] "annot 88634; rsq_maf 88634; annot_rsq 88634"
[1] "annot 88110; rsq_maf 88110; annot_rsq 88110"
[1] "annot 86113; rsq_maf 86113; annot_rsq 86113"
[1] "annot 29328; rsq_maf 29328; annot_rsq 29328"
[1] "annot 28822; rsq_maf 28822; annot_rsq 28822"
[1] "annot 27096; rsq_maf 27096; annot_rsq 27096"


In [3]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed")

library(dplyr)
library(data.table)

for(rsq in c(3, 8)){
    for(maf in c(0.01, 0.005, 0.001)){
        maf_c <- gsub("\\.", "", as.character(maf))
        hrc_chr1 <- read.csv(sprintf("./hrc_168206ids_chr1_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))
        hrc_chr2 <- read.csv(sprintf("./hrc_168206ids_chr2_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))

        topmed_chr1 <- read.csv(sprintf("./topmed_v3_168206ids_chr1_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))
        topmed_chr2 <- read.csv(sprintf("./topmed_v3_168206ids_chr2_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))

        hrc <- rbind(hrc_chr1, hrc_chr2)
        topmed <- rbind(topmed_chr1, topmed_chr2) %>% mutate(ID_hg19 = NA)

        topmed_hrc <- full_join(hrc, topmed, 
                                by = c("Chr", "Start", "End", "Ref", "Alt", 
                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                       "ID")) %>%
                    select(-ID_hg19.y) %>% 
                    rename(ID_hg19 = ID_hg19.x)%>%
                    mutate(REVEL_score = as.numeric(REVEL_score),
                           R2_hrc = tidyr::replace_na(as.numeric(R2_hrc), 0),
                           R2_topmed = tidyr::replace_na(R2_topmed, 0)) %>%
                    mutate(R2 = ifelse(R2_topmed >= R2_hrc, R2_topmed, R2_hrc),
                           source = ifelse(R2_topmed > R2_hrc, "topmed", "hrc")) %>%
                    mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                           PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                    select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))
        
        frq <- topmed_hrc %>% pull(ID) %>% table() %>% as.data.frame() %>% filter(Freq >= 2)
        
        sub_annot_1 <- topmed_hrc %>% filter(!ID %in% frq[,1]) %>%
            mutate(ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            select(-c(ID_hg38.y, ID_hg38.x, cat))
        
        sub_annot_2 <- topmed_hrc %>% 
            filter(ID %in% frq[,1]) %>%
            group_by(ID) %>%
            mutate(max_R2 = pmax(R2_hrc, R2_topmed, na.rm = TRUE),
                   max_R2_hrc = max(R2_hrc, na.rm = TRUE),
                   max_R2_topmed = max(R2_topmed, na.rm = TRUE)) %>%
            tidyr::fill(ID_hg19, .direction = "downup") %>%
            filter(max_R2 == max(max_R2)) %>%
            mutate(R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE),
                   R2_hrc = max_R2_hrc, 
                   R2_topmed = max_R2_topmed, 
                   ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            ungroup() %>%
            select(-c(ID_hg38.x, ID_hg38.y, max_R2, max_R2_hrc, max_R2_topmed, cat))

        full_annot <- rbind(sub_annot_1, sub_annot_2)
        print(dim(full_annot))
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

        full_annot %>% fwrite(sprintf("./hrc_topmed_v3_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_v3_168206ids_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
    }
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 245608     20


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 244170     20


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 238436     20


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 83562    20


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 82175    20


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


[1] 76961    20


##### Chromosome 11

In [2]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed")

for(rsq in c(3, 8)){
    for(maf in c(0.01, 0.005, 0.001)){
        maf_c <- gsub("\\.", "", as.character(maf))
        hrc <- read.csv(sprintf("./hrc_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))
        topmed <- read.csv(sprintf("./topmed_v3_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed), ID_hg19 = NA)
        
        topmed_hrc <- full_join(hrc, topmed, 
                                by = c("Chr", "Start", "End", "Ref", "Alt", 
                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                       "MAF_nfe_exome", "REVEL_score", "Function", "ID")) %>%
                    select(-ID_hg19.y) %>% 
                    rename(ID_hg19 = ID_hg19.x) %>%
                    mutate(REVEL_score = as.numeric(REVEL_score),
                           R2_hrc = tidyr::replace_na(as.numeric(R2_hrc), 0),
                           R2_topmed = tidyr::replace_na(R2_topmed, 0)) %>%
                    mutate(R2 = ifelse(R2_topmed >= R2_hrc, R2_topmed, R2_hrc),
                           source = ifelse(R2_topmed > R2_hrc, "topmed", "hrc")) %>%
                    mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                           PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                    select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))
        
        sub_annot_1 <- topmed_hrc %>% filter(!ID %in% frq[,1]) %>%
            mutate(ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            select(-c(ID_hg38.y, ID_hg38.x, cat))
        
        sub_annot_2 <- topmed_hrc %>% 
            filter(ID %in% frq[,1]) %>%
            group_by(ID) %>%
            mutate(max_R2 = pmax(R2_hrc, R2_topmed, na.rm = TRUE),
                   max_R2_hrc = max(R2_hrc, na.rm = TRUE),
                   max_R2_topmed = max(R2_topmed, na.rm = TRUE)) %>%
            filter(max_R2 == max(max_R2)) %>%
            mutate(R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE),
                   R2_hrc = max_R2_hrc, 
                   R2_topmed = max_R2_topmed, 
                   ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            ungroup() %>%
            select(-c(ID_hg38.x, ID_hg38.y, max_R2, max_R2_hrc, max_R2_topmed, cat))

        full_annot <- rbind(sub_annot_1, sub_annot_2)

        topmed_hrc_lof <- full_annot %>% filter(Function == "LoF")
        topmed_hrc_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

        full_annot %>% fwrite(sprintf("./hrc_topmed_v3_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
        rbind(topmed_hrc_lof, topmed_hrc_mis) %>% fwrite(sprintf("./hrc_topmed_v3_168206ids_chr11_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
    }
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `max_R2_hrc = max(R2_hrc, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `max_R2_hrc = max(R2_hrc, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-miss

### 6.2 Exome + HRC + TOPMed

#### 6.21 Make merged annotation

#### Chromosome 1 + 2

In [71]:
fill_na_within_group <- function(id_hg19, r2_hrc) {
  non_na_value <- id_hg19[which(!is.na(id_hg19) & r2_hrc != 0)]
  if(length(non_na_value) > 0) {
    id_hg19[is.na(id_hg19)] <- non_na_value[1]
  }
  return(id_hg19)
}

In [75]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/")

library(dplyr)
library(data.table)

for(maf in c(0.01, 0.005, 0.001)){
    maf_c <- gsub("\\.", "", as.character(maf))
    
    exome_chr1 <- read.csv(sprintf("./exome/ukb23156_c1.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense.csv.gz", maf_c))
    exome_chr2 <- read.csv(sprintf("./exome/ukb23156_c2.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense.csv.gz", maf_c))
    exome <- rbind(exome_chr1, exome_chr2) %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
    for(rsq in c(3, 8)){
        topmed_hrc <- read.csv(sprintf("./hrc_topmed/hrc_topmed_v3_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% mutate(REVEL_score = as.numeric(REVEL_score))
        full_annot <- 
            full_join(topmed_hrc %>% rename("source_hrc_topmed" = source), 
              exome %>% mutate(R2_exome = 999), by = c("Chr", "Start", "End", "Ref", "Alt", 
                                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                                       "MAF_nfe_exome", "REVEL_score", "Function","ID")) %>%
                mutate(source = ifelse(is.na(R2_exome), source_hrc_topmed, "exome")) %>% 
                mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                       PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y, -source_hrc_topmed))
                
        frq <- full_annot %>% pull(ID) %>% table() %>% as.data.frame() %>% filter(Freq >= 2)
        
        sub_annot_1 <- full_annot %>% filter(!ID %in% frq[,1]) %>%
            mutate(ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            select(-c(ID_hg38.y, ID_hg38.x, source_hrc_topmed))
        
        sub_annot_2 <- full_annot %>% 
            filter(ID %in% frq[,1]) %>%
            group_by(ID) %>%
            arrange(ID) %>%
            mutate(ID_hg19 = na_if(ID_hg19, ""),
                   max_R2 = pmax(R2, R2_exome, na.rm = TRUE),
                   max_R2_exome = max(R2_exome, na.rm = TRUE),
                   max_R2_hrc = max(R2_hrc, na.rm = TRUE),
                   max_R2_topmed = max(R2_topmed, na.rm = TRUE)) %>% 
            tidyr::fill(source_hrc_topmed, .direction = "downup") %>%
            mutate(R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE),
                   R2_hrc = max_R2_hrc, 
                   R2_topmed = max_R2_topmed, 
                   R2_exome = max_R2_exome,
                   ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x),
                   ID_hg19 = fill_na_within_group(ID_hg19, R2_hrc)) %>%
            filter(max_R2 == max(max_R2)) %>%
            ungroup() %>%
            select(-c(ID_hg38.x, ID_hg38.y, max_R2, max_R2_hrc, max_R2_topmed, max_R2_exome, source_hrc_topmed))
        
        full_annot <- rbind(sub_annot_1, sub_annot_2)
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)
        
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_v3_exome_168206ids_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
        full_annot %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_v3_exome_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
    } 
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


#### Chromosome 11

In [76]:
for(maf in c(0.01, 0.005, 0.001)){
    maf_c <- gsub("\\.", "", as.character(maf))

    exome <- read.csv(sprintf("./exome/ukb23156_c11.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_cadd.csv.gz", maf_c))
    exome <- exome %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
    for(rsq in c(3, 8)){
        topmed_hrc <- read.csv(sprintf("./hrc_topmed/hrc_topmed_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
        full_annot <- 
            full_join(topmed_hrc %>% rename("source_hrc_topmed" = source), 
              exome %>% mutate(R2_exome = 999), by = c("Chr", "Start", "End", "Ref", "Alt", 
                                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                                       "MAF_nfe_exome", "REVEL_score", "Function","ID")) %>%
                mutate(source = ifelse(is.na(R2_exome), source_hrc_topmed, "exome")) %>% 
                mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                       PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y, -source_hrc_topmed))
                
        frq <- full_annot %>% pull(ID) %>% table() %>% as.data.frame() %>% filter(Freq >= 2)
        
        sub_annot_1 <- full_annot %>% filter(!ID %in% frq[,1]) %>%
            mutate(ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x)) %>%
            select(-c(ID_hg38.y, ID_hg38.x))
        
        sub_annot_2 <- full_annot %>% 
            filter(ID %in% frq[,1]) %>%
            group_by(ID) %>%
            arrange(ID) %>%
            mutate(ID_hg19 = na_if(ID_hg19, ""),
                   max_R2 = pmax(R2, R2_exome, na.rm = TRUE),
                   max_R2_exome = max(R2_exome, na.rm = TRUE),
                   max_R2_hrc = max(R2_hrc, na.rm = TRUE),
                   max_R2_topmed = max(R2_topmed, na.rm = TRUE)) %>% 
            tidyr::fill(source_hrc_topmed, .direction = "downup") %>%
            mutate(R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE),
                   R2_hrc = max_R2_hrc, 
                   R2_topmed = max_R2_topmed, 
                   R2_exome = max_R2_exome,
                   ID_hg38 = ifelse(ID_hg38.x %in% c("", NA), ID_hg38.y, ID_hg38.x),
                   ID_hg19 = fill_na_within_group(ID_hg19, R2_hrc)) %>%
            filter(max_R2 == max(max_R2)) %>%
            ungroup() %>%
            select(-c(ID_hg38.x, ID_hg38.y, max_R2, max_R2_hrc, max_R2_topmed, max_R2_exome, source_hrc_topmed))
        
        full_annot <- rbind(sub_annot_1, sub_annot_2)
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)
        
        print(dim(full_annot))
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_chr11_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
        full_annot %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
    } 
} 

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 215249     22


[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 186585     22


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 214701     22


[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 186048     22


[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 212593     22


[1m[22m[36mℹ[39m In argument: `max_R2_exome = max(R2_exome, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf
[1m[22m[36mℹ[39m In argument: `R2 = max(max_R2_hrc, max_R2_topmed, na.rm = TRUE)`.
[33m![39m no non-missing arguments to max; returning -Inf”
[1m[22m[36mℹ[39m In argument: `max_R2 == max(max_R2)`.
[33m![39m no non-missing arguments to max; returning -Inf”


[1] 184077     22
