# Imputed Raw Data Processing

This notebook records the steps to prepare input for imputation and post-imputation processing steps

## Imputation input preparation

### Divide the 168,206 individuals into 7 batches of 25k

TOPMed imputation server can only accept maximum 25k ID in one file.

In [3]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis")

idx <- lapply(c(1:7), function(x) rep(x, 25000)) %>% unlist()
idx <- idx[c(1:168206)]

iid_lst <- fread("168206ind.sample.txt") %>% mutate(idx = idx) %>% group_by(idx) %>% group_split() %>% lapply(function(x) x$V1)
for(i in c(1:7)){
    write.table(iid_lst[[i]], sprintf("./imputation_input/168206ind_sample_batch%d.txt", i), col.names = FALSE, row.names = FALSE, quote = FALSE)
}

### Create per chromosome .vcf.gz file 

In [2]:
for i in list((1,2,11)):
    for j in range(1,8):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N make_impute_input_chr%i_batch%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr%i_batch%i_$JOB_ID.out
#$ -e ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr%i_batch%i_$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/1.9.10

plink \
    --bfile /mnt/mfs/statgen/UKBiobank/QCed_Plink_autosomal_files_hg38/QCed_White_EU_460649ind_10212022_hg38_sorted \
    --chr %i \
    --keep-fam ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/168206ind_sample_batch%i.txt \
    --make-bed \
    --output-chr chrM \
    --recode vcf bgz \
    --out ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/imputation_input_hg38_sorted_unrelated_white_eur_extracted_168206ind_chr%i_batch%i

'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr"+str(i)+"_batch"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [None]:
for i in 1 2 11; do
    for j in {1..7}; do
        qsub ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/imputation_input/scripts/make_impute_input_chr${i}_batch${j}.sh
    done;
done

## Upload file to server and download

[TOPMed Imputation Server - TOPMed](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!)

[Michigan Imputation Server - HRC](https://imputationserver.sph.umich.edu/index.html#!)

## Imputation file post-processing

**Summary**

Total number of variants imputed after R-square filtering

| Dataset |    RSQ    |Chromosome |    # Var    |
| --------|-----------|:---------:| :---------: |
| `HRC`   |    0.3    |   Chr 1   |  2,658,810  |
|         |           |   Chr 2   |  2,950,986  |
|         |           |   Chr 11  | 1,696,916   |
| `HRC`   |    0.8    |   Chr 1   | 1,322,459   |
|         |           |   Chr 2   | 1,489,325   |
|         |           |   Chr 11  |   881,445   |
| `TOPMed`|    0.3    |   Chr 1   |  11,100,489 |
|         |           |   Chr 2   |  11,896,025 |
|         |           |   Chr 11  |  6,742,803  |
| `TOPMed`|    0.8    |   Chr 1   |  4,753,189  |
|         |           |   Chr 2   |  5,166,353  |
|         |           |   Chr 11  |  2,909,711  |

### Combining imputed files

Because of the sample size limit of the imputation servers, we need to concatenate imputed files into one in order to calculate the overall $R^2$ for each variant.

The TOPMed team has provided [hds-util](https://github.com/statgen/hds-util), a post-processing tool for Minimac4 and Michigan Imputation Server (MIS). It can generate FORMAT fields from HDS, convert from the SAV file format to BCF or VCF, and paste together sample groups that were split due to MIS sample size limit.


In [4]:
## software installation
mkdir ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/
git clone https://github.com/statgen/hds-util

cd hds-util

pip3 install --user cget
cget install -f ./requirements.txt
mkdir build; cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../cget/cget/cget.cmake -DCMAKE_BUILD_TYPE=Release ..
make
make install

export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

hds-util --help

Cloning into 'hds-util'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 72 (delta 32), reused 57 (delta 17), pack-reused 0[K
Unpacking objects: 100% (72/72), done.
Downloading https://github.com/statgen/savvy/archive/78f94e2d703ed952da5026f9444f053518ed2485.tar.gz
[?25l  [######################################################################]  100%[?25h
Extracting archive /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/cget/cget/build/tmp-a1631e0215714c9d91279253328fa48a/78f94e2d703ed952da5026f9444f053518ed2485.tar.gz ...
Downloading https://github.com/jonathonl/shrinkwrap/archive/v1.2.0.tar.gz
[?25l  [######################################################################]  100%[?25h
Extracting archive /mnt/vast/hpc/csg/tl3031/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/cget/cget/build/tmp-c60

In [8]:
## write out script topmed
for i in list((1,2,11)):
    for j in list((0,3,8)):
        script='''#!/bin/sh
#!/bin/sh
#$ -l h_rt=700:00:00
#$ -l h_vmem=5G
#$ -N combine_vcf_chr%i_rsq0%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/combine_vcf_chr%i_rsq0%i_$JOB_ID.out
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/combine_vcf_chr%i_rsq0%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed
export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

chr=%i

hds-util -f GT,DS,HDS --min-r2 0.%i -O vcf.gz \
    ./topmed_batch1/chr${chr}.dose.vcf.gz \
    ./topmed_batch2/chr${chr}.dose.vcf.gz \
    ./topmed_batch3/chr${chr}.dose.vcf.gz \
    ./topmed_batch4/chr${chr}.dose.vcf.gz \
    ./topmed_batch5/chr${chr}.renamed.hg38.dose.vcf.gz \
    ./topmed_batch6/chr${chr}.renamed.hg38.dose.vcf.gz \
    ./topmed_batch7/chr${chr}.renamed.hg38.dose.vcf.gz > ./topmed_chr${chr}_merged_168206ids_rsq0%i_dose.vcf.gz
'''%(i,j,i,j,i,j,i,j,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/combine_vcf_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [10]:
## write out script hrc
for i in list((1,2,11)):
    for j in list((0,3,8)):
        script='''#!/bin/sh
#!/bin/sh
#$ -l h_rt=700:00:00
#$ -l h_vmem=5G
#$ -N paste_hrc_chr%i_rsq0%i
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/paste_hrc_chr%i_rsq0%i_$JOB_ID.out
#$ -o ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/paste_hrc_chr%i_rsq0%i_$JOB_ID.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg.q

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
export PATH="~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/tools/hds-util/build:$PATH"

chr=%i

hds-util -f GT,DS,HDS --min-r2 0.%i -O vcf.gz \
    ./hrc_batch1/chr${chr}.dose.vcf.gz \
    ./hrc_batch2/chr${chr}.dose.vcf.gz \
    ./hrc_batch3/chr${chr}.dose.vcf.gz \
    ./hrc_batch4/chr${chr}.dose.vcf.gz \
    ./hrc_batch5/chr${chr}.dose.vcf.gz \
    ./hrc_batch6/chr${chr}.dose.vcf.gz \
    ./hrc_batch7/chr${chr}.dose.vcf.gz > ./hrc_chr${chr}_merged_168206ids_rsq0%i_dose.vcf.gz
'''%(i,j,i,j,i,j,i,j,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/pasting_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

### Liftover HRC (hg19->hg38)

#### Recode VCF

In [11]:
# Recode VCF into .pgen format, recode variant name
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
module load Plink/2.00a

plink2 \
    --vcf hrc_chr1_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr1_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr1_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr1_merged_168206ids_rsq08_dose
    
plink2 \
    --vcf hrc_chr2_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr2_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr2_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr2_merged_168206ids_rsq08_dose
    
plink2 \
    --vcf hrc_chr11_merged_168206ids_rsq03_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr11_merged_168206ids_rsq03_dose
    
plink2 \
    --vcf hrc_chr11_merged_168206ids_rsq08_dose.vcf.gz \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --out hrc_chr11_merged_168206ids_rsq08_dose
    
    
# Check for monomorphic variants
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr1_merged_168206ids_rsq03_dose.acount > monomprphic_chr1_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr1_merged_168206ids_rsq08_dose.acount > monomprphic_chr1_rsq08_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr2_merged_168206ids_rsq03_dose.acount > monomprphic_chr2_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr2_merged_168206ids_rsq08_dose.acount > monomprphic_chr2_rsq08_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr11_merged_168206ids_rsq03_dose.acount > monomprphic_chr11_rsq03_SNPs
awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr11_merged_168206ids_rsq08_dose.acount > monomprphic_chr11_rsq08_SNPs

#### Liftover

In [12]:
## Running our in-house liftover pipeline

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/ \
    --input_file ./hrc_168206_chr1.bim \
    --output_file ./hrc_168206_chr1_hg38.bim

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_168206_chr2.bim \
    --output_file ./hrc_168206_chr2_hg38.bim

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_168206_chr11.bim \
    --output_file ./hrc_168206_chr11_hg38.bim

# HRC Data Processing

In HRC annotation, ID_hg19 is the original id, ID_hg38 is the liftedover id in hg38, ID is the id created by pasting annotations.

## Recode VCF

In [13]:
## writing out script
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N recode_vcf_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
module load Plink/2.00a

plink2 \
    --vcf hrc_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz dosage=DS \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --new-id-max-allele-len 200 \
    --out hrc_chr%i_merged_168206ids_rsq0%i_dose

awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' hrc_chr%i_merged_168206ids_rsq0%i_dose.acount > monomprphic_chr%i_rsq0%i_SNPs

'''%(i,j,i,j,i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/recode_vcf_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [14]:
cd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/
for i in recode_vcf_hrc_chr*_rsq0*.sh; do qsub $i; done

Your job 8117847 ("recode_vcf_hrc_chr11_rsq03") has been submitted
Your job 8117848 ("recode_vcf_hrc_chr11_rsq08") has been submitted
Your job 8117849 ("recode_vcf_hrc_chr1_rsq03") has been submitted
Your job 8117850 ("recode_vcf_hrc_chr1_rsq08") has been submitted
Your job 8117851 ("recode_vcf_hrc_chr2_rsq03") has been submitted
Your job 8117852 ("recode_vcf_hrc_chr2_rsq08") has been submitted


## Annotate HRC

In [15]:
## writing out script
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N annotate_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Singularity

sos run ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/notebooks/annovar.ipynb annovar \
    --build 'hg38' \
    --cwd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc \
    --bim_name /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/hrc_chr%i_merged_168206ids_rsq0%i_dose_hg38.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --job_size 1 \
    --name_prefix hrc_chr%i_merged_168206ids_rsq0%i_dose \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif

'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/annotate_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [16]:
## rename columns
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

for(chr in c(1,2,11)){
    for (rsq in c(3,8)){
        annot <- data.table::fread(sprintf("hrc_chr%d_merged_168206ids_rsq0%d_dose_hg38.hg38.hg38_multianno.csv", chr, rsq))
        bim <- data.table::fread(sprintf("hrc_chr%d_merged_168206ids_rsq0%d_dose.bim", chr, rsq))
        colnames(annot)[29:41] <- c("AF_genome",
                                    "AF_raw_genome",
                                    "AF_male_genome",
                                    "AF_female_genome",
                                    "AF_afr_genome",
                                    "AF_ami_genome",
                                    "AF_amr_genome",
                                    "AF_asj_genome",
                                    "AF_eas_genome",
                                    "AF_fin_genome",
                                    "AF_nfe_genome",
                                    "AF_oth_genome",
                                    "AF_sas_genome")
        colnames(annot)[42:54] <- c("AF_exome",
                                    "AF_popmax_exome",
                                    "AF_male_exome",
                                    "AF_female_exome",
                                    "AF_raw_exome",
                                    "AF_afr_exome",
                                    "AF_sas_exome",
                                    "AF_amr_exome",
                                    "AF_eas_exome",
                                    "AF_nfe_exome",
                                    "AF_fin_exome",
                                    "AF_asj_exome",
                                    "AF_oth_exome")
        annot <- annot %>% 
            mutate(AF_nfe_exome = as.numeric(AF_nfe_exome)) %>% 
            mutate(MAF_nfe_exome = ifelse(AF_nfe_exome > 0.5, 1 - AF_nfe_exome, AF_nfe_exome)) %>% 
            rename("ID_hg38" = "Otherinfo1") %>%
            mutate(ID = paste(Chr, Start, Ref, Alt, sep = ":"), ID_hg19 = bim$V2) %>%
            mutate(ID = paste0("chr", ID)) %>%
            select(Chr, Start, End, Ref, Alt, 
                   Func.refGene, Gene.refGene, ExonicFunc.refGene, 
                   MAF_nfe_exome, REVEL_score,
                   ID_hg38, ID_hg19, ID, CADD_phred)
        data.table::fwrite(annot, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq))
    }
}

## Filter HRC

**MAF (0.01, 0.005, 0.001) $\times$ R2 (0.2, 0.8)**

In [17]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

In [19]:
filter_df <- data.frame(data.frame(matrix(ncol = 7, nrow = 0)))

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("hrc_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq)) %>% select(-CADD_phred)
            mono_list <- fread(sprintf("monomprphic_chr%d_rsq0%d_SNPs", chr, rsq))$ID # hg19

            annot <-  annot %>% filter(Chr == chr)
            annot_maf <-  annot %>% 
                filter(!ID_hg19 %in% mono_list) %>%
                filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)

            annot_func <- annot_maf %>% 
                filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing")) %>%
                filter(ExonicFunc.refGene != 'unknown') %>% 
                filter(ExonicFunc.refGene != 'synonymous SNV' & ExonicFunc.refGene != 'nonframeshift substitution') %>%
                mutate(Function = ifelse(ExonicFunc.refGene == "nonsynonymous SNV", "missense", "")) %>%
                mutate(Function = ifelse(grepl("splicing", Func.refGene), "splicing", Function)) %>%
                mutate(Function = ifelse(ExonicFunc.refGene %in% c("stopgain", "stoploss", "startloss", "frameshift substitution"), "LoF", Function))
        
            annot_func <- annot_func %>% 
                tidyr::separate(Gene.refGene, c("Gene.refGene", "discard_1", "discard_2"), sep = ";") %>% 
                select(-discard_1, -discard_2)

            gene_list <- annot_func %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_func %>% filter(Gene.refGene %in% gene_list)
            
            data.table::fwrite(annot_func, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_func %>% select(ID_hg19), sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)
            
            data.table::fwrite(annot_final, sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            data.table::fwrite(annot_final %>% select(ID_hg19), sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                               sep = " ", col.names = FALSE)

            sub_df <- data.frame(data = "hrc", chromosome = chr, maf = maf, rsq = rsq/10,
                                 total_num_var = nrow(annot), maf_filtering_var = nrow(annot_maf),
                                 function_filtering_var = sprintf("%d (%d)", nrow(annot_func), length(unique(annot_func$Gene.refGene))),
                                 gene_filtering_var = sprintf("%d (%d)", nrow(annot_final), length(gene_list)))
            filter_df <- rbind(filter_df, sub_df)
        }
    }
}

“[1m[22mExpected 3 pieces. Additional pieces discarded in 13 rows [6556, 6557, 6561,
8110, 8112, 8120, 9478, 12706, 12714, 15013, 15015, 15027, 15031].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 20633 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 13 rows [6275, 6276, 6280,
7778, 7779, 7787, 9095, 12212, 12220, 14429, 14431, 14443, 14447].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 19838 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 9 rows [5220, 5221, 5223,
6503, 6504, 10249, 10254, 12131, 12133].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 16730 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 4 rows [3255, 6248, 6250,
6257].”
“[1m

In [20]:
filter_df

data,chromosome,maf,rsq,total_num_var,maf_filtering_var,function_filtering_var,gene_filtering_var
<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<chr>,<chr>
hrc,1,0.01,0.3,2658041,2646976,20656 (1904),20551 (1799)
hrc,1,0.005,0.3,2658041,2644855,19861 (1902),19743 (1784)
hrc,1,0.001,0.3,2658041,2637260,16746 (1883),16591 (1728)
hrc,1,0.01,0.8,1322393,1311707,8709 (1694),8396 (1381)
hrc,1,0.005,0.8,1322393,1309907,7996 (1658),7666 (1328)
hrc,1,0.001,0.8,1322393,1304988,5655 (1510),5243 (1098)
hrc,2,0.01,0.3,2950782,2942882,14733 (1173),14683 (1123)
hrc,2,0.005,0.3,2950782,2941416,14184 (1172),14133 (1121)
hrc,2,0.001,0.3,2950782,2935812,11928 (1163),11856 (1091)
hrc,2,0.01,0.8,1489256,1481568,6402 (1054),6238 (890)


## Subsetting for CADD score

In [21]:
for i in list((1,2,11)):
        script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_maf001_chr%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -q csg.q
#$ -S /bin/bash

export PATH=$HOME/miniconda3/bin:$PATH
module load HTSLIB/1.17
module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc

plink2 \
    --bpfile hrc_chr%i_merged_168206ids_rsq03_dose \
    --extract hrc_chr%i_rsq03_hg19_hg38_maf001_LOF_missense_all_snplist \
    --make-bpgen --sort-vars \
    --export vcf-4.2 vcf-dosage=DS bgz \
    --out hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted

zcat hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted.vcf.gz | cut -f-5 > hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf
bgzip hrc_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf


'''%(i,i,i,i,i,i,i,i,i)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/make_5col_vcf_maf001_chr"+str(i)+".sh", 'w')
        f.write(script)
        f.close()

In [22]:
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts

qsub make_5col_vcf_maf001_chr11.sh
qsub make_5col_vcf_maf001_chr1.sh
qsub make_5col_vcf_maf001_chr2.sh

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr1_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr1_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr2_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr2_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

sos run /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/liftover.ipynb \
    --cwd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_annot_168206ids \
    --input_file ./hrc_chr11_rsq03_maf001_LOF_missense_all_extracted.vcf.gz \
    --output_file ./hrc_chr11_rsq03_maf001_LOF_missense_all_extracted_hg38.vcf.gz

In [23]:
library(data.table)
library(dplyr)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc")

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("hrc_chr%i_rsq0%s_hg19_hg38_maf001_LOF_missense_all_annot.csv.gz", chr, rsq))

            cadd_all <- fread(sprintf("GRCh37-v1.6_chr%i.tsv", chr), header = TRUE) %>% arrange(Pos) 
            colnames(cadd_all) <- c("Chr", "Start", "Ref", "Alt", "RawScore", "PHRED")
            cadd_all <- cadd_all %>% mutate(ID_hg19 = paste(Chr, Start, Ref, Alt, sep = ":")) %>% mutate(ID_hg19 = paste0("chr", ID_hg19))

            annot_all <- left_join(annot, cadd_all %>% select(ID_hg19, RawScore, PHRED)) %>% filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)
            annot_all_lof <- annot_all %>% filter(Function == "LoF")
            annot_all_cadd <- annot_all %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            gene_list <- annot_all %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_all %>% filter(Gene.refGene %in% gene_list)
            annot_final_lof <- annot_final %>% filter(Function == "LoF")
            annot_final_cadd <- annot_final %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            ## >= 1 variant
            fwrite(annot_all, 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_all$ID_hg19, 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 1 variant + CADD filtering
            fwrite(rbind(annot_all_lof, annot_all_cadd), 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_annot.csv.gz", chr, rsq, maf_c),  
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg19), 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant
            fwrite(annot_final, 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_final$ID_hg19, 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant + CADD filtering
            fwrite(rbind(annot_final_lof, annot_final_cadd), 
                   sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(rbind(annot_all_lof, annot_all_cadd) %>% pull(ID_hg19), 
                        sprintf("hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_snplist", chr, rsq, maf_c),
                        col.names = FALSE, row.name = FALSE, quote = FALSE)
        }
    }   
}


[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(ID_hg19, RawScore, 

In [24]:
for i in list((1,2,11)):
    for j in list((3,8)):
        for k in list((("1", "05", "01"))):
            for c in list(("", "_cadd")):
                script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_chr%i_rsq0%i_maf00%s%s
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/2.00a

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc
plink2 \
    --bpfile hrc_chr%i_merged_168206ids_rsq0%i_dose \
    --extract hrc_chr%i_rsq0%i_hg19_hg38_maf00%s_LOF_missense%s_snplist \
    --make-bpgen --sort-vars \
    --out hrc_chr%i_rsq0%i_maf00%s_LOF_missense%s_extracted

'''%(i,j,k,c,i,j,k,c,i,j,k,c,i,j,i,j,k,c,i,j,k,c)
                f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc/scripts/extract_filtered_chr"+str(i)+"_rsq0"+str(j)+"_maf00"+k+c+".sh", 'w')
                f.write(script)
                f.close()

# TOPMed Data Processing

## Recode VCF

In [25]:
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N recode_topmed_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/recode_vcf_topmed_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/recode_vcf_topmed_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed
module load Plink/2.00a

plink2 \
    --vcf topmed_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz dosage=DS \
    --freq counts \
    --make-bpgen --sort-vars \
    --set-all-var-ids chr@:#:\$r:\$a \
    --new-id-max-allele-len 200 \
    --out topmed_chr%i_merged_168206ids_rsq0%i_dose

awk 'BEGIN {FS=" "; OFS=" "} {if(NR==1 || $5==0 || $6==0)print $2}' topmed_chr%i_merged_168206ids_rsq0%i_dose.acount > monomprphic_chr%i_rsq0%i_SNPs

'''%(i,j,i,j,i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/recode_vcf_topmed_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [27]:
for i in /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/recode_vcf_topmed_chr*sh; 
    do qsub $i; 
done

Your job 8117996 ("recode_topmed_hrc_chr11_rsq03") has been submitted
Your job 8117997 ("recode_topmed_hrc_chr11_rsq08") has been submitted
Your job 8117998 ("recode_topmed_hrc_chr1_rsq03") has been submitted
Your job 8117999 ("recode_topmed_hrc_chr1_rsq08") has been submitted
Your job 8118000 ("recode_topmed_hrc_chr2_rsq03") has been submitted
Your job 8118001 ("recode_topmed_hrc_chr2_rsq08") has been submitted


## Annotate TOPMed

In [29]:
# annotate topmed
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=30G
#$ -N annotate_hrc_chr%i_rsq0%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/annotate_topmed_chr%i_rsq%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/annotate_topmed_chr%i_rsq%i-$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Singularity

sos run ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/notebooks/annovar.ipynb annovar \
    --build 'hg38' \
    --cwd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed \
    --bim_name /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/topmed_chr%i_merged_168206ids_rsq0%i_dose.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --job_size 1 \
    --name_prefix topmed_chr%i_merged_168206ids_rsq0%i_dose_hg38 \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif

'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/annotate_topmed_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [30]:
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts

qsub annotate_topmed_chr1_rsq03.sh
qsub annotate_topmed_chr1_rsq08.sh
qsub annotate_topmed_chr2_rsq03.sh
qsub annotate_topmed_chr2_rsq08.sh
qsub annotate_topmed_chr11_rsq03.sh
qsub annotate_topmed_chr11_rsq08.sh

Your job 8118002 ("annotate_hrc_chr1_rsq03") has been submitted
Your job 8118003 ("annotate_hrc_chr1_rsq08") has been submitted
Your job 8118004 ("annotate_hrc_chr2_rsq03") has been submitted
Your job 8118005 ("annotate_hrc_chr2_rsq08") has been submitted
Your job 8118006 ("annotate_hrc_chr11_rsq03") has been submitted
Your job 8118007 ("annotate_hrc_chr11_rsq08") has been submitted


In [31]:
## rename columns
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed")

for(chr in c(1,2,11)){
    for (rsq in c(3,8)){
        annot <- data.table::fread(sprintf("topmed_chr%d_merged_168206ids_rsq0%d_dose.hg38.hg38_multianno.csv", chr, rsq))
        bim <- data.table::fread(sprintf("topmed_chr%d_merged_168206ids_rsq0%d_dose.bim", chr, rsq))
        colnames(annot)[29:41] <- c("AF_genome",
                                    "AF_raw_genome",
                                    "AF_male_genome",
                                    "AF_female_genome",
                                    "AF_afr_genome",
                                    "AF_ami_genome",
                                    "AF_amr_genome",
                                    "AF_asj_genome",
                                    "AF_eas_genome",
                                    "AF_fin_genome",
                                    "AF_nfe_genome",
                                    "AF_oth_genome",
                                    "AF_sas_genome")
        colnames(annot)[42:54] <- c("AF_exome",
                                    "AF_popmax_exome",
                                    "AF_male_exome",
                                    "AF_female_exome",
                                    "AF_raw_exome",
                                    "AF_afr_exome",
                                    "AF_sas_exome",
                                    "AF_amr_exome",
                                    "AF_eas_exome",
                                    "AF_nfe_exome",
                                    "AF_fin_exome",
                                    "AF_asj_exome",
                                    "AF_oth_exome")
        annot <- annot %>% 
            mutate(AF_nfe_exome = as.numeric(AF_nfe_exome)) %>% 
            mutate(MAF_nfe_exome = ifelse(AF_nfe_exome > 0.5, 1 - AF_nfe_exome, AF_nfe_exome)) %>% 
            rename("ID_hg38" = "Otherinfo1") %>%
            mutate(ID = paste(Chr, Start, Ref, Alt, sep = ":")) %>%
            mutate(ID = paste0("chr", ID)) %>%
            select(Chr, Start, End, Ref, Alt, 
                   Func.refGene, Gene.refGene, ExonicFunc.refGene, 
                   MAF_nfe_exome, REVEL_score,
                   ID_hg38, ID, CADD_phred)
        data.table::fwrite(annot, sprintf("topmed_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq))
        }
}

## Filter TOPMed

In [38]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed")

In [39]:
filter_df <- data.frame(data.frame(matrix(ncol = 7, nrow = 0)))

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("topmed_chr%d_rsq0%d_hg19_hg38_sel_col_annot.csv.gz", chr, rsq)) %>% select(-CADD_phred)
            mono_list <- fread(sprintf("monomprphic_chr%d_rsq0%d_SNPs", chr, rsq))$ID

            annot <-  annot %>% filter(Chr == chr)
            annot_maf <-  annot %>% 
                filter(!ID_hg38 %in% mono_list) %>%
                filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)

            annot_func <- annot_maf %>% 
                filter(Func.refGene %in% c("exonic", "splicing", "exonic;splicing")) %>%
                filter(ExonicFunc.refGene != 'unknown') %>% 
                filter(ExonicFunc.refGene != 'synonymous SNV' & ExonicFunc.refGene != 'nonframeshift substitution') %>%
                mutate(Function = ifelse(ExonicFunc.refGene == "nonsynonymous SNV", "missense", "")) %>%
                mutate(Function = ifelse(grepl("splicing", Func.refGene), "splicing", Function)) %>%
                mutate(Function = ifelse(ExonicFunc.refGene %in% c("stopgain", "stoploss", "startloss", "frameshift substitution"), "LoF", Function))

            annot_func <- annot_func %>% 
                mutate(cat = if_else(grepl(";", Gene.refGene) & Function == "splicing", 2, 1)) %>%
                tidyr::separate(Gene.refGene, c("Gene.refGene", "discard_1", "discard_2"), sep = ";") %>% 
                mutate(Gene.refGene = if_else(cat == 1, Gene.refGene, discard_1)) %>%
                select(-discard_1, -discard_2)

            gene_list <- annot_func %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_func %>% filter(Gene.refGene %in% gene_list)
            
#             data.table::fwrite(annot_func, sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
#             data.table::fwrite(annot_func %>% select(ID_hg38), sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
#                                sep = " ", col.names = FALSE)
            
#             data.table::fwrite(annot_final, sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
#             data.table::fwrite(annot_final %>% select(ID_hg38), sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
#                                sep = " ", col.names = FALSE)

            sub_df <- data.frame(data = "topmed", chromosome = chr, maf = maf, rsq = rsq/10,
                                 total_num_var = nrow(annot), maf_filtering_var = nrow(annot_maf),
                                 function_filtering_var = sprintf("%d (%d)", nrow(annot_func), length(unique(annot_func$Gene.refGene))),
                                 gene_filtering_var = sprintf("%d (%d)", nrow(annot_final), length(gene_list)))
            filter_df <- rbind(filter_df, sub_df)
        }
    }
}

“[1m[22mExpected 3 pieces. Additional pieces discarded in 69 rows [10636, 10641, 10644,
19423, 41961, 41966, 41968, 51279, 51282, 51285, 51286, 51291, 51292, 51293,
51294, 51296, 51314, 51315, 51324, 51328, ...].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 123569 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 69 rows [10560, 10565, 10568,
19291, 41669, 41674, 41676, 50929, 50931, 50934, 50935, 50940, 50941, 50942,
50943, 50945, 50963, 50964, 50973, 50977, ...].”
“[1m[22mExpected 3 pieces. Missing pieces filled with `NA` in 122730 rows [1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
“[1m[22mExpected 3 pieces. Additional pieces discarded in 66 rows [10252, 10257, 10260,
18754, 40558, 40563, 40565, 49589, 49591, 49594, 49595, 49599, 49600, 49601,
49602, 49604, 49620, 49621, 49630, 49634, ...].”
“[1m[22mExpected 3 pieces. Missing piece

ERROR: Error in .shallow(x, cols = cols, retain.key = TRUE): attempt to set index 0/0 in SET_VECTOR_ELT


In [40]:
filter_df

data,chromosome,maf,rsq,total_num_var,maf_filtering_var,function_filtering_var,gene_filtering_var
<chr>,<dbl>,<dbl>,<dbl>,<int>,<int>,<chr>,<chr>
topmed,1,0.01,0.3,11100489,11088608,123710 (2018),123687 (1995)
topmed,1,0.005,0.3,11100489,11086291,122871 (2018),122848 (1995)
topmed,1,0.001,0.3,11100489,11078008,119623 (2017),119600 (1994)
topmed,1,0.01,0.8,4753189,4741493,44978 (1950),44940 (1912)
topmed,1,0.005,0.8,4753189,4739284,44164 (1948),44126 (1910)
topmed,1,0.001,0.8,4753189,4732089,41295 (1945),41257 (1907)
topmed,2,0.01,0.3,11896025,11887469,86951 (1229),86940 (1218)
topmed,2,0.005,0.3,11896025,11885858,86379 (1229),86368 (1218)
topmed,2,0.001,0.3,11896025,11879793,84030 (1229),84018 (1217)
topmed,2,0.01,0.8,5166353,5157890,33210 (1212),33194 (1196)


## Subsetting for CADD score

In [41]:
for i in list((1,2,11)):
        script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_maf001_chr%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/make_5col_vcf_maf001_chr%i_$JOB_ID.out
#$ -q csg.q
#$ -S /bin/bash

export PATH=$HOME/miniconda3/bin:$PATH
module load HTSLIB/1.17
module load Plink/2.00a
cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed

plink2 \
    --bpfile topmed_chr%i_merged_168206ids_rsq03_dose \
    --extract topmed_chr%i_rsq03_hg19_hg38_maf001_LOF_missense_all_snplist \
    --make-bpgen --sort-vars \
    --export vcf-4.2 vcf-dosage=DS bgz \
    --out topmed_chr%i_rsq03_maf001_LOF_missense_all_extracted

zcat topmed_chr%i_rsq03_maf001_LOF_missense_all_extracted.vcf.gz | cut -f-5 > topmed_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf
bgzip topmed_chr%i_rsq03_maf001_LOF_missense_all_extracted_5col.vcf

'''%(i,i,i,i,i,i,i,i,i)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/make_5col_vcf_maf001_chr"+str(i)+".sh", 'w')
        f.write(script)
        f.close()

In [1]:
cd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/

qsub qmake_5col_vcf_maf001_chr11.sh
qsub make_5col_vcf_maf001_chr1.sh
qsub make_5col_vcf_maf001_chr2.sh

In [3]:
library(data.table)
library(dplyr)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed")

for(chr in c(1, 2, 11)){
    for(rsq in c(3, 8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            annot <- fread(sprintf("topmed_chr%i_rsq0%s_hg19_hg38_maf001_LOF_missense_all_annot.csv.gz", chr, rsq))

            cadd_all <- fread(sprintf("GRCh38-v1.6_chr%i.tsv", chr), header = TRUE) %>% arrange(Pos) 
            colnames(cadd_all) <- c("Chr", "Start", "Ref", "Alt", "RawScore", "PHRED")

            annot_all <- left_join(annot, cadd_all) %>% filter(is.na(MAF_nfe_exome) | MAF_nfe_exome < maf)
            annot_all_mis <- annot_all %>% filter(Function == "LoF")
            annot_all_cadd <- annot_all %>% filter(Function != "LoF") %>% filter(as.numeric(PHRED) >= 20)

            gene_list <- annot_all %>% pull(Gene.refGene) %>% table() %>% as.data.frame() %>% filter(Freq > 1) %>% pull(1)
            annot_final <- annot_all %>% filter(Gene.refGene %in% gene_list)
            annot_final_mis <- annot_final %>% filter(Function == "LoF")
            annot_final_cadd <- annot_final %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

            ## >= 1 variant
            fwrite(annot_all, 
                   sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_all$ID_hg38, 
                        sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 1 variant + CADD filtering
            fwrite(rbind(annot_all_mis, annot_all_cadd), 
                   sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_annot.csv.gz", chr, rsq, maf_c),  
                   quote = FALSE)

            write.table(rbind(annot_all_mis, annot_all_cadd) %>% pull(ID_hg38), 
                        sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_cadd_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant
            fwrite(annot_final, 
                   sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(annot_final$ID_hg38, 
                        sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_snplist", chr, rsq, maf_c), 
                        col.names = FALSE, row.name = FALSE, quote = FALSE)

            ## >= 2 variant + CADD filtering
            fwrite(rbind(annot_final_mis, annot_final_cadd), 
                   sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_annot.csv.gz", chr, rsq, maf_c), 
                   quote = FALSE)

            write.table(rbind(annot_all_mis, annot_all_cadd) %>% pull(ID_hg38), 
                        sprintf("topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_cadd_snplist", chr, rsq, maf_c),
                        col.names = FALSE, row.name = FALSE, quote = FALSE)
        }
    }   
}


[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoining with `by = join_by(Chr, Start, Ref, Alt, RawScore, PHRED)`
[1m[22mJoi

In [4]:
for i in list((1,2,11)):
    for j in list((3,8)):
        for k in list((("1", "05", "01"))):
            for c in list(("", "_cadd")):
                script='''#!/bin/sh
#$ -l h_rt=48:00:00
#$ -l h_vmem=64G
#$ -N extract_filtered_chr%i_rsq0%i_maf00%s%s
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/extract_filtered_chr%i_rsq0%i_maf00%s%s_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Plink/2.00a

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed
plink2 \
    --bpfile topmed_chr%i_merged_168206ids_rsq0%i_dose \
    --extract topmed_chr%i_rsq0%i_hg19_hg38_maf00%s_LOF_missense%s_snplist \
    --make-bpgen --sort-vars \
    --out topmed_chr%i_rsq0%i_maf00%s_LOF_missense%s_extracted

'''%(i,j,k,c,i,j,k,c,i,j,k,c,i,j,i,j,k,c,i,j,k,c)
                f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/extract_filtered_chr"+str(i)+"_rsq0"+str(j)+"_maf00"+k+c+".sh", 'w')
                f.write(script)
                f.close()

In [5]:
cd /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed/scripts/
for i in extract_filtered_chr*sh; do qsub $i; done

Your job 8118150 ("extract_filtered_chr11_rsq03_maf0001_cadd") has been submitted
Your job 8118151 ("extract_filtered_chr11_rsq03_maf0001") has been submitted
Your job 8118152 ("extract_filtered_chr11_rsq03_maf0005_cadd") has been submitted
Your job 8118153 ("extract_filtered_chr11_rsq03_maf0005") has been submitted
Your job 8118154 ("extract_filtered_chr11_rsq03_maf001_cadd") has been submitted
Your job 8118155 ("extract_filtered_chr11_rsq03_maf001") has been submitted
Your job 8118156 ("extract_filtered_chr11_rsq08_maf0001_cadd") has been submitted
Your job 8118157 ("extract_filtered_chr11_rsq08_maf0001") has been submitted
Your job 8118158 ("extract_filtered_chr11_rsq08_maf0005_cadd") has been submitted
Your job 8118159 ("extract_filtered_chr11_rsq08_maf0005") has been submitted
Your job 8118160 ("extract_filtered_chr11_rsq08_maf001_cadd") has been submitted
Your job 8118161 ("extract_filtered_chr11_rsq08_maf001") has been submitted
Your job 8118162 ("extract_filtered_chr1_rsq03_maf

# Make Merged Dataset

We create 2 merged dataset, HRC_TOPMed and ES_HRC_TOPMed. For the HRC_TOPMed dataset, we compare individual R-sqaure for each variants and choose the one with higher R-square score. For ES_HRC_TOPMed, we prioritize usin exome sequenced variants, then use whichever variant with higher R-square score.

## HRC + TOPMed

### Obtain R2

Since we need to compare the individual R-sqaure for each variant, we need to query from the original HRC and TOPMed imputed vcf file.

In [6]:
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N extract_topmed_rsq-%i-%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_topmed_chr%i_rsq0%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_topmed_chr%i_rsq0%i_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/topmed

zcat topmed_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz | cut -f-8 >> ../hrc_topmed/topmed_168206ids_chr%i_rsq0%i_rsq.txt
'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_topmed_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [7]:
for i in list((1,2,11)):
    for j in list((3,8)):
        script='''#!/bin/sh
#$ -l h_rt=24:00:00
#$ -l h_vmem=10G
#$ -N extract_topmed_rsq-%i-%i
#$ -o /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr%i_rsq0%i_$JOB_ID.out
#$ -e /home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr%i_rsq0%i_$JOB_ID.err
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH

cd ~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc

zcat hrc_chr%i_merged_168206ids_rsq0%i_dose.vcf.gz | cut -f-8 >> ../hrc_topmed/hrc_168206ids_chr%i_rsq0%i_rsq.txt
'''%(i,j,i,j,i,j,i,j,i,j)
        f=open("/home/tl3031/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed/scripts/extract_hrc_chr"+str(i)+"_rsq0"+str(j)+".sh", 'w')
        f.write(script)
        f.close()

In [8]:
for i in 1 2 11;do
    sed -i '1,19d' hrc_168206ids_chr${i}_rsq03_rsq.txt
    sed -i '1,19d' hrc_168206ids_chr${i}_rsq08_rsq.txt
    sed -i '1,19d' topmed_168206ids_chr${i}_rsq03_rsq.txt
    sed -i '1,19d' topmed_168206ids_chr${i}_rsq08_rsq.txt
done

### Make Merged Annotation

We merge Chromosome 1 and 2 into 1 file and will do Chromosome 11 separately. This is because we need variants from chromosome 1 and 2 for follow simulation studies and Chromosome 11 is only needed for APOC3 analysis

#### Chromsome 1 + 2

In [9]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis")

In [10]:
for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq03_rsq_formatted.txt", chr))
    
    for(rsq in c(3,8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq0%d_maf%s_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./hrc/hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg19) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg19" = "ID"))
            fwrite(annot_rsq, fname_out)
        }
    } 
}

for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./hrc_topmed/topmed_168206ids_chr%d_rsq03_rsq_formatted.txt", chr))
    
    for(rsq in c(3,8)){    
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/topmed_168206ids_chr%d_rsq0%d_maf%s_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./topmed/topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg38) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg38" = "ID"))
            fwrite(annot_rsq, fname_out)
        }
    } 
}

In [12]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed")

for(rsq in c(3, 8)){
    for(maf in c(0.01, 0.005, 0.001)){
        maf_c <- gsub("\\.", "", as.character(maf))
        hrc_chr1 <- fread(sprintf("./hrc_168206ids_chr1_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))
        hrc_chr2 <- fread(sprintf("./hrc_168206ids_chr2_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))

        topmed_chr1 <- fread(sprintf("./topmed_168206ids_chr1_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))
        topmed_chr2 <- fread(sprintf("./topmed_168206ids_chr2_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))

        hrc <- rbind(hrc_chr1, hrc_chr2)
        topmed <- rbind(topmed_chr1, topmed_chr2)

        topmed_hrc <- full_join(hrc, topmed, 
                                by = c("Chr", "Start", "End", "Ref", "Alt", 
                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                       "ID_hg38", "ID")) %>%
                    select(-ID_hg19.y) %>% 
                    rename(ID_hg19 = ID_hg19.x)%>%
                    mutate(REVEL_score = as.numeric(REVEL_score),
                           R2_hrc = tidyr::replace_na(as.numeric(R2_hrc), 0),
                           R2_topmed = tidyr::replace_na(R2_topmed, 0)) %>%
                    mutate(R2 = ifelse(R2_topmed >= R2_hrc, R2_topmed, R2_hrc),
                           source = ifelse(R2_topmed > R2_hrc, "topmed", "hrc")) %>%
                    mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                           PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                    select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))

        topmed_hrc_lof <- topmed_hrc %>% filter(Function == "LoF")
        topmed_hrc_mis <- topmed_hrc %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

        topmed_hrc %>% fwrite(sprintf("./hrc_topmed_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
        rbind(topmed_hrc_lof, topmed_hrc_mis) %>% fwrite(sprintf("./hrc_topmed_168206ids_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
    }
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


In [13]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis")

for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq03_rsq_formatted.txt", chr))
    
    for(rsq in c(3,8)){
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/hrc_168206ids_chr%d_rsq0%d_maf%s_all_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./hrc/hrc_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg19) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg19" = "ID"))
            print(sprintf("annot %i; rsq_maf %i; annot_rsq %i", nrow(annot), nrow(rsq_maf), nrow(annot_rsq)))
            fwrite(annot_rsq, fname_out)
        }
    } 
}

for(chr in c(1,2,11)){
    rsq_df <- fread(sprintf("./hrc_topmed/topmed_168206ids_chr%d_rsq03_rsq_formatted.txt", chr))
    
    for(rsq in c(3,8)){    
        for(maf in c(0.01, 0.005, 0.001)){
            maf_c <- gsub("\\.", "", as.character(maf))
            fname_out <- sprintf("./hrc_topmed/topmed_168206ids_chr%d_rsq0%d_maf%s_all_annot.csv.gz", chr, rsq, maf_c)
            annot <- fread(sprintf("./topmed/topmed_chr%d_rsq0%d_hg19_hg38_maf%s_LOF_missense_all_annot.csv.gz", chr, rsq, maf_c))
            rsq_maf <- rsq_df %>% filter(ID %in% annot$ID_hg38) %>% select(ID, R2)
            annot_rsq <- left_join(annot, rsq_maf, by = c("ID_hg38" = "ID"))
            print(sprintf("annot %i; rsq_maf %i; annot_rsq %i", nrow(annot), nrow(rsq_maf), nrow(annot_rsq)))
            fwrite(annot_rsq, fname_out)
        }
    } 
}

[1] "annot 20656; rsq_maf 20656; annot_rsq 20656"
[1] "annot 19861; rsq_maf 19861; annot_rsq 19861"
[1] "annot 16746; rsq_maf 16746; annot_rsq 16746"
[1] "annot 8709; rsq_maf 8709; annot_rsq 8709"
[1] "annot 7996; rsq_maf 7996; annot_rsq 7996"
[1] "annot 5655; rsq_maf 5655; annot_rsq 5655"
[1] "annot 14733; rsq_maf 14733; annot_rsq 14733"
[1] "annot 14184; rsq_maf 14184; annot_rsq 14184"
[1] "annot 11928; rsq_maf 11928; annot_rsq 11928"
[1] "annot 6402; rsq_maf 6402; annot_rsq 6402"
[1] "annot 5904; rsq_maf 5904; annot_rsq 5904"
[1] "annot 4172; rsq_maf 4172; annot_rsq 4172"
[1] "annot 13447; rsq_maf 13447; annot_rsq 13447"
[1] "annot 12927; rsq_maf 12927; annot_rsq 12927"
[1] "annot 10962; rsq_maf 10962; annot_rsq 10962"
[1] "annot 5992; rsq_maf 5992; annot_rsq 5992"
[1] "annot 5517; rsq_maf 5517; annot_rsq 5517"
[1] "annot 4029; rsq_maf 4029; annot_rsq 4029"
[1] "annot 123710; rsq_maf 123710; annot_rsq 123710"
[1] "annot 122871; rsq_maf 122871; annot_rsq 122871"
[1] "annot 119623; rs

In [14]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed")
for(rsq in c(3, 8)){
    for(maf in c(0.01, 0.005, 0.001)){
        maf_c <- gsub("\\.", "", as.character(maf))
        hrc_chr1 <- fread(sprintf("./hrc_168206ids_chr1_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))
        hrc_chr2 <- fread(sprintf("./hrc_168206ids_chr2_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))

        topmed_chr1 <- fread(sprintf("./topmed_168206ids_chr1_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))
        topmed_chr2 <- fread(sprintf("./topmed_168206ids_chr2_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))

        hrc <- rbind(hrc_chr1, hrc_chr2)
        topmed <- rbind(topmed_chr1, topmed_chr2)

        topmed_hrc <- full_join(hrc, topmed, 
                                by = c("Chr", "Start", "End", "Ref", "Alt", 
                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                       "ID_hg38", "ID")) %>%
                    select(-ID_hg19.y) %>% 
                    rename(ID_hg19 = ID_hg19.x)%>%
                    mutate(REVEL_score = as.numeric(REVEL_score),
                           R2_hrc = tidyr::replace_na(as.numeric(R2_hrc), 0),
                           R2_topmed = tidyr::replace_na(R2_topmed, 0)) %>%
                    mutate(R2 = ifelse(R2_topmed >= R2_hrc, R2_topmed, R2_hrc),
                           source = ifelse(R2_topmed > R2_hrc, "topmed", "hrc")) %>%
                    mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                           PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                    select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))

        topmed_hrc_lof <- topmed_hrc %>% filter(Function == "LoF")
        topmed_hrc_mis <- topmed_hrc %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

        topmed_hrc %>% fwrite(sprintf("./hrc_topmed_168206ids_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c))
        rbind(topmed_hrc_lof, topmed_hrc_mis) %>% fwrite(sprintf("./hrc_topmed_168206ids_rsq0%d_maf%s_cadd_all_annot.csv.gz", rsq, maf_c))
    }
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


#### Chromosome 11

In [15]:
setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/hrc_topmed")

for(rsq in c(3, 8)){
    for(maf in c(0.01, 0.005, 0.001)){
        maf_c <- gsub("\\.", "", as.character(maf))
        hrc <- fread(sprintf("./hrc_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_hrc" = R2) %>% mutate(R2_hrc = as.numeric(R2_hrc))
        topmed <- fread(sprintf("./topmed_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% rename("R2_topmed" = R2) %>% mutate(R2_topmed = as.numeric(R2_topmed))
        
        topmed_hrc <- full_join(hrc, topmed, 
                                by = c("Chr", "Start", "End", "Ref", "Alt", 
                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                       "ID_hg38", "ID")) %>%
                    select(-ID_hg19.y) %>% 
                    rename(ID_hg19 = ID_hg19.x) %>%
                    mutate(REVEL_score = as.numeric(REVEL_score),
                           R2_hrc = tidyr::replace_na(as.numeric(R2_hrc), 0),
                           R2_topmed = tidyr::replace_na(R2_topmed, 0)) %>%
                    mutate(R2 = ifelse(R2_topmed >= R2_hrc, R2_topmed, R2_hrc),
                           source = ifelse(R2_topmed > R2_hrc, "topmed", "hrc")) %>%
                    mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                           PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                    select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))

        topmed_hrc_lof <- topmed_hrc %>% filter(Function == "LoF")
        topmed_hrc_mis <- topmed_hrc %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)

        topmed_hrc %>% fwrite(sprintf("./hrc_topmed_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
        rbind(topmed_hrc_lof, topmed_hrc_mis) %>% fwrite(sprintf("./hrc_topmed_168206ids_chr11_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
    }
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


## Exome + HRC + TOPMed

### Make merged annotation

#### Chromosome 1 + 2

In [16]:
library(dplyr)
library(data.table)

setwd("~/project/imputation-rvtest/analysis/imputation_aggregated_analysis/")

In [23]:
for(maf in c(0.01, 0.005, 0.001)){
    maf_c <- gsub("\\.", "", as.character(maf))
    
    exome_chr1 <- fread(sprintf("./exome/ukb23156_c1.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_cadd.csv.gz", maf_c))
    exome_chr2 <- fread(sprintf("./exome/ukb23156_c2.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_cadd.csv.gz", maf_c))
    exome <- rbind(exome_chr1, exome_chr2) %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
    for(rsq in c(3, 8)){
        topmed_hrc <- fread(sprintf("./hrc_topmed/hrc_topmed_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% mutate(REVEL_score = as.numeric(REVEL_score))
        
        full_annot <- 
            full_join(topmed_hrc %>% select(-ID_hg38) %>% rename("source_hrc_topmed" = source), 
              exome %>% mutate(R2_exome = 999), by = c("Chr", "Start", "End", "Ref", "Alt", 
                                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                                       "ID")) %>%
                mutate(source = ifelse(is.na(R2_exome), source_hrc_topmed, "exome")) %>% 
                select(Chr, Start, End, Ref, Alt, Func.refGene, Gene.refGene, ExonicFunc.refGene, Function,
                       MAF_nfe_exome, REVEL_score, R2, R2_hrc, R2_topmed, R2_exome, ID, ID_hg38, ID_hg19, source, 
                       RawScore.x, RawScore.y, PHRED.x, PHRED.y)  %>%
                mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                       PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)
        
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
        full_annot %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
    } 
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


In [24]:
for(maf in c(0.01, 0.005, 0.001)){
    maf_c <- gsub("\\.", "", as.character(maf))
    
    exome_chr1 <- fread(sprintf("./exome/ukb23156_c1.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_all_cadd.csv.gz", maf_c))
    exome_chr2 <- fread(sprintf("./exome/ukb23156_c2.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_all_cadd.csv.gz", maf_c))
    exome <- rbind(exome_chr1, exome_chr2) %>% mutate(REVEL_score = as.numeric(REVEL_score))

    for(rsq in c(3, 8)){
        topmed_hrc <- fread(sprintf("./hrc_topmed/hrc_topmed_168206ids_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c)) %>% mutate(REVEL_score = as.numeric(REVEL_score))

        full_annot <- 
            full_join(topmed_hrc %>% select(-ID_hg38) %>% rename("source_hrc_topmed" = source), 
              exome %>% mutate(R2_exome = 999), by = c("Chr", "Start", "End", "Ref", "Alt", 
                                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                                       "ID")) %>%
                mutate(source = ifelse(is.na(R2_exome), source_hrc_topmed, "exome")) %>% 
                select(Chr, Start, End, Ref, Alt, Func.refGene, Gene.refGene, ExonicFunc.refGene, Function,
                       MAF_nfe_exome, REVEL_score, R2, R2_hrc, R2_topmed, R2_exome, ID, ID_hg38, ID_hg19, source, 
                       RawScore.x, RawScore.y, PHRED.x, PHRED.y)  %>%
                mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                       PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)
        
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_rsq0%d_maf%s_cadd_all_annot.csv.gz", rsq, maf_c))
        full_annot %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_rsq0%d_maf%s_all_annot.csv.gz", rsq, maf_c))
    } 
}

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”


#### Chromosome 11

In [26]:
for(maf in c(0.01, 0.005, 0.001)){
    maf_c <- gsub("\\.", "", as.character(maf))

    exome <- fread(sprintf("./exome/ukb23156_c11.merged.filtered.hg38.hg38_multianno_formatted_sel_col_maf%s_LOF_missense_cadd.csv.gz", maf_c))
    exome <- exome %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
    for(rsq in c(3, 8)){
        topmed_hrc <- fread(sprintf("./hrc_topmed/hrc_topmed_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c)) %>% mutate(REVEL_score = as.numeric(REVEL_score))
    
        full_annot <- 
            full_join(topmed_hrc %>% select(-ID_hg38) %>% rename("source_hrc_topmed" = source), 
              exome %>% mutate(R2_exome = 999), by = c("Chr", "Start", "End", "Ref", "Alt", 
                                                       "Func.refGene", "Gene.refGene", "ExonicFunc.refGene", 
                                                       "MAF_nfe_exome", "REVEL_score", "Function", 
                                                       "ID")) %>%
                mutate(source = ifelse(is.na(R2_exome), source_hrc_topmed, "exome")) %>% 
                select(Chr, Start, End, Ref, Alt, Func.refGene, Gene.refGene, ExonicFunc.refGene, Function,
                       MAF_nfe_exome, REVEL_score, R2, R2_hrc, R2_topmed, R2_exome, ID, ID_hg38, ID_hg19, source, 
                       RawScore.x, RawScore.y, PHRED.x, PHRED.y)  %>%
                mutate(RawScore = ifelse(source == "hrc", RawScore.x, RawScore.y),
                       PHRED = ifelse(source == "hrc", PHRED.x, PHRED.y)) %>%
                select(-c(RawScore.x, RawScore.y, PHRED.x, PHRED.y))
        
        annot_lof <- full_annot %>% filter(Function == "LoF")
        annot_mis <- full_annot %>% filter(Function != "LoF" & as.numeric(PHRED) >= 20)
        
        rbind(annot_lof, annot_mis) %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_chr11_rsq0%d_maf%s_cadd_annot.csv.gz", rsq, maf_c))
        full_annot %>% fwrite(sprintf("./hrc_topmed_exome/hrc_topmed_exome_168206ids_chr11_rsq0%d_maf%s_annot.csv.gz", rsq, maf_c))
    } 
} 

[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
[1m[22m[36mℹ[39m In argument: `REVEL_score = as.numeric(REVEL_score)`.
[33m![39m NAs introduced by coercion”
