# Imputation with QUILT

## Step 1: Prepare reference panel

Note: this step is to illustrate how to prepare the reference panel for QUILT, but we will not be executing this code

From your QC'd reference panel (e.g. 1000 genomes dataset), make haplotype and legend files:

In [None]:
bcftools convert --haplegendsample reference_panel reference_panel.vcf.gz

Next, create the reference files with QUILT. You also need a genetic map for this. See: https://github.com/rwdavies/QUILT/blob/master/README_QUILT1.md

In [None]:
./QUILT_prepare_reference.R \
--outputdir=/path_to_store_reference/ref_${chr} \
--tempdir=./temp_QUILT \
--chr=${chr} --regionStart=${begin} --regionEnd=${last} \
--buffer=250000 --nGen=100 \
--reference_haplotype_file=/path_to_haplo/${chr}_highcov_1KG.hap.gz \
--reference_legend_file=/path_to_legend/${chr}_highcov_1KG.legend.gz \
--genetic_map_file=/path_to_genetic_map/${chr}_genetic_map.map.gz

## Step 2: Imputation

Below, you can find the job script to perform the imputation across 21 windows, each spanning 5M bp. Below this script, we will go through the code step by step.

In [None]:
#!/bin/bash
#SBATCH -J QUILT
#SBATCH --time=24:00:00
#SBATCH --array 1-21%5
#SBATCH --ntasks=5
#SBATCH --mem=10g
#SBATCH -A ealloc_e7679_project1-tk-echo
#SBATCH --nodes=1

your_user="" #fill in your user name here

export PATH="/gpfs/helios/home/etais/${your_user}/miniconda3/bin:${PATH}"
source activate quilt2

mkdir -p QUILT_output
mkdir -p temp_files_QUILT

windows_file="/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/windows_to_be_imputed"
index=${SLURM_ARRAY_TASK_ID}
read chr start end window <<< $(head -n ${index} ${windows_file} | tail -n 1)

bam_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/bam_list”
output_dir=“/gpfs/helios/home/etais/${your_user}/QUILT_output”
ref_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/ref”
pos_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/posfiles”

./QUILT/QUILT.R --outputdir=${ref_dir}/ref_${chr} \
--output_filename=${output_dir}/${chr}.${window}.${start}.${end}.vcf.gz \
--tempdir=/gpfs/helios/home/etais/${your_user}/temp_files_QUILT/temp.${chr} \
--chr=${chr} \
--regionStart=${start} \
--regionEnd=${end} \
--buffer=250000 \
--bamlist=${bamlist} \
--posfile=${pos_dir}/posfile_${chr}

If you didn't already, let's install the conda environment for this session. You can find the conda at our github page: https://github.com/lm-ut/Workshop_25/ or in this folder: /gpfs/helios/projects/echo_workshops/project.1.tk/conda_env

Download the quilt2.yml file. If this is the first time you install conda on this server, let's first download and install miniconda:

In [None]:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Download the QUILT github repository:

In [None]:
git clone --recursive https://github.com/rwdavies/QUILT.git

Finally, create the quilt conda environment:

In [None]:
conda env create -f quilt2.yml 

Let's also create some folders to store the temporary files, and the QUILT output:

In [None]:
mkdir -p QUILT_output
mkdir -p temp_files_QUILT

After installing the conda environment, you can activate this in the slurm script as follows:

In [None]:
export PATH="/gpfs/helios/home/etais/${your_user}/miniconda3/bin:${PATH}"
source activate quilt2

We prepared a windows file to perform the imputation in 5M bp chunks. In the next lines, we define this windows file and by use of the SLURM_ARRAY we go through this file one line at a time.

In [None]:
windows_file="/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/windows_to_be_imputed"
index=${SLURM_ARRAY_TASK_ID}
read chr start end window <<< $(head -n ${index} ${windows_file} | tail -n 1)

Next, we define where the bam file list, your output path, the reference files and the posfiles are located:

In [None]:
bam_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/bam_list”
output_dir=“/gpfs/helios/home/etais/${your_user}/QUILT_output”
ref_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/ref”
pos_dir=“/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_files/posfiles”

Then, run QUILT for each window

In [None]:
./QUILT/QUILT.R --outputdir=${ref_dir}/ref_${chr} \
--output_filename=${output_dir}/${chr}.${window}.${start}.${end}.vcf.gz \
--tempdir=/gpfs/helios/home/etais/${your_user}/temp_files_QUILT/temp.${chr} \
--chr=${chr} \
--regionStart=${start} \
--regionEnd=${end} \
--buffer=250000 \
--bamlist=${bamlist} \
--posfile=${pos_dir}/posfile_${chr}

This step will take a few hours to run. We will continue with the output of this step tomorrow.

## Step 3: Concatenation of the windows and chromosomes + Step 4: GP-to-GT correction

Below, you can find the script to perform the concatenation of (1) the windows and (2) the chromosomes. This is followed by GP-to-GT correction. The consequence of the GP-to-GT correction is loss of phasing.

In [None]:
#!/bin/bash
#SBATCH -J concat_GP_to_GT
#SBATCH --time=24:00:00
#SBATCH --ntasks=5
#SBATCH -A ealloc_e7679_project1-tk-echo
#SBATCH --nodes=1

module load bcftools/1.19

your_user="" #fill in your user name here

output_dir="/gpfs/helios/projects/echo_workshops/project.1.tk/data/QUILT_output"
own_dir="/gpfs/helios/home/etais/${your_user}"

## Concatenation of the windows

### Feel free to do this for chr1-19 and 21 and use your own imputed data for chr20 and chr22. Just make sure that everything is placed in the same folder in the end.

for chr in {1..22}
do
n=$(ls ${output_dir}/chr${chr}/| grep gz$ | cut -d'_' -f2 | cut -d'.' -f1 | sort -n | tr '\n' ' ' | sed 's/ /,/g' | rev | cut -c2- | rev)
e=$(echo bcftools concat ${output_dir}/chr${chr}/chr${chr}.window_{${n}}.*.vcf.gz --threads 5 -Oz -o ${own_dir}/concat_chr${chr}.vcf.gz)
eval $e &
done &&
wait

## Concatenation of the chromosomes

bcftools concat concat_chr1.vcf.gz concat_chr2.vcf.gz concat_chr3.vcf.gz concat_chr4.vcf.gz concat_chr5.vcf.gz concat_chr6.vcf.gz concat_chr7.vcf.gz concat_chr8.vcf.gz concat_chr9.vcf.gz concat_chr10.vcf.gz concat_chr11.vcf.gz concat_chr12.vcf.gz concat_chr13.vcf.gz concat_chr14.vcf.gz concat_chr15.vcf.gz concat_chr16.vcf.gz concat_chr17.vcf.gz concat_chr18.vcf.gz concat_chr19.vcf.gz concat_chr20.vcf.gz concat_chr21.vcf.gz concat_chr22.vcf.gz -Oz -o raw_imputed_data.vcf.gz  && 
tabix -p vcf raw_imputed_data.vcf.gz

## GP-to-GT correction

bcftools +tag2tag raw_imputed_data.vcf.gz -Oz -o GT_corrected_imputed_data.vcf.gz -- -t 1 --gp-to-gt
tabix -p vcf GT_corrected_imputed_data.vcf.gz