# Post-Imputation Data Processing

Authors: Rafaella Ormond and Jose Jaime Martinez-Magana

***Description:***
This script processes imputed data from TOPMED server, converting VCF files to PLINK format and performing quality control filtering.

**INPUT:** Imputation output files from TOPMed server (chromosome-specific VCFs).
**OUTPUT:** Quality-controlled PLINK dataset files, filtered and ready for downstream analyses.

### Requirements:
- PLINK 1.9 or 2.0 ([plink 2.0](https://www.cog-genomics.org/plink/2.0/), [plink 1.9](https://www.cog-genomics.org/plink/1.9/))
- BCFtools ([BCFtools Download](http://samtools.github.io/bcftools/))
- TOPMED Imputation Results (VCF files per chromosome, QC report, summary stats)

### Recommended Workflow:
1. Unzip imputation output files
2. PLINK: create and merge chromosome-specific files
3. BCFtools: QC, convert to PLINK, merge, and standardize variants

## 1. Unzip the imputation files
***Description:***
Unzip the output files from imputation. Some servers encrypt the files with a password.

In [None]:
# Set path to imputation files
input_path="/path_to_your_data/imputation_results"
password='YOUR_PASSWORD'

cd ${input_path}
for chr in *.zip; do
    unzip -P ${password} -o ${chr}
done

## 2. PLINK
Convert the `.dose.vcf.gz` files into PLINK format and merge all chromosomes.

In [None]:
# User configuration
plink2="/path_to_your_data/plink2"
splitted_path="/path_to_your_data/splitted_files"
merged_path="/path_to_your_data/merged_files"
ldpruned_path="/path_to_your_data/merged_files_ldpruned"
input_vcf_path="/path_to_your_data/imputation_results"

In [None]:
# Transform VCF dose files to PLINK format by chromosome
for chr in {1..22}; do
    ${plink2} --vcf ${input_vcf_path}/chr${chr}.dose.vcf.gz \
        --make-pgen --id-delim '_' --rm-dup 'force-first' --snps-only \
        --out ${splitted_path}/chr${chr}.dose.plink
done

### Merge PLINK files and standardize variant IDs

In [None]:
cd ${splitted_path}
# Create list of chromosome files (exclude chr1, will be base)
for chr in {2..22}; do
    echo chr${chr}.dose.plink >> plink_files.list
done

# Merge all chromosomes into single dataset
${plink2} --pfile chr1.dose.plink --pmerge-list plink_files.list pfile --multiallelics-already-joined --out ${merged_path}/merged.chr.dose.plink

# Standardize variant IDs to chr:bp:ref:alt and keep only SNPs
${plink2} --pfile ${merged_path}/merged.chr.dose.plink --set-all-var-ids 'chr@:#:$1:$2' --snps-only 'just-acgt' --make-pgen --out ${merged_path}/merged.chr.dose.plink.snpsonly

## 3. BCFtools

In [None]:
# User configuration
input='/path_to_your_data/00impfiles'
filtered_path='/path_to_your_data/02impfilesfiltered'
plink_out='/path_to_your_data/plink_per_chr'
merged_path='/path_to_your_data/merged_plink'
plink2='/path_to_your_data/plink2'

In [None]:
# Quality control and filtering using BCFtools
for chr in {1..22}; do
    bcftools view -i 'R2>0.8 & MAF>0.001' --threads 30 -Oz -o ${filtered_path}/cohort.chr${chr}.dose.filtered.vcf.gz ${input}/chr${chr}.dose.vcf.gz
    bcftools +fill-tags --threads 30 ${filtered_path}/cohort.chr${chr}.dose.filtered.vcf.gz --write-index -Ob -o ${filtered_path}/cohort.chr${chr}.dose.filtered.tag.bcf.gz -- -t AC,AN
done

In [None]:
# Convert filtered BCF files to PLINK format
for chr in {1..22}; do
    ${plink2} \
      --bcf ${filtered_path}/cohort.chr${chr}.dose.filtered.tag.bcf.gz \
      --max-alleles 2 \
      --set-all-var-ids 'chr@:#:$1:$2' \
      --new-id-max-allele-len 100 \
      --make-pgen \
      --out ${plink_out}/cohort_chr${chr}.plink
done

In [None]:
# Merge chromosome-specific PLINK files
mkdir -p ${merged_path}

# Create merge list (exclude chr1)
for chr in {2..22}; do
    echo "${plink_out}/cohort_chr${chr}.plink" >> ${merged_path}/cohort_merge_list.txt
done

# Remove duplicates and merge
for chr in {1..22}; do
    ${plink2} --pfile ${plink_out}/cohort_chr${chr}.plink --rm-dup 'exclude-all' --make-pgen --out ${merged_path}/cohort_nodup_chr${chr}
done

# Merge all non-duplicate chromosome files (chr1 as base)
cd ${merged_path}
${plink2} --pfile cohort_nodup_chr1 --pmerge-list cohort_merge_list.txt 'pfile' --make-pgen --out cohort_nodup_merged

In [None]:
# Standardize variant IDs and keep only SNPs
${plink2} \
  --pfile ${merged_path}/cohort_nodup_merged \
  --set-all-var-ids 'chr@:#:$1:$2' \
  --snps-only 'just-acgt' \
  --make-pgen \
  --out ${merged_path}/cohort_nodup_merged.snpsonly