# Post-Imputation Data Processing

Authors: Rafaella Ormond and Jose Jaime Martinez-Magana <br>
***Description:***<br>
This script processes imputed data from TOPMED server, converting VCF files to PLINK format and performing quality control filtering

**INPUT:** Imputation output files from TOPMed server (e.g., chromosome-specific VCFs or dosage files).<br>
**OUTPUT:** Quality-controlled PLINK dataset files, filtered and ready for downstream analyses.


### ***Requirements:***
### Download Plink
You can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
For install plink2 [access here](https://www.cog-genomics.org/plink/2.0/)<br>
For install plink1.9 [access here](https://www.cog-genomics.org/plink/1.9/) 

### Download BCFtools
**BCFtools** is another popular tool for handling VCF/BCF files:  
- [BCFtools Download](http://samtools.github.io/bcftools/)  

### TOPMED Imputation Results
This script assumes you have already completed imputation using TOPMED server [link here](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!) and downloaded the results<br>
After imputation is complete, you will need to download:

- The **Quality Control Report** (e.g., `qcreport.html`)
- The **QC summary statistics** (e.g., `chunks-excluded.txt`, `snps-excluded.txt`, `typed-only.txt`)
- The **imputed VCF files**, one per chromosome (e.g., `chr1.dose.vcf.gz`, `chr2.dose.vcf.gz`, ...)

These files are available via direct links from the server interface after your job completes. Make sure to save them in a consistent folder structure for downstream analysis.

## QC Software Options

You can perform post-imputation quality control using one of the following:

1) Plink
2) BCFtools

| Tool      | Pros ‚úÖ | Cons ‚ùå |
|-----------|--------|---------|
| **PLINK** | - Widely used in GWAS pipelines.<br>- Very fast for large datasets.<br>- Many built-in genetic QC functions (MAF, HWE, missingness, etc.).<br>- Can output multiple formats. | - Requires conversion from VCF to PLINK format first.<br>- Limited manipulation of complex VCF structures compared to BCFtools. |
| **BCFtools** | - Works directly with compressed VCF/BCF files.<br>- Very efficient for filtering and subsetting large variant files.<br>- Flexible for script automation.<br>- Supports streaming and piping with other tools. | - Requires more manual scripting for GWAS-specific QC steps.<br>- Less "turnkey" for genetic association workflows compared to PLINK. |

---

üí° **Tip:**  
- If you plan to move quickly into GWAS analysis, **PLINK** is usually the easiest choice.  
- If you want maximum flexibility and work a lot with raw VCF files, **BCFtools** is a powerful option.

Analysis Steps:¬∂
1. **Unzip the imputation files**<br>
2. **Plink**<br>
   2.1 User Configuration<br>
   2.2 Create plink files and merge all chromosomes<br>
   2.3 Merge PLINK files and standardize variant IDs<br>
   2.4 Perform LD pruning on merged PLINK dataset
   
3. **BCFTools**
   3.1 User Configuration<br>
   3.2 Quality Control
   3.3 Merge Chromosome-specific PLINK Files
   3.4 Standardize Variant IDs and Keep Only SNPs



### 1. Unzip the imputation files
***Description:***<br>
Unzip the output files from imputation

> **Note:** Some imputation servers, such as TOPMed, may encrypt the output files with a password for security purposes.  
> The password is usually provided in your job results page or in the confirmation email.  
> Make sure to replace the example password above with the one specific to your dataset.

Please adjust the path and files

In [None]:
cd ${input_path}

# unzip all files
# add the password of your work in the next line
password='XXX'
for chr in *.zip
do
    echo unzip -P ${password} -o ${chr}
    unzip -P ${password} -o ${chr}
done

### 2. PLINK


### 2.1 User Configuration

***Description:***<br>
Set the parameters and adjust the path and files according to your analysis

In [None]:
# input path
input_path="/path_to_your_data/imputation_results"
# set directory for files split by chromosome
splitted_path="/path_to_your_data/splitted_files"
# set the plink2 file
plink2="/path_to_your_data/plink2"
# set directory for files with the chromosomes merged
merged_path="/path_to_your_data/merged_files"
# set ld pruned path
ldpruned_path="/path_to_your_data/merged_files_ldpruned"


### 2.2 Create plink files and merge all chromosomes
***Description:***<br>
Convert the `.dose.vcf.gz` files into PLINK format and merge all chromosomes into a single dataset. 

Please adjust the path and files

In [4]:
# transform dose files to plink
for chr in {1..22}
do
${plink2} --vcf ${input_path}/chr${chr}.dose.vcf.gz --make-pgen --id-delim '_' --rm-dup 'force-first' --snps-only --out ${splitted_path}/chr${chr}.dose.plink
done

ParsingError: File contains parsing errors: 
	[line  3]: input_path="/path_to_your_data/imputation_results"

output_path="/path_to_your_data/splitted_files"
module load PLINK


Invalid statements: SyntaxError('invalid syntax', ('<string>', 4, 8, 'module load PLINK\n', 4, 12))

### 2.3 Merge PLINK files and standardize variant IDs

***Description:*** <br>
This script moves to the directory containing PLINK files split by chromosome, creates a list of those files, merges them into a single dataset, and standardizes variant IDs to the format chr:bp:ref:alt for consistency in downstream analyses.

In [None]:
# move to splitted by chromosome path
cd ${splitted_path}

# create list of plink files if needed
# for chr in {2..22};do echo chr${chr}.dose.plink;done > plink_files.list

# load libraries
plink2="/path_to_your_data/plink2"

# transform dose files to plink
${plink2} --pfile chr1.dose.plink --pmerge-list plink_files.list pfile --multiallelics-already-joined --out ${merged_path}/merged.chr.dose.plink

# change all variants ID to chr:bp:ref:alt
${plink2} --pfile ${merged_path}/merged.chr.dose.plink --set-all-var-ids '@:#:$r:$a' --snps-only 'just-acgt' --make-pgen --out ${merged_path}/merged.chr.dose.plink.norsids.snpsonly

### 2.4 Quality-controlled LD-pruned PLINK dataset files ############ maybe exclude ld prunning 

**Description:**  
This script moves to the output directory and performs linkage disequilibrium (LD) pruning on the merged PLINK dataset.  
Filters are applied based on minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE), genotype missingness, and LD thresholds to remove correlated variants.  
A pruned PLINK dataset is generated for downstream analyses.

In [None]:
# move to output path
cd ${ldpruned_path}

# set to plink2
plink2="/path_to_your_data/plink2"

# performing LD prunning
${plink2} --pfile ${merged_path}/merged.chr.dose.plink.norsids.snpsonly --rm-dup 'force-first' --extract-if-info 'R2 > 0.4' --snps-only 'just-acgt' --maf 0.05 --hwe 1e-10 --geno 0.01 --indep-pairwise 50 5 0.2 --out merged.chr.dose.plink.prune_vars

# performing LD prunning filtering
${plink2} --pfile ${merged_path}/merged.chr.dose.plink.norsids.snpsonly --extract merged.chr.dose.plink.01202023.prune_vars.prune.in --make-pgen --out merged.chr.dose.plink.norsids.snpsonly.prunned

### 3. BCFTools

### 3.1 User Configuration

***Description:***<br>
Set the parameters and adjust the path and files according to your analysis

In [None]:
input='/path_to_your_data/00impfiles'
filtered_path='/path_to_your_data/02impfilesfiltered'
plink_path="/path_to_your_data/plink2"
plink_out="/path_to_your_data/plink_per_chr"
merged_path="/path_to_your_data/merged_plink"

### 3.1 Quality Control

**Description:**  
This script sequentially processes chromosome-specific imputed variant files by applying quality filters using `bcftools`. Variants are filtered based on imputation quality (R2) and minor allele frequency (MAF). The filtered files are then compressed and indexed, preparing them for downstream genetic analyses.


In [None]:
# filter BRHC imputed variants
# set the paths for input and output data

# loop over each chromosome sequentially
for chr in {1..22}; do
    bcftools view -i 'R2>0.8 & MAF>0.001' --threads 30 -Oz -o ${filtered_path}/bhrc.chr${chr}.dose.filtered.vcf.gz ${input}/chr${chr}.dose.vcf.gz
    bcftools +fill-tags --threads 30 ${filtered_path}/bhrc.chr${chr}.dose.filtered.vcf.gz --write-index -Ob -o ${filtered_path}/bhrc.chr${chr}.dose.filtered.tag.bcf.gz -- -t AC,AN
done

### 3.2 Convert Filtered BCF to PLINK by Chromosome
***Description:***<br> Converts filtered BCF files to .pgen format using PLINK.

In [None]:
mkdir -p ${plink_out}

for chr in {1..22}; do
    ${plink_path} \
        --bcf ${filtered_path}/bhrc.chr${chr}.dose.filtered.tag.bcf.gz \
        --make-pgen \
        --out ${plink_out}/chr${chr}.plink
done

### 3.3 Merge Chromosome-specific PLINK Files
***Description:*** 
Merges individual chromosome `.pgen` files into a single PLINK dataset.

In [None]:
mkdir -p ${merged_path}

# Cria lista de arquivos para merge (exclui o cromossomo 1 que ser√° o arquivo base)
for chr in {2..22}; do
    echo "${plink_out}/chr${chr}.plink" >> ${merged_path}/plink_merge_list.txt
done

# Merge
${plink_path} \
    --pfile ${plink_out}/chr1.plink \
    --pmerge-list ${merged_path}/plink_merge_list.txt pfile \
    --out ${merged_path}/bhrc_merged_allchr

### 3.4 Standardize Variant IDs and Keep Only SNPs
***Description:*** <br>
Updates variant IDs to the `@:#:$r:$a` format, filters to include only A/C/G/T SNPs, and recreates the `.pgen` file.

In [None]:

${plink_path} \
    --pfile ${merged_path}/bhrc_merged_allchr \
    --set-all-var-ids '@:#:$r:$a' \
    --snps-only just-acgt \
    --make-pgen \
    --out ${merged_path}/bhrc_merged_allchr.snpsonly