# Array Quality Control
Authors: Rafaella Ormond and Jose Jaime Martinez-Magana <br>
***Description:***<br>
This script will prepare the files for perform for imputation

**INPUT:** PLINK files (.bed/.bim/.fam)<br>
**OUTPUT:** 
1) PCA results files (principal components table for samples).
2) Genetic Relationship Matrix (GRM) files.

### ***Requirements:***
### Download Plink
We can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
For instal plink2 [access here](https://www.cog-genomics.org/plink/2.0/)<br>
For instal plink1.9 [access here](https://www.cog-genomics.org/plink/1.9/) <br>

### Download BCF tools:
For instal bcftools [acess here](http://samtools.github.io/bcftools/) <br>

### Reference Genome: 
For the first steps the reference genome depends of the array, can be used hg19 or hg38<br>
For download the reference genome hg19 please download the file: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz<br> 
For download the reference genome hg38 please download the file: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

### Imputation on TOPMED
After the execution of this script, the imputation needs to be performed on TOPMED [link here](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!)<br>

**Note:** After imputation, all files should be aligned to the **hg38** reference genome for consistency.

### Analysis Steps:
1. **User Configuration** – Define paths, reference genome, and software parameters.  
2. **Convert PLINK to VCF** – Convert `.bed/.bim/.fam` files to `.vcf` format and index them.  
3. **Alignment** – Align genotypes to the reference genome (strand checking).  
4. **Split by Chromosome** – Generate one VCF file per chromosome (`chr1.vcf.gz`, `chr2.vcf.gz`, ...).
5. **Instructions for TOPMED imputation** - Generate an imputated file in hg38 and phased

### 1) User Configuration
***Description:***<br>
Set the parameters and adjust the path and files according to your analysis

In [1]:
# USER CONFIGURATION - MODIFY ACCORDING TO YOUR ANALYSIS

# PLINK files prefix (.bed, .bim, .fam)
inPath="/path_to_your_data/plink_prefix"

# Your cohort name
cohortName="cohort_name"

# Directory for VCF files
clvcfPath="path_vcf_files"

# Directory for VCFs by chromosome
clvcfbychrPath="path_vcf_chr"

# Reference genome file (hg19 or hg38)
refFile="/path/to/reference/hg19.fa"

### 2) Convert PLINK files to VCF, perform quality control, and index the files
***Description:***<br>
Convert PLINK binary files to compressed VCF format and create index files for efficient access.  
Quality control filters are applied during conversion to exclude variants and samples based on minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE), genotype missingness per variant and per individual. 

Quality control filters: <br>
--maf 0.01  → exclude variants with minor allele frequency < 1%<br>
--hwe 1e-6  → exclude variants with Hardy-Weinberg p-value < 1e-6<br>
--geno 0.05 → exclude variants with missing genotype rate > 5%<br>
--mind 0.05 → exclude samples with missing genotype rate > 5%<br>

Please adjust the input paths and filenames accordingly.

In [None]:
# Convert PLINK binary files (.bed/.bim/.fam) to a compressed VCF (.vcf.gz) and aply que quality control filters
# - 'bgz' outputs a block-gzipped (bgzip) VCF file compatible with bcftools/tabix
# The flag 'vcf-iid bgz' removes the Family ID (FID) field, keeping only the Individual ID (IID)
plink2 --bfile ${inPath} --chr 1-22 --recode vcf-iid bgz --maf 0.01 --hwe 1e-6 --geno 0.05 --mind 0.05 --out ${clvcfPath}/${cohort_name}

# Index the resulting VCF file using bcftools
# This creates a .csi index file, needed for fast access and downstream analyses
bcftools index ${clvcfPath}/${cohort_name}.vcf.gz

### 3) Aligment
***Description:***<br>
Align VCF alleles with the reference genome<br>
Please adjust the threads accordingly.

In [None]:
# Convert VCF file to BCF format using 25 threads for faster processing (adjusting acordly your compuationak power)
bcftools convert ${clvcfPath}/${cohort_name}.vcf.gz -Ob -o ${clvcfPath}/${cohort_name}.bcf.gz --threads 25

# Run BCFtools fixref plugin to fix reference/alternative allele inconsistencies
# This step checks and updates alleles based on the reference genome
bcftools +fixref ${clvcfPath}/${cohort_name}.bcf.gz -- -f ${refFile}

# Run BCFtools fixref again, this time outputting a new VCF file
# The -d option removes alleles that do not match the reference
# The -m flip option flips alleles to match the reference when possible
# The final result is a VCF file corrected for reference mismatches
bcftools +fixref ${clvcfPath}/${cohort_name}.bcf.gz -Oz -o ${clvcfPath}/${cohort_name}.vcf.gz -- -d -m flip -f ${refFile}

### 4) Split VCF file by chromosome
***Description:***<br>
Split VCF file by chromosome for imputation and sort VCF files by chromosome to ensure proper ordering before downstream analyses

In [None]:
# Iterate over chromosome numbers
for chr_num in {1..22}
do
plink2 --vcf ${clvcfPath}/${cohort_name}.vcf.gz --chr ${chr_num} --recode vcf-4.2 bgz --out ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}
done

# sorting
# Iterate over chromosome numbers
for chr_num in {1..22}
do
bcftools sort ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}.vcf.gz -Oz -o ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}_sorted.vcf.gz
done

## 5) Instructions for Imputation on TOPMed

Please access the [TOPMed Imputation Server](https://imputation.biodatacatalyst.nhlbi.nih.gov).

Log in to your account and access the dashboard.  
At the top of the page, click **“Run” → “Genotype Imputation (Minimac4)”**.

<p align="center">
  <img src="topmed1.png" alt="Example of TOPMed interface" width="600">
</p>

Upload your VCF files and configure the following options:

- **Array Build:** select *hg19* or *hg38* (according to your file build).  
- **Rsq Filter:** *off*  
- **Phasing:** *Eagle v2.4 (phased output)*  
- **Population:** select *All populations* (currently “vs. TOPMed Panel”)  
- **Mode:** *Quality Control & Imputation*  
- Enable the checkbox **“Generate Meta-imputation file”**

After submitting the job, you will receive an email when the results are ready.  
You will also receive a **password by email** to access the output folders.  
Download all the result folders to your working directory.

<p align="center">
  <img src="topmed2.png" alt="Example of download page" width="600">
</p>
