# Array Quality Control
Authors: Rafaella Ormond and Jose Jaime Martinez-Magana <br>
***Description:***<br>
This script will prepare the files for perform for imputation

**INPUT:** PLINK files (.bed/.bim/.fam)<br>
**OUTPUT:** VCF files separated by chromosome

### ***Requirements:***
### Download Plink
We can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
To instal plink2 follow the instructions[in this link](https://www.cog-genomics.org/plink/2.0/)<br>
To instal plink1.9 follow the instructions[in this link](https://www.cog-genomics.org/plink/1.9/) <br>

### Download BCF tools:
To instal bcftools follow the instructions[in this link](http://samtools.github.io/bcftools/) <br>

### Reference Genome: 
The first steps depends on the reference genome of the array, can be hg19 or hg38<br>
To download the reference genome hg19 follow the link to this file: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz<br> 
To download the reference genome hg38 follow the link to this file: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

### Imputation on TOPMED
After the execution of this script, the imputation needs to be performed on TOPMED [link here](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!)<br>

**Note:** After imputation, all files should be aligned to the **hg38** reference genome for consistency.

### Analysis Steps:
1. **User Configuration** – Define paths, reference genome, and software parameters.  
2. **Convert PLINK to VCF** – Convert `.bed/.bim/.fam` files to `.vcf` format and index them.  
3. **Alignment** – Align genotypes to the reference genome (strand checking).  
4. **Split by Chromosome** – Generate one VCF file per chromosome (`chr1.vcf.gz`, `chr2.vcf.gz`, ...).  

### 1) User Configuration
***Description:***<br>
Set the parameters and adjust the path and files according to your analysis

In [None]:
# USER CONFIGURATION - MODIFY ACCORDING TO YOUR ANALYSIS

# Reference genome file
refFile="/path/to/reference/hg19.fa"

# PLINK files prefix (.bed, .bim, .fam) -- descrever mais o que está acontecendo e o que vai fazer em cada parte
inPath="/path_to_your_data/plink_prefix"
# Your cohort name
cohort_name="cohort_name"

# Directory for VCF files
clvcfPath="path_vcf_files"
# Directory for VCFs by chromosome
clvcfbychrPath="path_vcf_chr"

### 2) Convert PLINK files to VCF, perform quality control, and index the files
***Description:***<br>
Convert PLINK binary files to compressed VCF format and create index files for efficient access.  
Quality control filters are applied during conversion to exclude variants and samples based on minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE), genotype missingness per variant and per individual. 

Please adjust the input paths and filenames accordingly.

In [None]:
# PLINK create vcf command
# this flag 'vcf-iid bgz' will remove the FID
plink --bfile ${inPath} --recode vcf-iid bgz --maf 0.01 --hwe 1e-6 --geno 0.05 --mind 0.05 --out ${clvcfPath}/${cohort_name}

#index
bcftools index ${clvcfPath}/${cohort_name}.vcf.gz

### 3) Aligment
***Description:***<br>
Align VCF alleles with the reference genome
Please adjust the threads accordingly.

In [1]:
# Convert VCF file to BCF format using 25 threads for faster processing
bcftools convert ${clvcfPath}/${cohort_name}.vcf.gz -Ob -o ${clvcfPath}/${cohort_name}.bcf.gz --threads 25

# Run BCFtools fixref plugin to fix reference/alternative allele inconsistencies
# This step checks and updates alleles based on the reference genome
bcftools +fixref ${clvcfPath}/${cohort_name}.bcf.gz -- -f ${refFile}

# Run BCFtools fixref again, this time outputting a new VCF file
# The -d option removes alleles that do not match the reference
# The -m flip option flips alleles to match the reference when possible
# The final result is a VCF file corrected for reference mismatches
bcftools +fixref ${clvcfPath}/${cohort_name}.bcf.gz -Oz -o ${clvcfPath}/${cohort_name}.vcf.gz -- -d -m flip -f ${refFile}

SyntaxError: invalid syntax (1077626377.py, line 2)

### 4) Split VCF file by chromosome
***Description:***<br>
Split VCF file by chromosome for imputation and sort VCF files by chromosome to ensure proper ordering before downstream analyses

In [None]:
# Iterate over chromosome numbers
for chr_num in {1..22}
do
plink2 --vcf ${clvcfPath}/${cohort_name}.vcf.gz --chr ${chr_num} --recode vcf-4.2 bgz --out ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}
done

# sorting
# Iterate over chromosome numbers
for chr_num in {1..22}
do
bcftools sort ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}.vcf.gz -Oz -o ${clvcfbychrPath}/${cohort_name}_forimputation_chr${chr_num}_sorted.vcf.gz
done