# Array QC
Author: Rafaella Ormond and Jose Jaime Martinez-Magana <br>
Description:<br>
This script will prepare the files for perform the GWAS of smoking traits on LAGC cohorts

### ***Requirements:***
### Download Plink
We can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
For instal plink2 [access here](https://www.cog-genomics.org/plink/2.0/)<br>
For instal plink1.9 [access here](https://www.cog-genomics.org/plink/1.9/) <br>

### Download BCF tools:
For instal bcftools [acess here](https://www.htslib.org/download/) <br>

### Reference Genome: 
For the first steps the reference genome depends of the array, can be used hg19 or hg38<br>
For download the reference genome hg19 please download the file: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz<br> 
For download the reference genome hg38 please download the file: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

### Imputation on TOPMED
After the execution of this script, the imputation needs to be performed on TOPMED [link here](https://imputation.biodatacatalyst.nhlbi.nih.gov/#!)<br>
After the imputation all files needs to be transformed to hg38

Ask for computational resources if needed

In [None]:
# request computational resources if needed
salloc -t 40:00 --mem=8G --partition=day

Activate conda environment if needed

In [None]:
# activate environment
module load miniconda
conda activate step1_array_qc

On this next step please change for your paths and files

In [None]:
## set parameters
# imput path
inPath= input_path
# cohort path with vcf files
clvcfPath='cohort_name_vcfs'
# cohort path with vcf files slpited by chromosome
clvcfbychrPath='cohort_name_vcfs_bychr'
# Reference genmonic file to alignment
refFile='hg19.fa.nochr.fa'

### Data for imputation

Convert PLINK files to VCF and index the output<br>
Please adjust the path and files

In [None]:
# PLINK create vcf command
# this flag 'vcf-iid bgz' will remove the FID
plink --bfile ${inPath} --recode vcf-iid bgz --maf 0.01 --hwe 1e-6 --geno 0.05 --mind 0.05 --out ${clvcfPath}/cohort_name

#index
bcftools index ${clvcfPath}/cohort_name.vcf.gz

Align VCF alleles with the reference genome

In [1]:
# Convert VCF file to BCF format using 25 threads for faster processing
bcftools convert ${clvcfPath}/cohort_name.vcf.gz -Ob -o ${clvcfPath}/cohort_name.bcf.gz --threads 25

# Run BCFtools fixref plugin to fix reference/alternative allele inconsistencies
# This step checks and updates alleles based on the reference genome
bcftools +fixref ${clvcfPath}/cohort_name.bcf.gz -- -f ${refFile}

# Run BCFtools fixref again, this time outputting a new VCF file
# The -d option removes alleles that do not match the reference
# The -m flip option flips alleles to match the reference when possible
# The final result is a VCF file corrected for reference mismatches
bcftools +fixref ${clvcfPath}/cohort_name.bcf.gz -Oz -o ${clvcfPath}/cohort_name.vcf.gz -- -d -m flip -f ${refFile}

SyntaxError: invalid syntax (1077626377.py, line 2)

Split VCF file by chromosome for imputation

In [None]:
# Iterate over chromosome numbers
for chr_num in {1..22}
do
plink2 --vcf ${clvcfPath}/cohort_name.vcf.gz --chr ${chr_num} --recode vcf-4.2 bgz --out ${clvcfbychrPath}/cohort_name_forimputation_chr${chr_num}
done

Sort VCF files by chromosome to ensure proper ordering before downstream analyses

In [None]:
# sorting
# Iterate over chromosome numbers
for chr_num in {1..22}
do
bcftools sort ${clvcfbychrPath}/cohort_name_forimputation_chr${chr_num}.vcf.gz -Oz -o ${clvcfbychrPath}/cohort_name_forimputation_chr${chr_num}_sorted.vcf.gz
done