##### **Title:** Preparing genotype variants and covariates for QTL analysis
##### **Author:** Marliette Matos
##### **Date:** 09/26/2024
##### **Description:** Using genotype files QCed by Sam Gathan and filtered for a MAF>5%

Please note: The QC of these genotypes is the exactly the same as the one perfomed for eQTL but here we are removing two samples that are not present in the caQTL data

In [None]:
#Enviroment variables
VCF="/gchm/cd4_QTL_analysis/01_genotype_snps_covar/02_genotype_covariates/analysis/001_merging_all_chr_vcf_MAF5/CD4_all_chr_ashkenazi.407.AF5.Q5.BA.vcf.gz"
OUTDIR="/gchm/cd4_caQTL_analysis/variant_to_peak_QTL/run_012625_qc_aware_qsmooth_CPM_MAF5_FDR5_1MB/results/004_genotypes/plink"
OUTLIERS_IID='/gchm/cd4_QTL_analysis/01_genotype_snps_covar/02_genotype_covariates/scripts/sam_MAF5/v4_removing_PC_outliers/CD4.allsamples.allchrs.snps.geno.mind.af1.hwe.kingcutoff.0884.king.cutoff.out.id'
#these outliers samples are related samples plus 4 outliers
INVERSION_SITES="/gchm/cd4_aging_genotypes/wgs_QC/scripts/inversion.txt"

#common samples between WGS and ATAC-seq
#contains two samples less than in eQTL analysis
COMMON_SAMPLES="/gchm/cd4_caQTL_analysis/variant_to_peak_QTL/run_101424_cpm_tmm_maf5_fdr5_50kb/results/001_peaks/001_common_samples_atac_wgs.in.tsv"

In [None]:
#convert to vcf to plink format and removing related samples
!plink2 --vcf $VCF \
--const-fid 0 \
--memory 120000 \
--remove $OUTLIERS_IID \
--make-bed --out $OUTDIR/CD4_all_chr_ashkenazi.384.AF5.Q5.BA.king

In [None]:
# Keeping only common samples  
!plink2 --bfile $OUTDIR/CD4_all_chr_ashkenazi.384.AF5.Q5.BA.king \
--memory 12000 \
--keep $COMMON_SAMPLES \
--make-bed --out $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2 \

In [None]:
# Prune for Hardy Weinberg Variants
!plink2 --bfile $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2 \
--memory 12000 \
--hwe 1e-6 \
--make-bed --out $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe

In [None]:
# LD prunning highly correlated variants to aid the computation
!plink2 --bfile $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe \
--memory 12000 \
--exclude $INVERSION_SITES \
--indep-pairwise 50 5 0.2 \
--out $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe.indepSNP

In [None]:
#Calculate PCs
!plink2 --bfile $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe \
--extract $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe.indepSNP.prune.in \
--pca 50 \
--out $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe.ld

In [None]:
#Genetic relatedness matrix (GRM)
!plink2 --bfile $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe \
--extract $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe.indepSNP.prune.in \
--make-rel 'square' \
--out $OUTDIR/CD4_all_chr_ashkenazi.362.AF1.QC.BA.king2.hwe.grm 

#### Before piping the resulting into tensorqTL, the variants should be renamed by chr/post/ref/alt -> 004_change_var_names.sh

Splittting the pre-pruned by chromosomes for cis-eQTL calling -> 005_splitting_genotyped_by_chr.sh