# GWAS Analysis on LAGC Cohorts Using Regenie
Authors: Rafaella Ormond and Jose Jaime Martinez-Magana <br>

***Description:***<br>
In this analysis, a GWAS was performed using **Regenie** for one cohort in the LAGC.

**INPUT:**
1) Genotype data in PLINK pgen/pvar/sample format (.pgen, .pvar, .sample)
2) Phenotype and covariate files corresponding to the cohort
3) Covariates for adjustment in the GWAS

**OUTPUT:** Regenie step 2 GWAS association results (regenie files with summary statistics)


### ***Requirements:***

**Regenie** is designed for use with cohorts that have a sample size greater than 300.

### Download Plink
We can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
For instal plink2 [access here](https://www.cog-genomics.org/plink/2.0/)<br>
For instal plink1.9 [access here](https://www.cog-genomics.org/plink/1.9/) <br>

### Download Regenie
To download **Regenie**, follow this link: [Access here](https://github.com/rgcgithub/regenie)

For more details, please refer to the analysis plan: [Access here](https://docs.google.com/document/d/1RzD5kBlj9rfiomda1G3NfxYDXLdmIUO7VX0cSNj70Kk/edit?usp=sharing)

### Analysis Steps:
1) Recode pfile to bfile<br>
2) Recode pgen to bgen<br>
3) Regenie step 1 (whole-genome regression model fitting)<br>
4) Regenie step 2 (Association testing)

### 1) Recode pfile to bfile
**Description:**<br>
Recoding PLINK2 format files to bfiles format, if necessary<br>
Please adjust the input paths and filenames accordingly.

In [None]:
## Transform pgen to bfile for Step 1
# This step converts the .pgen format to .bfile format for use in step 1
# Make sure to adjust the file names and suffixes according to your PLINK files

plink2 \
    --pfile cohort_name \
    --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-10 --mind 0.1 \
    --write-snplist --write-samples --no-id-header --make-bed --snps-only 'just-acgt' --max-alleles 2 \
    --out cohort_name

### 2) Recode pgen to bgen
**Description:**<br>
This step recodes PLINK2 format files (`.pgen`) to (`.bgen`) format for use in step 2.

In [None]:
## Recode pgen to bgen 
# The pgen file gave an error in the psam file
# Make sure to adjust the file names and suffixes according to your PLINK files

plink2 --pfile cohort_name --export bgen-1.1 --snps-only 'just-acgt' --max-alleles 2 --out cohort_name --threads 30

# Regenie

For steps 1 and 2, phenotype and covariate files will be used, both separated by sex.  

**Phenotype files (`inph`)** must include the following columns:  
`FID`, `IID`, `Pheno1`, `Pheno2`  

There should be two separate files:  
- `inph_bt`: for **binary traits**  
- `inph_qt`: for **quantitative traits**  

The phenotypes must be selected and adjusted based on those available in your cohort and according to the analysis plan:  
[Access the analysis plan here](https://docs.google.com/document/d/1RzD5kBlj9rfiomda1G3NfxYDXLdmIUO7VX0cSNj70Kk/edit?usp=sharing)  

**Covariate file (`inco`)** must include the following columns:  
`FID`, `IID`, `age`, and `PC1` to `PC10`  

The 10 PCs (Principal Components) must be generated using **PC-AiR**.  
Make sure the individuals in the phenotype and covariate files match.

> **Note:** !!! Warning, we will use HapMap SNPs for Step 1!!!<br>
> For hapmap we selected 250k snps randomly, please download this file on the github [LINK here](https://github.com/ormondr/Smoking_GWAS_LAGC/blob/main/English/02GWAS/00Regenie/w_hm3_hg38_random250K.snplist)

### 3) Regenie Step1

**Description:**<br>
**Step 1 – Whole-genome regression model fitting**<br>
This step fits a ridge regression model across the genome using genotype data.  
The goal is to estimate individual trait predictions (leave-one-chromosome-out, LOCO) while accounting for population structure and relatedness.  
Input: genotype data in PLINK format and phenotype/covariate files.  
Output: LOCO predictions used as covariates in Step 2.

> **Note:** !!! Warning, we will use HapMap SNPs for Step 1!!!<br>
> For hapmap we selected 250k snps randomly, please download this file on the github [LINK here]()

> **Note:** <br>
> The `"inph"` files are the phenotype information<br>
> The `"inco"` files are the covariate information, `"bt"` for binary traits and `"qt"` for quantitative traits<br>
> `"bt"` for binary traits and "qt" for quantitative traits<br>
> **The analysis needs to be separated by sex**

In [None]:
## Set parameters
# Set genotype data
# Adjust acording to your data
# The "inge" needs to be the plink sufix files
inge="path/cohort_name"

# The "inph" files are the phenotype information, "bt" for binary traits and "qt" for quantitative traits, separated by sex
# Set phenotype and covariates files for females
female_inph_bt="path/cohort_female_pheno_bt_forregenie.txt"
female_inph_qt="path/cohort_female_pheno_qt_forregenie.txt"
female_inco="path/cohort_female_covar_forregenie.txt"

# Set phenotype and covariates files for males
male_inph_bt="path/cohort_male_pheno_bt_forregenie.txt"
male_inph_qt="path/cohort_male_pheno_qt_forregenie.txt"
male_inco="path/cohort_male_covar_forregenie.txt"

# Set output for step 1
out_step1_qt_female="path/cohort/out_step1_qt_female"
out_step1_bt_female="path/cohort/out_step1_bt_female"
out_step1_qt_male="path/cohort/out_step1_qt_male"
out_step1_bt_male="path/cohort/out_step1_bt_male"

# Set output for step 2
out_step2_qt_female="path/cohort/out_step2_qt_female"
out_step2_bt_female="path/cohort/out_step2_bt_female"
out_step2_qt_male="path/cohort/out_step2_qt_male"
out_step2_bt_male="path/cohort/out_step2_bt_male"

## Add zeros to the FID if you get errors. You could use the following code in bash
# awk 'BEGIN {OFS="\t"} NR==1 {print "FID", $0; next} {print "0", $0}' "${female_inph_qt}" > tmpfile && mv tmpfile "${female_inph_qt}"
## !!! Warning, we will use HapMap SNPs for Step 1!!!
# For hapmap we selected 250k snps randomly, please download this file on the github
hapmap="/vast/palmer/scratch/montalvo-ortiz/jjm262/02lagc_smoking_gwas/11references/w_hm3_hg38_random250K.snplist"

## Running Step1
# Running for females for quantitative traits
regenie\
    --step 1\
    --bed ${inge}\
    --covarFile ${female_inco}\
    --phenoFile ${female_inph_qt}\
    --bsize 400\
    --qt\
    --extract ${hapmap}\
    --force-step1\
    --out ${out_step1_qt_female}

# Running for females for binary traits
regenie\
    --step 1\
    --bed ${inge}\
    --covarFile ${female_inco}\
    --phenoFile ${female_inph_bt}\
    --bsize 400\
    --iid-only\
    --bt\
    --extract ${hapmap}\
    --force-step1\
    --out ${out_step1_bt_female}

# Running for males for quantitative traits
regenie\
    --step 1\
    --bed ${inge}\
    --covarFile ${male_inco}\
    --phenoFile ${male_inph_qt}\
    --bsize 400\
    --iid-only\
    --qt\
    --extract ${hapmap}\
    --force-step1\
    --out ${out_step1_qt_male}

# Running for males for binary traits
regenie\
    --step 1\
    --bed ${inge}\
    --covarFile ${male_inco}\
    --phenoFile ${male_inph_bt}\
    --bsize 400\
    --iid-only\
    --bt\
    --extract ${hapmap}\
    --force-step1\
    --out ${out_step1_bt_male}

### 4) Regenie Step 2

**Description:**
**Step 2 – Association testing:** <br>
This step performs the GWAS using the LOCO predictions from Step 1 as offsets in a linear/firth regression model.  
It tests each variant for association with the trait of interest.  
Input: same phenotype/covariate files, LOCO predictions, and full genotype data.  
Output: GWAS results (effect sizes, p-values, etc.).

> **Note:** <br>
> The `"inph"` files are the phenotype information<br>
> The `"inco"` files are the covariate information, `"bt"` for binary traits and `"qt"` for quantitative traits<br>
> `"bt"` for binary traits and "qt" for quantitative traits<br>
> **The analysis needs to be separated by sex**

In [None]:
## Set parameters
# Set genotype data
# Adjust acording to your data
# The "inge" needs to be the plink sufix files
inge="path/cohort_name"

# The "inph" files are the phenotype information, "bt" for binary traits and "qt" for quantitative traits, separated by sex
# Set phenotype and covariates files for females
female_inph_bt="path/cohort_female_pheno_bt_forregenie.txt"
female_inph_qt="path/cohort_female_pheno_qt_forregenie.txt"
female_inco="path/cohort_female_covar_forregenie.txt"

# Set phenotype and covariates files for males
male_inph_bt="path/cohort_male_pheno_bt_forregenie.txt"
male_inph_qt="path/cohort_male_pheno_qt_forregenie.txt"
male_inco="path/cohort_male_covar_forregenie.txt"

# Set output for step 1
out_step1_qt_female="path/cohort/out_step1_qt_female"
out_step1_bt_female="path/cohort/out_step1_bt_female"
out_step1_qt_male="path/cohort/out_step1_qt_male"
out_step1_bt_male="path/cohort/out_step1_bt_male"

# Set output for step 2
out_step2_qt_female="path/cohort/out_step2_qt_female"
out_step2_bt_female="path/cohort/out_step2_bt_female"
out_step2_qt_male="path/cohort/out_step2_qt_male"
out_step2_bt_male="path/cohort/out_step2_bt_male"

# Running for females for quantitative traits
regenie \
  --step 2 \
  --bgen ${inge}.bgen \
  --ref-first \
  --sample ${inge}.sample \
  --phenoFile ${female_inph_qt} \
  --covarFile ${female_inco} \
  --qt \
  --pred ${out_step1_qt_female}_pred.list \
  --bsize 400 \
  --out ${out_step2_qt_female}

# Running for females for binary traits
regenie \
  --step 2 \
  --bgen ${inge}.bgen \
  --ref-first \
  --sample ${inge}.sample \
  --phenoFile ${female_inph_bt} \
  --covarFile ${female_inco} \
  --bt \
  --firth --approx --pThresh 0.01 \
  --pred ${out_step1_bt_female}_pred.list \
  --bsize 400 \
  --out ${out_step2_bt_female}

# Running for males for quantitative traits
regenie \
  --step 2 \
  --bgen ${inge}.bgen \
  --ref-first \
  --sample ${inge}.sample \
  --phenoFile ${male_inph_qt} \
  --covarFile ${male_inco} \
  --qt \
  --pred ${out_step1_qt_male}_pred.list \
  --bsize 400 \
  --out ${out_step2_qt_male}

# Running for males for binary traits
regenie \
  --step 2 \
  --bgen ${inge}.bgen \
  --ref-first \
  --sample ${inge}.sample \
  --phenoFile ${male_inph_bt} \
  --covarFile ${male_inco} \
  --bt \
  --firth --approx --pThresh 0.01 \
  --pred ${out_step1_bt_male}_pred.list \
  --bsize 400 \
  --out ${out_step2_bt_male}