In [None]:
## First, we will perform an LD pruning with plink2
# Substitute for the path and file name prefix of plink2 files
in="/path_to_your_data/plink_prefix"
# Perform first pass of LD pruning
plink2 --pfile ${in}\
    --indep-pairwise 50 5 0.2 --geno 0.01 --mind 0.01 --maf 0.05\
    --out ${in}_indep_snps
plink2 --pfile ${in}\
    --extract ${in}_indep_snps.prune.in --make-bed\
    --out ${in}_forpcair

### 2. Creating GDS files

***Description:***

This script will create gds files used for perform the Principal Components calculation, on the next step

### 2.1 Creating the script for create the gds files 
name of script: **create_gds.Rscript**

You can download this script on github [LINK HERE](https://github.com/ormondr/Smoking_GWAS_LAGC/blob/main/English/02PCair/create_gds.Rscript)

In [None]:
####################################################################################
# This script uses a plink file to create a GDS file for input to PCAiR
####################################################################################
# set parameters
# this function uses the library optparse to add arguments to the script
# adding arguments to the script
library(optparse) 
option_list = list(
    make_option(c("--plinkfile"), type="character", default=NULL,
                help="path to the directory storing the plink files, without plink extensions *bed/*bim/*fam", metavar="character"),
    make_option(c("--out"), type="character", default=NULL,
                help="output file name for GDS file, use complete path, example /data/analsyis/analysis.gds", metavar="character")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

if (is.null(opt$plinkfile)){
    print_help(opt_parser)
    stop("At least one argument must be supplied (input file)", call.=FALSE)
}

###################################################################################
# load libraries
library(GENESIS)
library(GWASTools)
library(SNPRelate)

# create list of files
bed_f=paste0(opt$plinkfile,".bed",sep="")
bim_f=paste0(opt$plinkfile,".bim",sep="")
fam_f=paste0(opt$plinkfile,".fam",sep="")

# create gds output files
snpgdsBED2GDS(bed.fn=bed_f,
              bim.fn=bim_f,
              fam.fn=fam_f,
              family=TRUE,
              out.gdsfn=opt$out)
# end of script

### 2.2 Create the gds files 
***Description:***
This script will run the script "create_gds.Rscript", create in the previous step <br>
Obs: needs to modifie the parameters for your files

In [None]:
# creating GDS files
# set parameters
Rscript='/path_to_your_data/create_gds.Rscript'
inpath='/path_to_your_data/01plinkldpruned'
oupath='/path_to_your_data/02gdsprunned'

# create job list file
do Rscript ${Rscript} --plinkfile=${inpath}/$cohort_name_hapmap_metalprs_allchrs --out=${oupath}/cohort_name_hapmap_metalprs_allchrs.gds

### 3. Principal Components Calculation Using PC-AiR

**Description:**

This script performs Principal Component Analysis (PCA), which will be used as covariates in the GWAS.  
The PCA is conducted using the `PC-AiR` function from the **GENESIS** package.

- To learn more about PC-AiR and PC-Relate, see this vignette: [GENESIS PCA Guide](https://bioconductor.org/packages/devel/bioc/vignettes/GENESIS/inst/doc/pcair.html)  
  - See **Section 3** for PC-AiR and **Section 4** for PC-Relate.
  
> **Note:**  
> PCA requires prior LD pruning, which is performed in the first step of this pipeline.  
> The file containing the top 10 PCs will be used for all GWAS methods (e.g., Regenie, GMMAT, Saige).  
> The GRM file is required only for GMMAT, but we recommend generating it anyway — this ensures you won’t need to repeat this step if you decide to use GMMAT later.




### 3.1. Developing the script for PCA and GRM with PCAiR and PCRelate
*Script written in R.*
developing the script for PCA and GRM with PCAiR and PCRelate <br>
this script don't need to be modified, the paths and file names will be added on the next step<br>
name of script: **create_grm.Rscript**

You can download this script on github [LINK HERE](https://github.com/ormondr/Smoking_GWAS_LAGC/blob/main/English/02PCair/create_grm.Rscript)


In [None]:
####################################################################################
# R script for creating GRM and PCs
# day: 25 July 2025
# author: Rafaella Ormond and Jose Jaime Martinez-Magana
####################################################################################
# This script uses a gds file to estimate PCs and GRM using GENESIS
####################################################################################
# set parameters
# this function uses the library optparse to add arguments to the script
# adding arguments to the script
library(optparse) 
option_list = list(
    make_option(c("--gdsfile"), type="character", default=NULL,
                help="path to gds file", metavar="character"),
    make_option(c("--out"), type="character", default=NULL,
                help="output path for PCAiR objects, use a complete directory path and a potential name of files. Example: /home/user/out_pca", metavar="character")
);

opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

if (is.null(opt$gdsfile)){
  print_help(opt_parser)
  stop("At least one argument must be supplied (input file)", call.=FALSE)
}

###################################################################################
# load libraries
library(GENESIS)
library(GWASTools)
library(SNPRelate)
library(SeqArray)
library(parallel)
library(BiocParallel)
library(stringi);

# stablishing a seed
set.seed(1000)
# stablishing cores
cores=detectCores()

# create function for LD prunning
ld_prun=function(gds){
snpset=snpgdsLDpruning(gds,
                       method="corr",
                       slide.max.bp=10e6,
                       ld.threshold=sqrt(0.1),
                       maf=0.01,
                       missing.rate=0.01,
                       verbose=TRUE, num.thread=cores);
    pruned=unlist(snpset,
                  use.names=FALSE);
return(pruned)
}

# create function for KING-robust analysis
king_mat=function(gds){
samp.id=read.gdsn(index.gdsn(gds, "sample.id"))
ibd.robust=snpgdsIBDKING(gds, sample.id=samp.id, family.id=NULL, maf=0.01,missing.rate=0.01,num.thread=cores)
return(ibd.robust)
}


# create function for PCAiR
pcair_r=function(gds_geno, pruned, KINGmat){
pcair=pcair(gds_geno, snp.include=pruned,
            kinobj=KINGmat, divobj=KINGmat)
return(pcair)
}

# create function for PCAiR with sample filter
pcair_r_fil=function(gds_geno, pruned, KINGmat, sample_list){
    pcair=pcair(gds_geno, snp.include=pruned,
                sample.include=sample_list,
                kinobj=KINGmat, divobj=KINGmat)
                return(pcair)
                }


# create function to match the sample ID in the GDS with the sample IDs of the supplied sampleID filter
filter_samples=function(gds_samples, incl_samples){
    matches=unique(grep(paste(incl_samples$SampleID,collapse="|"),
                        gds_samples, value=TRUE))
    return(matches)
}


# run analysis for pcs with relationships
# open the gds object
gds=snpgdsOpen(opt$gdsfile);
# LD prunning
pruned=ld_prun(gds)
# build KING matrix
KINGmat=king_mat(gds)
# adjust KING matrix
KINGmat_m=KINGmat$kinship
# add sampleIDs to colnames y row names
colnames(KINGmat_m)=KINGmat$sample.id
rownames(KINGmat_m)=KINGmat$sample.id
# get samples in gds
gds_samples=read.gdsn(index.gdsn(gds, "sample.id"))
# close the gds object
snpgdsClose(gds)

# read the gds object
gds=GdsGenotypeReader(filename=opt$gdsfile)

# create a GenotypeData class object
gds_geno=GenotypeData(gds)

# run PCAiR
PCair=pcair_r(gds_geno, pruned, KINGmat_m)
PCairpart=pcairPartition(kinobj = KINGmat_m, divobj = KINGmat_m)

# PC relate calculation 
gdsData=GenotypeBlockIterator(gds_geno, snpInclude=pruned)
PCrelate=pcrelate(gdsData,
                  pcs=PCair$vectors[,1:2],
                  training.set=PCairpart$unrels,
                  BPPARAM=BiocParallel::SerialParam())

# making GRM
grm=pcrelateToMatrix(PCrelate)
# making sparce matrix
grm[grm<0.05]=0
grm_sparse=as.matrix(grm, sparse = TRUE)

# save file
out=list()
out$PCair=PCair
out$PCairpart=PCairpart
out$KINGmat=KINGmat_m
out$pruned=pruned
out$PCrelate=PCrelate
out$grm=grm
out$grm_sparse=grm_sparse
saveRDS(file=paste0(opt$out,".rds",""),out)

### 3.2. Generate the PCA and GRM files.
***Description:***

This sctipt will run the "create_grm.Rscript" created in the previous step.
You will need to adjust this step according to your own input files.
The output will be a .RDS file containing both the PCA information and the Genetic Relationship Matrix (GRM).

In [None]:
# run R script for creating GDS files
# add the path of the create_grm.Rscript from the previous step
Rscript="create_grm.Rscript"
Rscript $Rscript \
--gdsfile=gdsfile_prunned \
--out=cohort_name.gds_prunned_grm_pca

### 3.3. Save the Pca file

***Description:***
This script extracts the top 10 principal components from the .RDS file generated in the previous step and saves them in a separate file to be used as covariates in Regenie.

In [None]:
# Load the PCA/GRM result file
pca_grm <- readRDS("cohort_name.gds_prunned_grm_pca.rds")

# Extract sample IDs (IIDs)
iids <- rownames(pca_grm$PCair$vectors)

# Extract top 10 PCs
pcs <- pca_grm$PCair$vectors[,c(1:10)]
pcs <- as.data.frame(pcs)
colnames(pcs) <- paste0(rep("PC",10),rep(1:10))

# add sample ID
pcs$SampleID=rownames(pcs)

covariate_table <- pcs

# Save to file
write.table(covariate_table,
            file = "cohort_10pcs_forregenie.txt",
            quote = FALSE,
            row.names = FALSE,
            col.names = TRUE,
            sep = "\t")
