# Principal Components Inference

This script performs Principal Component Analysis (PCA) to generate covariates for the GWAS of smoking traits in the LAGC cohorts.  
The following steps will be carried out:

1) LD pruning using PLINK  
2) Principal Components calculation using PC-AiR

### 1) LD pruning using PLINK

Description: This code is for perform the LD pruning usinh PLINK2 for PCs analysis

You also can do the transformation for bfiles, but if the files are slited by chromossomes, with pfiles is faster to merge those

We can download Plink version 1.9 and version 2.0 following the steps from their website.<br>
For instal plink2 [access here](https://www.cog-genomics.org/plink/2.0/)<br>
For instal plink1.9 [access here](https://www.cog-genomics.org/plink/1.9/)

In [None]:
## First, we will perform an LD pruning with plink2
# Substitute for the path and file name prefix of plink2 files
in="path_and_file"
# Perform first pass of LD pruning
# MxGDAR-Fz1
plink2 --pfile ${in}\
    --indep-pairwise 50 5 0.2 --geno 0.01 --mind 0.01 --maf 0.05\
    --out ${in}_indep_snps
plink2 --pfile ${in}\
    --extract ${in}_indep_snps.prune.in --make-bed\
    --out ${in}_forpcair

### 2) Principal Components Calculation Using PC-AiR

**Description:**

This script performs Principal Components (PCs) analysis, which will be used as covariates in the GWAS.  
The PCA is performed using the `PC-AiR` function from the **GENESIS** package.

To download **GENESIS**, follow this link: [Access here](https://www.bioconductor.org/packages/release/bioc/html/GENESIS.html)

> **Note:** Principal Component Analysis requires prior LD pruning, which is performed in the first step of this file.

*Script written in R.*


In [None]:
# load libraries
library(GENESIS)
library(GWASTools)
library(SNPRelate)
library(SeqArray)
library(parallel)
library(BiocParallel);

############################################################################################
## !!! This is the only section that needs adjustment based on your data !!! ##
## Set parameters
# Change directory to the directory where the PLINK LD pruned files are stored
wkdir="path_plinkfiles/"
setwd(wkdir)
# Create a list of files for input to PCAiR
filename="cohortname_forpcair"
bed_f=paste0(filename,".bed",sep="")
bim_f=paste0(filename,".bim",sep="")
fam_f=paste0(filename,".fam",sep="")

# Create output files for a GDS file, keep the GDS extension
gds_output="cohortname_merged_forpcair.gds"

# Create a file name to store the PCA output, keep the CSV extension
outfile_name="cohortname_pcs32.csv"

# create gds output files
snpgdsBED2GDS(bed.fn=bed_f,
              bim.fn=bim_f,
              fam.fn=fam_f,
              family=TRUE,
              out.gdsfn=gds_output)
####################################################################################
# establishing a seed
set.seed(1000)
# establishing cores
cores=detectCores()
####################################################################################
# Create functions
# Create a function for LD pruning
ld_prun = function(gds){
snpset=snpgdsLDpruning(gds,
                       method="corr",
                       slide.max.bp=10e6,
                       ld.threshold=sqrt(0.1),
                       maf=0.01,
                       missing.rate=0.01,
                       verbose=TRUE, num.thread=cores);
    pruned=unlist(snpset,
                  use.names=FALSE);
return(pruned)
}

# Create a function for KING-robust analysis
king_mat = function(gds){
samp.id=read.gdsn(index.gdsn(gds, "sample.id"))
ibd.robust=snpgdsIBDKING(gds, sample.id=samp.id, family.id=NULL, maf=0.01,missing.rate=0.01,num.thread=cores)
return(ibd.robust)
}

# Create a function for PCAiR
pcair_r = function(gds_geno, pruned, KINGmat){
pcair=pcair(gds_geno, snp.include=pruned,
            kinobj=KINGmat, divobj=KINGmat)
return(pcair)
}
####################################################################################
# set input data
gds=gds_output

# run analysis for PCs with relationships
# open the GDS object
gds=snpgdsOpen(gds);
# LD pruning
pruned=ld_prun(gds)
# build KING matrix
KINGmat=king_mat(gds)
# adjust KING matrix
KINGmat_m=KINGmat$kinship
# add sampleIDs to colnames y row names
colnames(KINGmat_m)=KINGmat$sample.id
rownames(KINGmat_m)=KINGmat$sample.id
# get samples in gds
gds_samples=read.gdsn(index.gdsn(gds, "sample.id"))
# close the gds object
snpgdsClose(gds)
####################################################################################
# Warning: review this step!!!!!
# read the GDS object
gds=GdsGenotypeReader(filename=gds_output)

# create a GenotypeData class object
gds_geno=GenotypeData(gds)

# run PCAiR
PCair=pcair_r(gds_geno, pruned, KINGmat_m)

# Create a file of PCs
PCair_df=as.data.frame(PCair$vectors)
# assign colnames
colnames(PCair_df)=paste0("PC",rep(1:32))
# add IID
PCair_df$IID=row.names(PCair_df)

# save file
write.csv(file=outfile_name,
         PCair_df,
         quote=FALSE,
         row.names=FALSE)