<a href="https://colab.research.google.com/github/hmgu-itg/VolosSummerSchool/blob/master/VSS_2023/6_Workshop_Polygenic_Scores/6_Workshop_PGS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1>Exercise on Polygenic Scoring</h1>
<b>Human Genetics of complex traits</b></br></br>
<i><small>Ana Arruda (ana.arruda@helmholtz-muenchen.de) - Ozvan Bocher (ozvan.bocher@helmholtz-muenchen.de)</i></small>
</center>

# Summary
In this exercise, we will apply two polygenic scores to samples from the 1000 Genomes project: **a polygenic risk score for Coronary Artery Disease (CAD)** and a **polygenic score for levels of the MEP1B protein**. We will perform the following steps:
- Compute the two scores by hand in R and using Plink
- See how well these two scores predict the traits in question
- Study the influence of ethnicity, and adjust for it
- Examine the polygenicity of these two traits through a genome-wide association.

During this practical, we will mostly use the R software but we will as well make use of Plink, a very common tool for a wide range of genetic analyses.
There are some optional exercises in the practical that you might want to do if you progress fast.
Multiple questions are asked through this notebook. The answers are provided in hidden cells but please try to answer the questions by yourself to get the most of this session.

Use `gc()` from time to time to free up memory.

# Downloading the data and installing libraries
We need following data:


*   Polygenic risk score (PGS000337) for CAD
*   Polygenic risk score for MET1B
*   Genetic data


All data is in the following location: https://www.dropbox.com/scl/fo/h0gwms5yhgd44od5fm11s/h?dl=0&rlkey=7ksbrzt4xyeoxkoq2kwrc4omu. We will download the data in the corresponding exercise and unzip it when necessary.

We will start by downloading the packages that we will need in R. We need `R.utils` (this step will need a few minutes to run) and `manqq`.
We will use the classical R libraries to import and export files. Nevertheless, be aware that genetic files can be large and this can take a lot of memory and computation time in R. The library `data.table` can be useful to answer this issue and codes with how to use this library will be provided throughout this notebook.

In [None]:
install.packages('R.utils')
devtools::install_github("hmgu-itg/man_qq_annotate")
install.packages('qqman')

Now, let's install the `plink` software:

In [None]:
cat(system("wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20210606.zip", intern=T), sep="\n")
cat(system("unzip plink_linux_x86_64_20210606.zip", intern=T), sep="\n")
cat(system("rm prettify toy.* LICENSE", intern=T), sep="\n")
cat(system('./plink --help', intern=T), sep="\n")

# **PGS**
## Importing scores
As previously mentioned, we will study two PGS, one for CAD and the other for the MEP1B protein levels. These scores were computed on large studies and are publicly available. We will start by importing the summary statistics for the variants used to construct the two scores.


**Exercise 1:**
- Import the data from the dropbox: https://www.dropbox.com/s/4083718iyo46nw2/liftedover.CAD.score?dl=1 for CAD and https://www.dropbox.com/s/o71gg622t2vnjqu/MEP1B.gilly.prs.txt?dl=1 for MEP1B. Use `read.table()` to do so.
- If needed, create an `id` column corresponding to chr`chromosome_number`:`position`
- For both traits, export a file containing the columns `id`, `effect_allele` and `effect_weight` (rename the columns of the file if necessary). These files should be name `CAD.score` and `MEP1B.score`.

**Question 1:** How many variants compose the two scores ?

In [None]:
#@title Scores importation
cadscore = read.table("https://www.dropbox.com/s/4083718iyo46nw2/liftedover.CAD.score?dl=1")
mepscore=read.table("https://www.dropbox.com/s/o71gg622t2vnjqu/MEP1B.gilly.prs.txt?dl=1", header = TRUE)
head(cadscore)
head(mepscore)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Get id column
colnames(cadscore) <- c("chr_name", "chr_position", "effect_allele", "effect_weight", "id_alleles", "id")
head(cadscore)
mepscore$id <- paste0("chr", mepscore$chr, ":", mepscore$pos)
head(mepscore)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Export the files
write.table(cadscore[,c("id", "effect_allele", "effect_weight")], "CAD.score", sep="\t", col.names=F, quote=F, row.names = F)
write.table(mepscore[,c("id", "A1", "effect")], "MEP1B.score", sep="\t", col.names=F, quote=F, row.names = F)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Answer question 1
cat("CAD PGS is composed of", nrow(cadscore), "variants\n")
cat("MEP1B PGS is composed of", nrow(mepscore), "variants\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title **Optional** Alternative using the data.table library
cadscore = fread("https://www.dropbox.com/s/4083718iyo46nw2/liftedover.CAD.score?dl=1")
cadscore[,id:=paste0("chr", V1, ":", V2)]
cadscore = cadscore[, .(chr_name=V1, chr_position=V2, effect_allele=V3, effect_weight=V4, id)]
fwrite(cadscore[,c("id", "effect_allele", "effect_weight")], "CAD.score", sep="\t", col.names=F, quote=F)

#With MEP1B
mepscore=fread("https://www.dropbox.com/s/o71gg622t2vnjqu/MEP1B.gilly.prs.txt?dl=1")
mepscore[,id:=paste0("chr", chr, ":", pos)]
head(mepscore)
fwrite(mepscore[,c("id", "A1", "effect")], "MEP1B.score", sep="\t", col.names=F, quote=F)

## Applying scores
We will now apply the scores to 2504 individuals of the 1000 Genome project. From the lecture, you will remember that we need genotype data for these individuals as well as the weights contained in the score files above. We will first use R to manually compute these PGS and then Plink, a popular genetic toolbox installed in this environment.

# Method 1: Manually applying scores (R)

**Exercise 2**: We will start by loading the genetic file autosomal.forPRS.mx.traw file.
- Get the data from https://www.dropbox.com/s/civmjfv89ou72cc/PRS.course.forscore.tar.gz using `system` and `wget`
- Untar the data
- Import the genotypes (file `autosomal.forPRS.mx.traw`) and the individual information (file `autosomal.forPRS.fam`) in R using `read.table`. The matrix of genotypes has positions as rows and samples as columns. The first 6 columns describe the chromosome position and alleles.
- Save the names of the samples present in the genotype matrix

**Question 2**: How many variants are present in the genotype file ? How many individuals are in the genotype matrix and in the fam file ?

In [None]:
#@title Download and untar data
cat(system('wget https://www.dropbox.com/s/civmjfv89ou72cc/PRS.course.forscore.tar.gz', intern=T), sep="\n")
cat(system('tar -xf PRS.course.forscore.tar.gz', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Import genotypes and fam files
genos=read.table("autosomal.forPRS.mx.traw", header = TRUE)
famfile=read.table("autosomal.forPRS.fam")
head(genos)
head(famfile)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Save the names of the samples
spnames=colnames(genos)[7:ncol(genos)]
length(spnames)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Answer question 2
cat("There are", nrow(genos), "variants in the genotype matrix\n")
cat("There are", ncol(genos)-6, "individuals in the genotype matrix\n")
cat("There are", nrow(famfile), "individuals in the fam file\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title **Optional** Alternative using the data.table library
genos=data.table::fread("autosomal.forPRS.mx.traw")
famfile=fread("autosomal.forPRS.fam")

**Exercise 3:** Apply the MEP1B score by:
- creating a genotype matrix restricted to the variants included in the MEP1B score
- applying an element-wise multiplication column by column, and sum the weighted genotypes:  $score_j=∑_{i∈SNPs}w_i∗g_i$  where  $i$  denotes SNPs and  $j$  denotes individuals. Hint: use the functions `colSums()` and `apply()`

Look at the distribution of the score with `hist`.

**Question 3:** What can you say about this distribution ?

In [None]:
#@title Apply the MEP1B score
mepg = subset(genos, SNP %in% mepscore$id)
head(mepg)
head(mepscore)

mepg.score <- colSums(apply(mepg[, spnames], 2, function(x) x*mepscore$effect))

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Distribution of MEP1B score
hist(mepg.score)
hist(mepg.score, breaks = 40)
#The MEP1B score deviates markedly from the normal assumption.

In [None]:
# use this tab and try to solve the task by youself

**Exercise 4:** CAD score
- Combine the information of the genotype matrix and the CAD scores using the `merge()` function in R

**Question 4:**
- Are all of the positions used in the CAD score present in the genotype matrix ?
- Is there a perfect match between the effect alleles of the scores and the ones in the genotype matrix ?

In [None]:
#@title Combine genotypes and CAD scores information
head(cadscore)
head(genos)
cadg=merge(cadscore, genos, by.x="id", by.y="SNP", sort = F)
nrow(cadg)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Answer question 4
dim(cadscore)
dim(cadg)
#All of the variants used for the CAD score are not present in the genotype matrix

all(cadg$effect_allele == cadg$COUNTED)
head(cadg[,c("id", "effect_allele", "effect_weight", "COUNTED", "ALT")])
#Even when the positions are present, the alleles can be different. For example, the effect allele for position 1:1236037 in the score is T whil it is C in the genotype data
#We will have to solve this issue before applying the CAD score

In [None]:
# use this tab and try to solve the task by youself

**Exercise 5:** Apply the CAD score:
- If the effect alleles are not the same for the score and the genotype matrix, you need to flip the effect (= change sign of effect)
- Compute CAD score for each individual by applying an element-wise multiplication column by column as before
- Create a data frame `cadg.mepg.scores` with the following columns: `id`, `CAD.manual` and `MEP1B.manual`. It will be used later to compare the results with the ones from the `plink` software.

**Question 5:** What do you think about the distribution of the CAD score ?

In [None]:
#@title Apply the CAD score
# If the effect allele is the ALT allele, inverse the effect
cadg[which(cadg$ALT == cadg$effect_allele), "effect_weight"] <- -cadg[which(cadg$ALT == cadg$effect_allele), "effect_weight"]
head(cadg)
#Compute the score
cadg.score <- colSums(apply(cadg[,spnames], 2, function(x) x*cadg$effect_weight), na.rm = TRUE)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Combine the results
cadg.mepg.scores <- data.frame(id=names(cadg.score), CAD.manual=cadg.score)
cadg.mepg.scores$MEP1B.manual <- data.frame(id=names(mepg.score), MEP1B.manual=mepg.score)[cadg.mepg.scores$id, "MEP1B.manual"]
head(cadg.mepg.scores)


In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Answer question 5
hist(cadg.mepg.scores$CAD.manual, breaks = 40)
#The CAD score is normally distributed, which is expected from a PGS of a highly polygenic trait

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title **Optional** Alternative using the data.table library
#MEP1B
mepg=genos[SNP %in% mepscore$id]
mepg
mepg.score=colSums(mepg[,lapply(.SD, function(x) x*mepscore$effect), .SDcol=spnames])
#CAD
cadg[,eff:=effect_weight]
cadg[effect_allele==ALT,eff:=-1*eff]
cadg=cadg[effect_allele==ALT | effect_allele == COUNTED]
cadg=cadg[,lapply(.SD, function(x) sum(x*eff, na.rm=T)), .SDcol=spnames]

In [None]:
rm(genos)  # to free up memory space
gc()  # to free up memory space

# Method 2: Applying scores using Plink
We will now use the plink software to compute the scores. To have more information on how to proceed, use:
`cat(system('./plink --help', intern=T), sep="\n")`

**Exercise 6:**
- Apply the scores to `autosomal.forPRS` using `CAD.score` and `MET1B.score`
- Import the results `CAD.profile` and `MEP1B.profile` to `R`
- Merge all the scores with the previous dataframe
- Compute the correlations between the scores computed in `R` or in `plink` for CAD and MEP1B

**Question 6:** You should observe some differences for CAD, where do you think they come from ?

In [None]:
#@title Apply the scores using plink
cat(system('./plink --bfile autosomal.forPRS --score CAD.score --out CAD', intern=T), sep="\n")
cat(system('./plink --bfile autosomal.forPRS --score MEP1B.score --out MEP1B', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Import the computed scores and combine with R-made scores
cadsc = read.table("CAD.profile", header = TRUE)
mepsc = read.table("MEP1B.profile", header = TRUE)
head(cadsc) ; head(mepsc)
cadsc$id <- paste0(cadsc$FID, "_", cadsc$IID)
mepsc$id <- paste0(mepsc$FID, "_", mepsc$IID)
allscores=merge(cadg.mepg.scores, cadsc[,c("id", "SCORE")], by="id", sort = FALSE)
colnames(allscores)[4] <- "CAD.Plink"
allscores=merge(allscores, mepsc[,c("id", "SCORE")], by="id")
colnames(allscores)[5] <- "MEP1B.Plink"
dim(allscores)
head(allscores)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Correlation between the scores
cor(allscores[,-1])

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Question 6

#We see a slight discrepancy between the manual and Plink-generated CAD scores.
#This is likely due to our imperfect allele matching, PLINK is more conservative as it only keeps variants where effect alleles concord.

In [None]:
# use this tab and try to solve the task by youself

# Predictive accuracy of MEP1B levels and CAD events
We will now look at the predictive accuracy of the two scores by looking at the actual phenotypes of individuals for those two traits. This information is available for a cohort of 1000 individuals representing a subset of the 1000 Genomes project.

**Exercise 7**:
- Load the CAD (https://www.dropbox.com/s/xs7wsgij95w2uau/CAD.phenotype?dl=1) and MEP1B phenotypes (https://www.dropbox.com/s/5rmhjmv0d6oqxpr/MEP1B.phenotype?dl=1).
- Import the data in `R` and add the phenotype to the files gathering all scores
- Plot the plink scores against the true phenotypes. Hint: think about which kind of plot to use according to the data type you have (i.e. continous or discrete)
- Compute the Pearson's correlation for MEP1B
- Get the top and bottom deciles of the distribution for CAD using the function `quantiles()`.
- Compute the odds ratio for people in the top vs bottom deciles of the distribution for CAD. Remember that $OR=Odds_{Q2}/Odds_{Q1}$ with $Odds_{Q1}=P(Case/Q1)/P(Control/Q1)$

**Question 7:** What can you say about the predictive accuracy of both scores?

In [None]:
#@title Download phenotypes and import in R
rCAD=read.table("https://www.dropbox.com/s/xs7wsgij95w2uau/CAD.phenotype?dl=1")
rMEP=read.table("https://www.dropbox.com/s/5rmhjmv0d6oqxpr/MEP1B.phenotype?dl=1")
colnames(rCAD) <- c("id", "CADpheno")
colnames(rMEP) <- c("id", "MEP1Bpheno")
allscores.pheno=merge(allscores, rCAD, by="id")
allscores.pheno=merge(allscores.pheno, rMEP, by="id")
head(allscores.pheno)
dim(allscores.pheno)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Graphical representations
plot(allscores.pheno$MEP1Bpheno, allscores.pheno$MEP1B.Plink, col="cornflowerblue", pch=20, main = "MEP1B")
boxplot(allscores.pheno$CAD.Plink ~ allscores.pheno$CADpheno, col="cornflowerblue", pch=20, main = "CAD")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Correlation for MEP1B score
cor(allscores.pheno$MEP1Bpheno, allscores.pheno$MEP1B.Plink)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Calculate the top and bottom deciles of the distribution for CAD
cadq=quantile(allscores.pheno$CAD.Plink, c(0.1, 0.9))
cadq

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title OR CAD
#Compute the Odds for bottom decile
Q1=table(allscores.pheno[which(allscores.pheno$CAD.Plink<cadq[1]),]$CADpheno)
Q1
OddsQ1 = Q1[2]/Q1[1]

#Compute the Odds for top decile
Q2=table(allscores.pheno[which(allscores.pheno$CAD.Plink>cadq[2]),]$CADpheno)
Q2
OddsQ2 = Q2[2]/Q2[1]

#Compute the odds ratio
OR=OddsQ2/OddsQ1
OR

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Question 7
# The correlation for MEP1B is quite low, highlighting a moderate prediction value of this score.
# The CAD OR is high, highlighting the fact that the predictive value of the corresponding score is good.

In [None]:
# use this tab and try to solve the task by youself

# **PRS and Polygenicity**
Until now, we have applied two genetic risk scores and examined how well they predict actual phenotypes. We will now examine what these scores can tell us about the genetic architecture of these traits.

**Question 8**: How many variants are present in each score (see `Question 1`)? Does that correspond to what you know of the genetic architecture of both traits?

In [None]:
#@title Question 8
nrow(cadscore)
nrow(mepscore)
#Yes. MEP1B is a protein trait, it is expected to be much less polygenic than CAD which is a complex trait.

In [None]:
# use this tab and try to solve the task by youself

**Exercise 9**: Perform a genome-wide association, using the CAD score profile as a phenotype.
- Create a properly formatted `.pheno` file for each score containing the columns `FID`, `IID`, `CAD.Plink`/`MEP1B.Plink` and write it to `CAD.pheno`/`MEP1B.pheno`
- Download and untar the genotype data (https://www.dropbox.com/s/yjqt5bl5xgrssyj/PRS.course.geno.tar.gz)
- Use PLINK's `--assoc` flag to run the association analysis for CAD. The genotype data is stored in `PRS.course.testset`
- Import the results in R, remove the variants with NA pvalues and use `manqq::fastqq(P)` to visualise the results.

**Question 9:** What can you say about these association results? What can be the cause ?

In [None]:
#@title Create the pheno file for CAD and MEP1B
pheno.CAD = allscores[,c("id", "CAD.Plink")]
split.ID <- do.call(rbind,strsplit(pheno.CAD$id, split = "_"))
pheno.CAD$FID <- split.ID[,1]
pheno.CAD$IID <- split.ID[,2]
write.table(pheno.CAD[,c("FID", "IID", "CAD.Plink")], "CAD.pheno", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

#MEP1B
pheno.MEP1B = allscores[,c("id", "MEP1B.Plink")]
pheno.MEP1B$FID <- split.ID[,1]
pheno.MEP1B$IID <- split.ID[,2]
write.table(pheno.MEP1B[,c("FID", "IID", "MEP1B.Plink")], "MEP1B.pheno", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Download and untar data
cat(system('wget https://www.dropbox.com/s/yjqt5bl5xgrssyj/PRS.course.geno.tar.gz', intern=T), sep="\n")
cat(system('tar -xf PRS.course.geno.tar.gz', intern=T), sep="\n")


In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Run the association analyses
cat(system('./plink --bfile PRS.course.testset --pheno CAD.pheno --allow-no-sex --assoc --out CAD 2>&1', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Import and visualise the results
library(manqq)
outcad=read.table("CAD.qassoc", header = TRUE)
head(outcad)
outcad=subset(outcad, !is.na(outcad$P))
fastqq(outcad$P)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Question 9
#There is a large inflation of the association test.
#We can see it both with the QQ-plot and with the lambda value that is largely above 1.
#There is therefore a misspecification in our model.
#Here we work on the data from the 1000Genomes project.
#This cohort is multiethnic, so there is a strong chance that some score variants will be correlated with ethnicity

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title **Optional** Alternative using the data.table library
pheno=allscores[,c("id", "CAD..Plink")]
pheno[,c("FID", "IID"):=tstrsplit(id, "_")]
pheno[,id:=NULL]
setcolorder(pheno, c(2,3,1))
fwrite(pheno, "CAD.pheno", sep="\t", quote=F, col.names=F)
#After running the association test
outcad=fread("CAD.qassoc")
outcad=outcad[!is.na(outcad$P)]
fastqq(outcad$P)

# **Population Structure**
We will check in the following steps if we have indeed a population structure in our data. If so, we will correct the association test for it and have another look at the results.

**NOTE** Exercice 10 is **optional**, you can directly move to Exercise 13 if you run out of time.

**Exercise 10:** Check for population structure in the data using principal component analysis (PCA). We need to do a clumping before running this PCA.
- Perform clumping using PLINK's `--indep-pairwise` flag in 200kb regions, shifting windows every 50 variants and using a $r^2$ threshold of 0.25
- Perform a PCA using the `--pca` flag on the pruned data by computing **10** principal components (use `--extract` to select SNPs in the data)


In [None]:
#@title Clumping
cat(system('./plink --bfile PRS.course.testset --indep-pairwise 200 50 0.25 --out PRS.course.testset.clumping', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title PCA
cat(system('./plink --bfile PRS.course.testset --extract PRS.course.testset.clumping.prune.in --pca 10 --out PRS.course.testset.pca', intern = T), sep = "\n")

In [None]:
# use this tab and try to solve the task by youself

**Exercise 11:** PCA plot.
- Import the principal components (PCs) in R (file `.eigenvec`).

**Question 11:** Plot PC1 and PC2. Do you see any population structure ?

In [None]:
#@title Import PCs
PCs <- read.table("PRS.course.testset.pca.eigenvec")
colnames(PCs) <- c("IID", "FID", paste0("PC", 1:10))
head(PCs)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Question 11
#Plot PC1 and PC2
plot(PCs$PC1, PCs$PC2, pch = 20)

#We can see that there is a clear structure in the data as we do not have one homogeneous group of individuals.
#This is probably the cause of the inflation of our association test.

In [None]:
# use this tab and try to solve the task by youself

**Exercise 12 (OPTIONAL):** Plot the ancestry of the individuals

- Download the file containing information about the individuals using `wget` (https://www.dropbox.com/s/ump64d155fqt8uz/igsr_samples_popinfo.txt) and import it in R
- Merge the data with the PCs
- Redo the plot with color depending on Superpopulation

**Question 12:** What do you observe ?

In [None]:
#@title Ancestry of individuals
cat(system('wget https://www.dropbox.com/s/ump64d155fqt8uz/igsr_samples_popinfo.txt', intern=T), sep="\n")
samples.info <- read.table("igsr_samples_popinfo.txt", header = T)
PCs.ancestry <- merge(PCs, samples.info, by.x = "IID", by.y = "Sample")
head(PCs.ancestry)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Colored PCA plot
PCs.ancestry$Superpopulation <- factor(PCs.ancestry$Superpopulation)
plot(PCs.ancestry$PC1, PCs.ancestry$PC2, pch = 20, col = PCs.ancestry$Superpopulation)
legend("bottomleft", legend = levels(PCs.ancestry$Superpopulation), pch = 20, col = 1:nlevels(PCs.ancestry$Superpopulation))

In [None]:
# use this tab and try to solve the task by youself

**Exercise 13:**
- If you have skipped the previous exercises, download the pre-computed results of the principal component analysis (PCA) here: https://www.dropbox.com/s/letu9kh93ammlhh/PRS.course.testset.pca.eigenvec
- Run a GWAS adjusted on the principal components for the CAD and MEP1B scores using the `.eigenvec` file in `--covar` flag of PLINK (takes a few minutes to run)
- Import and check the validity of the results in `R`

**Question 13:** What do you conclude from these results ?

In [None]:
#@title Adjusted GWAS
cat(system('wget https://www.dropbox.com/s/letu9kh93ammlhh/PRS.course.testset.pca.eigenvec', intern=T), sep="\n")
cat(system('./plink --pheno CAD.pheno --bfile PRS.course.testset --linear hide-covar --covar PRS.course.testset.pca.eigenvec --out CAD.wPC --allow-no-sex 2>&1', intern=T), sep="\n")
cat(system('./plink --pheno MEP1B.pheno --bfile PRS.course.testset --linear hide-covar --covar PRS.course.testset.pca.eigenvec --out MEP1B.wPC --allow-no-sex 2>&1', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Import and visualise the results
outcad=read.table("CAD.wPC.assoc.linear", header = TRUE)
head(outcad)
outcad=subset(outcad, !is.na(outcad$P))
fastqq(outcad$P)
outmep=read.table("MEP1B.wPC.assoc.linear", header = TRUE)
head(outmep)
outmep=subset(outmep, !is.na(outmep$P) & outmep$P>0)
fastqq(outmep$P)

In [None]:
# use this tab and try to solve the task by youself

**Exercise 14:** Build the manhattan plot of the association tests using `manhattan` from `qqman`

**Question 14**: What can you deduce from the two manhattan plots about the architecture of these two traits?

In [None]:
#@title Mahattan plots
library(qqman)
manhattan(outcad, chr = "CHR", bp = "BP", snp="SNP", p="P", main = "CAD")
manhattan(outmep, chr = "CHR", bp = "BP", snp="SNP", p="P", main = "MEP1B")

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Question 14
#We can see that while MEP1B is driven by a very strong effect signal that is detectable at current sample size (1000 individuals),
#CAD is a much more complex traits where no signal actually reaches significance level.
#This hints at GWAS power: proteome association studies can detect effects even with hundreds of samples, whereas complex trait associations require large meta-analyses of hundreds of thousands of samples.

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title **Optional** Alternative using the data.table library
#CAD
cadpc=fread("CAD.wPC.assoc.linear")
cadpc=cadpc[!is.na(cadpc$P)]
fastqq(cadpc$P)
cadpc=cadpc[,c("CHR", "BP", "P")]
setnames(cadpc, c("chr", "pos", "p"))
forgetme=fastmanh(cadpc)

#MEP1B
meppc=fread("MEP1B.wPC.assoc.linear")
meppc=meppc=meppc[!is.na(meppc$P) & meppc$P>0]
fastqq(meppc$P)
meppc=meppc[,c("CHR", "BP", "P")]
setnames(meppc, c("chr", "pos", "p"))
forgetme=fastmanh(meppc, no_annot=T)

# Bonus exercise: lifting over the polygenic score
We have worked in this exercise with a CAD score that was obtained after a liftover. The original CAD score was downloaded from the publicly available PGS catalog. Variants in that score are identified by chromosome:position, but are on build 37 (also called hg19) of the human reference genome. Our genetic data is on build 38, and we must therefore first map these coordinates onto that build to apply the PGS to the genetic data. That process is called a liftover. We have provided you the score in the right build for the previous exercise but in this optional exercise, we give you the possibility to perform the liftover yourself.
There are R libraries that can do this, but they come from the Bioconductor project, and are clunky and hard to use. We will use an external program called CrossMap.


**Step 1**:
Download the dictionary of positions from build 37 to 38 from the USCS Liftover FTP Website: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz

In [None]:
#@title Downloading the data
cat(system('
(rm *.chain.gz || echo downloading) && wget https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz 2>&1
', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

**Step 2:** Install CrossMap using `pip3 install` and look at the help page

In [None]:
#@title Install CrossMap
cat(system('pip3 install CrossMap', intern=T), sep="\n")
cat(system('CrossMap.py --help 2>&1', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

As you can see, CrossMap supports multiple input formats. We are going to use `bed` as it is the easiest to use. This format is composed of 3 mandatory, tab-separated columns, and an arbitrary number of columns afterwards. The 3 columns are `chr`, `pos-1`, `pos`. We say that the second column is '0-based' while the third one is '1-based'.

**Step 3**:
- Download the CAD score from https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000337/ScoringFiles/PGS000337.txt.gz
- Read it in R
- Make it compatible with the BED format.
- Export the file to `cad.bed` (without the column names)

In [None]:
#@title Download the score and make a bed file
cat(system('wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000337/ScoringFiles/PGS000337.txt.gz', intern=T), sep="\n")
cadscore = read.table(gzfile("PGS000337.txt.gz"), header = TRUE)
cadscore$start <- cadscore$chr_position-1
cadscore <- cadscore[,c("chr_name", "start", "chr_position", "effect_allele", "effect_weight", "variant_description")]
write.table(cadscore, "cad.bed", col.names = F, row.names = F, quote = F, sep = "\t")

head(cadscore)

In [None]:
# use this tab and try to solve the task by youself

**Step 4**: Use CrossMap bed to convert the positions in the CAD score from build 37 to build 38.

In [None]:
#@title Convert the positions
cat(system('CrossMap.py bed hg19ToHg38.over.chain.gz cad.bed cad.38.bed 2>&1', intern=T), sep="\n")

In [None]:
# use this tab and try to solve the task by youself

**Step 5**:
- Read in the lifted over file
- Remove the second column (pos - 1),
- Remove any position that maps outside of the autosomes (chr1-chr22),
- Check that no variant maps to several positions on build 38.

In [None]:
#@title Import the lifted positions and keep only autosomes
liftedover = read.table("cad.38.bed")
colnames(liftedover) <- colnames(cadscore)
liftedover = liftedover[,-which(colnames(liftedover) == "start")]
dim(liftedover)
#Remove the positions not mapping to chromosomes 1 to 22
liftedover <- subset(liftedover, chr_name %in% 1:22)
dim(liftedover)

In [None]:
# use this tab and try to solve the task by youself

In [None]:
#@title Do variants map to several positions ?
nb.positions <- table(liftedover$variant_description)
head(nb.positions)
table(nb.positions)
#All of the variants map only to one position --> ok

In [None]:
# use this tab and try to solve the task by youself

**Step 6**: Create a tab-separated, headerless score file for your lifted over CAD score, with the following columns :

*   `id` which has the form chr1:1234
*   `effect_allele`
*   `effect_weight`

Export the file to `CAD.score`. This file corresponds to what we used in the first part of this workshop.

In [None]:
#@title Create the final file
cadscore=liftedover
cadscore$id = paste0(cadscore$chr_name, ":", cadscore$chr_position)
write.table(cadscore[,c("id", "effect_allele", "effect_weight")], "CAD.score", sep ="\t", col.names = F, row.names = F, quote = F)
write.table(cadscore, "CAD.score", sep ="\t", col.names = F, row.names = F, quote = F)

In [None]:
# use this tab and try to solve the task by youself