# Task 6 Gene-wise statistics (MAGMA)

In gene analysis, genetic marker data is aggregated to the level of whole genes, testing the joint association of all markers in the gene with the phenotype. Similarly, in gene-set analysis individual genes are aggregated to groups of genes sharing certain biological, functional or other characteristics.

This is done using MAGMA[1]. The gene-set analysis is divided into two distinct and largely independent parts. In the first part a gene analysis is performed to quantify the degree of association each gene has with the phenotype. In addition the correlations between genes are estimated. These correlations reflect the LD between genes, and are needed in order to compensate for the dependencies between genes during the gene-set analysis. The gene p-values and gene correlation matrix are then used in the second part to perform the actual gene-set analysis.


[1] de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS computational biology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4401657/. Published April 17, 2015. Accessed August 18, 2020.

In [1]:
%load_ext rpy2.ipython

In [2]:
import os

# Create directories for the output files
path="/mnt/data/GWAS/output/build37/task6_genewise"
if not os.path.exists(path):
    os.makedirs(path)

In [3]:
%env path=/mnt/data/GWAS/output/build37/task6_genewise
%env task3path= /mnt/data/GWAS/output/build37/task3_imputation/imputed_files
%env task4path= /mnt/data/GWAS/output/build37/task4_assoc

env: path=/mnt/data/GWAS/output/build37/task6_genewise
env: task3path=/mnt/data/GWAS/output/build37/task3_imputation/imputed_files
env: task4path=/mnt/data/GWAS/output/build37/task4_assoc


**Filter out low frequency SNPs (MAF<0.01) for MAGMA**

In [4]:
%%bash
awk '{OFS="\t"; if($6>0.01 && $6<0.99) print $0}' | sed 's/ /\t/g' $task4path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot > $path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01
wc $path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01


In [5]:
%%bash
head $path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01

CHR	BP	SNP	A1	A2	FRQ	INFO	OR	SE	P	RS	ANNOT
10	100000625	10:100000625:A:G	A	G	0.4496	0.9979	1.2232	0.2799	0.4717	rs7899632	HPS1(-175.3kb)|LOXL4(-6.817kb)|MIR1287(-154.3kb)|MIR4685(-190.4kb)|PYROXD2(-142.7kb)|R3HCC1L(0)
10	100000645	10:100000645:A:C	A	C	0.2238	0.9554	0.6617	0.3298	0.2105	rs61875309	HPS1(-175.3kb)|LOXL4(-6.797kb)|MIR1287(-154.3kb)|MIR4685(-190.4kb)|PYROXD2(-142.7kb)|R3HCC1L(0)
10	100001867	10:100001867:C:T	C	T	0.0105	0.9292	1.2906	1.0324	0.8048	rs150203744	HPS1(-174.1kb)|LOXL4(-5.575kb)|MIR1287(-153.1kb)|MIR4685(-189.2kb)|PYROXD2(-141.5kb)|R3HCC1L(0)
10	100002464	10:100002464:T:C	T	C	0.0112	0.9812	0.7513	1.0186	0.7789	rs111551711	HPS1(-173.5kb)|LOXL4(-4.978kb)|MIR1287(-152.5kb)|MIR4685(-188.6kb)|PYROXD2(-140.9kb)|R3HCC1L(0)
10	100003242	10:100003242:T:G	T	G	0.1401	1.0713	1.9542	0.3874	0.08371	rs12258651	HPS1(-172.7kb)|LOXL4(-4.2kb)|MIR1287(-151.7kb)|MIR4685(-187.8kb)|PYROXD2(-140.1kb)|R3HCC1L(0)
10	100003304	10:100003304:A:G	A	G	0.0362	0.9602	2.6659	0.7874	0.213	rs7282846

In [6]:
%%bash
# Generate  *.SNP.LOC (SNP, CHR,BP) and *.SNP.VAL files

awk 'BEGIN{OFS="\t";print "SNP","CHR","BP"};{OFS="\t"; if(NR>1)  print $11,$1,$2}' $path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01 > $path/dataset.b37.imputed.dosage.maf.0.01.SNP.LOC
awk 'BEGIN{OFS="\t";print "SNP","P"};{OFS="\t"; if(NR>1) print $11,$10}' $path/dataset.b37.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01 | sed 's/e/E/g' > $path/dataset.b37.imputed.dosage.maf.0.01.SNP.PVAL

head -3 $path/dataset.b37.imputed.dosage.maf.0.01.SNP.LOC 
head -3  $path/dataset.b37.imputed.dosage.maf.0.01.SNP.PVAL



SNP	CHR	BP
rs7899632	10	100000625
rs61875309	10	100000645
SNP	P
rs7899632	0.4717
rs61875309	0.2105


**de 200 kb pasa a 50kb - esta bien?**

In [7]:
%%bash
# Annotate
/usr/lib/magma/magma --annotate window=50,50 --snp-loc $path/dataset.b37.imputed.dosage.maf.0.01.SNP.LOC --gene-loc /mnt/data/GWAS/ref_files/NCBI37.3.gene.loc --out $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb
head -3 $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot

Welcome to MAGMA v1.06 (linux)
Using flags:
	--annotate window=50,50
	--snp-loc /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.SNP.LOC
	--gene-loc /mnt/data/GWAS/ref_files/NCBI37.3.gene.loc
	--out /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb

Start time is 11:49:16, Sunday 14 Mar 2021

Starting annotation...
Reading gene locations from file /mnt/data/GWAS/ref_files/NCBI37.3.gene.loc... 
	adding window: 50000bp
	19427 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 1016 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 g

In [8]:
%%bash
#Gene level analysis is performed using MAGMA, which compute gene-wise statistics taking into account physical distance and linkage disequilibrium (LD) between markers (de Leeuw et al. 2015). 
# All SNPs with MAF above 5% are used in these analyses, setting a distance threshold of 50kb, which usually ranges from 0 to 200kb
# N: number of individuals
for i in {1..22}
do
nohup /usr/lib/magma/magma  --batch $i chr --big-data --seed 1234 --genes-only --bfile /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed --gene-annot $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot  --pval $path/dataset.b37.imputed.dosage.maf.0.01.SNP.PVAL  N=496 --gene-model multi --out $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma &
done

Welcome to MAGMA v1.06 (linux)
Using flags:
	--batch 1 chr
	--big-data
	--seed 1234
	--genes-only
	--bfile /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed
	--gene-annot /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot
	--pval /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.SNP.PVAL N=496
	--gene-model multi
	--out /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma

Start time is 11:49:29, Sunday 14 Mar 2021

Loading PLINK-format data...
Reading file /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.fam... 496 individuals read
Reading file /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed.bim (chromosome 1 only)... Welcome to MAGMA v1.06 (linux)
Using flags:
	--batch 2 chr
	--big-data
	--seed 1234
	--genes-only
	--bfile /mnt/data/GWAS/output/build37/task2_QC/dataset.b37.QCed
	--gene-annot /mnt/data/GWAS/output/build37/task6_ge

In [9]:
%%bash
#merge batches
#/usr/lib/magma/magma
magma --merge $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma --out $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma
head $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out


Welcome to MAGMA v1.06 (linux)
Using flags:
	--merge /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma
	--out /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma

Start time is 12:23:27, Sunday 14 Mar 2021

Merging gene results files with prefix '/mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma'... 
Reading file /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch1_chr.genes.out... 
	2002 genes read from file
Reading file /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch2_chr.genes.out... 
	1220 genes read from file
Reading file /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch3_chr.genes.out... 
	1050 genes read from file
Reading

In [10]:
%%bash
sort -gk10,10 $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out >$path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted
head $path/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted


GENE       CHR      START       STOP  NSNPS  NPARAM    N        ZSTAT      P_JOINT  P_SNPWISE_MEAN  P_SNPWISE_TOP1
11214       15   85873818   86342589   1212      32  496       4.1297   1.8165e-05      9.2671e-06      0.00077629
64410       15   86252557   86388189    508      38  496       4.1298   1.8151e-05      2.5588e-05      0.00024736
343413       1  159720282  159836047    235      41  496       3.7093   0.00010392      6.7899e-05       0.0034542
134111       5    6387460    6546834    386      47  496       3.4649   0.00026519      6.8813e-05        0.012499
1053        14   23536515   23638820    147      28  496       3.4645   0.00026563      0.00010603       0.0079492
56833        1  159746440  159857282    237      37  496       3.7065   0.00010508      0.00010668       0.0035125
284677       1  159754264  159875544    260      33  496       3.5985   0.00016002      0.00012396       0.0035435
8698        19    3128250    3230335    395      50  496       2.8686    0.00206

In [11]:
%%R
# Annotate genes using the reference file NCBI37.3.gene.loc
magma<-read.table("/mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted", header=TRUE)
genes<-read.table("/mnt/data/GWAS/ref_files/NCBI37.3.gene.loc")
colnames(genes) <- c("GENE","CHR","START","STOP","STRAND","HUGO")

magma_merged<-merge(magma,genes, by="GENE")
magma_merged <- magma_merged[order(magma_merged$P_SNPWISE_MEAN), ]
magma_rank<-rank(magma_merged[,10],na.last = "keep", ties.method = "min")
magma_ranked=cbind(magma_rank, magma_merged)

write.table(magma_ranked, "/mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted.annot", quote=FALSE, sep="\t", row.names = FALSE)


In [12]:
%%R
magma<-read.table("/mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted", header=TRUE)
head(magma)

    GENE CHR     START      STOP NSNPS NPARAM   N  ZSTAT    P_JOINT
1  11214  15  85873818  86342589  1212     32 496 4.1297 1.8165e-05
2  64410  15  86252557  86388189   508     38 496 4.1298 1.8151e-05
3 343413   1 159720282 159836047   235     41 496 3.7093 1.0392e-04
4 134111   5   6387460   6546834   386     47 496 3.4649 2.6519e-04
5   1053  14  23536515  23638820   147     28 496 3.4645 2.6563e-04
6  56833   1 159746440 159857282   237     37 496 3.7065 1.0508e-04
  P_SNPWISE_MEAN P_SNPWISE_TOP1
1     9.2671e-06     0.00077629
2     2.5588e-05     0.00024736
3     6.7899e-05     0.00345420
4     6.8813e-05     0.01249900
5     1.0603e-04     0.00794920
6     1.0668e-04     0.00351250


In [13]:
%%R
genes<-read.table("/mnt/data/GWAS/ref_files/NCBI37.3.gene.loc")
colnames(genes) <- c("GENE","CHR","START","STOP","STRAND","HUGO")
head(genes)

       GENE CHR  START   STOP STRAND         HUGO
1     79501   1  69091  70008      +        OR4F5
2 100996442   1 142447 174392      - LOC100996442
3    729759   1 367659 368597      +       OR4F29
4     81399   1 621096 622034      -       OR4F16
5    148398   1 859993 879961      +       SAMD11
6     26155   1 879583 894679      -        NOC2L


In [14]:
%%bash
head /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted.annot
tail /mnt/data/GWAS/output/build37/task6_genewise/dataset.b37.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted.annot

magma_rank	GENE	CHR.x	START.x	STOP.x	NSNPS	NPARAM	N	ZSTAT	P_JOINT	P_SNPWISE_MEAN	P_SNPWISE_TOP1	CHR.y	START.y	STOP.y	STRAND	HUGO
1	11214	15	85873818	86342589	1212	32	496	4.1297	1.8165e-05	9.2671e-06	0.00077629	15	85923818	86292589	+	AKAP13
2	64410	15	86252557	86388189	508	38	496	4.1298	1.8151e-05	2.5588e-05	0.00024736	15	86302557	86338189	-	KLHL25
3	343413	1	159720282	159836047	235	41	496	3.7093	0.00010392	6.7899e-05	0.0034542	1	159770282	159786047	+	FCRL6
4	134111	5	6387460	6546834	386	47	496	3.4649	0.00026519	6.8813e-05	0.012499	5	6437460	6496834	+	UBE2QL1
5	1053	14	23536515	23638820	147	28	496	3.4645	0.00026563	0.00010603	0.0079492	14	23586515	23588820	-	CEBPE
6	56833	1	159746440	159857282	237	37	496	3.7065	0.00010508	0.00010668	0.0035125	1	159796440	159807282	+	SLAMF8
7	284677	1	159754264	159875544	260	33	496	3.5985	0.00016002	0.00012396	0.0035435	1	159804264	159825544	-	C1orf204
8	8698	19	3128250	3230335	395	50	496	2.8686	0.0020613	0.00017936	0.063752	19	3178250	3180335	+	S1PR4
9	