# Task 6 Gene-wise statistics (MAGMA)

In gene analysis, genetic marker data is aggregated to the level of whole genes, testing the joint association of all markers in the gene with the phenotype. Similarly, in gene-set analysis individual genes are aggregated to groups of genes sharing certain biological, functional or other characteristics.

This is done using MAGMA[1]. The gene-set analysis is divided into two distinct and largely independent parts. In the first part a gene analysis is performed to quantify the degree of association each gene has with the phenotype. In addition the correlations between genes are estimated. These correlations reflect the LD between genes, and are needed in order to compensate for the dependencies between genes during the gene-set analysis. The gene p-values and gene correlation matrix are then used in the second part to perform the actual gene-set analysis.


[1] de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS computational biology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4401657/. Published April 17, 2015. Accessed August 18, 2020.

In [19]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [9]:
import os

# Create directories for the output files
path="/mnt/data/GWAS/output/build38/task6_genewise"
if not os.path.exists(path):
    os.makedirs(path)

In [13]:
%env path=/mnt/data/GWAS/output/build38/task6_genewise
%env task4path= /mnt/data/GWAS/output/build38/task4_assoc

env: path=/mnt/data/GWAS/output/build38/task6_genewise
env: task4path=/mnt/data/GWAS/output/build38/task4_assoc


**Filter out low frequency SNPs (MAF<0.01 in this example) for MAGMA**

In [14]:
%%bash
awk '{OFS="\t"; if($6>0.01 && $6<0.99) print $0}' | sed 's/ /\t/g' $task4path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot > $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01
wc $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01


In [15]:
%%bash
wc $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01
wc $task4path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot

   8976887  107722644 1254477752 /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01
   8976887  107722644 1254477752 /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot


In [16]:
%%bash
head $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01

CHR	BP	SNP	A1	A2	FRQ	INFO	OR	SE	P	RS	ANNOT
10	100000235	10:100000235:C:T	C	T	0.2982	0.7959	0.6679	0.3084	0.1906	rs11596870	ABCC2(+148kb)|CHUK(-188.1kb)|CPN1(-42.07kb)|DNMBP(0)|DNMBP-AS1(+41.24kb)|ERLIN1(-149.9kb)
10	100000943	10:100000943:G:A	G	A	0.0929	0.7945	1.2681	0.5058	0.6387	rs11190359	ABCC2(+148.8kb)|CHUK(-187.4kb)|CPN1(-41.36kb)|DNMBP(0)|DNMBP-AS1(+41.94kb)|ERLIN1(-149.1kb)
10	100000979	10:100000979:T:C	T	C	0.0504	0.7881	1.7337	0.7499	0.4631	rs11190360	ABCC2(+148.8kb)|CHUK(-187.4kb)|CPN1(-41.33kb)|DNMBP(0)|DNMBP-AS1(+41.98kb)|ERLIN1(-149.1kb)
10	100002012	10:100002012:T:C	T	C	0.0374	0.8215	1.0097	0.6786	0.9886	rs11190362	ABCC2(+149.8kb)|CHUK(-186.4kb)|CPN1(-40.3kb)|DNMBP(0)|DNMBP-AS1(+43.01kb)|ERLIN1(-148.1kb)
10	100002038	10:100002038:G:A	G	A	0.0102	0.8861	3.051	1.3529	0.4097	rs192480913	ABCC2(+149.8kb)|CHUK(-186.3kb)|CPN1(-40.27kb)|DNMBP(0)|DNMBP-AS1(+43.04kb)|ERLIN1(-148.1kb)
10	100002300	10:100002300:GA:G	GA	G	0.0374	0.8216	1.0082	0.6788	0.9904	rs111354488	ABCC2(+150.1kb)|C

In [17]:
%%bash
# Generate  *.SNP.LOC (SNP, CHR,BP) and *.SNP.VAL files

awk 'BEGIN{OFS="\t";print "SNP","CHR","BP"};{OFS="\t"; if(NR>1)  print $11,$1,$2}' $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01 > $path/dataset.b38.imputed.dosage.maf.0.01.SNP.LOC
awk 'BEGIN{OFS="\t";print "SNP","P"};{OFS="\t"; if(NR>1) print $11,$10}' $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot.maf.0.01 | sed 's/e/E/g' > $path/dataset.b38.imputed.dosage.maf.0.01.SNP.PVAL

head -3 $path/dataset.b38.imputed.dosage.maf.0.01.SNP.LOC 
head -3  $path/dataset.b38.imputed.dosage.maf.0.01.SNP.PVAL



SNP	CHR	BP
rs11596870	10	100000235
rs11190359	10	100000943
SNP	P
rs11596870	0.1906
rs11190359	0.6387


In [21]:
%%bash
# Annotate SNPS to genes in 50Kb; this parameter is not fixed, usually ranging from 0 to 500Kb. You cn modify this distance with the window parameter
/usr/lib/magma/magma --annotate window=50,50 --snp-loc $path/dataset.b38.imputed.dosage.maf.0.01.SNP.LOC --gene-loc /mnt/data/GWAS/ref_files/build38/NCBI38.gene.loc --out $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb
head -3 $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot

Welcome to MAGMA v1.06 (linux)
Using flags:
	--annotate window=50,50
	--snp-loc /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.SNP.LOC
	--gene-loc /mnt/data/GWAS/ref_files/build38/NCBI38.gene.loc
	--out /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb

Start time is 14:44:20, Saturday 06 Mar 2021

Starting annotation...
Reading gene locations from file /mnt/data/GWAS/ref_files/build38/NCBI38.gene.loc... 
	adding window: 50000bp
	20137 gene locations read from file
	chromosome  1: 2097 genes
	chromosome  2: 1285 genes
	chromosome  3: 1078 genes
	chromosome  4: 765 genes
	chromosome  5: 886 genes
	chromosome  6: 1053 genes
	chromosome  7: 942 genes
	chromosome  8: 691 genes
	chromosome  9: 803 genes
	chromosome 10: 748 genes
	chromosome 11: 1299 genes
	chromosome 12: 1032 genes
	chromosome 13: 341 genes
	chromosome 14: 613 genes
	chromosome 15: 616 genes
	chromosome 16: 865 genes
	chromosome 17: 1197 genes
	chromo

In [38]:
%%bash
#Gene level analysis is performed using MAGMA, which compute gene-wise statistics taking into account physical distance and linkage disequilibrium (LD) between markers (de Leeuw et al. 2015). 
# All SNPs with MAF above 1% are used in these analyses, setting a distance threshold of 50kb. N is the number of individuals
for i in {1..22}
do
nohup /usr/lib/magma/magma  --batch $i chr --big-data --seed 1234 --genes-only --bfile /mnt/data/GWAS/output/build38/task2.2_stratification/intermediate_datasets/dataset.b38.QCed --gene-annot $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot  --pval $path/dataset.b38.imputed.dosage.maf.0.01.SNP.PVAL  N=495 --gene-model multi --out $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma &
done

Welcome to MAGMA v1.06 (linux)
Using flags:
	--batch 8 chr
	--big-data
	--seed 1234
	--genes-only
	--bfile /mnt/data/GWAS/output/build38/task2.2_stratification/intermediate_datasets/dataset.b38.QCed
	--gene-annot /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot
	--pval /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.SNP.PVAL N=495
	--gene-model multi
	--out /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma

Start time is 15:03:10, Saturday 06 Mar 2021

Welcome to MAGMA v1.06 (linux)
Using flags:
	--batch 1 chr
	--big-data
	--seed 1234
	--genes-only
	--bfile /mnt/data/GWAS/output/build38/task2.2_stratification/intermediate_datasets/dataset.b38.QCed
	--gene-annot /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot
	--pval /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.

In [39]:
%%bash
#merge batches
#/usr/lib/magma/magma
magma --merge $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma --out $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma
head $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out


Welcome to MAGMA v1.06 (linux)
Using flags:
	--merge /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma
	--out /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma

Start time is 15:04:15, Saturday 06 Mar 2021

Merging gene results files with prefix '/mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma'... 
Reading file /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch1_chr.genes.out... 
	1974 genes read from file
Reading file /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch2_chr.genes.out... 
	1235 genes read from file
Reading file /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.batch3_chr.genes.out... 
	1061 genes read from file
Readi

In [40]:
%%bash
sort -gk10,10 $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out >$path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted
head $path/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted


GENE       CHR      START       STOP  NSNPS  NPARAM    N        ZSTAT      P_JOINT  P_SNPWISE_MEAN  P_SNPWISE_TOP1
11214       15   85330616   85799358     20      13  495        3.813   6.8633e-05      1.8683e-05       0.0013226
134111       5    6387347    6546721     17      14  495       3.7768   7.9438e-05      5.6238e-05       0.0018005
26074       20   20002514   20410714     25      19  495       3.6031   0.00015722      7.9538e-05       0.0047978
57117        4  105632627  105758729      5       4  495       2.9638    0.0015192       0.0001333        0.037451
2197        11   65070627   65172200      1       1  495       3.5637    0.0001828       0.0001828       0.0001828
741         11   65066403   65167738      1       1  495       3.5637    0.0001828       0.0001828       0.0001828
30811       21   31823315   32054064     18      15  495       2.9599    0.0015389       0.0002843        0.049541
132884       4    5484397    5759548     26      21  495        3.242   0.000593

In [41]:
%%R
# Add HUGO gene name using the refernece file NCBI38.gene.loc linking oficial gene symbols  and GenBank Ids
magma<-read.table("/mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted", header=TRUE)
genes<-read.table("/mnt/data/GWAS/ref_files/build38/NCBI38.gene.loc")
genes<-genes[-c(2:4)]
head(genes)
colnames(genes) <- c("GENE","STRAND","HUGO")

magma_merged<-merge(magma,genes, by="GENE")
magma_merged <- magma_merged[order(magma_merged$P_SNPWISE_MEAN), ]
magma_rank<-rank(magma_merged[,10],na.last = "keep", ties.method = "min")
magma_ranked=cbind(magma_rank, magma_merged)

write.table(magma_ranked, "/mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted.annot", quote=FALSE, sep="\t", row.names = FALSE)


In [42]:
%%bash
head /mnt/data/GWAS/output/build38/task6_genewise/dataset.b38.imputed.dosage.maf.0.01.LOC.50kb.genes.annot.magma.genes.out.sorted.annot

magma_rank	GENE	CHR	START	STOP	NSNPS	NPARAM	N	ZSTAT	P_JOINT	P_SNPWISE_MEAN	P_SNPWISE_TOP1	STRAND	HUGO
1	11214	15	85330616	85799358	20	13	495	3.813	6.8633e-05	1.8683e-05	0.0013226	+	AKAP13
2	134111	5	6387347	6546721	17	14	495	3.7768	7.9438e-05	5.6238e-05	0.0018005	+	UBE2QL1
3	26074	20	20002514	20410714	25	19	495	3.6031	0.00015722	7.9538e-05	0.0047978	+	CFAP61
4	57117	4	105632627	105758729	5	4	495	2.9638	0.0015192	0.0001333	0.037451	-	INTS12
5	741	11	65066403	65167738	1	1	495	3.5637	0.0001828	0.0001828	0.0001828	-	ZNHIT2
5	2197	11	65070627	65172200	1	1	495	3.5637	0.0001828	0.0001828	0.0001828	-	FAU
7	30811	21	31823315	32054064	18	15	495	2.9599	0.0015389	0.0002843	0.049541	+	HUNK
8	132884	4	5484397	5759548	26	21	495	3.242	0.00059339	0.00031645	0.0069702	-	EVC2
9	51250	6	106978172	107101586	8	7	495	3.0124	0.0012961	0.00042762	0.012777	+	C6orf203
