# Task 4 Association analysis

The association analysis tries to identify genetic variants (SNPs) that can be associated with a trait.

To complete this task it is necessary to have generated the genotypes dossages 'chri.dose.rsq.DS.vcf.gz', the fam file 'chr22.dose.for.assoc.fam' updated with phenotype and sex and the covariates file 'covar_mds.txt' from the Task 2.2 Stratification Analysis.

In [19]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [2]:
import os

# Create directories for the output files
path="/mnt/data/GWAS/output/build38/task4_assoc"
if not os.path.exists(path):
    os.makedirs(path)

In [10]:
# Set an environment variable to hold the path to the output directory
# It is recommended to send the output to the datavolume (so that you don't fill up the home directory). You will be able to acces it from your host machine
%env path= /mnt/data/GWAS/output/build38/task4_assoc
%env task3path= /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files

env: path=/mnt/data/GWAS/output/build38/task4_assoc
env: task3path=/mnt/data/GWAS/output/build38/task3_imputation/Imputed_files


## Association analysis

In [11]:
%%bash
# Perform association analysis with PLINK (Purcell et al. 2007)
# Association of genotype dosages with the AD case-control status is explored through regression model adjusted by sex, and the 10 MDS dimensions as covariates using PLINK.
for i in {1..22}
do
plink --fam $task3path/chr22.dose.for.assoc.fam --out $path/chr$i.imputed.dosage --ci 0.95 --covar /mnt/data/GWAS/output/build38/task2.2_stratification/covar_mds.txt --hide-covar --dosage $task3path/chr$i.dose.rsq.0.3.DS.vcf.gz format=1 noheader
done

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build38/task4_assoc/chr1.imputed.dosage.log.
Options in effect:
  --ci 0.95
  --covar /mnt/data/GWAS/output/task2.2_stratification/covar_mds.txt
  --dosage /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr1.dose.rsq.0.3.DS.vcf.gz format=1 noheader
  --fam /mnt/data/GWAS/output/build38/task3_imputation/Imputed_files/chr22.dose.for.assoc.fam
  --hide-covar
  --out /mnt/data/GWAS/output/build38/task4_assoc/chr1.imputed.dosage

Note: --hide-covar flag deprecated.  Use e.g. '--linear hide-covar'.
257659 MB RAM detected; reserving 128829 MB for main workspace.
495 people (237 males, 258 females) loaded from .fam.
495 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
--covar: 10 covariates loaded.
495 people pass filters and QC.
Among remaining pheno

In [12]:
%%bash
#Merge chr results in a single file
for i in {2..22}
do
awk '{if (NR>1) print $0}' $path/chr$i.imputed.dosage.assoc.dosage > $path/chr$i.imputed.dosage.assoc.dosage.nh
done
for i in {2..22}
do cat $path/chr$i.imputed.dosage.assoc.dosage.nh; done > $path/chr2-22.imputed.dosage.assoc.dosage.nh
cat $path/chr1.imputed.dosage.assoc.dosage $path/chr2-22.imputed.dosage.assoc.dosage.nh > $path/dataset.b38.imputed.dosage.full.assoc.dosage

In [33]:
%%bash
# Remove missing values
wc -l $path/dataset.b38.imputed.dosage.full.assoc.dosage
awk '{OFS="\t"; gsub("^ +","",$0); gsub(" +","\t",$0); gsub("chr","",$0); print $0}'  $path/dataset.b38.imputed.dosage.full.assoc.dosage |awk '{if ($8!="NA") print $0}' > $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean
wc -l $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean
head $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean

20854453 /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage
8908105 /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean
SNP	A1	A2	FRQ	INFO	OR	SE	P
1:710225:T:A	T	A	0.0558	0.6204	1.6517	0.7604	0.5093
1:714325:TAGA:T	TAGA	T	0.0129	0.4120	0.4997	2.2725	0.7601
1:716799:ATTT:A	ATTT	A	0.0156	0.2991	45.0793	1.7215	0.02695
1:722408:G:C	G	C	0.6782	0.3486	1.2899	0.4709	0.5887
1:722700:G:A	G	A	0.0222	0.5887	1.8617	1.1152	0.5774
1:727233:G:A	G	A	0.0162	0.6254	0.7898	1.4863	0.8738
1:727242:G:A	G	A	0.0858	0.4554	1.4763	0.7089	0.5827
1:727717:G:C	G	C	0.6976	0.3587	0.9375	0.5058	0.8985
1:729272:G:A	G	A	0.0181	0.4581	7.8951	1.1374	0.06928


## Add rs ID

In [48]:
%%R
## Add rs ID
path= "/mnt/data/GWAS/output/build38/task4_assoc"
# assoc_results<-read.table("/mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean", sep="\t", header=T)
annot<-read.table("/mnt/data/GWAS/ref_files/build38/TopMed.snps.0.001.rs.txt", sep="\t", header=F)
merged <-merge(assoc_results,annot, by.x="SNP", by.y="V1", all.x=TRUE)
write.table (merged,"/mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs", sep="\t", row.names=FALSE , quote=FALSE)



x
/mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs


In [59]:
%%R
head(merged)
write.table (merged,"/mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs", sep="\t", row.names=FALSE , quote=FALSE)


In [60]:
%%bash
head $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs
sed 's/:/\t/g' $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs | awk 'BEGIN{OFS="\t"; print "CHR","BP","SNP","A1","A2","FRQ","INFO","OR","SE","P","RS"};{OFS="\t"; if (NR>1) print $1,$2,$1":"$2":"$3":"$4,$5,$6,$7,$8,$9,$10,$11,$12}' > $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.chr.bp
head $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.chr.bp

SNP	A1	A2	FRQ	INFO	OR	SE	P	V2
10:100000235:C:T	C	T	0.2982	0.7959	0.6679	0.3084	0.1906	rs11596870
10:100000943:G:A	G	A	0.0929	0.7945	1.2681	0.5058	0.6387	rs11190359
10:100000979:T:C	T	C	0.0504	0.7881	1.7337	0.7499	0.4631	rs11190360
10:100002012:T:C	T	C	0.0374	0.8215	1.0097	0.6786	0.9886	rs11190362
10:100002038:G:A	G	A	0.0102	0.8861	3.051	1.3529	0.4097	rs192480913
10:100002300:GA:G	GA	G	0.0374	0.8216	1.0082	0.6788	0.9904	rs111354488
10:100002330:AAAAG:A	AAAAG	A	0.0647	0.8095	1.3697	0.6481	0.6274	NA
10:100002628:A:C	A	C	0.4467	0.8605	0.7322	0.2868	0.2771	rs11190363
10:100002875:A:G	A	G	0.0648	0.8118	1.3222	0.6435	0.6642	rs7894103
CHR	BP	SNP	A1	A2	FRQ	INFO	OR	SE	P	RS
10	100000235	10:100000235:C:T	C	T	0.2982	0.7959	0.6679	0.3084	0.1906	rs11596870
10	100000943	10:100000943:G:A	G	A	0.0929	0.7945	1.2681	0.5058	0.6387	rs11190359
10	100000979	10:100000979:T:C	T	C	0.0504	0.7881	1.7337	0.7499	0.4631	rs11190360
10	100002012	10:100002012:T:C	T	C	0.0374	0.8215	1.0097	0.6786	0.9886	rs11190362
10	10000

## Add nearest genes

In [61]:
%%bash
#Annotate results 
plink --annotate $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.chr.bp ranges=/mnt/data/GWAS/ref_files/build38/glist-hg38 --border 200 --out $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.200kb.annotated 


PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.200kb.annotated.log.
Options in effect:
  --annotate /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.chr.bp ranges=/mnt/data/GWAS/ref_files/build38/glist-hg38
  --border 200
  --out /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.200kb.annotated

257659 MB RAM detected; reserving 128829 MB for main workspace.
--annotate ranges: 25804 annotations loaded from
/mnt/data/GWAS/ref_files/build38/glist-hg38 (counting multi-chromosome
annotations once per spanned chromosome).
--annotate: 7615467 out of 8976886 rows annotated; new report written to
/mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.full.assoc.dosage.clean.

In [62]:
%%bash
sort -gk 10,10 $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.200kb.annotated.annot | awk '{OFS="\t"; if($10<0.0001) print $0}' | sed 's/ /\t/g' > $path/temp
awk 'BEGIN {OFS="\t"; print "CHR","BP","SNP","A1","A2","FRQ","INFO","OR","P","SE","RS","ANNOT"};{OFS="\t"; print $0}' $path/temp > $path/dataset.b38.imputed.dosage.assoc.200kb.annot.tops
wc $path/dataset.b38.imputed.dosage.assoc.200kb.annot.tops
head $path/dataset.b38.imputed.dosage.assoc.200kb.annot.tops
rm $path/temp


  555  6660 93550 /mnt/data/GWAS/output/build38/task4_assoc/dataset.b38.imputed.dosage.assoc.200kb.annot.tops
CHR	BP	SNP	A1	A2	FRQ	INFO	OR	P	SE	RS	ANNOT
2	62287452	2:62287452:C:G	C	G	0.4079	0.9443	5.2055	0.3446	1.691e-06	rs12614194	B3GNT2(+62.72kb)|COMMD1(+151.4kb)|MIR5192(+81.54kb)
10	77350874	10:77350874:C:T	C	T	0.03	0.876	0.0062	1.0741	2.268e-06	rs144195574	KCNMA1(0)
10	77360086	10:77360086:G:C	G	C	0.0295	0.8607	0.0056	1.107	2.765e-06	rs118093470	KCNMA1(0)
1	161424710	1:161424710:T:C	T	C	0.3682	0.9168	5.6367	0.3802	5.412e-06	rs3922744	C1orf192(+56.83kb)|FCGR2A(-80.7kb)|FCGR2C(-156.6kb)|FCGR3A(-117kb)|FCGR3B(-198.5kb)|HSPA6(-99.54kb)|HSPA7(-181.3kb)|MIR5187(+197.4kb)|MPZ(+114.7kb)|NR1I3(+186.5kb)|PCP4L1(+139.3kb)|SDHC(+59.96kb)|TOMM40L(+194kb)
1	161421832	1:161421832:G:T	G	T	0.3681	0.9169	5.6362	0.3802	5.413e-06	rs4657021	APOA2(+198.2kb)|C1orf192(+53.95kb)|FCGR2A(-83.58kb)|FCGR2C(-159.5kb)|FCGR3A(-119.9kb)|HSPA6(-102.4kb)|HSPA7(-184.2kb)|MIR5187(+194.6kb)|MPZ(+111.9kb)|NR1I3(+183.6kb

In [63]:
%%bash
cp $path/dataset.b38.imputed.dosage.full.assoc.dosage.clean.rs.200kb.annotated.annot $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot


In [64]:
%%bash
head $path/dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot

CHR	BP	SNP	A1	A2	FRQ	INFO	OR	SE	P	RS ANNOT
10	100000235	10:100000235:C:T	C	T	0.2982	0.7959	0.6679	0.3084	0.1906	rs11596870 ABCC2(+148kb)|CHUK(-188.1kb)|CPN1(-42.07kb)|DNMBP(0)|DNMBP-AS1(+41.24kb)|ERLIN1(-149.9kb)
10	100000943	10:100000943:G:A	G	A	0.0929	0.7945	1.2681	0.5058	0.6387	rs11190359 ABCC2(+148.8kb)|CHUK(-187.4kb)|CPN1(-41.36kb)|DNMBP(0)|DNMBP-AS1(+41.94kb)|ERLIN1(-149.1kb)
10	100000979	10:100000979:T:C	T	C	0.0504	0.7881	1.7337	0.7499	0.4631	rs11190360 ABCC2(+148.8kb)|CHUK(-187.4kb)|CPN1(-41.33kb)|DNMBP(0)|DNMBP-AS1(+41.98kb)|ERLIN1(-149.1kb)
10	100002012	10:100002012:T:C	T	C	0.0374	0.8215	1.0097	0.6786	0.9886	rs11190362 ABCC2(+149.8kb)|CHUK(-186.4kb)|CPN1(-40.3kb)|DNMBP(0)|DNMBP-AS1(+43.01kb)|ERLIN1(-148.1kb)
10	100002038	10:100002038:G:A	G	A	0.0102	0.8861	3.051	1.3529	0.4097	rs192480913 ABCC2(+149.8kb)|CHUK(-186.3kb)|CPN1(-40.27kb)|DNMBP(0)|DNMBP-AS1(+43.04kb)|ERLIN1(-148.1kb)
10	100002300	10:100002300:GA:G	GA	G	0.0374	0.8216	1.0082	0.6788	0.9904	rs111354488 ABCC2(+150.1kb)|C

**For the next step you need the following file:**
- dataset.b38.imputed.assoc.dosage.clean.rs.200kb.annot