**Purpose:** Update https://www.synapse.org/#!Synapse:syn12514826 with new AD risk genes as identified by GWAS. 

This notebook ingests an Excel file of identified GWAS genes (https://adsp.niagads.org/index.php/gvc-top-hits-list/), queries Biomart for the Ensembl IDs of these genes, and writes the result to a csv file.

In [1]:
from os import name
import pandas as pd # Requires install of package "openpyxl" for read_excel
import requests
from io import StringIO

**The list of AD risk genes identified in GWAS studies** was downloaded as an Excel file from here: https://adsp.niagads.org/index.php/gvc-top-hits-list/

The file contains 2 sheets:

    table1 = Table 1: List of AD Loci with Genetic Evidence Compiled by ADSP Gene Verification Committee
    table2 = Table 2: AD risk/protective causal genes
    
We want the genes from both tables. 

In [2]:
gwas = pd.read_excel("../input/gwas_gvc_compiled_list.xlsx", sheet_name=[0,1], skiprows=1)
print(gwas[0].shape)
print(gwas[1].shape)

(76, 5)
(20, 4)


Concatenate the tables into one data frame.

In [3]:
gwas[0] = gwas[0].rename(columns={"Reported Gene/ Closest gene": "Gene"})
gwas_df = pd.concat(gwas, axis = 0)
print(gwas_df.shape)
gwas_df.head()

(96, 6)


Unnamed: 0,Unnamed: 1,Number,Chr,Location (hg38),SNV,Gene,Source
0,0,1,1.0,109345810,rs141749679,SORT1,
0,1,2,1.0,207577223,rs679515,CR1,
0,2,3,2.0,9558882,rs72777026,ADAM17,
0,3,4,2.0,37304796,rs17020490,PRKD3,
0,4,5,2.0,105749599,rs143080277,NCK2,


Query Ensembl for a list of Ensembl IDs that match the gene symbols in this table. Normally I like to use the pybiomart library for queries, but there is a bug in the library that doesn't allow searching on external_gene_name. So we manually make the request. See http://uswest.ensembl.org/info/data/biomart/biomart_restful.html

In [4]:
attributes = ['ensembl_gene_id', 'external_gene_name']
filters = {'external_gene_name': set(gwas_df['Gene'])}

query = '<Query  virtualSchemaName = "default" formatter = "TSV" header = "1" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >'
query = query + '<Dataset name = "hsapiens_gene_ensembl" interface = "default" >'

for name, value in filters.items():
    query = query + '<Filter name = "' + name + '" value = "' + ",".join(value) + '"/>'

for attr in attributes:
    query = query + '<Attribute name = "' + attr + '" />'

query = query + '</Dataset>'
query = query + '</Query>'
    
response = requests.get(url = 'http://www.ensembl.org/biomart/martservice', params = {'query': query})

result = pd.read_csv(StringIO(response.text), sep = "\t")
result = result.rename(columns = {'Gene stable ID': 'ensembl_gene_id', 'Gene name': 'hgnc_symbol'})
result

Unnamed: 0,ensembl_gene_id,hgnc_symbol
0,ENSG00000277751,LILRB2
1,ENSG00000277641,WNT3
2,ENSG00000275463,LILRB2
3,ENSG00000274513,LILRB2
4,ENSG00000276021,WDR81
...,...,...
96,ENSG00000138442,WDR12
97,ENSG00000005379,TSPOAP1
98,ENSG00000151694,ADAM17
99,ENSG00000134243,SORT1


Check: The output should contain every gene in the GWAS input. 

In [5]:
print(len(set(gwas_df['Gene'])))
print(len(list(set(gwas_df['Gene']) & set(result['hgnc_symbol']))))
print(all(elem in set(result['hgnc_symbol']) for elem in set(gwas_df['Gene'])))

86
86
True


Write to file. Note: Some gene symbols map to multiple Ensembl IDs -- and that's okay. 

In [6]:
result.to_csv('../output/igap_genetic_association_genes_2022.csv', index = False, header = True)