## Orange Team CQ#1.7

### Query: 
What genes show high phenotypic similarity to the 11 Fanconi Anemia core complex genes (set FA-core)?

### Services:
BioLink API (Monarch) - https://api.monarchinitiative.org/api/

Owlsim - http://owlsim3.monarchinitiative.org/api/docs/

### Approach:
Get closest human genes using phenotypic similarity approach, using owlsim3

### Authors
Kent Shefchek and Greg Stupp

In [None]:
# autogenerate biolink_client
# curl --insecure -X POST -H "content-type:application/json" -d '{"swaggerUrl":"https://api.monarchinitiative.org/api/swagger.json"}' https://generator.swagger.io/api/gen/clients/python
# and rename it to biolink_client

In [1]:
import os, sys
# change this path
sys.path.insert(0, "/home/gstupp/projects/NCATS-Tangerine/biolink_client")

In [46]:
import biolink_client
from biolink_client.api_client import ApiClient
from biolink_client.rest import ApiException
import requests
from itertools import chain
import pandas as pd
from pprint import pprint
from tqdm import tqdm, tqdm_notebook
from collections import defaultdict

pd.options.display.max_rows = 999
pd.options.display.max_columns = 12
pd.set_option('display.width', 1000)

MONARCH_API = "https://api.monarchinitiative.org/api"
OWLSIM_API = "http://owlsim3.monarchinitiative.org/api"

gene_list = "https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_4_all_genes.txt"

client = ApiClient(host=MONARCH_API)
client.set_default_header('Content-Type', 'text/plain')
api_instance = biolink_client.BioentityApi(client)

# Get the gene list from github
dataframe = pd.read_csv(gene_list, sep='\t', names=['gene_id', 'symbol'])
df = dataframe.set_index('symbol')
human_genes = set(df.gene_id)
symbol_id = dict(zip(df.index, df.gene_id))
id_symbol = {v:k for k,v in symbol_id.items()}

In [10]:
gene_hpo_map = dict()
for gene_id in tqdm_notebook(set(df.gene_id)):
    api_response = api_instance.get_gene_phenotype_associations(gene_id, rows=500)
    # TODO add facet_counts to AssociationResults model
    # TODO use facet_counts to check the gene does not have >500 phenotypes
    # TODO or better, add pagination
    gene_hpo_map[gene_id] = api_response.objects

The installed widget Javascript is the wrong version.





ApiException: (502)
Reason: Bad Gateway
HTTP response headers: HTTPHeaderDict({'Connection': 'keep-alive', 'Content-Length': '182', 'Date': 'Mon, 13 Nov 2017 20:04:04 GMT', 'Server': 'nginx/1.10.3 (Ubuntu)', 'Content-Type': 'text/html'})
HTTP response body: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.10.3 (Ubuntu)</center>
</body>
</html>



In [12]:
# Get the first five phenotypes for FANCA
pprint(gene_hpo_map[df.at['FANCA', 'gene_id']][0:5])

['EFO:0003924', 'EFO:0003963', 'HP:0000010', 'HP:0000027', 'HP:0000028']


In [35]:
# Search for top human genes
# TODO implement prefix or taxon+type filters in owlsim
# TODO fix cutoff filter

# Note that this notebook takes a few minutes to run

# Use phenodigm algorithm
matcher = 'phenodigm'
results = []

for ncbi_id, phenotypes in tqdm_notebook(gene_hpo_map.items()):
    params = { 'id': phenotypes }
    url = "{}/match/{}".format(OWLSIM_API, matcher)
    req = requests.get(url, params=params)
    owlsim_results = req.json()
    for match in owlsim_results['matches']:
        results.append([ncbi_id, id_symbol[ncbi_id], match['matchId'], match['matchLabel'], match['rawScore']])

The installed widget Javascript is the wrong version.





KeyboardInterrupt: 

In [36]:
results[0]

['NCBIGene:2072', 'ERCC4', 'NCBIGene:2072', 'ERCC4', 99.49491789772425]

In [97]:
# Create a table of query gene, matched gene, and sim score
column_names = ['query_gene', 'query_symbol', 'match_gene', 'match_symbol', 'sim_score']
df = pd.DataFrame(data=results, columns=column_names)
df = df.replace('NaN', pd.np.NaN).dropna().reindex()

In [98]:
# Get sim scores for ERCC4
df_ercc4 = df.query("query_symbol == 'ERCC4'")
print(df_ercc4.head(40))

       query_gene query_symbol       match_gene                                     match_symbol  sim_score
0   NCBIGene:2072        ERCC4    NCBIGene:2072                                            ERCC4  99.494918
1   NCBIGene:2072        ERCC4   NCBIGene:10459                                           MAD2L2  80.791164
2   NCBIGene:2072        ERCC4   NCBIGene:57697                                            FANCM  80.791164
3   NCBIGene:2072        ERCC4    NCBIGene:2188                                            FANCF  80.791164
4   NCBIGene:2072        ERCC4    NCBIGene:2189                                            FANCG  80.791164
5   NCBIGene:2072        ERCC4    NCBIGene:7516                                            XRCC2  80.791164
6   NCBIGene:2072        ERCC4       DOID:13636                                   Fanconi anemia  80.791164
7   NCBIGene:2072        ERCC4    NCBIGene:5888                                            RAD51  80.692858
8   NCBIGene:2072        ERC

In [99]:
# Filter out Non-Genes
df = df[df.match_gene.str.startswith("NCBIGene")]

In [100]:
# Get sim scores for ERCC4
df_ercc4 = df.query("query_symbol == 'ERCC4'")
print(df_ercc4.head(40))

       query_gene query_symbol      match_gene match_symbol  sim_score
0   NCBIGene:2072        ERCC4   NCBIGene:2072        ERCC4  99.494918
1   NCBIGene:2072        ERCC4  NCBIGene:10459       MAD2L2  80.791164
2   NCBIGene:2072        ERCC4  NCBIGene:57697        FANCM  80.791164
3   NCBIGene:2072        ERCC4   NCBIGene:2188        FANCF  80.791164
4   NCBIGene:2072        ERCC4   NCBIGene:2189        FANCG  80.791164
5   NCBIGene:2072        ERCC4   NCBIGene:7516        XRCC2  80.791164
7   NCBIGene:2072        ERCC4   NCBIGene:5888        RAD51  80.692858
8   NCBIGene:2072        ERCC4  NCBIGene:29089        UBE2T  80.692858
9   NCBIGene:2072        ERCC4  NCBIGene:55215        FANCI  80.497801
10  NCBIGene:2072        ERCC4  NCBIGene:55120        FANCL  80.458300
11  NCBIGene:2072        ERCC4   NCBIGene:5889       RAD51C  80.443244
12  NCBIGene:2072        ERCC4  NCBIGene:84464         SLX4  80.347636
13  NCBIGene:2072        ERCC4   NCBIGene:2187        FANCB  80.263305
14  NC

In [101]:
# remove self matches
df = df[df.query_gene != df.match_gene]

In [132]:
# sum scores for each matched gene
sim_score = df.groupby("match_symbol").agg({"sim_score": sum}).sim_score
sim_score = sim_score.sort_values(ascending=False)
sim_score[:20]

match_symbol
FANCD2    645.872798
FANCA     643.888836
FANCI     640.039973
XRCC2     638.543443
FANCM     638.543443
MAD2L2    638.543443
FANCF     638.543443
RAD51     637.482770
RAD51C    633.733749
FANCL     631.168424
FANCB     630.888143
PALB2     630.403085
BRIP1     551.687171
BRCA2     551.024568
FANCE     546.233946
FANCC     546.233946
UBE2T     544.324993
FANCG     538.543443
SLX4      535.817756
ERCC4     494.228326
Name: sim_score, dtype: float64

In [133]:
## Sanity check. Only show the FA genes
sim_score[sim_score.index.isin(symbol_id)]

match_symbol
FANCD2    645.872798
FANCA     643.888836
FANCI     640.039973
XRCC2     638.543443
FANCM     638.543443
MAD2L2    638.543443
FANCF     638.543443
RAD51     637.482770
RAD51C    633.733749
FANCL     631.168424
FANCB     630.888143
PALB2     630.403085
BRIP1     551.687171
BRCA2     551.024568
FANCE     546.233946
FANCC     546.233946
UBE2T     544.324993
FANCG     538.543443
SLX4      535.817756
ERCC4     494.228326
BRCA1     291.560420
Name: sim_score, dtype: float64

In [135]:
# Filter out all genes from the input set (FA)
sim_score_nofa = sim_score[~sim_score.index.isin(symbol_id)]
sim_score_nofa[:20]

match_symbol
ERCC5     388.214663
ERCC3     384.741851
ERCC2     380.767802
DDB2      378.441790
XPA       377.492515
XPC       377.005496
GATA1     375.973420
RPS19     374.372384
FLI1      371.583715
ATM       365.580848
BLM       363.770373
RPL35A    361.500421
RPS26     361.051479
RECQL4    360.921232
RPL5      360.579315
CEP57     360.460385
RPL11     360.276507
RPL26     359.769532
BUB1      358.562051
BUB1B     358.562051
Name: sim_score, dtype: float64

In [137]:
# which genes matched to ERCC5?
df.query("match_symbol == 'ERCC5'")

Unnamed: 0,query_gene,query_symbol,match_gene,match_symbol,sim_score
20,NCBIGene:2072,ERCC4,NCBIGene:2073,ERCC5,77.349963
451,NCBIGene:2176,FANCC,NCBIGene:2073,ERCC5,51.42409
654,NCBIGene:83990,BRIP1,NCBIGene:2073,ERCC5,45.360208
1035,NCBIGene:2189,FANCG,NCBIGene:2073,ERCC5,56.306953
1451,NCBIGene:2178,FANCE,NCBIGene:2073,ERCC5,51.42409
1638,NCBIGene:84464,SLX4,NCBIGene:2073,ERCC5,55.943945
1842,NCBIGene:29089,UBE2T,NCBIGene:2073,ERCC5,50.405415


In [143]:
# Across the list of gene pairs, which genes show up the most?
df['match_symbol'].value_counts()[:100]

FANCD2                                                    7
WIPF1                                                     7
BUB3                                                      7
PTPN11                                                    7
PEX14                                                     7
PEX26                                                     7
MAD2L2                                                    7
ELN                                                       7
RPL35A                                                    7
RPL11                                                     7
LIMK1                                                     7
XPC                                                       7
RAD51                                                     7
MKS1                                                      7
BUB1B                                                     7
BLM                                                       7
MPL                                     

In [149]:
## Run same summation, but removing all scores lower than 70 beforehand
sim_score = df.query("sim_score>70").groupby("match_symbol").agg({"sim_score": sum}).sim_score
sim_score = sim_score.sort_values(ascending=False)
sim_score = sim_score[~sim_score.index.isin(symbol_id)]
sim_score[:20]

match_symbol
ERCC5    77.349963
ERCC3    77.015192
ERCC2    75.396065
DDB2     74.143082
XPC      73.742835
XPA      73.613378
Name: sim_score, dtype: float64

###### Next steps
1. Run on model organisms
2. Improvements to owlsim service layer: https://github.com/monarch-initiative/owlsim-v3/issues/87
3. Add pagination to owlsim services

It is possible we are missing gene pairs from pulling sim scores across all types (diseases, model genes)
