## Orange Team CQ#1.7

### Query: 
What genes show high phenotypic similarity to the 11 Fanconi Anemia core complex genes (set FA-core)?

### Services:
BioLink API (Monarch) - https://api.monarchinitiative.org/api/

Owlsim - http://owlsim3.monarchinitiative.org/api/docs/

### Approach:
Get closest human genes using phenotypic similarity approach, using owlsim3

### Authors
Kent Shefchek and Greg Stupp

In [151]:
# autogenerate biolink_client
# curl --insecure -X POST -H "content-type:application/json" -d '{"swaggerUrl":"https://api.monarchinitiative.org/api/swagger.json"}' https://generator.swagger.io/api/gen/clients/python
# and rename it to biolink_client

In [152]:
import os, sys
# change this path
sys.path.insert(0, "/home/gstupp/projects/NCATS-Tangerine/biolink_client")

In [153]:
import biolink_client
from biolink_client.api_client import ApiClient
from biolink_client.rest import ApiException
import requests
from itertools import chain
import pandas as pd
from pprint import pprint
from tqdm import tqdm, tqdm_notebook
from collections import defaultdict

pd.options.display.max_rows = 999
pd.options.display.max_columns = 12
pd.set_option('display.width', 1000)

MONARCH_API = "https://api.monarchinitiative.org/api"
OWLSIM_API = "http://owlsim3.monarchinitiative.org/api"

gene_list = "https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_4_all_genes.txt"

client = ApiClient(host=MONARCH_API)
client.set_default_header('Content-Type', 'text/plain')
api_instance = biolink_client.BioentityApi(client)

# Get the gene list from github
dataframe = pd.read_csv(gene_list, sep='\t', names=['gene_id', 'symbol'])
df = dataframe.set_index('symbol')
human_genes = set(df.gene_id)
symbol_id = dict(zip(df.index, df.gene_id))
id_symbol = {v:k for k,v in symbol_id.items()}

In [154]:
gene_hpo_map = dict()
for gene_id in tqdm_notebook(set(df.gene_id)):
    api_response = api_instance.get_gene_phenotype_associations(gene_id, rows=500)
    # TODO add facet_counts to AssociationResults model
    # TODO use facet_counts to check the gene does not have >500 phenotypes
    # TODO or better, add pagination
    gene_hpo_map[gene_id] = api_response.objects

The installed widget Javascript is the wrong version.






In [155]:
# Get the first five phenotypes for FANCA
pprint(gene_hpo_map[df.at['FANCA', 'gene_id']][0:5])

['EFO:0003924', 'EFO:0003963', 'HP:0000010', 'HP:0000027', 'HP:0000028']


In [159]:
# Search for top human genes
# TODO implement prefix or taxon+type filters in owlsim
# TODO fix cutoff filter

# Note that this notebook takes a few minutes to run

# Use phenodigm algorithm
matcher = 'phenodigm'
results = []

for ncbi_id, phenotypes in tqdm_notebook(gene_hpo_map.items()):
    params = { 'id': phenotypes }
    url = "{}/match/{}".format(OWLSIM_API, matcher)
    req = requests.get(url, params=params)
    owlsim_results = req.json()
    if "matches" not in owlsim_results:
        print(ncbi_id, owlsim_results)
        continue
    for match in owlsim_results['matches']:
        results.append([ncbi_id, id_symbol[ncbi_id], match['matchId'], match['matchLabel'], match['rawScore']])

The installed widget Javascript is the wrong version.



NCBIGene:5889 {'message': 'There was an error processing your request. It has been logged (ID def490adc1579464).'}
NCBIGene:675 {'message': 'There was an error processing your request. It has been logged (ID 9346cce7faf3a2b2).'}



In [160]:
results[0]

['NCBIGene:2072', 'ERCC4', 'NCBIGene:2072', 'ERCC4', 99.49491789772425]

In [161]:
# Create a table of query gene, matched gene, and sim score
column_names = ['query_gene', 'query_symbol', 'match_gene', 'match_symbol', 'sim_score']
df = pd.DataFrame(data=results, columns=column_names)
df = df.replace('NaN', pd.np.NaN).dropna().reindex()

In [162]:
# Get sim scores for ERCC4
df_ercc4 = df.query("query_symbol == 'ERCC4'")
print(df_ercc4.head(40))

       query_gene query_symbol       match_gene                                     match_symbol  sim_score
0   NCBIGene:2072        ERCC4    NCBIGene:2072                                            ERCC4  99.494918
1   NCBIGene:2072        ERCC4   NCBIGene:10459                                           MAD2L2  80.791164
2   NCBIGene:2072        ERCC4   NCBIGene:57697                                            FANCM  80.791164
3   NCBIGene:2072        ERCC4    NCBIGene:2188                                            FANCF  80.791164
4   NCBIGene:2072        ERCC4    NCBIGene:2189                                            FANCG  80.791164
5   NCBIGene:2072        ERCC4    NCBIGene:7516                                            XRCC2  80.791164
6   NCBIGene:2072        ERCC4       DOID:13636                                   Fanconi anemia  80.791164
7   NCBIGene:2072        ERCC4    NCBIGene:5888                                            RAD51  80.692858
8   NCBIGene:2072        ERC

In [163]:
# Filter out Non-Genes
df = df[df.match_gene.str.startswith("NCBIGene")]

In [164]:
# Get sim scores for ERCC4
df_ercc4 = df.query("query_symbol == 'ERCC4'")
print(df_ercc4.head(40))

       query_gene query_symbol      match_gene match_symbol  sim_score
0   NCBIGene:2072        ERCC4   NCBIGene:2072        ERCC4  99.494918
1   NCBIGene:2072        ERCC4  NCBIGene:10459       MAD2L2  80.791164
2   NCBIGene:2072        ERCC4  NCBIGene:57697        FANCM  80.791164
3   NCBIGene:2072        ERCC4   NCBIGene:2188        FANCF  80.791164
4   NCBIGene:2072        ERCC4   NCBIGene:2189        FANCG  80.791164
5   NCBIGene:2072        ERCC4   NCBIGene:7516        XRCC2  80.791164
7   NCBIGene:2072        ERCC4   NCBIGene:5888        RAD51  80.692858
8   NCBIGene:2072        ERCC4  NCBIGene:29089        UBE2T  80.692858
9   NCBIGene:2072        ERCC4  NCBIGene:55215        FANCI  80.497801
10  NCBIGene:2072        ERCC4  NCBIGene:55120        FANCL  80.458300
11  NCBIGene:2072        ERCC4   NCBIGene:5889       RAD51C  80.443244
12  NCBIGene:2072        ERCC4  NCBIGene:84464         SLX4  80.347636
13  NCBIGene:2072        ERCC4   NCBIGene:2187        FANCB  80.263305
14  NC

In [165]:
# remove self matches
df = df[df.query_gene != df.match_gene]

In [167]:
# sum scores for each matched gene
sim_score = df.groupby("match_symbol").agg({"sim_score": sum}).sim_score
sim_score = sim_score.sort_values(ascending=False)
sim_score[:20]

match_symbol
RAD51C    1694.115818
FANCD2    1627.693050
FANCA     1624.479407
FANCC     1620.373007
FANCE     1620.373007
BRIP1     1616.937313
FANCI     1616.563142
PALB2     1605.086634
FANCB     1605.079036
SLX4      1596.326291
FANCL     1590.386976
XRCC2     1582.069088
UBE2T     1571.582898
MAD2L2    1567.853780
FANCG     1567.597860
FANCF     1567.597860
FANCM     1567.597860
RAD51     1565.194425
BRCA2     1485.466534
ERCC4     1404.844667
Name: sim_score, dtype: float64

In [168]:
## Sanity check. Only show the FA genes
sim_score[sim_score.index.isin(symbol_id)]

match_symbol
RAD51C    1694.115818
FANCD2    1627.693050
FANCA     1624.479407
FANCC     1620.373007
FANCE     1620.373007
BRIP1     1616.937313
FANCI     1616.563142
PALB2     1605.086634
FANCB     1605.079036
SLX4      1596.326291
FANCL     1590.386976
XRCC2     1582.069088
UBE2T     1571.582898
MAD2L2    1567.853780
FANCG     1567.597860
FANCF     1567.597860
FANCM     1567.597860
RAD51     1565.194425
BRCA2     1485.466534
ERCC4     1404.844667
BRCA1      636.373464
Name: sim_score, dtype: float64

In [169]:
# Filter out all genes from the input set (FA)
sim_score_nofa = sim_score[~sim_score.index.isin(symbol_id)]
sim_score_nofa[:20]

match_symbol
ERCC3     993.385510
XPC       977.266561
RPS19     971.663947
GATA1     969.407885
FLI1      964.228144
NRAS      960.537220
ERCC5     960.248320
ERCC2     942.011987
ATM       941.945292
RPL35A    938.912265
MYC       938.727405
BLM       938.108438
DDB2      937.711570
RPS26     937.213249
RPL5      936.197777
XPA       935.980590
CEP57     934.706647
RPL11     933.713539
RPL26     931.878519
RPS28     925.210044
Name: sim_score, dtype: float64

In [172]:
# which genes matched to ERCC5?
df.query("match_symbol == 'ERCC5'")

Unnamed: 0,query_gene,query_symbol,match_gene,match_symbol,sim_score
20,NCBIGene:2072,ERCC4,NCBIGene:2073,ERCC5,77.349963
451,NCBIGene:2176,FANCC,NCBIGene:2073,ERCC5,51.42409
654,NCBIGene:83990,BRIP1,NCBIGene:2073,ERCC5,45.360208
1035,NCBIGene:2189,FANCG,NCBIGene:2073,ERCC5,56.306953
1451,NCBIGene:2178,FANCE,NCBIGene:2073,ERCC5,51.42409
1638,NCBIGene:84464,SLX4,NCBIGene:2073,ERCC5,55.943945
1842,NCBIGene:29089,UBE2T,NCBIGene:2073,ERCC5,50.405415
2041,NCBIGene:55215,FANCI,NCBIGene:2073,ERCC5,55.020788
2435,NCBIGene:57697,FANCM,NCBIGene:2073,ERCC5,56.306953
2648,NCBIGene:2177,FANCD2,NCBIGene:2073,ERCC5,51.626397


In [173]:
# Across the list of gene pairs, which genes show up the most?
df['match_symbol'].value_counts()[:100]

XPC         19
NRAS        19
BRCA2       19
RAD51C      19
MYC         19
ERCC3       19
NDN         18
NSDHL       18
BUB1B       18
RAF1        18
MPL         18
GATA1       18
BAZ1B       18
PEX3        18
GDF6        18
RPS26       18
MKS1        18
RPL15       18
FANCE       18
BRIP1       18
SLX4        18
CLIP2       18
PEX14       18
PEX26       18
CHD7        18
FANCB       18
FERMT1      18
DHCR7       18
CEP290      18
DDB2        18
FANCA       18
PEX11B      18
TSR2        18
ATM         18
PALB2       18
TMEM67      18
RPL5        18
PTPN11      18
XPA         18
RPL35A      18
RPL11       18
FANCL       18
CEP57       18
FANCD2      18
PEX19       18
PEX2        18
ERCC2       18
JAK2        18
ELN         18
BUB1        18
FLI1        18
RPS19       18
SNRPN       18
LIMK1       18
BRAF        18
PEX6        18
RPS28       18
PEX16       18
GTF2IRD1    18
RPL26       18
SH2B3       18
APOE        18
C11orf65    18
FANCC       18
RECQL4      18
PEX5        18
FANCI     

In [174]:
## Run same summation, but removing all scores lower than 70 beforehand
sim_score = df.query("sim_score>70").groupby("match_symbol").agg({"sim_score": sum}).sim_score
sim_score = sim_score.sort_values(ascending=False)
sim_score = sim_score[~sim_score.index.isin(symbol_id)]
sim_score[:20]

match_symbol
ERCC5     77.349963
ERCC3     77.015192
ERCC2     75.396065
DDB2      74.143082
XPC       73.742835
RNASEL    73.653639
XPA       73.613378
HOXB13    73.368494
MSMB      73.368494
EPHB2     73.187186
ELAC2     72.571169
MSR1      70.818853
Name: sim_score, dtype: float64

###### Next steps
1. Run on model organisms
2. Improvements to owlsim service layer: https://github.com/monarch-initiative/owlsim-v3/issues/87
3. Add pagination to owlsim services

It is possible we are missing gene pairs from pulling sim scores across all types (diseases, model genes)
