Q1.5 
This query aims to expand the FA-ensemblage gene set based on upstream TF binding site motif patterns.

Data  
 - FA gene sets  'https://github.com/NCATS-Tangerine/cq-notebooks/tree/master/FA_gene_sets'
 - motif simalarity datastore 'jaspar.nt'


In [1]:
import csv
import yaml
import requests
import json
import re

Untill the hackathon blazegraph instance is available    
We are running a local instance [described here](http://localhost:8888/notebooks/LocalBlazeGraph.ipynb)

In [2]:
bg_host = 'http://localhost:9999'
bg = bg_host + '/blazegraph/sparql?format=json&query=' 

which offers:  

    rdf:Description rdf:nodeID="service"
        rdf:type 
            rdf:resource="http://www.w3.org/ns/sparql-service-description#Service"
        endpoint 
            rdf:resource="http://localhost:9999/blazegraph/sparql"
        endpoint 
            rdf:resource="http://localhost:9999/bigdata/LBS/sparql"
            
Trying the blazegraph/sparql endpoint yields

In [3]:
x = requests.get(bg + 'SELECT ?g1 WHERE{?g1 <http://purl.obolibrary.org/obo/SO_adjacent_to> ?o} LIMIT 1')
print(x.text)

{
  "head" : {
    "vars" : [ "g1" ]
  },
  "results" : {
    "bindings" : [ {
      "g1" : {
        "type" : "uri",
        "value" : "http://www.ncbi.nlm.nih.gov/gene/6309"
      }
    } ]
  }
}


Blazegraph is working localy,   
there is a process in place to refresh it when I change the RDF
Using the RDF stored in Blazegraph to address the FD TF-binding motif question  
in this notebook follows.

In [4]:
# Want to use existing translation tables when constructing SPARQL queries
yamlurl='https://raw.githubusercontent.com/TomConlin/Jaspar_FA/master/translation_tables/curie_map.yaml'
rsponse = requests.get(yamlurl)
PREFIX = yaml.load(rsponse.text)
# print(PREFIX)

yamlurl='https://raw.githubusercontent.com/TomConlin/Jaspar_FA/master/translation_tables/translation_table.yaml'
rsponse = requests.get(yamlurl)
TT = yaml.load(rsponse.text)
# print(TT)

# redecorate the curie base IRI map 
# to look like a sparql prefix namespace map  
# except the bnode: skolemIRI which is given a java 
#     "org.openrdf.query.MalformedQueryException:
# for no good reason

prefixns = ""
for p in PREFIX: 
    if len(p) > 1:
       prefixns += 'PREFIX ' + p + ': <' + PREFIX[p] + '>\n'

print(prefixns)

PREFIX RO: <http://purl.obolibrary.org/obo/RO_>
PREFIX SO: <http://purl.obolibrary.org/obo/SO_>
PREFIX JASPAR: <http://fantom.gsc.riken.jp/5/sstar/JASPAR_motif:>
PREFIX OIO: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX NCBIGene: <http://www.ncbi.nlm.nih.gov/gene/>
PREFIX GENO: <http://purl.obolibrary.org/obo/GENO_>
PREFIX SWO: <http://www.ebi.ac.uk/efo/swo/SWO_>



The Fanconi Anemia genes come as symbols/aliases in three sets [here](https://docs.google.com/spreadsheets/d/1yX-5sfrC3vrahf4_k7-5rl4Oqzm853ollIMmUo1PTc0)  
I converted them to NCBI gene identifiers and current symbols [here](https://github.com/NCATS-Tangerine/cq-notebooks/tree/master/FA_gene_sets)

In [5]:
fagene=[]

fa1 = '../FA_gene_sets/FA_1_core_complex.txt'
core_complex = {}
with open(fa1, 'r') as tabfile:
    filereader = csv.reader(tabfile, delimiter='\t')
    for row in filereader:   
        (fa_gene, fa_symbol) = row
        fagene.append(fa_gene)
        core_complex[fa_gene]=fa_symbol
        
fa2 = '../FA_gene_sets/FA_2_effector_proteins.txt'
effector_proteins = {}
with open(fa2, 'r') as tabfile:
    filereader = csv.reader(tabfile, delimiter='\t')
    for row in filereader:   
        (fa_gene, fa_symbol) = row
        fagene.append(fa_gene)
        effector_proteins[fa_gene]=fa_symbol
        
fa3 = '../FA_gene_sets/FA_3_associated_proteins.txt'
associated_proteins = {} 
with open(fa3, 'r') as tabfile:
    filereader = csv.reader(tabfile, delimiter='\t')
    for row in filereader:   
        (fa_gene, fa_symbol) = row
        fagene.append(fa_gene)
        associated_proteins[fa_gene]=fa_symbol

We can get all the triples patterns available in 'jaspar.nt'   
from this [GraphViz dot file](https://github.com/TomConlin/Jaspar_FA/blob/master/jaspar_target_model.gv) since we used it to guide   
generating the RDF data in the first place.  

![](https://github.com/TomConlin/Jaspar_FA/blob/master/jaspar_target_model.png?raw=true)



Composing the query by atomizing the GraphViz elements   
and selectivly translating with the same tables   
the data was generates with means  
the SPARQL query remains free of semanticly obsolete clauses.

I did this manually, but in general, the rules were roughly:
  
-    remove angle brackets
-    add trailing dot 
-    replace BNODE: with __?__
-    if the element (predicate) is a curie then wrap it in a TT lookup
-    if the object is a LITERAL wrap it in quotes
-    give the items of intrest specific __?names__
-    formating to suit sensibilities

In particular we want: 
    candidate genes which share motif similarity with FA genes
    
    <NCBIGene:fagene><SO:adjacent_to><BNODE:gene1_upstream_region>
    <BNODE:gene1_upstream_region><RO:member of><BNODE:pairwise_similarity>
    <BNODE:gene2_upstream_region><RO:member of><BNODE:pairwise_similarity>
    <NCBIGene:xyz><SO:adjacent_to><BNODE:gene2_upstream_region> 
    
    when <NCBIGene:xyz> is not <NCBIGene:fagene> 
    
we may also be interested in limiting by region extent size   
or Jaccard similarity score

    <BNODE:pairwise_similarity><SWO:Similarity score><0.73>
    <BNODE:gene1_upstream_region><GENO:has_extent><1k>
    
    

Much of the effort to this point has been developing and maintaining a context  
in which we are able to __write a readable query__.  

Here, given a (fa) gene, we are looking for other genes   
with optimal matches (similarity=1) within their 1k start regions.

In [6]:
selectstr = ' '.join([
    'SELECT ?fagene ?candidate\n',
    'WHERE{\n',
        '?fagene', 'SO:adjacent_to', '?gene1_upstream_region .\n', 
        '?gene1_upstream_region',  TT['RO:member of'], '?pairwise_similarity .\n',
        '?gene2_upstream_region',  TT['RO:member of'], '?pairwise_similarity .\n',
        'FILTER(?gene1_upstream_region != ?gene2_upstream_region) \n', 
        '?candidate', 'SO:adjacent_to', '?gene2_upstream_region .\n',
        '?pairwise_similarity', TT['SWO:Similarity score'], "'1' .\n", 
        '?gene1_upstream_region', TT['GENO:has_extent'], "'1k' .\n",
    'FILTER(?fagene != ?candidate)\n}'
    ]) 

# note the abscence of opaque identifiers the query engine actually uses

query = prefixns + "\n" + selectstr
print(query)

PREFIX RO: <http://purl.obolibrary.org/obo/RO_>
PREFIX SO: <http://purl.obolibrary.org/obo/SO_>
PREFIX JASPAR: <http://fantom.gsc.riken.jp/5/sstar/JASPAR_motif:>
PREFIX OIO: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX NCBIGene: <http://www.ncbi.nlm.nih.gov/gene/>
PREFIX GENO: <http://purl.obolibrary.org/obo/GENO_>
PREFIX SWO: <http://www.ebi.ac.uk/efo/swo/SWO_>

SELECT ?fagene ?candidate
 WHERE{
 ?fagene SO:adjacent_to ?gene1_upstream_region .
 ?gene1_upstream_region RO:0002350 ?pairwise_similarity .
 ?gene2_upstream_region RO:0002350 ?pairwise_similarity .
 FILTER(?gene1_upstream_region != ?gene2_upstream_region) 
 ?candidate SO:adjacent_to ?gene2_upstream_region .
 ?pairwise_similarity SWO:0000425 '1' .
 ?gene1_upstream_region GENO:0000678 '1k' .
 FILTER(?fagene != ?candidate)
}


In [7]:
# apply our query to each gene in the FA Core complex
candidate_set = {}
for fagene in core_complex:
    payload = {
        'format' : 'json', 
        # '$fagene':  fagene, 
        # BG is not accepting the curie. Wants IRI or LITERAL
        '$fagene': '<' + re.sub('NCBIGene:',PREFIX['NCBIGene'],fagene) + '>',
        'query': query
    }
    response = requests.post(bg_host + '/blazegraph/sparql', data=payload)
 
    resp = json.loads(response.text)
    if resp['results']['bindings'] != []:
        print(core_complex[fagene], '\t', 
              re.sub('NCBIGene:', PREFIX['NCBIGene'],fagene))
        candidate_set[core_complex[fagene]]=[]
        for hit in resp['results']['bindings']:
            candidate_set[core_complex[fagene]].append(hit['candidate']['value'])
            print('\t', hit['candidate']['value']) 

UBE2T 	 http://www.ncbi.nlm.nih.gov/gene/29089
	 http://www.ncbi.nlm.nih.gov/gene/3608
	 http://www.ncbi.nlm.nih.gov/gene/171392
	 http://www.ncbi.nlm.nih.gov/gene/340252
	 http://www.ncbi.nlm.nih.gov/gene/8125
	 http://www.ncbi.nlm.nih.gov/gene/54958


UBE2T is the only gene in the Fancomi Anemia core complex  
with optimal associations with other genes
Briefly those associated genes are:   

- ILF2 http://www.ncbi.nlm.nih.gov/gene/3608  
    a transcription factor required for T-cell expression of the interleukin 2 gene

- ZNF675 https://www.ncbi.nlm.nih.gov/gene/171392  
    the novel zinc finger protein TIZ may play a role during osteoclast differentiation by modulating TRAF6 signaling activity.

- ZNF680 http://www.ncbi.nlm.nih.gov/gene/340252  
    obsevered expressed in thyroid but fairly uncharaterizied

- ANP32A http://www.ncbi.nlm.nih.gov/gene/8125  
    expressed lymph nodes & bone marrow 
    The tumor suppressor acidic nuclear phosphoprotein 32 family, member A 

- TMEM160 http://www.ncbi.nlm.nih.gov/gene/54958  
    Not much to see here

----
Where to go next

- We can look for ideal similarity in larger upstream regions but counter intiutivly larger regions average fewer associations because the number of distinct motifs between the larger regions grows faster than the number of motifs the regions will have in common.  (and I dropped any that dipped below the negotiable threshold of one part in five)
- We can look for less ideal similarity in the same 1k region 
- We can look for less ideal similarity in larger regions  

For now I am going with looking for less similarity within the same 1k start regions  

In [8]:
# small changes to expose the similarity score
selectstr = ' '.join([
    'SELECT ?fagene ?candidate ?score\n',
    'WHERE{\n',
        '?fagene', 'SO:adjacent_to', '?gene1_upstream_region .\n', 
        '?gene1_upstream_region',  TT['RO:member of'], '?pairwise_similarity .\n',
        '?gene2_upstream_region',  TT['RO:member of'], '?pairwise_similarity .\n',
        'FILTER(?gene1_upstream_region != ?gene2_upstream_region) \n', 
        '?candidate', 'SO:adjacent_to', '?gene2_upstream_region .\n',
        '?pairwise_similarity', TT['SWO:Similarity score'], '?score .\n', 
        '?gene1_upstream_region', TT['GENO:has_extent'], "'1k' .\n",
    'FILTER(?fagene != ?candidate)\n',
    #'FILTER(xsd:float(?score) >= 0.5)\n',
    '}\n',
    'ORDER by DESC(?score)\n'
    ])
query = prefixns + "\n" + selectstr
print(selectstr)

SELECT ?fagene ?candidate ?score
 WHERE{
 ?fagene SO:adjacent_to ?gene1_upstream_region .
 ?gene1_upstream_region RO:0002350 ?pairwise_similarity .
 ?gene2_upstream_region RO:0002350 ?pairwise_similarity .
 FILTER(?gene1_upstream_region != ?gene2_upstream_region) 
 ?candidate SO:adjacent_to ?gene2_upstream_region .
 ?pairwise_similarity SWO:0000425 ?score .
 ?gene1_upstream_region GENO:0000678 '1k' .
 FILTER(?fagene != ?candidate)
 }
 ORDER by DESC(?score)



In [9]:
# apply our query to each gene in the FA Core complex
# lowering the association threshold to 1 part in 5 similar 
# keeping the score for consideration in later processing
candidate_core_complex = {}
for fagene in core_complex:
    payload = {
        'format' : 'json', 
        '$fagene': '<' + re.sub('NCBIGene:',PREFIX['NCBIGene'],fagene) + '>',
        'query'  : query
    }
    response = requests.post(bg_host + '/blazegraph/sparql', data=payload)
    # print(response)
    resp = json.loads(response.text)
    # print(response.text)
    
    if resp['results']['bindings'] != []:
        # print(core_complex[fagene], '\t', 
              # re.sub('NCBIGene:', PREFIX['NCBIGene'], fagene))
        candidate_core_complex[core_complex[fagene]]={}
        for hit in resp['results']['bindings']:
            candidate_core_complex[core_complex[fagene]][
                hit['candidate']['value']] = hit['score']['value']
            # print('\t', hit['candidate']['value'], hit['score']['value'])

In [10]:
# print(candidate_core_complex)

In [11]:
# the next FA geneset is effector_proteins
candidate_effector_proteins = {}
for fagene in effector_proteins:
    payload = {
        'format' : 'json', 
        '$fagene': '<' + re.sub('NCBIGene:',PREFIX['NCBIGene'],fagene) + '>',
        'query'  : query
    }
    response = requests.post(bg_host + '/blazegraph/sparql', data=payload)
    # print(response)
    resp = json.loads(response.text)
    # print(response.text)
    
    if resp['results']['bindings'] != []:
        # print(effector_proteins[fagene], '\t', 
              # re.sub('NCBIGene:', PREFIX['NCBIGene'],fagene))
        candidate_effector_proteins[effector_proteins[fagene]]={}
        for hit in resp['results']['bindings']:
            candidate_effector_proteins[effector_proteins[fagene]][
                    hit['candidate']['value']] = hit['score']['value']
            # print('\t', hit['candidate']['value'], hit['score']['value'])

In [12]:
# and finally associated_proteins
candidate_associated_proteins = {}
for fagene in associated_proteins:
    payload = {
        'format' : 'json', 
        '$fagene': '<' + re.sub('NCBIGene:',PREFIX['NCBIGene'],fagene) + '>',
        'query'  : query
    }
    response = requests.post(bg_host + '/blazegraph/sparql', data=payload)
    # print(response)
    resp = json.loads(response.text)
    # print(response.text)
    
    if resp['results']['bindings'] != []:
        # print(associated_proteins[fagene], '\t', 
              # re.sub('NCBIGene:', PREFIX['NCBIGene'],fagene))
        candidate_associated_proteins[associated_proteins[fagene]]={}
        for hit in resp['results']['bindings']:
            candidate_associated_proteins[
                associated_proteins[fagene]][
                   hit['candidate']['value']] = hit['score']['value']
            # print('\t', hit['candidate']['value'], hit['score']['value'])

In [46]:
# Researchers would like to see gene symbols. 
# Here we are calling HGNC for only the freshest in gene fashion wear.
# http://rest.genenames.org/search/entrez_id/673
def entrez_symbol(ncbigene_uri):
    api_uri = re.sub(
        'http://www.ncbi.nlm.nih.gov/gene/',
        'http://rest.genenames.org/search/entrez_id/', ncbigene_uri)
    response = requests.get(api_uri, headers={'Accept': 'application/json'})
    if str(response) == '<Response [200]>':
        hgnc = json.loads(response.text)
        if hgnc['response']['numFound'] > 0:
            symbol = hgnc['response']['docs'][0]['symbol']
        else:
            symbol = ""  # none found    
    else: # it will be ugly so someone will notice
        symbol = 'ERROR ' + api_uri + ' ' + str(response)
    return symbol

    # quick tests
    entrez_symbol('http://www.ncbi.nlm.nih.gov/gene/672')

    entrez_symbol('http://www.ncbi.nlm.nih.gov/gene/-1')

In [50]:
# what have we got in the candidate sets?

# homoiconic coders, avert your eyes
fa_sets={
    'candidate_core_complex' : candidate_core_complex,
    'candidate_effector_proteins' : candidate_effector_proteins,
    'candidate_associated_proteins' : candidate_associated_proteins
}
# keep a stash of gene symbols
ncbigene_symbol = {}

for fas in fa_sets:
    print(fas +' set has', '(' + str(len(fa_sets[fas])) +') genes with hits')
    for fa in fa_sets[fas]:
        print('  ', fa, '(' + str(len(fa_sets[fas][fa])) +') associations')
        hits = []
        for c in fa_sets[fas][fa]:
            # try to be nice to our friends at HGNC
            if c not in ncbigene_symbol:
                ncbigene_symbol[c] = entrez_symbol(c)
            symbol = ncbigene_symbol[c]
            #print(symbol,'\t', c, fa_sets[fas][fa][c])
            hits.append([fa_sets[fas][fa][c], symbol, c])
        # reorder the set  
        hits.sort()
        # output the set

candidate_effector_proteins set has (4) genes with hits
   SLX4 (67) associations
EIF3E 	 http://www.ncbi.nlm.nih.gov/gene/3646 0.25
AREL1 	 http://www.ncbi.nlm.nih.gov/gene/9870 0.25
WDR46 	 http://www.ncbi.nlm.nih.gov/gene/9277 0.2
ZNF329 	 http://www.ncbi.nlm.nih.gov/gene/79673 0.2
MEA1 	 http://www.ncbi.nlm.nih.gov/gene/4201 0.2
N6AMT1 	 http://www.ncbi.nlm.nih.gov/gene/29104 0.25
MIS18A 	 http://www.ncbi.nlm.nih.gov/gene/54069 0.2
HEATR5B 	 http://www.ncbi.nlm.nih.gov/gene/54497 0.2
IKZF2 	 http://www.ncbi.nlm.nih.gov/gene/22807 0.25
PTGES3L 	 http://www.ncbi.nlm.nih.gov/gene/100885848 0.25
SUGP2 	 http://www.ncbi.nlm.nih.gov/gene/10147 0.2
PSPC1 	 http://www.ncbi.nlm.nih.gov/gene/55269 0.25
TMEM161B 	 http://www.ncbi.nlm.nih.gov/gene/153396 0.2
RAB14 	 http://www.ncbi.nlm.nih.gov/gene/51552 0.25
CCDC127 	 http://www.ncbi.nlm.nih.gov/gene/133957 0.25
RBFOX2 	 http://www.ncbi.nlm.nih.gov/gene/23543 0.25
SOS2 	 http://www.ncbi.nlm.nih.gov/gene/6655 0.2
SMG6 	 http://www.ncbi.nlm.nih

RGL2 	 http://www.ncbi.nlm.nih.gov/gene/5863 0.25
FBXW2 	 http://www.ncbi.nlm.nih.gov/gene/26190 0.333333
MED19 	 http://www.ncbi.nlm.nih.gov/gene/219541 0.2
C7orf49 	 http://www.ncbi.nlm.nih.gov/gene/78996 0.2
PALB2 	 http://www.ncbi.nlm.nih.gov/gene/79728 0.25
KIAA0825 	 http://www.ncbi.nlm.nih.gov/gene/285600 0.25
RPS6 	 http://www.ncbi.nlm.nih.gov/gene/6194 0.2
C2orf68 	 http://www.ncbi.nlm.nih.gov/gene/388969 0.333333
EMC7 	 http://www.ncbi.nlm.nih.gov/gene/56851 0.25
MCAT 	 http://www.ncbi.nlm.nih.gov/gene/27349 0.25
CAGE1 	 http://www.ncbi.nlm.nih.gov/gene/285782 0.2
KNOP1 	 http://www.ncbi.nlm.nih.gov/gene/400506 0.2
FBXW9 	 http://www.ncbi.nlm.nih.gov/gene/84261 0.5
   FAAP20 (30) associations
MYO1A 	 http://www.ncbi.nlm.nih.gov/gene/4640 0.25
ANKRD18A 	 http://www.ncbi.nlm.nih.gov/gene/253650 0.5
C8orf59 	 http://www.ncbi.nlm.nih.gov/gene/401466 0.2
ZNF414 	 http://www.ncbi.nlm.nih.gov/gene/84330 0.25
CEACAM1 	 http://www.ncbi.nlm.nih.gov/gene/634 0.25
DBNDD1 	 http://www.ncb

well they went in ordered by score... 