## Normalization of Node Identifiers 

This notebook demonstrates the normalization of the node identifiers in a set of networks. A set of example networks is cloned to your account and then each network is updated in turn. The normalization of the identifiers is performed using the mygene.info resource. The example networks are copies of selected NCI-PID networks, made in 2017.

The tutorial demonstrates
* Using Network Sets
* Cloning networks
* Using the mygene.info resource
* Updating networks
* Use of the @context aspect of a network

#### Updates to Each Node:

Identifiers are formatted using standard prefixes and the namespaces used are defined in the @context aspect of the network. 
* Node "name" = HGNC gene symbol (without prefix) i.e. **TP53**
* Node "represents" = NCBI gene id, i.e. **ncbigene:7689**
* The *former* node name is added to the values of the node attribute "alias" (i.e. the aliases)
* The HGNC gene symbol *with* prefix is added the aliases
* Uniprot identifiers in the aliases are updated to use the standard "uniprot" prefix


### Import Packages

In [1]:
import ndex2
import json
import requests
from os.path import isfile, expanduser
import ndex2.client as nc
from datetime import datetime

### NDEx Credentials

Get the username and password to access your account from ndex_tutorial_config.json in your home directory. This file should have the following structure:

    {
      "username" : "<my_username>",
      "password" : "<my_password>"
    }


In [2]:
config_file = expanduser("~/ndex_tutorial_config.json")
my_username = None
my_password = None
my_server = 'public.ndexbio.org'
save_tutorial_networks_to_my_account = True

if(isfile(config_file)):
    file = open(config_file, "r")
    data = json.load(file)
    file.close()
    if data.get("password") and data.get("username"):
        my_username = data.get("username")
        my_password = data.get("password")
    else:
        print("Error: " + config_file + " does not define username and password")
else:
    print("Error: " + config_file + " was not found")

### Get the Example Network Set by UUID

In [3]:
#my_network_set = '71cde621-deb7-11e7-adc1-0ac135e8bacf'
my_network_set = 'a70f9299-d6f0-11e7-adc1-0ac135e8bacf'

cravat_username = 'cravat2017'
cravat_password = 'cravat2017'

ndex2_client = nc.Ndex2(host=my_server, username=my_username, password=my_password, debug=True)
ndex2_client_cravat = nc.Ndex2(host=my_server, username=cravat_username, password=cravat_password, debug=True)

set_response = ndex2_client_cravat.get_network_set(my_network_set)
uuids = set_response.get('networks')  # for one or more individually specified networks

print('Number of networks: ' + str(len(uuids)))

GET route: http://public.ndexbio.org/v2/networkset/a70f9299-d6f0-11e7-adc1-0ac135e8bacf
status code: 200
Number of networks: 212


### Functions to Access mygene.info and Update Nodes

In [4]:
def query_mygene_x(q, tax_id='9606', entrezonly=True):
    if entrezonly:
        r = requests.get('http://mygene.info/v3/query?q=' + q + '&species=' + tax_id + '&entrezonly=true')
    else:
        r = requests.get('http://mygene.info/v3/query?q=' + q + '&species=' + tax_id)
    result = r.json()
    hits = result.get("hits")
    if hits and len(hits) > 0:
        return hits[0]
    return False


def query_batch(query_string, tax_id='9606', scopes="symbol, entrezgene, alias, uniprot", fields="symbol, entrezgene, uniprot"):
    data = {'species': tax_id,
            'scopes': scopes,
            'fields': fields,
            'q': query_string}
    r = requests.post('http://mygene.info/v3/query', data)

    json = r.json()
    
    if isinstance(json, dict) and not json.get('success'):
        return []
    else:
        return json


def query_mygene(q):
    hits = query_batch(q)
    for hit in hits:
        symbol = hit.get('symbol')
        id = hit.get('entrezgene')
        if symbol and id:
            return (symbol, id)
    return None

# per node update method
def update_node(node, nicecx):
    #print("\nnode %s" % node.get_name())
    original_node_name = node.get_name().replace('_HUMAN', '')
    aliases = nicecx.get_node_attribute(node, "alias")
    #print("aliases: %s" % aliases)
    query_string = ''

    alias_tmp = []
    non_uniprot_aliases = []

    if aliases is not None:
        for alias in aliases:
            # assume uniprot --> uniprot:1234
            split_alias = alias.split(':')
            id = split_alias[-1]
            #print(alias)
            if 'uniprot' not in split_alias[0].lower():
                non_uniprot_aliases.append(alias)
            else:
                query_string += str(id) + ' '
            
        #print(query_string)
        #print(non_uniprot_aliases)
        #print(alias_tmp)
        hits = query_batch(query_string)
        #print(hits)
    else:
        hits = []
        
    alias_tmp = []
    for hit in hits:
        symbol = hit.get('symbol')
        id = hit.get('entrezgene')
        uniprot = hit.get('_id')
        
        if symbol and id:
            if len(alias_tmp) > 0: 
                #print(hits)
                # IF WE REACH THIS POINT WE ASSUME THIS
                # IS A PROTIEN FAMILY. RESET THE NODE 
                # BACK TO THE PROTIEN FAMILY NAME
                node.set_node_name(original_node_name)
                nicecx.set_node_attribute(node, 'type', 'Protein_Family')

            else:
                node.set_node_name(symbol)
                node.set_node_represents('ncbigene:' + str(id))

            alias_tmp.append('hgnc.symbol:' + symbol)
            alias_tmp.append('ncbigene:' + str(id))

            if uniprot is not None:
                alias_tmp.append('uniprot:' + uniprot)
            else:
                raise Exception('uniprot not found ' + symbol)
        
    if len(alias_tmp) > 0:
        alias_tmp = list(set(alias_tmp).union(set(non_uniprot_aliases)))
        nicecx.set_node_attribute(node, "alias", alias_tmp)
    else:
        node.set_node_name(original_node_name)
        succeed = False
    
    '''
    hit = query_mygene(node.get_name())
    if hit:
        succeed = True
        if (len(hit) > 0):
            node.set_node_name(hit[0])
        if (len(hit) > 1):
            node.set_node_represents('ncbigene:' + str(hit[1])
        print("hit: %s" % json.dumps(hit, indent=4))
    else:
        succeed = False
        for alias in aliases:
            # assume uniprot
            id = alias.split(':')[-1]
            hit = query_mygene(id)
            if hit:
                print("hit: %s" % json.dumps(hit, indent=4))
                succeed = True
                if(len(hit) > 0):
                    node.set_node_name(hit[0])
                if(len(hit) > 1):
                    node.set_node_represents('ncbigene:' + str(hit[1])

                break
        if not succeed:
            print("no gene hit for node %s " % node.get_name())
    '''

In [5]:
# TBD: create output network set
# HUGO example: hgnc.symbol:tp53 --> non-prefixed hugo symbol
# Entrez NCBI example: ncbigene:7157 --> represents
# Aliases with prefixes
# iteration over networks
count = 1

net_set_uuid = ndex2_client.create_networkset('Normalized Networks' + str(datetime.now()), 'Normalized Networks')
net_set_uuid = net_set_uuid.split('/')[-1]
print('Network set uuid:')
print(net_set_uuid)
add_these_networks = []
count = 0
for network_uuid in uuids:
    # load network in NiceCX
    ncx = ndex2.create_nice_cx_from_server(server=my_server, uuid=network_uuid, 
                                           username=cravat_username, password=cravat_password)
    context = [{'ncbigene': 'http://identifiers.org/ncbigene/', 
               'hgnc.symbol': 'http://identifiers.org/hgnc.symbol/',
              'uniprot': 'http://identifiers.org/uniprot/',
               'cas': 'http://identifiers.org/cas/',
               'entrez': 'http://www.ncbi.nlm.nih.gov/gene/'}]
    ncx.set_context(context)
    for id, node in ncx.get_nodes():
        update_node(node, ncx)
    
    #ncx.set_name('Normalized Nodes ' + str(count))
    if save_tutorial_networks_to_my_account:
        upload_message = ncx.upload_to(my_server, my_username, my_password)
        net_uuid = upload_message.split('/')[-1]
        print(net_uuid)
        add_these_networks.append(net_uuid)

    count += 1
    print(count)
    if count > 0:
        break

print('Adding to network set')
ndex2_client.add_networks_to_networkset(net_set_uuid, add_these_networks)        
print('Done adding to network set')

print('------------------------ DONE ------------------------')
#print(ncx.to_cx())

    

POST route: http://public.ndexbio.org/v2/networkset
POST json: {"name": "Normalized Networks2018-01-09 10:27:15.906207", "description": "Normalized Networks"}
status code: 201
response text: http://public.ndexbio.org/v2/networkset/b7f451b3-f56a-11e7-adc1-0ac135e8bacf
Network set uuid:
b7f451b3-f56a-11e7-adc1-0ac135e8bacf
b94da024-f56a-11e7-adc1-0ac135e8bacf
1
Adding to network set
POST route: http://public.ndexbio.org/v2/networkset/b7f451b3-f56a-11e7-adc1-0ac135e8bacf/members
POST json: ["b94da024-f56a-11e7-adc1-0ac135e8bacf"]
status code: 201
response text: http://public.ndexbio.org/v2/networkset/b7f451b3-f56a-11e7-adc1-0ac135e8bacf/members
Done adding to network set
------------------------ DONE ------------------------
