In [1]:
from sys import path
path.append('../src')
from NLEval import graph, valsplit, label, model
from sklearn.metrics import roc_auc_score as auroc
import numpy as np
import pandas as pd

This is a function that has some of inputs a user might be able to chose from on the website

In [23]:
def example(input_genes, model_type, network, GSC, numbers_of_top_genes):
    # load graph and lablset collection
    data_path = '../data/' # path to data
    # load graph
    if network == 'STRING-EXP': # options will be STRING, STRING-EXP, BioGRID, GIANT-TN eveuntually
        g = graph.DenseGraph.DenseGraph.from_edglst(data_path \
            + 'networks/String_experiments.edg', weighted=True, directed=False)
    elif network == 'BioGRID':
        g = graph.DenseGraph.DenseGraph.from_edglst(data_path \
            + 'networks/BioGRID_3.4.136.edg', weighted=True, directed=False)
    # load label (gene) set collection
    if GSC == 'KEEG': # option will be KEEG, DisGeNet and GO eventually
        lsc = label.LabelsetCollection.SplitLSC.from_gmt(data_path + 'labels/c2.cp.kegg.v6.1.entrez.BP.gsea-min10-max200-ovlppt7-jacpt5.nonred.gmt')
    elif GSC == 'DisGeNet':
        lsc = label.LabelsetCollection.SplitLSC.from_gmt(data_path + 'labels/disgenet_disease-genes_prop.gsea-min10-max600-ovlppt7-jacpt5.nonred.gmt')
    # initialize models
    if model_type == 'SLA': # option will be SLA, SLI or SLE eveuntually
        SL_A = model.SupervisedLearning.LogReg(g, penalty='l2', solver='lbfgs')
   
    lsc.add_labelset(input_genes, 'New')    
    # train and get genome wide prediction scores
    score_dict = SL_A.predict(lsc.get_labelset('New'), lsc.get_negative('New'))
    
    # print top ranked genes and its intersection with known ones
    top_list = sorted(score_dict, key=score_dict.get, reverse=True)[:numbers_of_top_genes]
    intersection = list(set(top_list) & lsc.get_labelset('New'))
    print("Top %d genes: %s" % (numbers_of_top_genes, repr(top_list)))
    print("Known genes in top %d: %s" % (numbers_of_top_genes, repr(intersection)))

The below is for a random known pathway that I forgot the name of already. 

In [20]:
input_genes = ['6457', '7037', '57403', '3134', '50807', '93343', '11311', '8766', '5584', '137492', '998', 
               '30011', '5337', '3312', '155', '10015', '55738', '57132', '153', '116986', '163', '11267', 
               '1950', '3559', '6714', '84249', '2066', '29924', '1213', '30846', '84612', '440073', '2060', 
               '3303', '3561', '9101', '51160', '56904', '3304', '23527', '5878', '3560', '7189', '3949', 
               '92421', '26286', '5979', '9922', '11031', '116983', '2261', '9230', '5867', '64145', 
               '867', '57154', '84313', '3577', '116987', '10617', '1436', '200576', '83737', '23396', '3310', '5590', '3133', '382', '6456', 
               '30845', '868', '2264', '5868', '84440', '116984', '5869', '23624', '22841', '161', 
               '23096', '5338', '652614', '84552', '51028', '55616', '9829', '3815', '29082', '9135', '23362', '9146', '128866', '156', 
               '8218', '89853', '154', '64744', '9525', '84364', '9727', '23550', '8853', '1956', '8395', '6455', '64411', 
               '5156', '51100', '8027', '408', '3305', '51534', '2868', '9744', '3106', '51652', '3265', '27243', '10938', 
               '60682', '157', '26056', '10059', '2321', '80230', '1173', '1175', '160', '3306', '3135', '1234', '2149', 
               '8411', '3791', '51510', '23327', '409', '11059', '3579', '27183', '8396', '1601', '1211', '3480', 
               '9815', '26119', '64750', '26052', '4914', '25978', '8394', '1212', '30844', '131890', '79720', 
               '7251', '50855', '116985', '5662', '2870', '10193', '1785', '155382', '652799', '22905', '3105', 
               '55048', '10254', '55040', '7852', '1759', '4193', '2869', '2065', '6011', '4734', '28964', 
               '4233', '80223', '79643', '3107', '2263', '56288']

## Below are a few examples of how the results can be generated

So we only have the SL_A method adopted right now (I think, I still actually have to read the code more). This example is using the STRING-EXP network as the features in the machine learning model and using the KEEG label(gene) set collection to select the negatives.

In [21]:
example(input_genes,'SLA','STRING-EXP','KEEG',50)

Top 50 genes: ['836', '6885', '1398', '4690', '4792', '8453', '6923', '7205', '1020', '1459', '409', '10713', '10772', '6389', '2147', '51588', '5887', '6434', '7046', '1111', '7535', '2771', '29924', '19', '29896', '7508', '56259', '3916', '356', '6184', '5292', '1978', '7099', '51176', '6470', '3732', '5356', '4728', '839', '9156', '5881', '5436', '4318', '2903', '5214', '23530', '4363', '11267', '440', '1154']
Known genes in top 50: ['409', '29924', '11267']


Here the network has just been changed to BioGRID

In [24]:
example(input_genes,'SLA','BioGRID','KEEG',50)

Top 50 genes: ['1398', '8453', '409', '7205', '836', '29896', '6923', '7046', '1459', '2771', '117178', '6470', '4792', '6885', '5887', '7124', '4690', '6184', '84612', '5436', '483', '10772', '51588', '356', '57448', '1111', '56259', '6389', '8697', '5214', '10713', '224', '6434', '8795', '230', '1969', '4363', '6301', '1020', '10059', '9047', '5698', '6726', '440', '5356', '11267', '29924', '873', '2539', '4728']
Known genes in top 50: ['409', '29924', '11267', '84612', '10059']


Here the negatvies are now selected based on the DisGeNet label(gene) set collection

In [25]:
example(input_genes,'SLA','BioGRID','DisGeNet',50)

Top 50 genes: ['26270', '8453', '1398', '4088', '10014', '8878', '10013', '6885', '29896', '4792', '3084', '22919', '4690', '3799', '1111', '5887', '6168', '51574', '2078', '8431', '10772', '5932', '2740', '6281', '23530', '356', '1978', '1630', '659', '1020', '1969', '5211', '9341', '26258', '8412', '10979', '5159', '29894', '7508', '4728', '23186', '51176', '144811', '29082', '4625', '8411', '5978', '2176', '27086', '3732']
Known genes in top 50: ['29082', '8411']


A few notes

1. All the data in here is small for now so I will push it will the repo. Once we get the big networks we will have to decide the best way to handle that
2. In the function I list some options that could be added for each input. I'll keep working on thay over the next week
3. So right now the negatives are selected by the GSC function argument. This might change to be neutral negatives so this argument may go away but for a while it should be good.
4. A big feature that is missing is comparing the model weights from the ML model trained on the input_genes to other model weights of know genesets. This requires some training of other models and storing them and adding code. I should have this done next week.
5. I think it is possible to even use this to start implementing the output of the webserver that displays the results of the network. The network is there and there is a list of input_genes and also we can select the top genes to add to this set. I'm a little confused what coding you do for this and what I do, so we can talk about that at some point. 