# Putting the gene annotation together for the Disrupt/Biograph project

After extensive discussion with Sebastian, we have decided to make everything in our database to reflect a "graph-like" structure. We would start with the gene_id being a primary structure, which needs to be annotated. Then follows patients, and patient specific graphs.

For a specific gene, we will have the following:
    
   - meta_information
   - node (strictly calculable)
   - edges (empty)
   
"Strictly Calculable" means that it's a number, and can be directly brought into machine learning and kernel applications. Now, the issue that I am considering at the moment is how to combine the different data sources.

The driver annotation will be pivotal to the database information. This I have put together from various sources, and is already in the clinical reporting database.

The next thing would be drug interactions and target information. This could go into the meta data. However, I would really like to place the one-hot vectors into the node field for categorical data. (Then do it)




driver_consensus_score
    

# Gene Module
## gene_id: mongodb
### meta_information
    
   * node_type (string) ["Gene",...]
   * species (string) ["hsa",...]
   * cellular_process (string)
   * core_pathway (string) ["Genome Maintenance",...] 
   * gene_symbol (string) ["DNA Damage Control",...]
   * entrez_id (string)
   * gene_family_id (string)
   * ensembl_id (string)
   * uniprot_id (string)
   * cancer
     * driver_type [None, TSG, Oncogene, Unkown, TSG/Oncogene]
     * driver_pmid [None]
     * driver_source [None]
   * 
   
### Node

   * is_driver (bool) [True,False]
   * driver_OncodriveROLE_prob (float) [None, 0...1]
   * driver_consensus_score (int) [None,#]
   
    

# Patient  Module
### meta_information
   * patient_id (string)
   * disease_type (string)["Cancer",...
   * disease_sub_type (string)["HCC",...]
   * vep_polyphen_category (string) [None, benign,...]
   * vep_sift_category (string) [None, tolerated,...]
   * vep_impact (string) [None, MODERATE,...]
   * vep_consequence (string) [None, missense_variant]

### Node   
   * gene_id
       * tumor_af (float) [None, 0...1]
       * vep_LOFtool (float) [None, 0...1]
       * vep_polyphen (float) [None, 0...1]
       * vep_sift (float) [None, 0...1]
       
   * gene_id   
       * ...

In [None]:
import pandas as pd

# Import Driver gene tables

In [3]:
df_scored = pd.read_pickle("/Users/Heisenberg/pythonProjects/biograph_seeding/data/driver_genes_DataFrame.pkl")

In [6]:
df_scored.head()

Unnamed: 0,Core pathway,OncodriveROLE_prob,Process,driver_type,gene_symbol,pmid,source_name,score
0,,,,TSG,ABI1,14993899,Cosmic,1
1,,,,Oncogene,ABL1,14993899,Cosmic,4
616,Cell Cycle/Apoptosis,,Cell Survival,Oncogene,ABL1,23539594,Vogelstein,4
2135,,,,Oncogene,ABL1,14681372,Uniprot,4
2367,,,,Oncogene,ABL1,25759023,Rubio-Perez,4


# prototype the driver insert

In [4]:
gene_symbol = "B2M"
driver_score = df_scored["score"].groupby(df_scored["gene_symbol"]).mean().loc[gene_symbol]
driver_oncodriveROLE = df_scored["OncodriveROLE_prob"].groupby(df_scored["gene_symbol"]).any().replace({False:None}).loc[gene_symbol]

driver_source = []

for driver_info in df_scored.loc[df_scored["gene_symbol"]==gene_symbol].iterrows():
    if driver_info[1]["source_name"] == "Vogelstein":
        driver_source.append(driver_info[1][['driver_type','source_name','pmid',"Core pathway","Process"]].to_dict())
    else:
        driver_source.append(driver_info[1][['driver_type','source_name','pmid']].to_dict())

In [5]:
if  not df_scored.loc[df_scored["gene_symbol"]=="EGFR"].empty:
    print("EGFR is a driver found something")

EGFR is a driver found something


In [15]:
df_scored.loc[df_scored["gene_symbol"]=="NaN","gene_symbol"]


Series([], Name: gene_symbol, dtype: object)

# Make the important important.
### Do that funky import thang whitey

In [5]:
# start with hsa_copy collection
import pymongo
from pymongo import MongoClient
import numpy as np
from bson.objectid import ObjectId
from bson.dbref import DBRef
from bson.json_util import loads

In [6]:
client = MongoClient()
client.database_names()

[u'__py_biograph_id_mapping__', u'admin', u'drivers', u'local', u'test']

In [7]:
db = client["__py_biograph_id_mapping__"]

In [8]:
# biograph gene structure
biograph_gene_dict = {"meta_information":{},
    "nodes":{},
    "edges":{}}
drivers_gene_dict = {"meta_information":{},
    "nodes":{},
    "edges":{}}

# Create the biograph_gene collection

In [16]:
# iterate through the Charlotta my_drug collection, re-structuring and adding driver information where necessary
search_filter = {}
for charlotta_pointer in db.charlotta.find(search_filter,projection={"_id":0,"cancer":0}):
    # reset the dictionary to avoid duplicate entries
    biograph_gene_dict = {"meta_information":{},
                          "nodes":{},
                          "edges":{}}
    # reconfigure charlotta to biograph model
    biograph_gene_dict["meta_information"] = charlotta_pointer
    # change the key from "gene_symbol" to "symbol"
    biograph_gene_dict["meta_information"]["symbol"] = biograph_gene_dict["meta_information"].pop("gene_symbol")
    gene_symbol = biograph_gene_dict["meta_information"]["symbol"]
    # insert species and node type for all genes
    biograph_gene_dict["meta_information"]["species"] = "hsa"
    biograph_gene_dict["meta_information"]["node_type"] = "gene"
    
    #check to see if it's in the driver list and append accordingly
    if not df_scored.loc[df_scored["gene_symbol"]==gene_symbol].empty:
        biograph_gene_dict["nodes"]["is_driver"] = True
        biograph_gene_dict["nodes"]["driver_score"] = df_scored["score"].groupby(df_scored["gene_symbol"]).mean().loc[gene_symbol]
        biograph_gene_dict["nodes"]["driver_oncodriveROLE"] = df_scored["OncodriveROLE_prob"].groupby(df_scored["gene_symbol"]).any().replace({False:None}).loc[gene_symbol]

        
        driver_source = []

        for driver_info in df_scored.loc[df_scored["gene_symbol"]==gene_symbol].iterrows():
            if driver_info[1]["source_name"] == "Vogelstein":
                driver_source.append(driver_info[1][['driver_type','source_name','pmid',"Core pathway","Process"]].to_dict())
            else:
                driver_source.append(driver_info[1][['driver_type','source_name','pmid']].to_dict())
        
        biograph_gene_dict["meta_information"]["driver_information"]=driver_source
    else:
        biograph_gene_dict["nodes"]["is_driver"] = False

    # create unique _id for each entry
    biograph_gene_dict["_id"] = ObjectId()
    # inject into database
    db.biograph_genes.insert_one(biograph_gene_dict)


# upload the TCGA patients to the database

In [14]:
#create dictionary strucutre for uploading to database
biograph_TCGA_patient_dict = {"meta_information":{},
    "nodes":{},
    "edges":{}}
# import the pandas dataframe containing the patient annotated VCFs 
save_path = "/Users/Heisenberg/pythonProjects/disrupt/data/graphs/TCGA_LIHC_VCFs_vep_to_DB.pkl"
VCFs = pd.read_pickle(save_path)
# iterate through dataframe on a patient basis, creating a json for each variant
patient_ids = VCFs["patient_id"].unique()
node_keys = ['LoFtool','POLYPHEN_score','SIFT_score']

In [8]:
#for patient_df in patient_ids:
        
    for n in node_keys:
        biograph_TCGA_patient_dict["meta_information"]["symbol"] = \\
            biograph_TCGA_patient_dict["meta_information"].pop("gene_symbol")
    
# match this variant to the database and retrieve db _id for later gene matching

# insert patient as document

In [11]:
VCFs.iloc[0].to_dict()

{'Allele': 'A',
 'Amino_acids': 'L',
 'BIOTYPE': 'protein_coding',
 'CANONICAL': 'YES',
 'CDS_position': '684',
 'CLIN_SIG': '',
 'Codons': 'ttG/ttA',
 'Consequence': 'synonymous_variant',
 'DISTANCE': '',
 'DOMAINS': 'Pfam_domain:PF14954&hmmpanther:PTHR31139',
 'ENSP': 'ENSP00000358314',
 'EXON': '4/6',
 'Existing_variation': 'COSM4937220',
 'FLAGS': '',
 'Feature': 'ENST00000369308',
 'Feature_type': 'Transcript',
 'GENE_PHENO': '',
 'Gene': 'ENSG00000152022',
 'HGNC_ID': '28715',
 'HGVS_OFFSET': '',
 'HGVSc': 'ENST00000369308.3:c.684G>A',
 'HGVSp': 'ENSP00000358314.3:p.Leu228%3D',
 'IMPACT': 'LOW',
 'INTRON': '',
 'LoF': '',
 'LoF_filter': '',
 'LoF_flags': '',
 'LoF_info': '',
 'LoFtool': '0.405',
 'PHENO': '1',
 'PICK': '1',
 'POLYPHEN_outcome': 'unknown',
 'POLYPHEN_score': nan,
 'PUBMED': '',
 'PolyPhen': 'unknown',
 'Protein_position': '228',
 'SIFT': 'unknown',
 'SIFT_outcome': 'unknown',
 'SIFT_score': nan,
 'SOMATIC': '1',
 'STRAND': '1',
 'SWISSPROT': 'Q8IVB5',
 'SYMBOL': '

In [21]:
# can you send a list of genes to the hsa_copy collection, recieving the _id
patients_genes = VCFs.loc[VCFs["patient_id"]==patient_ids[0],"SYMBOL"].values
search_filter = {"$and":[{"symbol":{"$ne":""}},{"symbol":{"$exists":True}},{"symbol":{"$in":list(patients_genes)}}]}
G = db.hsa_copy.find(search_filter,projection={"_id":1})

In [20]:
patients_genes = VCFs.loc[VCFs["patient_id"]==patient_ids[0],"SYMBOL"].values
patients_genes

array(['LIX1L', 'TTC4', 'CHD5', 'ITGA10', 'CNKSR1', 'CCDC28B', 'COL16A1',
       'TGFB2', 'UBR4', 'HNRNPCL1', 'SPEN', 'SLC9C2', 'ENO1', 'EPHA10',
       'NOTCH2', 'FAM89A', 'PTPRU', 'CTSK', 'TCTEX1D4', 'COL11A1', 'KPNA6',
       'ABCB10', 'ADORA3', 'LCE1F', 'AK2', 'ARID1A', 'LYST', 'NEBL',
       'RTKN2', 'TACC2', 'FBXW4', 'PNLIP', 'TRIM22', 'TRIM66', 'RCE1',
       'ATM', 'RP11-399J13.3', 'WNK1', 'FGF6', 'KRT4', 'ABCC4', 'PDX1',
       'SPTLC2', 'SYNE2', 'RIN3', 'NYNRIN', 'FAM71D', 'ZFYVE1', 'KIAA0586',
       'ARHGEF40', 'PLCG2', 'CCDC135', 'ITGAM', 'RBL2', 'OGFOD3', 'KIF19',
       'MYH4', 'KRTAP4-11', 'MTMR4', 'MRC2', 'AANAT', 'QRICH2', 'ACE',
       'AFMID', 'ALOX12', 'ALPK2', 'SOCS6', 'TXNL1', 'ZNF112', 'USHBP1',
       'MUC16', 'MBOAT7', 'NFKBIB', 'CPT1C', 'RESP18', 'LRP2', 'NPHP1',
       'XPO1', 'HTRA2', 'NMUR1', 'ST6GAL2', 'VWA3B', 'USP40', 'NEU2',
       'MYO3B', 'PUS10', 'TMEM131', 'WDFY1', 'SCN3A', 'PIKFYVE', 'HDLBP',
       'E2F6', 'GPR148', 'CYP27C1', 'CPS1', 'COBLL1', '

In [26]:
#patients_genes
for g in G:
    print(g[u'_id'])

In [29]:
oi = g[u'_id']

In [32]:
oi.is_valid

<bound method type.is_valid of <class 'bson.objectid.ObjectId'>>

In [41]:
print('"' + '","'.join(patients_genes)+'"')

"LIX1L","TTC4","CHD5","ITGA10","CNKSR1","CCDC28B","COL16A1","TGFB2","UBR4","HNRNPCL1","SPEN","SLC9C2","ENO1","EPHA10","NOTCH2","FAM89A","PTPRU","CTSK","TCTEX1D4","COL11A1","KPNA6","ABCB10","ADORA3","LCE1F","AK2","ARID1A","LYST","NEBL","RTKN2","TACC2","FBXW4","PNLIP","TRIM22","TRIM66","RCE1","ATM","RP11-399J13.3","WNK1","FGF6","KRT4","ABCC4","PDX1","SPTLC2","SYNE2","RIN3","NYNRIN","FAM71D","ZFYVE1","KIAA0586","ARHGEF40","PLCG2","CCDC135","ITGAM","RBL2","OGFOD3","KIF19","MYH4","KRTAP4-11","MTMR4","MRC2","AANAT","QRICH2","ACE","AFMID","ALOX12","ALPK2","SOCS6","TXNL1","ZNF112","USHBP1","MUC16","MBOAT7","NFKBIB","CPT1C","RESP18","LRP2","NPHP1","XPO1","HTRA2","NMUR1","ST6GAL2","VWA3B","USP40","NEU2","MYO3B","PUS10","TMEM131","WDFY1","SCN3A","PIKFYVE","HDLBP","E2F6","GPR148","CYP27C1","CPS1","COBLL1","HK2","ECEL1","ADAM17","FIGN","ASAP2","ARHGAP25","CLASP1","MOCS3","NCOA6","RP4-785G19.5","RPRD1B","TRIOBP","MN1","PANX2","FNDC3B","RAB6B","TUSC2","PPP2R3A","LAMB2","DPPA2","DBR1","PLXNA1","COL6A6

# Miscellaneous code so I don't forget what I know

In [106]:
db.biograph_genes.remove()

  if __name__ == '__main__':


{u'n': 37208, u'ok': 1.0}

In [94]:
db.biograph_driver_genes.remove()

  if __name__ == '__main__':


{u'n': 3527, u'ok': 1.0}

In [5]:
#make a copy of hsa
#pipeline = [ {"$match": {}}, 
#             {"$out": "hsa_copy"},
#]
#db.hsa.aggregate(pipeline)

In [6]:
pipeline = [{"$match":{"ensembl_gene_id":"ENSG00000243989"}},
            {"$project":{"cancer.source_name":1,
                         "cancer.driver_type":1}},
            {"$addFields":{"node_type":"gene",
                           "species":"hsa"}}
           ]

In [8]:
r = db.charlotta.aggregate(pipeline)

In [9]:
pipeline = [{"$unwind":"$nodes"},
            {"$lookup":
             {"from":"drivers",
             "localField":"ensembl_gene_id",
             "foreignField":"ensembl",
             "as":"combined"}
            },
            {"$out":"combined"}
             ]