# Re-Organize the Candidates

From the [previous notebook](1.data-loader.ipynb) we aim to stratify the candidates into the appropiate categories (training, development, test). Since the hard work (data insertion) was already done, this part is easy as it breaks down into relabeling the split column inside the Candidate table. The split column will be used throughout the rest of this pipeline.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

#Imports
import csv
import os
import random

import numpy as np
import pandas as pd
import tqdm

In [2]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [3]:
from snorkel.models import  candidate_subclass

In [4]:
#This specifies the type of candidates to extract
DiseaseGene = candidate_subclass('DiseaseGene', ['Disease', 'Gene'])

# Make Stratified File

In [5]:
disease_ontology_df = pd.read_csv('https://raw.githubusercontent.com/dhimmel/disease-ontology/052ffcc960f5897a0575f5feff904ca84b7d2c1d/data/xrefs-prop-slim.tsv', sep="\t")
disease_ontology_df = disease_ontology_df.drop_duplicates(["doid_code", "doid_name"])

In [6]:
gene_entrez_df = pd.read_csv('https://raw.githubusercontent.com/dhimmel/entrez-gene/a7362748a34211e5df6f2d185bb3246279760546/data/genes-human.tsv', sep="\t")
gene_entrez_df = gene_entrez_df[["GeneID", "Symbol"]]

## Map Each Disease to Each Gene

In [7]:
gene_entrez_df['dummy_key'] =0
disease_ontology_df['dummy_key'] = 0
dg_map_df = gene_entrez_df.merge(disease_ontology_df[["doid_code", "doid_name", "dummy_key"]], on='dummy_key')

## Label All Pairs Whether or Not They are in Hetnets

In [8]:
%%time
hetnet_kb_df = pd.read_csv("hetnet_dg_kb.csv")
hetnet_set = set(map(lambda x: tuple(x), hetnet_kb_df.values))
hetnet_labels = np.ones(dg_map_df.shape[0]) * -1

for index, row in tqdm.tqdm(dg_map_df.iterrows()):
    if (row["doid_code"], row["GeneID"]) in hetnet_set:
        hetnet_labels[index] = 1 
    
dg_map_df["hetnet"] = hetnet_labels

7663872it [08:07, 15735.60it/s]

CPU times: user 8min 26s, sys: 44.1 s, total: 9min 10s
Wall time: 8min 7s





## See if D-G Pair is in Pubmed

In [10]:
%%time
pubmed_dg_pairs = set({})
cands = []
chunk_size = 1e5
offset = 0

while True:
    cands = session.query(DiseaseGene).limit(chunk_size).offset(offset).all()
    
    if not cands:
        break
        
    for candidate in tqdm.tqdm(cands):
        pubmed_dg_pairs.add((candidate.Disease_cid, candidate.Gene_cid))
    
    offset = offset + chunk_size

100%|██████████| 100000/100000 [00:00<00:00, 600430.89it/s]
100%|██████████| 100000/100000 [00:00<00:00, 573072.20it/s]
100%|██████████| 100000/100000 [00:00<00:00, 582798.01it/s]
100%|██████████| 100000/100000 [00:00<00:00, 577477.27it/s]
100%|██████████| 100000/100000 [00:00<00:00, 549028.60it/s]
100%|██████████| 100000/100000 [00:00<00:00, 551130.83it/s]
100%|██████████| 100000/100000 [00:00<00:00, 553903.12it/s]
100%|██████████| 100000/100000 [00:00<00:00, 537302.13it/s]
100%|██████████| 100000/100000 [00:00<00:00, 474774.91it/s]
100%|██████████| 100000/100000 [00:00<00:00, 522155.00it/s]
100%|██████████| 100000/100000 [00:00<00:00, 476559.93it/s]
100%|██████████| 100000/100000 [00:00<00:00, 467372.54it/s]
100%|██████████| 100000/100000 [00:00<00:00, 505586.36it/s]
100%|██████████| 100000/100000 [00:00<00:00, 518605.38it/s]
100%|██████████| 100000/100000 [00:00<00:00, 515817.63it/s]
100%|██████████| 100000/100000 [00:00<00:00, 462519.66it/s]
100%|██████████| 100000/100000 [00:00<00

CPU times: user 1min 13s, sys: 1.85 s, total: 1min 15s
Wall time: 5min 5s


In [11]:
pubmed_labels = np.ones(dg_map_df.shape[0]) * -1

for index, row in tqdm.tqdm(dg_map_df.iterrows()):
    if (row["doid_code"], row["GeneID"]) in hetnet_set:
        pubmed_labels[index] = 1

dg_map_df["pubmed"] = pubmed_labels

7663872it [08:05, 15786.79it/s]


In [18]:
dg_map_df = dg_map_df.rename({"GeneID": "gene_id", "doid_code": "disease_ontology", "doid_name": "disease_name", "Symbol":"gene_name"})
dg_map_df["hetnet"] = dg_map_df["hetnet"].astype(int)
dg_map_df["pubmed"] = dg_map_df["pubmed"].astype(int)
dg_map_df.drop("dummy_key", axis=1).to_csv("dg_map.csv", index=False)

In [19]:
dg_map_df

Unnamed: 0,GeneID,Symbol,dummy_key,doid_code,doid_name,hetnet,pubmed
0,1,A1BG,0,DOID:2531,hematologic cancer,-1,-1
1,1,A1BG,0,DOID:1319,brain cancer,-1,-1
2,1,A1BG,0,DOID:1324,lung cancer,-1,-1
3,1,A1BG,0,DOID:263,kidney cancer,-1,-1
4,1,A1BG,0,DOID:1793,pancreatic cancer,-1,-1
5,1,A1BG,0,DOID:4159,skin cancer,-1,-1
6,1,A1BG,0,DOID:184,bone cancer,-1,-1
7,1,A1BG,0,DOID:0060119,pharynx cancer,-1,-1
8,1,A1BG,0,DOID:2394,ovarian cancer,-1,-1
9,1,A1BG,0,DOID:1612,breast cancer,-1,-1


## Modify the Candidate split

This code below changes the split column of the candidate table as mentioned above. Using sqlalchemy and the chunking strategy, every candidate that has the particular disease entity id (DOID:3393) will be given the category of 2. 2 Representes the testing set which will be used in the rest of the notebooks.

In [None]:
np.random.seed(100)
cands = []
chunk_size = 1e5
offset = 0

while True:
    cands = session.query(DiseaseGene).limit(chunk_size).offset(offset).all()
    
    if not cands:
        break
        
    for candidate in tqdm.tqdm(cands):
        if candidate.Disease_cid == "DOID:3393":
            candidate.split = 2
        else:
            split = np.random.choice([0,1], 1,  p=[0.8,0.2])
            candidate.split = split[0]
        
        session.add(candidate)
    
    offset = offset + chunk_size
# persist the changes into the database
session.commit()