# Re-Organize the Candidates

From the [previous notebook](1.data-loader.ipynb) we aim to stratify the candidates into the appropiate categories (training, development, test). Since the hard work (data insertion) was already done, this part is easy as it breaks down into relabeling the split column inside the Candidate table. The split column will be used throughout the rest of this pipeline.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

#Imports
import csv
import os
import random

import numpy as np
import pandas as pd
import tqdm

In [2]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [3]:
from snorkel.models import  candidate_subclass

In [4]:
#This specifies the type of candidates to extract
DiseaseGene = candidate_subclass('DiseaseGene', ['Disease', 'Gene'])

## Modify the Candidate split

This code below changes the split column of the candidate table as mentioned above. Using sqlalchemy and the chunking strategy, every candidate that has the particular disease entity id (DOID:3393) will be given the category of 2. 2 Representes the testing set which will be used in the rest of the notebooks.

In [9]:
np.random.seed(100)
cands = []
chunk_size = 1e5
offset = 0

while True:
    cands = session.query(DiseaseGene).limit(chunk_size).offset(offset).all()
    
    if not cands:
        break
        
    for candidate in tqdm.tqdm(cands):
        if candidate.Disease_cid == "DOID:3393":
            candidate.split = 2
        else:
            split = np.random.choice([0,1], 1,  p=[0.8,0.2])
            candidate.split = split[0]
        
        session.add(candidate)
    
    offset = offset + chunk_size
# persist the changes into the database
session.commit()

100%|██████████| 100000/100000 [00:05<00:00, 17583.91it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17931.66it/s]
100%|██████████| 100000/100000 [00:05<00:00, 18641.76it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17645.14it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17575.49it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17513.28it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17383.84it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17504.64it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17650.04it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17850.91it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17439.88it/s]
100%|██████████| 100000/100000 [00:05<00:00, 18542.10it/s]
100%|██████████| 100000/100000 [00:05<00:00, 18116.84it/s]
100%|██████████| 100000/100000 [00:05<00:00, 18353.84it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17032.68it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17993.90it/s]
100%|██████████| 100000/100000 [00:05<00:00, 17632.32it/