# Label The Candidates! Extract The Features!

This notebook corresponds to labeling and genearting features for each extracted candidate from the [previous notebook](1.data-loader.ipynb).

## MUST RUN AT THE START OF EVERYTHING

Load all the imports and set up the database for database operations. Plus, set up the particular candidate type this notebook is going to work with. 

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from collections import defaultdict
import csv
import os
import re


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm

In [2]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [3]:
from snorkel.annotations import FeatureAnnotator, LabelAnnotator
from snorkel.features import get_span_feats
from snorkel.models import candidate_subclass
from snorkel.models import Candidate, GoldLabel
from snorkel.viewer import SentenceNgramViewer

In [4]:
edge_type = "dg"
debug = False

In [5]:
if edge_type == "dg":
    DiseaseGene = candidate_subclass('DiseaseGene', ['Disease', 'Gene'])
    edge = "disease_gene"
elif edge_type == "gg":
    GeneGene = candidate_subclass('GeneGene', ['Gene1', 'Gene2'])
    edge = "gene_gene"
elif edge_type == "cg":
    CompoundGene = candidate_subclass('CompoundGene', ['Compound', 'Gene'])
    edge = "compound_gene"
elif edge_type == "cd":
    CompoundDisease = candidate_subclass('CompoundDisease', ['Compound', 'Disease'])
    edge = "compound_disease"
else:
    print("Please pick a valid edge type")

# Look at potential Candidates

Use this to look at loaded candidates from a given set. The constants represent the index to retrieve the appropiate set. Ideally, here is where one can look at a subset of the candidate and develop label functions for candidate labeling.

In [None]:
#dev_set = pd.read_csv("vanilla_lstm/lstm_disease_gene_holdout/train_candidates_to_ids.csv")
#dev_set.head(3)

In [None]:
#TRAIN = 0
#DEV = 1

In [None]:
#candidates = session.query(DiseaseGene).filter(DiseaseGene.split==DEV).offset(300).limit(100)
#candidates = session.query(DiseaseGene).filter(DiseaseGene.id.in_(dev_set["id"])).offset(400).limit(100)
#sv = SentenceNgramViewer(candidates, session)

In [None]:
#sv

# Label Functions

Here is one of the fundamental part of this project. Below are the label functions that are used to give a candidate a label of 1,0 or -1 which corresponds to correct label, unknown label and incorrection label. The goal here is to develop functions that can label accurately label as many candidates as possible. This idea comes from the [data programming paradigm](https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly), where the goal is to be able to create labels that machine learning algorithms can use for accurate classification.  

In [11]:
if edge_type == "dg":
    from utils.disease_gene_lf import LFS
elif edge_type == "gg":
    from utils.gene_gene_lf import *
elif edge_type == "cg":
    from utils.compound_gene_lf import *
elif edge_type == "cd":
    from utils.compound_disease_lf import *
else:
    print("Please pick a valid edge type")

  return f(*args, **kwds)


In [None]:
c = session.query(DiseaseGene).filter(DiseaseGene.id.in_(target_cids)).all()
c

In [None]:
try:
    for i, cand in enumerate(c):
        print(LFS['LF_CHECK_DISEASE_TAG'](cand))
except:
    print(i)

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

disease_desc = pd.read_table("https://raw.githubusercontent.com/dhimmel/disease-ontology/052ffcc960f5897a0575f5feff904ca84b7d2c1d/data/xrefs-prop-slim.tsv")
disease_normalization_df = pd.read_table("https://raw.githubusercontent.com/dhimmel/disease-ontology/052ffcc960f5897a0575f5feff904ca84b7d2c1d/data/slim-terms-prop.tsv")
wordnet_lemmatizer = WordNetLemmatizer()

disease_name = re.sub("\) ?", "", c[152][0].get_span())
disease_name = [wordnet_lemmatizer.lemmatize(word) for word in disease_name.split(" ")]
nltk.pos_tag(disease_name)

# Label The Candidates

Label each candidate based on the provided labels above. This code runs with realtive ease, but optimization is definitely needed when the number of label functions increases linearly.

In [6]:
from snorkel.annotations import load_gold_labels
L_gold_train = load_gold_labels(session, annotator_name='danich1', split=0)
annotated_cands_train_ids = list(map(lambda x: L_gold_train.row_index[x], L_gold_train.nonzero()[0]))

L_gold_dev = load_gold_labels(session, annotator_name='danich1', split=1)
annotated_cands_dev_ids = list(map(lambda x: L_gold_dev.row_index[x], L_gold_dev.nonzero()[0]))

In [7]:
sql = '''
SELECT id from candidate
WHERE split = 0 and type='disease_gene'
ORDER BY RANDOM()
LIMIT 50000;
'''
target_cids = [x[0] for x in session.execute(sql)]

In [8]:
target_cids

[9951794,
 904609,
 5192262,
 14552559,
 16277239,
 7513663,
 26709637,
 18498661,
 31276326,
 7508019,
 21182051,
 8718860,
 29420557,
 4576271,
 16265780,
 14579509,
 28546887,
 903561,
 2554892,
 11357489,
 6320225,
 10340995,
 1234711,
 27189858,
 23931639,
 27643583,
 8312730,
 24407247,
 12368020,
 17597476,
 20275693,
 23479706,
 28127286,
 34028805,
 33602727,
 25305237,
 16280703,
 33582365,
 2508141,
 34965519,
 5925834,
 26266918,
 24408282,
 17606676,
 9530917,
 35862276,
 28546709,
 4565185,
 7910524,
 23028505,
 18512660,
 27199812,
 29002515,
 19823705,
 35883032,
 1236055,
 34493648,
 33118337,
 4842917,
 32245680,
 31265266,
 33580470,
 32240426,
 23947258,
 13230693,
 22092561,
 22583261,
 7508178,
 6331841,
 33571610,
 35826537,
 9121725,
 4589207,
 27213249,
 33562081,
 2868939,
 31282135,
 9554292,
 25760465,
 16758106,
 34031511,
 24852520,
 4845785,
 7901313,
 27162063,
 30814233,
 31307332,
 28997960,
 20748470,
 8705830,
 29878788,
 7509402,
 17162052,
 1160875

In [9]:
sql = '''
SELECT candidate_id FROM gold_label
'''
gold_cids = [x[0] for x in session.execute(sql)]
gold_cids

[24700,
 24703,
 22411,
 22426,
 22904,
 22908,
 22925,
 23375,
 81233,
 82753,
 81692,
 82372,
 81167,
 83221,
 83225,
 83229,
 83233,
 83237,
 83241,
 83422,
 81044,
 82920,
 80914,
 82846,
 127048,
 127034,
 130192,
 130195,
 130019,
 129088,
 128867,
 128592,
 128597,
 128602,
 128713,
 127106,
 130542,
 130543,
 130544,
 128005,
 129942,
 129160,
 129322,
 129328,
 127393,
 130082,
 127597,
 128377,
 130033,
 130054,
 130049,
 129984,
 126689,
 126695,
 128727,
 130815,
 130819,
 130925,
 130545,
 130927,
 130640,
 178000,
 174278,
 177603,
 175363,
 177711,
 178469,
 176072,
 178158,
 241000,
 244451,
 235144,
 231382,
 232720,
 231376,
 237268,
 236283,
 235411,
 235417,
 235510,
 235515,
 235518,
 235523,
 235529,
 235534,
 229709,
 233867,
 233873,
 231095,
 231250,
 232716,
 232731,
 232735,
 226358,
 239079,
 243774,
 243174,
 243181,
 228017,
 245558,
 480361,
 480366,
 476332,
 481835,
 481842,
 483470,
 481850,
 487622,
 478730,
 477467,
 477471,
 477477,
 477480,
 23442,

In [12]:
from  sqlalchemy.sql.expression import func
labeler = LabelAnnotator(lfs=list(LFS.values()))

cids = session.query(DiseaseGene.id).filter(DiseaseGene.id.in_(target_cids))
%time L_train = labeler.apply(split=0, cids_query=cids, parallelism=5)

cids = session.query(Candidate.id).filter(Candidate.id.in_(gold_cids))
%time L_dev = labeler.apply_existing(cids_query=cids, parallelism=5, clear=False)

#cids = session.query(Candidate.id).filter(Candidate.split==2)
#%time L_test = labeler.apply_existing(split=2, cids_query=cids, parallelism=5, clear=False)

Clearing existing...
Running UDF...
CPU times: user 1min 22s, sys: 4.7 s, total: 1min 27s
Wall time: 4min 19s
Running UDF...
CPU times: user 10.7 s, sys: 336 ms, total: 11 s
Wall time: 20.8 s


In [None]:
L_dev.shape

In [13]:
np.savetxt('data/labeled_candidates.txt', target_cids)

In [None]:
L_train.lf_stats(session)

# DO NOT RUN BELOW

# Generate Candidate Features

In conjunction with each candidate label, generate candidate features that will be used by some machine learning algorithms (notebook 4). This step is broken as insert takes an **incredibly** long time to run. Had to do roundabout way to load the features. **Do not run this block** and refer to the code block below. Gonna need to debug this part, when I get time.

In [None]:
%%time
featurizer = FeatureAnnotator()
featurizer.apply(split=0, clear=False)

F_dev = featurizer.apply_existing(split=1, parallelism=5, clear=False)
F_test = featurizer.apply_existing(split=2, parallelism=5, clear=False)

# Work Around for above code

As mentioned above this code is the workaround for the broken featurizer. The intuition behind this section is to write all the generated features to a sql text file. Exploting the psql's COPY command, the time taken for inserting features drops to ~30 minutues (compared to 1 week+).

In [None]:
%%time

group = 0
chunksize = 1e5
seen = set()
feature_key_hash = defaultdict(int)
feat_counter = 0

with open('feature_key.sql', 'wb') as f:
    with open('feature.sql', 'wb') as g:
        # Write the headers
        f.write("COPY feature_key(\"group\", name, id) from stdin with CSV DELIMITER '	' QUOTE '\"';\n")
        g.write("COPY feature(value, candidate_id, key_id) from stdin with CSV DELIMITER '	' QUOTE '\"';\n")
        
        # Set up the writers
        feature_key_writer = csv.writer(f, delimiter='\t',  quoting=csv.QUOTE_NONNUMERIC)
        feature_writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONNUMERIC)
        
        # For each split get and generate features
        for split in [0,1,2]:
    
            #reset pointer to cycle through database again
            pointer = 0
            
            print(split)
            candidate_query = session.query(Candidate).filter(Candidate.split==split).limit(chunksize)
            
            while True:
                candidates = candidate_query.offset(pointer).all()
                
                if not candidates:
                    break

                for c in tqdm.tqdm(candidates):
                    try:
                        for name, value in get_span_feats(c):

                            # If the training set, set the feature hash
                            if split == 0:
                                if name not in feature_key_hash:
                                    feature_key_hash[name] = feat_counter
                                    feat_counter = feat_counter + 1
                                    feature_key_writer.writerow([group, name, feature_key_hash[name]])

                            if name in feature_key_hash:
                                # prevent duplicates from being written to the file
                                if (c.id, name) not in seen:
                                    feature_writer.writerow([value, c.id, feature_key_hash[name]])
                                    seen.add((c.id, name))

                        #To prevent memory overload
                        seen = set()
                    
                    except Exception as e:
                        print(e.message)
                        print(c)
                        print(c.get_parent().text)

                # update pointer for database
                pointer = pointer + chunksize

# Generate Coverage Stats

Before throwing our labels at a machine learning algorithm take a look at some quick stats. The code below will show the coverage and conflicts of each label function. Furthermore, this code will show the dimension of each label matrix.

In [None]:
print(L_train.lf_stats(session, ))

In [None]:
print(L_dev.lf_stats(session, ))