# Embed Compound Treats Disease Sentences

This notebook is designed to embed compound treats disease (CtD) sentences. After word vectors have been trained, we embed sentences using the following steps:

1. Load the total vocab generated from trained word vectors.
2. Cycle through each sentence
3. For each word in the sentence determine if word is in vocab
4. if yes assign index of no assign index for unknown token

# Set Up Environment

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from collections import defaultdict
import os
import pickle
import sys

sys.path.append(os.path.abspath('../../../modules'))

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook

from gensim.models import FastText
from gensim.models import KeyedVectors

from utils.notebook_utils.dataframe_helper import load_candidate_dataframes, generate_embedded_df

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [4]:
from snorkel.learning.pytorch.rnn.rnn_base import mark_sentence
from snorkel.learning.pytorch.rnn.utils import candidate_to_tokens
from snorkel.models import Candidate, candidate_subclass

In [5]:
CompoundDisease = candidate_subclass('CompoundDisease', ['Compound', 'Disease'])

# Compound Treats Disease

This section loads the dataframe that contains all compound treats disease candidate sentences and their respective dataset assignments.

In [9]:
cutoff = 300
total_candidates_df = (
    pd
    .read_table("../dataset_statistics/results/all_ctd_map.tsv.xz")
    .query("sen_length < 300")
)
total_candidates_df.head(2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,doid_id,doid_name,resource,resource_id,drugbank_id,drug_name,disease,sources,hetionet,n_sentences,has_sentence,partition_rank,split,compound_mention_count,disease_mention_count,gene_mention_count,sentence_id,text,sen_length,candidate_id
0,DOID:2531,hematologic cancer,CSP,2004-1600,DB00007,Leuprolide,,,0,6,1,0.58699,9,1.0,1.0,0.0,77006168,Follicular mucinosis and mycosis-fungoides-lik...,16,26220260
1,DOID:2531,hematologic cancer,CSP,2004-1600,DB00007,Leuprolide,,,0,6,1,0.58699,9,1.0,1.0,0.0,77006178,We report an unusual case of disseminated urti...,35,26208290


# Embed all of Compound Treats Disease Sentences

This section embeds all candidate sentences. For each sentence, we place tags around each mention, tokenized the sentence and then matched each token to their corresponding word index. Any words missing from our vocab receive a index of 1. Lastly, the embedded sentences are exported as a sparse dataframe.

In [17]:
word_dict_df = pd.read_table("results/compound_treats_disease_word_dict.tsv")
word_dict = {word[0]:word[1] for word in word_dict_df.values.tolist()}

In [18]:
limit = 1000000
total_candidate_count = total_candidates_df.shape[0]

for offset in list(range(0, total_candidate_count, limit)):
    candidates = (
        session
        .query(CompoundDisease)
        .filter(
            CompoundDisease.id.in_(
                total_candidates_df
                .candidate_id
                .astype(int)
                .tolist()
            )
        )
        .offset(offset)
        .limit(limit)
        .all()
    )
    
    max_length = total_candidates_df.sen_length.max()
    
    # if first iteration create the file
    if offset == 0:
        (
            generate_embedded_df(candidates, word_dict, max_length=max_length)
            .to_sparse()
            .to_csv(
                "results/all_embedded_cd_sentences.tsv",
                index=False, 
                sep="\t", 
                mode="w"
            )
        )
        
    # else append don't overwrite
    else:
        (
            generate_embedded_df(candidates, word_dict, max_length=max_length)
            .to_sparse()
            .to_csv(
                "results/all_embedded_cd_sentences.tsv",
                index=False, 
                sep="\t", 
                mode="a",
                header=False
            )
        )







In [None]:
os.system("cd results; xz all_embedded_cd_sentences.tsv")