# Generate Word Vectors For Compound Treats Disease Sentences

This notebook is designed to generate word vectors for compound treats disease (CtD) sentences. Using facebooks's fasttext, we trained word vectors using all sentences that contain a disease and gene mention. The model was trained using the following specifications:

| Parameter | Value |
| --- | --- |
| Size | 300 |
| alpha | 0.005 | 
| window | 2 |
| epochs | 50 |
| seed | 100 | 

# Set Up Environment

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from collections import defaultdict
import os
import pickle
import sys

sys.path.append(os.path.abspath('../../../modules'))

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook

from gensim.models import FastText
from gensim.models import KeyedVectors

from utils.notebook_utils.dataframe_helper import load_candidate_dataframes, generate_embedded_df

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [4]:
from snorkel.learning.pytorch.rnn.rnn_base import mark_sentence
from snorkel.learning.pytorch.rnn.utils import candidate_to_tokens
from snorkel.models import Candidate, candidate_subclass

In [5]:
CompoundDisease = candidate_subclass('CompoundDisease', ['Compound', 'Disease'])

# Compound Treats Disease

This section loads the dataframe that contains all compound treats disease candidate sentences and their respective dataset assignments.

In [9]:
cutoff = 300
total_candidates_df = (
    pd
    .read_table("../dataset_statistics/output/all_ctd_map.tsv.xz")
    .query("sen_length < 300")
)
total_candidates_df.head(2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,doid_id,doid_name,resource,resource_id,drugbank_id,drug_name,disease,sources,hetionet,n_sentences,has_sentence,partition_rank,split,compound_mention_count,disease_mention_count,gene_mention_count,sentence_id,text,sen_length,candidate_id
0,DOID:2531,hematologic cancer,CSP,2004-1600,DB00007,Leuprolide,,,0,6,1,0.58699,9,1.0,1.0,0.0,77006168,Follicular mucinosis and mycosis-fungoides-lik...,16,26220260
1,DOID:2531,hematologic cancer,CSP,2004-1600,DB00007,Leuprolide,,,0,6,1,0.58699,9,1.0,1.0,0.0,77006178,We report an unusual case of disseminated urti...,35,26208290


# Train Word Vectors

This section trains the word vectors using the specifications described above.

In [10]:
words_to_embed = []
candidates = (
    session
    .query(CompoundDisease)
    .filter(
        CompoundDisease.id.in_(
            total_candidates_df
            .candidate_id
            .astype(int)
            .tolist()
        )
    )
    .all()
)

In [11]:
for cand in tqdm_notebook(candidates):
    args = [
                (cand[0].get_word_start(), cand[0].get_word_end(), 1),
                (cand[1].get_word_start(), cand[1].get_word_end(), 2)
    ]
    words_to_embed.append(mark_sentence(candidate_to_tokens(cand), args))




In [12]:
model = FastText(
    words_to_embed, 
    window=2, 
    negative=10, 
    iter=50, 
    sg=1, 
    workers=4, 
    alpha=0.005, 
    size=300,
    seed=100
)

In [13]:
(
    model
    .wv
    .save_word2vec_format(
        "output/compound_treats_disease_word_vectors.bin", 
        fvocab="output/compound_treats_disease_word_vocab.txt", 
        binary=False
        )
)

In [14]:
model.wv.most_similar("diabetes")

  if np.issubdtype(vec.dtype, np.int):


[('diabete', 0.9016001224517822),
 ('diabetes.all', 0.8938153982162476),
 ('diabetes/obesity', 0.8613971471786499),
 ('obesity/diabetes', 0.8333636522293091),
 ('prediabetes', 0.8226915597915649),
 ('mellitus', 0.7994284629821777),
 ('pre-diabetes', 0.7542805075645447),
 ('diabetes/metabolic', 0.7506469488143921),
 ('diabetes-prone', 0.7498353719711304),
 ('diabetes-free', 0.7486210465431213)]

In [15]:
word_dict = {val[1]:val[0] for val in list(enumerate(model.wv.vocab.keys()))}
word_dict_df = (
    pd
    .DataFrame
    .from_dict(word_dict, orient="index")
    .reset_index()
    .rename({"index":"word", 0:"index"}, axis=1)
)
word_dict_df.to_csv("output/compound_treats_disease_word_dict.tsv", sep="\t", index=False)
word_dict_df.head(2)

Unnamed: 0,word,index
0,~~[[1,0
1,dexamethasone,1
