# Generate Word Vectors For Gene Interacts Gene Sentences

This notebook is designed to generate word vectors for gene interacts gene (GiG) sentences. Using facebooks's fasttext, we trained word vectors using all sentences that contain a disease and gene mention. The model was trained using the following specifications:

| Parameter | Value |
| --- | --- |
| Size | 300 |
| alpha | 0.005 | 
| window | 2 |
| epochs | 50 |
| seed | 100 | 

# Set Up Environment

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from collections import defaultdict
import os
import pickle
import sys

sys.path.append(os.path.abspath('../../../modules'))

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook

from gensim.models import FastText
from gensim.models import KeyedVectors

from utils.notebook_utils.dataframe_helper import load_candidate_dataframes, generate_embedded_df

In [2]:
#Set up the environment
username = "danich1"
password = "snorkel"
dbname = "pubmeddb"

#Path subject to change for different os
database_str = "postgresql+psycopg2://{}:{}@/{}?host=/var/run/postgresql".format(username, password, dbname)
os.environ['SNORKELDB'] = database_str

from snorkel import SnorkelSession
session = SnorkelSession()

In [3]:
from snorkel.learning.pytorch.rnn.rnn_base import mark_sentence
from snorkel.learning.pytorch.rnn.utils import candidate_to_tokens
from snorkel.models import Candidate, candidate_subclass

In [4]:
GeneGene = candidate_subclass('GeneGene', ['Gene1', 'Gene2'])

# Gene Interacts Gene

This section loads the dataframe that contains all gene interacts gene candidate sentences and their respective dataset assignments.

In [5]:
total_candidates_df = (
    pd.read_table("../dataset_statistics/results/all_gig_candidates.tsv.xz")
    .query("sen_length < 300")
)
total_candidates_df.head(2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,gene1_id,gene1_name,gene2_id,gene2_name,sources,n_sentences,hetionet,has_sentence,split,partition_rank,compound_mention_count,disease_mention_count,gene_mention_count,sentence_id,text,sen_length,candidate_id
0,1,A1BG,10321,CRISP3,II_literature|hetio-dag,5,1,1,3,0.436432,0.0,0.0,2,65570963,Human CRISP-3 binds serum alpha(1)B-glycoprote...,11,20992573
1,1,A1BG,10321,CRISP3,II_literature|hetio-dag,5,1,1,3,0.436432,0.0,0.0,3,65570968,BACKGROUND: CRISP-3 was previously shown to be...,21,20930188


# Train Word Vectors

This section trains the word vectors using the specifications described above.

In [15]:
words_to_embed = []
candidates = (
    session
    .query(GeneGene)
    .filter(
        GeneGene.id.in_(
            total_candidates_df
            .sample(2500000, random_state=100)
            .candidate_id
            .astype(int)
            .tolist()
        )
    )
    .all()
)

In [None]:
for cand in tqdm_notebook(candidates):
    args = [
                (cand[0].get_word_start(), cand[0].get_word_end(), 1),
                (cand[1].get_word_start(), cand[1].get_word_end(), 2)
    ]
    words_to_embed.append(mark_sentence(candidate_to_tokens(cand), args))

In [None]:
model = FastText(
    words_to_embed, 
    window=2, 
    negative=10, 
    iter=50, 
    sg=1, 
    workers=4, 
    alpha=0.005, 
    size=300, 
    seed=100
)

In [19]:
(
    model
    .wv
    .save_word2vec_format(
        "results/gene_interacts_gene_word_vectors.bin", 
        fvocab="results/gene_interacts_gene_word_vocab.txt", 
        binary=False
        )
)

In [20]:
model.wv.most_similar("diabetes")

  if np.issubdtype(vec.dtype, np.int):


[('gck-diabetes', 0.8755556344985962),
 ('prediabetes', 0.8549448251724243),
 ('pre-diabetes', 0.8382469415664673),
 ('mellitus', 0.8098444938659668),
 ('non-diabetes', 0.7960482239723206),
 ('hnf1a-diabetes', 0.7894294857978821),
 ('hyperglycemia/diabetes', 0.7500520348548889),
 ('diabetology', 0.7296439409255981),
 ('antidiabetes', 0.7268327474594116),
 ('diabetes-related', 0.718899667263031)]

In [21]:
word_dict = {val[1]:val[0] for val in list(enumerate(model.wv.vocab.keys()))}
word_dict_df = (
    pd
    .DataFrame
    .from_dict(word_dict, orient="index")
    .reset_index()
    .rename({"index":"word", 0:"index"}, axis=1)
)
word_dict_df.to_csv("results/gene_interacts_gene_word_dict.tsv", sep="\t", index=False)
word_dict_df.head(2)

Unnamed: 0,word,index
0,glucose,0
1,and,1


# Embed all Gene Interacts Gene Sentences

**Note**: Must run this section separately because the kernel cannot handle both training the word vectors and then embedding each GiG sentence.

This section embesd all candidate sentences. For each sentence, we place tags around each mention, tokenized the sentence and then matched each token to their corresponding word index. Any words missing from our vocab receive a index of 1. Lastly, the embedded sentences are exported as a sparse dataframe.

In [6]:
word_dict_df = pd.read_table("results/gene_interacts_gene_word_dict.tsv")
word_dict = {word[0]:word[1] for word in word_dict_df.values.tolist()}

In [None]:
limit = 1000000
total_candidate_count = total_candidates_df.shape[0]

for offset in list(range(0, total_candidate_count, limit)):
    candidates = (
        session
        .query(GeneGene)
        .filter(
            GeneGene.id.in_(
                total_candidates_df
                .candidate_id
                .astype(int)
                .tolist()
            )
        )
        .offset(offset)
        .limit(limit)
        .all()
    )
    
    max_length = total_candidates_df.sen_length.max()

    # if first iteration create the file
    if offset == 0:
        (
            generate_embedded_df(candidates, word_dict, max_length=max_length)
            .to_csv(
                "results/all_embedded_gg_sentences.tsv",
                index=False, 
                sep="\t", 
                mode="w"
            )
        )
        
    # else append don't overwrite
    else:
        (
            generate_embedded_df(candidates, word_dict, max_length=max_length)
            .to_csv(
                "results/all_embedded_gg_sentences.tsv",
                index=False, 
                sep="\t", 
                mode="a",
                header=False
            )
        )




In [None]:
os.system("cd results; xz all_embedded_gg_sentences.tsv")