### Can machine learning approaches learn relationships between concepts that are in ontologies?
This notebooks is for evaluating to what extent a set of NLP and machine learning methods are capable of learning the relationships that are encoded in biological ontologies (specifically in this case PO and PATO) by domain experts who create the terms and relationships in those graphs. For example, by specifying that one term is the child of another term in the ontology, this is encoding the knowledge that these terms are closely related (the particular relationship is specified by whatever the edge connecting these terms is) and these two concepts are likely to have significant overlap in their meaning. If a NLP or machine learning approach for evaluating text similarity is useful for that particular domain, it should be able to learn (without accounting for the ontology directly) that the words related to those concepts are indeed similar. Therefore, this notebook uses the labels (names) of the ontology terms as a dataset of texts input into each NLP or machine learning method, and evluates the distance values for pairs of labels against the expected similarity given the ontology structure. 

In [1]:
import datetime
import nltk
import seaborn as sns
import pandas as pd
import numpy as np
import time
import sys
import gensim
import os
import warnings
import torch
import itertools
import pronto
from collections import Counter, defaultdict
from scipy import spatial, stats
from nltk.tokenize import word_tokenize
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from gensim.parsing.preprocessing import strip_non_alphanum, stem_text, preprocess_string, remove_stopwords
from gensim.utils import simple_preprocess
from scipy.spatial.distance import cosine
from scipy.spatial.distance import jaccard
from scipy.stats import spearmanr

sys.path.append("../../oats")
from oats.utils.utils import save_to_pickle, load_from_pickle, merge_list_dicts, flatten, to_hms
from oats.datasets.dataset import Dataset
from oats.annotation.ontology import Ontology
from oats.datasets.string import String
from oats.annotation.annotation import annotate_using_noble_coder
from oats.graphs import pairwise as pw
from oats.graphs.indexed import IndexedGraph
from oats.utils.utils import function_wrapper_with_duration
from oats.nlp.preprocess import concatenate_with_bar_delim
from _utils import Method

warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

### Saving a dataset of parent-child and sibling term label pairs

In [2]:
# The relationship dictionary is used later to identify the relationship between pairs of terms in the dataset.
relationship_dict = defaultdict(dict)
ontologies = {"PATO":pronto.Ontology("../ontologies/pato.obo"), "PO":pronto.Ontology("../ontologies/po.obo")}
tuples = []
for ont_name,ont in ontologies.items():
    delim = "[DELIM]"
    sibling_pairs = set()
    for term in ont:
        for parent in term.parents.id:
            tuples.append((ont_name,"parent_child",term.name,ont[parent].name))   
            relationship_dict[term.id][ont[parent].id] = "parent_child"
            relationship_dict[ont[parent].id][term.id] = "parent_child"
        sorted_id_pairs = [sorted(pair) for pair in list(itertools.combinations(term.children.id, 2))]
        for sorted_id_pair in sorted_id_pairs:
            relationship_dict[sorted_id_pair[0]][sorted_id_pair[1]] = "sibling"
            relationship_dict[sorted_id_pair[1]][sorted_id_pair[0]] = "sibling"
        sorted_pairs = ["{}{}{}".format(ont[pair[0]].name, delim, ont[pair[1]].name) for pair in sorted_id_pairs]
        sibling_pairs.update(sorted_pairs)
    for pair in list(sibling_pairs):
        pair = pair.split(delim)
        tuples.append((ont_name,"sibling",pair[0],pair[1]))        
pairs_df = pd.DataFrame(tuples, columns=["ontology","relationship","label_1","label_2"])
pairs_df.head(10)

Unnamed: 0,ontology,relationship,label_1,label_2
0,PATO,parent_child,mobility,physical quality
1,PATO,parent_child,speed,movement quality
2,PATO,parent_child,age,time
3,PATO,parent_child,color,optical quality
4,PATO,parent_child,color hue,chromatic property
5,PATO,parent_child,color brightness,optical quality
6,PATO,parent_child,color saturation,chromatic property
7,PATO,parent_child,fluorescence,luminous flux
8,PATO,parent_child,color pattern,spatial pattern
9,PATO,parent_child,compatibility,behavioral quality


### Saving a dataset of all possible label pairs and their Jaccard distances

In [None]:
# Generate a dataset of all possible label pairs and their Jaccard distance based on the ontology structure.
ontologies = {"PATO":pronto.Ontology("../ontologies/pato.obo"), "PO":pronto.Ontology("../ontologies/po.obo")}
ontology = Ontology("../ontologies/mo.obo")
edgelists = []
for ont_name,ont in ontologies.items():
    annotations = {}
    id_to_term_label = {}
    id_to_term_id = {}
    for i,term in enumerate(ont):
        if not "obsolete" in term.name:
            id_to_term_label[i] = term.name
            id_to_term_id[i] = term.id
            annotations[i] = [term.id]
    edgelist = pw.pairwise_square_annotations(annotations, ontology, "jaccard").edgelist
    edgelist["term_1"] = edgelist["from"].map(lambda x: id_to_term_id[x]) 
    edgelist["term_2"] = edgelist["to"].map(lambda x: id_to_term_id[x])
    edgelist["label_1"] = edgelist["from"].map(lambda x: id_to_term_label[x]) 
    edgelist["label_2"] = edgelist["to"].map(lambda x: id_to_term_label[x])
    edgelist["ontology"] = ont_name
    edgelist.rename(columns={"value":"distance"}, inplace=True)
    edgelist = edgelist[["ontology","term_1","term_2","label_1","label_2","distance"]]
    edgelists.append(edgelist)
all_pairs_df = pd.concat(edgelists, ignore_index=True)
all_pairs_df.head(10)

### Evaluating all methods for recapturing relationships encoded in the ontologies
This section is similar to the main analysis notebook that generates the pairwise distance matrices for all of the different NLP methods, and many of those cells have been copied and pasted here. The main difference is that the dataset of text descriptions is actually term labels from the ontologies, so the Jaccard similarity between each of possible pair of terms is treated as ground truth in order to evaluate how well these relationships are captured by each of the methods. This section uses only one ontology at a time because we are only interested in the pairs of terms that come from the same ontology and therefore have a meaningful distance measure between them.

In [None]:
# Generate dictionaries in the shape expected for running all the NLP methods for just one particular ontology.
ontology_name = "PATO"
ontology = Ontology("../ontologies/pato.obo")
ont = pronto.Ontology("../ontologies/pato.obo")
annotations = {}
descriptions = {}
id_to_term_id = {}
for i,term in enumerate(ont):
    if not "obsolete" in term.name:
        descriptions[i] = term.name
        annotations[i] = [term.id]
        id_to_term_id[i] = term.id

### Sections borrowed from the main analysis notebook
If the dictionary between IDs and term labels is stored as'descriptions', then the cells from the main analysis notebook can be borrowed directly.

In [None]:
# The summarizing output dictionary has the shape TABLE[method][metric] --> value.
TOPIC = "ontology structure"
DATA = ontology_name
TABLE = defaultdict(dict)
OUTPUT_DIR = os.path.join("../outputs",datetime.datetime.now().strftime('%m_%d_%Y_h%Hm%Ms%S'))
os.mkdir(OUTPUT_DIR)

# Paths
dataset_filename = "../data/pickles/text_plus_annotations_dataset.pickle"        # The full dataset pickle.
groupings_filename = "../data/pickles/lloyd_subsets.pickle"                      # The groupings pickle.
background_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/background.txt"       # Text file with background content.
phenotypes_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/phenotypes_small.txt" # Text file with specific content.
doc2vec_pubmed_filename = "../gensim/pubmed_dbow/doc2vec_2.bin"                  # File holding saved Doc2Vec model.
doc2vec_wikipedia_filename = "../gensim/enwiki_dbow/doc2vec.bin"                 # File holding saved Doc2Vec model.
word2vec_model_filename = "../gensim/wiki_sg/word2vec.bin"                       # File holding saved Word2Vec model.
ontology_filename = "../ontologies/mo.obo"                                       # Ontology file in OBO format.
noblecoder_jarfile_path = "../lib/NobleCoder-1.0.jar"                            # Jar for NOBLE Coder tool.
biobert_pmc_path = "../gensim/biobert_v1.0_pmc/pytorch_model"                    # Path for PyTorch BioBERT model.
biobert_pubmed_path = "../gensim/biobert_v1.0_pubmed/pytorch_model"              # Path for PyTorch BioBERT model.
biobert_pubmed_pmc_path = "../gensim/biobert_v1.0_pubmed_pmc/pytorch_model"      # Path for PyTorch BioBERT model.

# Files and models related to the machine learning text embedding methods.
doc2vec_wiki_model = gensim.models.Doc2Vec.load(doc2vec_wikipedia_filename)
doc2vec_pubmed_model = gensim.models.Doc2Vec.load(doc2vec_pubmed_filename)
word2vec_model = gensim.models.Word2Vec.load(word2vec_model_filename)
bert_tokenizer_base = BertTokenizer.from_pretrained('bert-base-uncased')
bert_tokenizer_pmc = BertTokenizer.from_pretrained(biobert_pmc_path)
bert_tokenizer_pubmed = BertTokenizer.from_pretrained(biobert_pubmed_path)
bert_tokenizer_pubmed_pmc = BertTokenizer.from_pretrained(biobert_pubmed_pmc_path)
bert_model_base = BertModel.from_pretrained('bert-base-uncased')
bert_model_pmc = BertModel.from_pretrained(biobert_pmc_path)
bert_model_pubmed = BertModel.from_pretrained(biobert_pubmed_path)
bert_model_pubmed_pmc = BertModel.from_pretrained(biobert_pubmed_pmc_path)

# Preprocessing of the text descriptions. Different methods are necessary for different approaches.
descriptions_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions.items()}
descriptions_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions.items()}
descriptions_no_stopwords = {i:remove_stopwords(d) for i,d in descriptions.items()}
get_pos_tokens = lambda text,pos: " ".join([t[0] for t in nltk.pos_tag(word_tokenize(text)) if t[1].lower()==pos.lower()])
descriptions_noun_only =  {i:get_pos_tokens(d,"NN") for i,d in descriptions.items()}
descriptions_noun_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_noun_only.items()}
descriptions_noun_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_noun_only.items()}
descriptions_adj_only =  {i:get_pos_tokens(d,"JJ") for i,d in descriptions.items()}
descriptions_adj_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_adj_only.items()}
descriptions_adj_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_adj_only.items()}

In [None]:
# Define a list of different methods for calculating distance between text descriptions using the Methods object 
# defined in the utilities for this notebook. The constructor takes a string for the method name, a string defining
# the hyperparameter choices for that method, a function to be called to run this method, a dictionary of arguments
# by keyword that should be passed to that function, and a distance metric from scipy.spatial.distance to associate
# with this method.

methods = [
    # Methods that use neural networks to generate embeddings.
    Method("Doc2Vec Wikipedia", "Size=300", pw.pairwise_square_doc2vec, {"model":doc2vec_wiki_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Doc2Vec PubMed", "Size=100", pw.pairwise_square_doc2vec, {"model":doc2vec_pubmed_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Mean", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"mean"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Max", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"max"}, spatial.distance.cosine),
    Method("BERT", "Base:Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=2,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PubMed,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed, "tokenizer":bert_tokenizer_pubmed, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    Method("BioBERT", "PubMed,PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed_pmc, "tokenizer":bert_tokenizer_pubmed_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
        
    # Methods that use variations on the n-grams approach with full preprocessing (includes stemming).
    Method("N-Grams", "Full,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach with simple preprocessing (no stemming).
    Method("N-Grams", "Simple,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach selecting for specific parts-of-speech.
    Method("N-Grams", "Full,Nouns,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Nouns,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Adjectives,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use terms inferred from automated annotation of the text.
    #Method("NOBLE Coder", "Precise", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    #Method("NOBLE Coder", "Partial", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    #Method("NOBLE Coder", "Precise,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    #Method("NOBLE Coder", "Partial,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    
    Method("Jaccard", "Default", pw.pairwise_square_annotations, {"ids_to_annotations":annotations, "ontology":ontology, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "tfidf":False}, spatial.distance.jaccard),
]

In [None]:
# Generate all the pairwise distance matrices (not in parallel).
graphs = {}
names = []
durations = []
for method in methods:
    graph,duration = function_wrapper_with_duration(function=method.function, args=method.kwargs)
    graphs[method.name_with_hyperparameters] = graph
    names.append(method.name_with_hyperparameters)
    durations.append(to_hms(duration))
    print("{:50} {}".format(method.name_with_hyperparameters,to_hms(duration)))
durations_df = pd.DataFrame({"method":names,"duration":durations})
durations_df.to_csv(os.path.join(OUTPUT_DIR,"{}_durations.csv".format(ontology_name.lower())), index=False)

# Merging all of the edgelist dataframes together.
metric_dict = {method.name_with_hyperparameters:method.metric for tup in methods}
methods = list(graphs.keys())
edgelists = {k:v.edgelist for k,v in graphs.items()}
df = pw.merge_edgelists(edgelists, default_value=0.000)
df = pw.remove_self_loops(df)
df.tail(10)

### Part 1: Spearman rank-order correlation coefficient and p-value for each method
The purpose of this section is to see how well each methods generated distance values between labels for all of the term pairs in the ontology correspond to the distance values generated by just looking just at the Jaccard distance between the terms themselves, ignoring the labels and just accounting directly for the specified ontology hierarchical graph. Spearman's ρ is used evalute the correlation between these distributions of distance values, and the results are output to a table. The distributions are also subset to include only the pairs where the labels do not have one more words in common, and the correlation coefficient is recalculated.

In [None]:
df_no_shared_words = df[df["N-Grams:Full,Words,1-grams,Binary"]==1]
for method in methods:
    sp = spearmanr(df["Jaccard:Default"].values, df[method].values)
    sp_no_shared = spearmanr(df_no_shared_words["Jaccard:Default"].values, df_no_shared_words[method].values)
    TABLE[method].update({"rho_all":sp.correlation,"p_all":sp.pvalue})
    TABLE[method].update({"rho_unshared":sp_no_shared.correlation,"p_unshared":sp_no_shared.pvalue})

### Part 2: Look at distance distributions for specific term relationships
The purpose of this section is to use the specific relationships between either sibling terms or parent-child term pairs and their labels in order to see how each method compares in capturing the relationships between these closely related terms. The distance values found by each method are converted to percentiles so that the distributions of scores between methods will be comparable, and then the dataframe is subset to only include the edges between term pairs that are siblings or parent-child pairs, and then the dataframe is written to file.

In [None]:
# Convert the distance values to all percentiles of the distance values against the background distribution.
df[methods] = df[methods].rank(pct=True)

# Also generate columns that aggregate among all hyperparameter choices for a method.
groups = list(set([method.split(":")[0] for method in methods]))
methods.extend(groups)
for group in groups:
    cols = [c for c in df.columns if c.split(":")[0]==group]
    df[group] = df[cols].mean(axis=1)
    
# Use the relationships dictionary saved above to find edges that correspond to specific relationships.
df["term_1"] = df["from"].map(lambda x: id_to_term_id[x])
df["term_2"] = df["to"].map(lambda x: id_to_term_id[x])
df["relationship"] = np.vectorize(lambda t1,t2: relationship_dict[t1].get(t2))(df["term_1"], df["term_2"])
df["ontology"] = ontology_name
df_sub = df[(df["relationship"]=="sibling") | (df["relationship"]=="parent_child")]

#Update the table of results to include statistics about each distances distribution.
for method in methods:
    TABLE[method].update({"sibling_mean":df_sub[df_sub["relationship"]=="sibling"][method].mean()})
    TABLE[method].update({"sibling_median":df_sub[df_sub["relationship"]=="sibling"][method].median()})
    TABLE[method].update({"parent_mean":df_sub[df_sub["relationship"]=="parent_child"][method].mean()})
    TABLE[method].update({"parent_median":df_sub[df_sub["relationship"]=="parent_child"][method].median()})
df_sub[flatten(["term_1","term_2","ontology","relationship",methods])].to_csv(os.path.join(OUTPUT_DIR,"{}_distance_percentiles_all.csv".format(ontology_name.lower())), index=False)

### Part 3: Saving a smaller table of just the distance percentiles for hand-picked term pairs
The purpose of this section is to compare how similar each method considers a set of hand-picked term pairs that are either parent-child pairs sibling pairs. These are read from a file and are selected to try and highlight interesting differences between the different embedding techniques, training methods, or hyperparamter choices for the methods.

In [None]:
pairs = pd.read_csv("../data/corpus_related_files/ontology_knowledge/phrase_pairs.csv")
term_id_tuples = [(term_id_1,term_id_2) for term_id_1,term_id_2 in zip(pairs["Term 1"].values,pairs["Term 2"].values)]
handpicked_dict = defaultdict(dict)
for ids in term_id_tuples:
    handpicked_dict[ids[0]][ids[1]] = True
    handpicked_dict[ids[1]][ids[0]] = True
df["handpicked"] = np.vectorize(lambda t1,t2: handpicked_dict[t1].get(t2))(df["term_1"], df["term_2"])
df_sub = df[df["handpicked"]==True]
df_sub[flatten(["term_1","term_2",methods])].to_csv(os.path.join(OUTPUT_DIR,"{}_distance_percentiles_handpicked.csv".format(ontology_name.lower())), index=False)

### Summarizing the results for this notebook
Write a large table of results to an output file. Columns are generally metrics and rows are generally methods.

In [None]:
results = pd.DataFrame(TABLE).transpose()
columns = flatten(["Hyperparams","Group","Order","Topic","Data",results.columns])
results["Hyperparams"] = ""
results["Group"] = ""
results["Order"] = np.arange(results.shape[0])
results["Topic"] = TOPIC
results["Data"] = DATA
results = results[columns]
results.reset_index(inplace=True)
results = results.rename({"index":"Method"}, axis="columns")
hyperparam_sep = ":"
results["Hyperparams"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[1] if hyperparam_sep in x else "-")
results["Method"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[0])
results.to_csv(os.path.join(OUTPUT_DIR,"{}_full_table.csv".format(ontology_name.lower())), index=False)
results