### Can machine learning approaches learn relationships between concepts that are in ontologies?
If the neural network document encoding models (Doc2Vec) are being successfully trained, then they should be able to recapture some of the domain-specific information that is written into relationships present in biological ontologies. Specifically, two concepts which have a parent-child relationship in PATO or PO can be considered to be highly similar in this context. We compare the distances between the labels for these pairs of terms as inferred by both the general Doc2Vec model trained on the English Wikipedia corpus, as well as our own models trained specifically on abstracts from PubMed that are specific to plant phenotypes. Here we generate figures to compare the results for a specific set of handpicked phrase or term pairs, as well as a second figure over all pairs parsed from the hierarchies in each ontology to check whether the result generalizes to the ontologies as a whole.

In [None]:
"""
# Convert the distance values in the dataframe to be percentiles.
df_pct = df.copy()
df_pct[methods] = df[methods].rank(pct=True)

# Looking through the distances that are low for PubMed and high for Wikipedia to see if this is a valuable approach.
of_interest = df_pct[(df_pct["doc2vec_pubmed"]<0.1) & (df_pct["doc2vec_wiki"]>0.3)]
for row in of_interest.itertuples():
    print("{} and {}".format(round(row[4],4),round(row[3],4)))
    sentence1 = descriptions[row[1]]
    sentence2 = descriptions[row[2]]
    print("1: {}\n2: {}\n\n".format(sentence1, sentence2))
    
# Looking at the distance values of sentence variations of interest found in the previous step. 
sentences = {
    0:"Susceptible to bacterial infection",
    1:"Resistant to bacterial infection",
    2:"Resistant to powdery mildew",
    3:"susceptible to powdery mildew",
    4:"Some random control sentence"
}
wikipedia_results = pw.pairwise_doc2vec_onegroup(doc2vec_wiki_model,sentences,"cosine").edgelist
wikipedia_results["value"] = wikipedia_results["value"].map(lambda x: stats.percentileofscore(df["doc2vec_wiki"].values, x, kind="rank")/100)
wikipedia_results = pw.remove_self_loops(wikipedia_results)
pubmed_results = pw.pairwise_doc2vec_onegroup(doc2vec_pubmed_model,sentences,"cosine").edgelist
pubmed_results["value"] = pubmed_results["value"].map(lambda x: stats.percentileofscore(df["doc2vec_pubmed"].values, x, kind="rank")/100)
pubmed_results = pw.remove_self_loops(pubmed_results)
results = pw.merge_edgelists({"wikipedia":wikipedia_results,"pubmed":pubmed_results})
results
"""

In [None]:
from scipy.spatial.distance import cosine
from scipy.spatial.distance import jaccard

# Loading a file of handpicked phrase pairs from ontology term labels. 
pairs = pd.read_csv("../data/corpus_related_files/ontology_knowledge/phrase_pairs.csv")
# Adding the distance values found with each method to the dataframe.
# Another valid background distribution for each method could be the distance between all pairs of term labels.
pairs["Wikipedia"] = pw.elemwise_doc2vec_twogroup(doc2vec_wiki_model, pairs["Label 1"].values, pairs["Label 2"].values, cosine)
pairs["PubMed"] = pw.elemwise_doc2vec_twogroup(doc2vec_pubmed_model, pairs["Label 1"].values, pairs["Label 2"].values, cosine)
pairs["Jaccard"] = pw.elemwise_ngrams_twogroup(pairs["Label 1"].values, pairs["Label 2"].values, jaccard)
pairs["Wikipedia"] = pairs["Wikipedia"].map(lambda x: stats.percentileofscore(df["Doc2Vec Wikipedia:Size=300"].values, x, kind="rank")/100)
pairs["PubMed"] = pairs["PubMed"].map(lambda x: stats.percentileofscore(df["Doc2Vec PubMed:Size=100"].values, x, kind="rank")/100)
pairs["Pair"] = pairs["Label 1"].values+","+pairs["Label 2"].values
pairs.to_csv("../data/scratch/phrase_pair_handpicked_results.csv",index=False)
pairs

In [None]:
import pronto

# Define the ontologies to be used for this section, using the pronto library to read them.
ontologies = {"PATO":pronto.Ontology("../ontologies/pato.obo"), "PO":pronto.Ontology("../ontologies/po.obo")}
tuples = []

# Iterate through the ontologies and all parent/child and sibling term label pairs.
for ont_name,ont in ontologies.items():
    delim = "[DELIM]"
    sibling_pairs = set()
    for term in ont:
        for parent in term.parents.id:
            tuples.append((ont_name,"parent_child",term.name,ont[parent].name))     
        sorted_id_pairs = [sorted(pair) for pair in list(itertools.combinations(term.children.id, 2))]
        sorted_pairs = ["{}{}{}".format(ont[pair[0]].name, delim, ont[pair[1]].name) for pair in sorted_id_pairs]
        sibling_pairs.update(sorted_pairs)
    for pair in list(sibling_pairs):
        pair = pair.split(delim)
        tuples.append((ont_name,"sibling",pair[0],pair[1]))  

In [None]:
# Using that dataframe to see how the generalized distance percentile distributions compare between models.
pairs = pd.DataFrame(tuples, columns=["Ontology", "Relationship", "Label 1", "Label 2"])

# Adding the distance values found with each method to the dataframe.
# Another valid background distribution for each method could be the distance between all pairs of term labels.
pairs["Wikipedia"] = pw.elemwise_doc2vec_twogroup(doc2vec_wiki_model, pairs["Label 1"].values, pairs["Label 2"].values, cosine)
pairs["PubMed"] = pw.elemwise_doc2vec_twogroup(doc2vec_pubmed_model, pairs["Label 1"].values, pairs["Label 2"].values, cosine)
pairs["Jaccard"] = pw.elemwise_ngrams_twogroup(pairs["Label 1"].values, pairs["Label 2"].values, jaccard)
pairs["Wikipedia"] = pairs["Wikipedia"].map(lambda x: stats.percentileofscore(df["Wikipedia"].values, x, kind="rank")/100)
pairs["PubMed"] = pairs["PubMed"].map(lambda x: stats.percentileofscore(df["PubMed"].values, x, kind="rank")/100)
pairs["Pair"] = pairs["Label 1"].values+","+pairs["Label 2"].values
pairs.to_csv("../data/scratch/phrase_pair_handpicked_results.csv",index=False)
pairs

### Can we use this dataset to identify sentences from abstracts that contain phenotypic information?
The question we want to answer is whether or not these machine learning approaches in combination with the gathered dataset of phenotypic descriptions provides a valuable method for identifying which sentences in abstracts are likely to contain to information related to phenotyping, as this approach could be valuable in a curation pipeline. We will use a partially hands on approach, only evaluating the predicted matches rather than creating a full dataset of annotated abstracts. This means that the result will have to be evaluated as quantifying how many of the return sentences are primarily about a phenotype, partially about a phenotype, or not about a phenotype (three different categories), we cannot actually get an F1 score for this approach because we will not know how many were missed or available as positives in the dataset from which the sentences were drawn. Types of classes could be:
1. Specifically mentioning a phenotype (e.g. "*Plants treated with chemical X exhibited phenotype Y*.")
2. Only generally discussing phenotypes as a topic (e.g. "*We organized a dataset of Z phenotypes from Arabidopsis thaliana studies*.")
3. Not talking about phenotypes (e.g. "*Arabidopsis thaliana is one of the most widely studied plant species.*")

Note that the false positives for this analysis likely include sentences where phenotypes are specifically mentioned, but not in the context of observing those phenotypes in a particular plant, which is what we are interested in here because those are the type of descriptions that should go in a database or dataset or something. This should be taken into account when evaluating the scores, and further researcher for distinguishing between these types of descriptions should be done.

In [None]:
descriptions = dataset.get_description_dictionary()


# Reading in the tagged dataset file of sentence that do or do not describe phenotypes.
tagged_dataset = pd.read_csv("../data/corpus_related_files/brat_annotations_zma_corpus/untagged_dataset.csv")
tagged_dataset = tagged_dataset[(tagged_dataset["tag"]==0) | (tagged_dataset["tag"]==1)]

#print(tagged_dataset)

sentence_dict = {i:sentence for i,sentence in enumerate(tagged_dataset["sentence"].values)}
tags_dict = {i:tag for i,tag in enumerate(tagged_dataset["tag"].values)}
wikipedia_results = pw.pairwise_doc2vec_twogroup(doc2vec_wiki_model,sentence_dict,descriptions,"cosine").edgelist
#wikipedia_results = pw.pairwise_ngrams_twogroup(sentence_dict,descriptions,"cosine").edgelist


# Evaluting each sentence either by their mean distance to the phenotypes or their minimum distance.
results = pd.DataFrame(wikipedia_results.groupby(["from"])["value"].min())
results = results.reset_index(inplace=False)
results = results.sort_values(by=["value"])
results["sentence"] = results["from"].map(lambda x: sentence_dict[x])

results.sort_values(by=["value"], inplace=True, ascending=True)

print(results.head())


# Generating the lists of true values and predictions and metrics.
y_true = [tags_dict[i] for i in results["from"].values]
y_prob = [1.000-v for v in results["value"].values]
n_pos, n_neg = Counter(y_true)[1], Counter(y_true)[0]
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
baseline = Counter(y_true)[1]/len(y_true) 
area = auc(recall, precision)
auc_to_baseline_auc_ratio = area/baseline
print(area)
print(baseline)



In [None]:
results.head(200).to_csv("~/Desktop/a.csv")

In [None]:
# Find the maximum Fß score for different values of ß.  
f_beta = lambda pr,re,beta: [((1+beta**2)*p*r)/((((beta**2)*p)+r)) for p,r in zip(pr,re)]
print(np.nanmax(f_beta(precision,recall,1)))
print(np.nanmax(f_beta(precision,recall,0.5)))
print(np.nanmax(f_beta(precision,recall,2)))