# Table of Contents

- [Introduction](#introduction)

- [Links of Interest](#links)

- [Part 1. Loading and Filtering Data](#paths)
    - [Setting input and output paths](#paths)
    - [Reading in a dataset of text descriptions](#read_this_data_)
    - [Reading in a dataset of groups or categories](#read_other_data)
    - [Relating the datasets to one another](#relating)
    - [Filtering the datasets](#filtering)
    
- [Part 2. NLP Models](#word2vec_doc2vec)
    - [Word2Vec and Doc2Vec](#word2vec_doc2vec)
    - [BERT and BioBERT](#bert_biobert)
    - [Loading models](#load_models)

- [Part 3. NLP Choices](#part8)
    - [Preprocessing the phenotype descriptions](#preprocessing)
    - [POS Tagging](#pos_tagging)
    - [Reducing the size of the vocabulary](#vocab)
    - [Annotating descriptions using biological ontologies](#annotation)
    - [Splitting phenotype descriptions into phene descriptions](#phenes)
    
- [Part 4. Generating Vector Representations and Distance Matrices](#matrix)
    - [Defining a list of methods to use](#methods)
    - [Running each method](#running)
    - [Merging distances into a single dataframe](#merging)
    - [Adding additional information](#merging)

- [Part 5. Cluster Analysis]()
    - [Topic modeling](#topic_modeling)
    - [Agglomerative clustering](#clustering)
    - [Phenologs for OMIM disease phenotypes](#phenologs)
    
- [Part 6. Supervised Tasks](#supervised)
    - [Combining methods with ensemble approaches](#ensemble)
    - [Comparing distributions of distance values between methods](#ks)
    - [Comparing the within-group distance values across gene groups and methods](#within)
    - [Comparing the AUC for predicting shared pathways, gene groups, or interactions between methods](#auc)
    - [Comparing querying for similar genes using distance matrices for each method](#y)
    - [Comparing the AUC for predicting the specific pathway or group of a gene](#mean)
    - [Generating a table of resulting metrics for each method used](#output)

<a id="introduction"></a>
### Introduction: Text Mining Analysis of Phenotype Descriptions in Plants
The purpose of this notebook is to evaluate what can be learned from a natural language processing approach to analyzing free-text descriptions of phenotype descriptions of plants. The approach is to generate pairwise distances matrices between a set of plant phenotype descriptions across different species, sourced from academic papers and online model organism databases. These pairwise distance matrices can be constructed using any vectorization method that can be applied to natural language. In this notebook, we specifically evaluate the use of n-gram and bag-of-words techniques, word and document embedding using Word2Vec and Doc2Vec, context-dependent word-embeddings using BERT and BioBERT, and ontology term annotations with automated annotation tools such as NOBLE Coder.

Loading, manipulation, and filtering of the dataset of phenotype descriptions associated with genes across different plant species is largely handled through a Python package created for this purpose called OATS (Ontology Annotation and Text Similarity) which is available [here](https://github.com/irbraun/oats). Preprocessing of the descriptions, mapping the dataset to additional resources such as protein-protein interaction databases and biochemical pathway databases are handled in this notebook using that package as well. In the evaluation of each of these natural language processing approaches to analyzing this dataset of descriptions, we compare performance against a dataset generated through manual annotation of a similar dataset in Oellrich Walls et al. (2015) and against manual annotations with experimentally determined terms from the Gene Ontology (PO) and the Plant Ontology (PO).

<a id="links"></a>
### Relevant links of interest:
- Paper describing comparison of NLP and ontology annotation approaches to curation: [Braun, Lawrence-Dill (2019)](https://doi.org/10.3389/fpls.2019.01629)
- Paper describing results of manual phenotype description curation: [Oellrich, Walls et al. (2015](https://plantmethods.biomedcentral.com/articles/10.1186/s13007-015-0053-y)
- Plant databases with phenotype description text data available: [TAIR](https://www.arabidopsis.org/), [SGN](https://solgenomics.net/), [MaizeGDB](https://www.maizegdb.org/)
- Python package for working with phenotype descriptions: [OATS](https://github.com/irbraun/oats)
- Python package used for general NLP functions: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html)
- Python package used for working with biological ontologies: [Pronto](https://pronto.readthedocs.io/en/latest/)
- Python package for loading pretrained BERT models: [PyTorch Pretrained BERT](https://pypi.org/project/pytorch-pretrained-bert/)
- For BERT Models pretrained on PubMed and PMC: [BioBERT Paper](https://arxiv.org/abs/1901.08746), [BioBERT Models](https://github.com/naver/biobert-pretrained)

In [1]:
import datetime
import nltk
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import time
import math
import sys
import gensim
import os
import warnings
import torch
import itertools
import multiprocessing as mp
from collections import Counter, defaultdict
from inspect import signature
from scipy.stats import ks_2samp, hypergeom
from sklearn.metrics import precision_recall_curve, f1_score, auc
from sklearn.model_selection import train_test_split, KFold
from scipy import spatial, stats
from statsmodels.sandbox.stats.multicomp import multipletests
from nltk.corpus import brown
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.neighbors import KNeighborsClassifier
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from gensim.parsing.preprocessing import strip_non_alphanum, stem_text, preprocess_string, remove_stopwords
from gensim.utils import simple_preprocess
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.cluster import AgglomerativeClustering

sys.path.append("../../oats")
from oats.utils.utils import save_to_pickle, load_from_pickle, merge_list_dicts, flatten, to_hms
from oats.datasets.dataset import Dataset
from oats.datasets.groupings import Groupings
from oats.annotation.ontology import Ontology
from oats.datasets.string import String
from oats.datasets.edges import Edges
from oats.annotation.annotation import annotate_using_noble_coder
from oats.graphs import pairwise as pw
from oats.graphs.indexed import IndexedGraph
from oats.graphs.weighting import train_logistic_regression_model, apply_logistic_regression_model
from oats.graphs.weighting import train_random_forest_model, apply_random_forest_model
from oats.nlp.vocabulary import get_overrepresented_tokens, get_vocabulary_from_tokens
from oats.nlp.vocabulary import reduce_vocabulary_connected_components, reduce_vocabulary_linares_pontes
from oats.utils.utils import function_wrapper_with_duration
from oats.nlp.preprocess import concatenate_with_bar_delim

from _utils import Method

mpl.rcParams["figure.dpi"] = 400
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

# Part 1. Loading and Filtering Data
<a id="paths"></a>
### Setting up the input and output paths and summarizing output table
This section defines some constants which are used for creating a uniquely named directory to contain all the outputs from running this instance of this notebook. The naming scheme is based on the time that the notebook is run. The other constants are used for specifying information in the output table about what the topic was for this notebook when it was run, such as looking at KEGG biochemical pathways or STRING protein-protein interaction data some other type of gene function grouping or hierarchy. These values are arbitrary and are just for keeping better notes about what the output of the notebook corresponds to. All the input and output file paths for loading datasets or models are also contained within this cell, so that if anything is moved the directories and file names should only have to be changed at this point and nowhere else further into the notebook. If additional files are added to the notebook cells they should be put here as well.

In [2]:
# The summarizing output dictionary has the shape TABLE[method][metric] --> value.
TOPIC = "Biochemical Pathways"
DATA = "Filtered"
TABLE = defaultdict(dict)
OUTPUT_DIR = os.path.join("../outputs",datetime.datetime.now().strftime('%m_%d_%Y_h%Hm%Ms%S'))
os.mkdir(OUTPUT_DIR)

In [3]:
dataset_filename = "../data/pickles/text_plus_annotations_dataset.pickle"                            # The full dataset pickle.
groupings_filename = "../data/pickles/pmn_pathways.pickle"                                           # The groupings pickle.
background_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/background.txt"     # Text file with background content.
phenotypes_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/phenotypes_all.txt" # Text file with specific content.
doc2vec_pubmed_filename = "../gensim/pubmed_dbow/doc2vec_2.bin"                                      # File holding saved Doc2Vec model.
doc2vec_wikipedia_filename = "../gensim/enwiki_dbow/doc2vec.bin"                                     # File holding saved Doc2Vec model.
word2vec_model_filename = "../gensim/wiki_sg/word2vec.bin"                                           # File holding saved Word2Vec model.
ontology_filename = "../ontologies/mo.obo"                                                           # Ontology file in OBO format.
noblecoder_jarfile_path = "../lib/NobleCoder-1.0.jar"                                                # Jar for NOBLE Coder tool.
biobert_pmc_path = "../gensim/biobert_v1.0_pmc/pytorch_model"                                        # Path for PyTorch BioBERT model.
biobert_pubmed_path = "../gensim/biobert_v1.0_pubmed/pytorch_model"                                  # Path for PyTorch BioBERT model.
biobert_pubmed_pmc_path = "../gensim/biobert_v1.0_pubmed_pmc/pytorch_model"                          # Path for PyTorch BioBERT model.
panther_to_omim_filename = "../data/orthology_related_files/pantherdb_omim_df.csv"                   # File with mappings to orthologs and disease phenotypes.

<a id="read_this_data"></a>
### Reading in the dataset of genes and their associated phenotype descriptions and annotations

In [4]:
dataset = load_from_pickle(dataset_filename)
dataset.describe()
dataset.filter_by_species("ath")
dataset.filter_has_description()
dataset.filter_has_annotation()
dataset.describe()
dataset.filter_has_annotation("GO")
dataset.filter_has_annotation("PO")
dataset.describe()
dataset.to_pandas().head(10)

Number of rows in the dataframe: 30169
Number of unique IDs:            30169
Number of unique descriptions:   4566
Number of unique gene name sets: 30169
Number of species represented:   6
Number of rows in the dataframe: 5615
Number of unique IDs:            5615
Number of unique descriptions:   3378
Number of unique gene name sets: 5615
Number of species represented:   1
Number of rows in the dataframe: 3480
Number of unique IDs:            3480
Number of unique descriptions:   2884
Number of unique gene name sets: 3480
Number of species represented:   1


Unnamed: 0,id,species,gene_names,description,term_ids
0,0,ath,At3g49600|UBP26|AT3G49600|SUP32|ATUBP26|ubiqui...,50% defective seeds. Low penetrance of endospe...,GO:0005730|GO:0048316|PO:0000013|PO:0000037|PO...
1,1,ath,AT1G74380|XXT5|xyloglucan xylosyltransferase 5...,Abnormal roothairs. Reduction in xyloglucan le...,GO:0005794|GO:0048767|GO:0005515|GO:0000139|GO...
2,2,ath,AT1G74450|AT1G74450.1|F1M20.13|F1M20_13,No visible phenotype.,GO:0003674|GO:0008150|PO:0000013|PO:0000037|PO...
3,3,ath,AT1G74560|AT2G03440|NRP1|NAP1-related protein ...,mutants did not show any phenotype under in vi...,GO:0005634|GO:0005829|GO:0046686|GO:0003682|GO...
4,4,ath,AT1G74660|MIF1|mini zinc finger 1|F1M20.34|F1M...,Constitutive overexpression of MIF1 caused dra...,GO:0048509|GO:0045892|GO:0009640|GO:0003677|GO...
5,5,ath,AT1G74730|RIQ2|F25A4.30|F25A4_30,"Reduced NPQ, affected organization of light-ha...",GO:0009535|GO:0009534|GO:0003674|GO:0009507|GO...
6,6,ath,AT1G74740|CPK30|CDPK1A|ATCPK30|calcium-depende...,Embryo lethality of cpk10 cpk30 double mutant ...,GO:0005515|GO:0005886|PO:0000013|PO:0000037|PO...
7,7,ath,AT1G74910|KJC1|KONJAC 1|F25A4.12|F25A4_12|AT1G...,"Reduced levels of GDP-Man. Severe dwarf, small...",GO:0005829|GO:0005777|GO:0046686|PO:0000013|PO...
8,8,ath,AT1G75080|BZR1|BRASSINAZOLE-RESISTANT 1|F9E10....,"Insensitive to brassinazole (BRZ), an inhibito...",GO:0045892|GO:0048481|GO:0003700|GO:0005515|GO...
9,9,ath,AT1G75520|SRS5|SHI-related sequence 5|F1B16.17,18-25% of flowers have homeotic conversion pet...,GO:0048467|PO:0000037|PO:0009009|PO:0009010|PO...


<a id="read_other_data"></a>
### Reading in the dataset of groupings, pathways, or any other type of categorization

In [5]:
groups = load_from_pickle(groupings_filename)
id_to_group_ids = groups.get_id_to_group_ids_dict(dataset.get_gene_dictionary())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
group_mapped_ids = [k for (k,v) in id_to_group_ids.items() if len(v)>0]
groups.describe()
groups.to_csv(os.path.join(OUTPUT_DIR,"part_1_groupings.csv"))
groups.to_pandas().head(10)

Number of groups present for each species
  ath: 627
  zma: 565
  mtr: 520
  osa: 569
  gmx: 618
  sly: 524
Number of genes names mapped to any group for each species
  ath: 9959
  zma: 14319
  mtr: 14100
  osa: 12156
  gmx: 20677
  sly: 13232


Unnamed: 0,species,pathway_id,pathway_name,gene_names,ec_number
0,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g52400-monomer|abscisic acid glucose ester ...,EC-3.2.1.175
1,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g15550-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
2,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g15260-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
3,ath,PWY-5272,abscisic acid degradation by glucosylation,at3g21790-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
4,ath,PWY-5272,abscisic acid degradation by glucosylation,at3g21760-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
5,ath,PWY-5272,abscisic acid degradation by glucosylation,at2g23210-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
6,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g05530-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
7,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g05560-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
8,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g34138-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
9,ath,PWY-5272,abscisic acid degradation by glucosylation,at2g23250-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263


<a id="relating"></a>
### Relating the dataset of genes to the dataset of groupings or categories
This section generates tables that indicate how the genes present in the dataset were mapped to the defined pathways or groups. This includes a summary table that indicates how many genes by species were succcessfully mapped to atleast one pathway or group, as well as a more detailed table describing how many genes from each species were mapped to each particular pathway or group.

In [6]:
# Generate a table describing how many of the genes input from each species map to atleast one group.
summary = defaultdict(dict)
species_dict = dataset.get_species_dictionary()
for species in dataset.get_species():
    summary[species]["input"] = len([x for x in dataset.get_ids() if species_dict[x]==species])
    summary[species]["mapped"] = len([x for x in group_mapped_ids if species_dict[x]==species])
table = pd.DataFrame(summary).transpose()
table.loc["total"]= table.sum()
table["fraction"] = table.apply(lambda row: "{:0.4f}".format(row["mapped"]/row["input"]), axis=1)
table = table.reset_index(inplace=False)
table = table.rename({"index":"species"}, axis="columns")
table.to_csv(os.path.join(OUTPUT_DIR,"part_1_mappings_summary.csv"), index=False)

# Generate a table describing how many genes from each species map to which particular group.
summary = defaultdict(dict)
for group_id,ids in group_id_to_ids.items():
    summary[group_id].update({species:len([x for x in ids if species_dict[x]==species]) for species in dataset.get_species()})
    summary[group_id]["total"] = len([x for x in ids])
table = pd.DataFrame(summary).transpose()
table = table.sort_values(by="total", ascending=False)
table = table.reset_index(inplace=False)
table = table.rename({"index":"pathway_id"}, axis="columns")
table["pathway_name"] = table["pathway_id"].map(groups.get_long_name)
table.loc["total"] = table.sum()
table.loc["total","pathway_id"] = "total"
table.loc["total","pathway_name"] = "total"
table = table[table.columns.tolist()[-1:] + table.columns.tolist()[:-1]]
table.to_csv(os.path.join(OUTPUT_DIR,"part_1_mappings_by_group.csv"), index=False)

<a id="filtering"></a>
### Option 1: Filtering the dataset based on presence in the curated Oellrich, Walls et al. (2015) dataset

In [7]:
# Filter the dataset based on whether or not the genes were in the curated dataset.
# This is similar to filtering based on protein interaction data because the dataset is a list of edge values.
pppn_edgelist_path = "../data/supplemental_files_oellrich_walls/13007_2015_53_MOESM9_ESM.txt"
pppn_edgelist = Edges(dataset.get_name_to_id_dictionary(), pppn_edgelist_path)
dataset.filter_with_ids(pppn_edgelist.ids)
dataset.describe()

Number of rows in the dataframe: 1899
Number of unique IDs:            1899
Number of unique descriptions:   1692
Number of unique gene name sets: 1899
Number of species represented:   1


### Option 2: Filtering the dataset based on protein-protein interactions
This is done to only include genes (and the corresponding phenotype descriptions and annotations) which are useful for the current analysis. In this case we want to only retain genes that are mentioned atleast one time in the STRING database for a given species. If a gene is not mentioned at all in STRING, there is no information available for whether or not it interacts with any other proteins in the dataset so choose to not include it in the analysis. Only genes that have atleast one true positive are included because these are the only ones for which the missing information (negatives) is meaningful. This should be run instead of the subsequent cell, or the other way around, based on whether or not protein-protein interactions is the prediction goal for the current analysis.

In [None]:
# Filter the dataset based on whether or not the genes were successfully mapped to an interaction.
# Reduce size of the dataset by removing genes not mentioned in the STRING.
naming_file = "../data/group_related_files/string/all_organisms.name_2_string.tsv"
interaction_files = [
    "../data/group_related_files/string/3702.protein.links.detailed.v11.0.txt", # Arabidopsis thaliana
    "../data/group_related_files/string/4577.protein.links.detailed.v11.0.txt", # maize
    "../data/group_related_files/string/4530.protein.links.detailed.v11.0.txt", # tomato 
    "../data/group_related_files/string/4081.protein.links.detailed.v11.0.txt", # medicago
    "../data/group_related_files/string/3880.protein.links.detailed.v11.0.txt", # rice 
    "../data/group_related_files/string/3847.protein.links.detailed.v11.0.txt", # soybean
]
genes = dataset.get_gene_dictionary()
string_data = String(genes, naming_file, *interaction_files)
dataset.filter_with_ids(string_data.ids)
dataset.describe()

### Option 3: Filtering the dataset based on membership in pathways or phenotype category
This is done to only include genes (and the corresponding phenotype descriptions and annotations) which are useful for the current analysis. In this case we want to only retain genes that are mapped to atleast one pathway in whatever the source of pathway membership we are using is (KEGG, Plant Metabolic Network, etc). This is because for these genes, it will be impossible to correctly predict their pathway membership, and we have no evidence that they belong or do not belong in certain pathways so they can not be identified as being true or false negatives in any case.

In [8]:
# Filter based on succcessful mappings to groups or pathways.
dataset.filter_with_ids(group_mapped_ids)
dataset.describe()
# Get the mappings in each direction again now that the dataset has been subset.
id_to_group_ids = groups.get_id_to_group_ids_dict(dataset.get_gene_dictionary())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())

Number of rows in the dataframe: 460
Number of unique IDs:            460
Number of unique descriptions:   433
Number of unique gene name sets: 460
Number of species represented:   1


# Part 2. NLP Models


<a id="word2vec_doc2vec"></a>
### Word2Vec and Doc2Vec
Word2Vec is a word embedding technique using a neural network trained on a so-called *false task*, namely either predicting a missing word from within a sequence of context words drawn from a sentence or phrase, or predicting which contexts words surround some given input word drawn from a sentence or phrase. Each of these tasks are supervised (the correct answer is fixed and known), but can be generated from unlabelled text data such as a collection of books or wikipedia articles, meaning that even though the task itself is supervised the training data can be generated automatically, enabling the creation of enormous training sets. The internal representation for particular words learned during the training process contain semantically informative features related to that given word, and can therefore be used as embeddings used downstream for tasks such as finding similarity between words or as input into additional models. Doc2Vec is an extension of this technique that determines vector embeddings for entire documents (strings containing multiple words, could be sentences, paragraphs, or documents).


<a id="bert_biobert"></a>
### BERT and BioBERT
BERT ('Bidirectional Encoder Representations from Transformers') is another neueral network-based model trained on two different false tasks, namely predicting the subsequent sentence given some input sentence, or predicting the identity of a set of words masked from an input sentence. Like Word2Vec, this architecture can be used to generate vector embeddings for a particular input word by extracting values from a subset of the encoder layers that correspond to that input word. Practically, a major difference is that because the input word is input in the context of its surrounding sentence, the embedding reflects the meaning of a particular word in a particular context (such as the difference in the meaning of *root* in the phrases *plant root* and *root of the problem*. BioBERT refers to a set of BERT models which have been finetuned on the PubMed and PMC corpora. See the list of relevant links for the publications and pages associated with these models.

<a id="load_models"></a>
### Loading trained and saved models
Versions of the architectures discussed above which have been saved as trained models are loaded here. Some of these models are loaded as pretrained models from the work of other groups, and some were trained on data specific to this notebook and loaded here.

In [9]:
# Files and models related to the machine learning text embedding methods used here.
doc2vec_wiki_model = gensim.models.Doc2Vec.load(doc2vec_wikipedia_filename)
doc2vec_pubmed_model = gensim.models.Doc2Vec.load(doc2vec_pubmed_filename)
word2vec_model = gensim.models.Word2Vec.load(word2vec_model_filename)
bert_tokenizer_base = BertTokenizer.from_pretrained('bert-base-uncased')
bert_tokenizer_pmc = BertTokenizer.from_pretrained(biobert_pmc_path)
bert_tokenizer_pubmed = BertTokenizer.from_pretrained(biobert_pubmed_path)
bert_tokenizer_pubmed_pmc = BertTokenizer.from_pretrained(biobert_pubmed_pmc_path)
bert_model_base = BertModel.from_pretrained('bert-base-uncased')
bert_model_pmc = BertModel.from_pretrained(biobert_pmc_path)
bert_model_pubmed = BertModel.from_pretrained(biobert_pubmed_path)
bert_model_pubmed_pmc = BertModel.from_pretrained(biobert_pubmed_pmc_path)

# Part 3. NLP Choices

<a id="preprocessing"></a>
### Preprocessing text descriptions
The preprocessing methods applied to the phenotype descriptions are a choice which impacts the subsequent vectorization and similarity methods which construct the pairwise distance matrix from each of these descriptions. The preprocessing methods that make sense are also highly dependent on the vectorization method or embedding method that is to be applied. For example, stemming (which is part of the full proprocessing done below using the Gensim preprocessing function) is useful for the n-grams and bag-of-words methods but not for the document embeddings methods which need each token to be in the vocabulary that was constructed and used when the model was trained. For this reason, embedding methods with pretrained models where the vocabulary is fixed should have a lighter degree of preprocessing not involving stemming or lemmatization but should involve things like removal of non-alphanumerics and normalizing case. 

In [10]:
# Obtain a mapping between IDs and the raw text descriptions associated with that ID from the dataset.
descriptions = dataset.get_description_dictionary()

# Preprocessing of the text descriptions. Different methods are necessary for different approaches.
descriptions_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions.items()}
descriptions_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions.items()}
descriptions_no_stopwords = {i:remove_stopwords(d) for i,d in descriptions.items()}

<a id="pos_tagging"></a>
### POS tagging the phenotype descriptions for nouns and adjectives
Note that preprocessing of the descriptions should be done after part-of-speech tagging, because tokens that are removed during preprocessing before n-gram analysis contain information that the parser needs to accurately call parts-of-speech. This step should be done on the raw descriptions and then the resulting bags of words can be subset using additional preprocesssing steps before input in one of the vectorization methods.

In [11]:
get_pos_tokens = lambda text,pos: " ".join([t[0] for t in nltk.pos_tag(word_tokenize(text)) if t[1].lower()==pos.lower()])
descriptions_noun_only =  {i:get_pos_tokens(d,"NN") for i,d in descriptions.items()}
descriptions_noun_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_noun_only.items()}
descriptions_noun_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_noun_only.items()}
descriptions_adj_only =  {i:get_pos_tokens(d,"JJ") for i,d in descriptions.items()}
descriptions_adj_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_adj_only.items()}
descriptions_adj_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_adj_only.items()}
descriptions_noun_adj = {i:"{} {}".format(descriptions_noun_only[i],descriptions_adj_only[i]) for i in descriptions.keys()}
descriptions_noun_adj_full_preprocessing = {i:"{} {}".format(descriptions_noun_only_full_preprocessing[i],descriptions_adj_only_full_preprocessing[i]) for i in descriptions.keys()}
descriptions_noun_adj_simple_preprocessing = {i:"{} {}".format(descriptions_noun_only_simple_preprocessing[i],descriptions_adj_only_simple_preprocessing[i]) for i in descriptions.keys()}

<a id="vocab"></a>
### Reducing the vocabulary size using a word distance matrix
These approaches for reducing the vocabulary size of the dataset work by replacing multiple words that occur throughout the dataset of descriptions with an identical word that is representative of this larger group of words. The total number of unique words across all descriptions is therefore reduced, and when observing n-gram overlaps between vector representations of these descriptions, overlaps will now occur between descriptions that included different but similar words. These methods work by actually generating versions of these descriptions that have the word replacements present. The returned objects for these methods are the revised description dictionary, a dictionary mapping tokens in the full vocabulary to tokens in the reduced vocabulary, and a dictionary mapping tokens in the reduced vocabulary to a list of tokens in the full vocabulary.

In [12]:
# Reducing the size of the vocabulary for descriptions treated with simple preprocessing.
tokens = list(set([w for w in flatten(d.split() for d in descriptions_simple_preprocessing.values())]))
tokens_dict = {i:w for i,w in enumerate(tokens)}
graph = pw.pairwise_square_word2vec(word2vec_model, tokens_dict, "cosine")

# Make sure that the tokens list is in the same order as the indices representing each word in the distance matrix.
# This is only trivial here because the IDs used are ordered integers 0 to n, but this might not always be the case.
distance_matrix = graph.array
tokens = [tokens_dict[graph.row_index_to_id[index]] for index in np.arange(distance_matrix.shape[0])]
n = 3
threshold = 0.2
descriptions_linares_pontes, reduce_lp, unreduce_lp = reduce_vocabulary_linares_pontes(descriptions_simple_preprocessing, tokens, distance_matrix, n)
descriptions_connected_components, reduce_cc, unreduce_cc = reduce_vocabulary_connected_components(descriptions_simple_preprocessing, tokens, distance_matrix, threshold)

### Reducing vocabulary size based on identifying important words
These approcahes for reducing the vocabulary size of the dataset work by identifying which words in the descriptions are likely to be the most important for identifying differences between the phenotypes and meaning of the descriptions. One approach is to determine which words occur at a higher rate in text of interest such as articles about plant phenotypes as compared to their rates in more general texts such as a corpus of news articles. These approaches do not create modified versions of the descriptions but rather provide vocabulary objects that can be passed to the sklearn vectorizer or constructors.

In [13]:
# Constructing a vocabulary by looking at what words are overrepresented in domain specific text.
background_corpus = open(background_corpus_filename,"r").read()
phenotypes_corpus = open(phenotypes_corpus_filename,"r").read()
tokens = get_overrepresented_tokens(phenotypes_corpus, background_corpus, max_features=5000)
vocabulary_from_text = get_vocabulary_from_tokens(tokens)

# Constructing a vocabulary by assuming all words present in a given ontology are important.
ontology = Ontology(ontology_filename)
vocabulary_from_ontology = get_vocabulary_from_tokens(ontology.get_tokens())

<a id="annotation"></a>
### Annotating descriptions with ontology terms
This section generates dictionaries that map gene IDs from the dataset to lists of strings, where those strings are ontology term IDs. How the term IDs are found for each gene entry with its corresponding phenotype description depends on the cell below. Firstly, the terms are found by using the NOBLE Coder annotation tool through these wrapper functions to identify the terms by looking for instances of the term's label or synonyms in the actual text of the phenotype descriptions. Secondly, the next cell just draws the terms directly from the dataset itself. In this case, these are high-confidence annotations done by curators for a comparison against what can be accomplished through computational analysis of the text.

In [14]:
# Run the ontology term annotators over the raw input text descriptions. NOBLE-Coder handles simple issues like case
# normalization so preprocessed descriptions are not used for this step.
ontology = Ontology(ontology_filename)
annotations_noblecoder_precise = annotate_using_noble_coder(descriptions, noblecoder_jarfile_path, "mo", precise=1)
annotations_noblecoder_partial = annotate_using_noble_coder(descriptions, noblecoder_jarfile_path, "mo", precise=0)

In [15]:
# Get the ID to term list annotation dictionaries for each ontology in the dataset.
annotations = dataset.get_annotations_dictionary()
go_annotations = {k:[term for term in v if term[0:2]=="GO"] for k,v in annotations.items()}
po_annotations = {k:[term for term in v if term[0:2]=="PO"] for k,v in annotations.items()}

<a id="phenes"></a>
### Splitting the descriptions into individual phenes
As a preprocessing step, split into a new set of descriptions that's larger. Note that phenotypes are split into phenes, and the phenes that are identical are retained as separate entries in the dataset. This makes the distance matrix calculation more needlessly expensive, because vectors need to be found for the same string more than once, but it simplifies converting the edgelist back to having IDs that reference the genes (full phenotypes) instead of the smaller phenes. If anything, that problem should be addressed in the pairwise functions, not here. (The package should handle it, not when creating input data for those methods).

In [16]:
# Create a dictionary of phene descriptions and a dictionary to convert back to the phenotype/gene IDs.
phenes = {}
phene_id_to_id = {}
phene_id = 0
for i,phene_list in {i:sent_tokenize(d) for i,d in descriptions.items()}.items():
    for phene in phene_list:
        phenes[phene_id] = phene
        phene_id_to_id[phene_id] = i
        phene_id = phene_id+1

<a id="matrix"></a>
# Part 4. Generating vector representations and pairwise distances matrices
This section uses the text descriptions, preprocessed text descriptions, or ontology term annotations created or read in the previous sections to generate a vector representation for each gene and build a pairwise distance matrix for the whole dataset. Each method specified is a unique combination of a method of vectorization (bag-of-words, n-grams, document embedding model, etc) and distance metric (Euclidean, Jaccard, cosine, etc) applied to those vectors in constructing the pairwise matrix. The method of vectorization here is equivalent to feature selection, so the task is to figure out which type of vectors will encode features that are useful (n-grams, full words, only words from a certain vocabulary, etc).

<a id="methods"></a>
### Specifying a list of NLP methods to use
Something here if needed.

In [17]:
# Define a list of different methods for calculating distance between text descriptions using the Methods object 
# defined in the utilities for this notebook. The constructor takes a string for the method name, a string defining
# the hyperparameter choices for that method, a function to be called to run this method, a dictionary of arguments
# by keyword that should be passed to that function, and a distance metric from scipy.spatial.distance to associate
# with this method.

methods = [

    
    # Methods that use neural networks to generate embeddings.
    Method("Doc2Vec Wikipedia", "Size=300", pw.pairwise_square_doc2vec, {"model":doc2vec_wiki_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Doc2Vec PubMed", "Size=100", pw.pairwise_square_doc2vec, {"model":doc2vec_pubmed_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Mean", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"mean"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Max", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"max"}, spatial.distance.cosine),
    #Method("BERT", "Base:Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=2,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PubMed,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed, "tokenizer":bert_tokenizer_pubmed, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PubMed,PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed_pmc, "tokenizer":bert_tokenizer_pubmed_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach with full preprocessing (includes stemming).
    Method("N-Grams", "Full,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach with simple preprocessing (no stemming).
    Method("N-Grams", "Simple,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach selecting for specific parts-of-speech.
    Method("N-Grams", "Full,Nouns,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Nouns,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Adjectives,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use terms inferred from automated annotation of the text.
    Method("NOBLE Coder", "Precise", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    Method("NOBLE Coder", "Partial", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    Method("NOBLE Coder", "Precise,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    Method("NOBLE Coder", "Partial,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use terms assigned by humans that are present in the dataset.
    Method("GO", "Default", pw.pairwise_square_annotations, {"ids_to_annotations":go_annotations, "ontology":ontology, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "tfidf":False}, spatial.distance.jaccard),
    Method("PO", "Default", pw.pairwise_square_annotations, {"ids_to_annotations":po_annotations, "ontology":ontology, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "tfidf":False}, spatial.distance.jaccard),
    

    
    
    
    # Approaches were the phenotype descriptions were split into individual phenes first (computationally expensive).
    Method("Phenes Doc2Vec Wikipedia", "Size=300", pw.pairwise_square_doc2vec, {"model":doc2vec_wiki_model, "ids_to_texts":phenes, "metric":"cosine"}, spatial.distance.cosine, tag="phenes"),
    Method("Phenes Doc2Vec PubMed", "Size=100", pw.pairwise_square_doc2vec, {"model":doc2vec_pubmed_model, "ids_to_texts":phenes, "metric":"cosine"}, spatial.distance.cosine, tag="phenes"),
    Method("Phenes Word2Vec Wikipedia", "Size=300,Mean", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":phenes, "metric":"cosine", "method":"mean"}, spatial.distance.cosine, tag="phenes"),
    Method("Phenes Word2Vec Wikipedia", "Size=300,Max", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":phenes, "metric":"cosine", "method":"max"}, spatial.distance.cosine, tag="phenes"),
    

]

<a id="running"></a>
### Running all of the methods to generate distance matrices
Something here if needed.

In [None]:
#Generate all of the pairwise distance matrices in parallel.
#start_time_mp = time.perf_counter()
#pool = mp.Pool(mp.cpu_count())
#results = [pool.apply_async(function_wrapper_with_duration, args=(method.function, method.kwargs)) for method in methods]
#results = [result.get() for result in results]
#graphs = {method.name_with_hyperparameters:result[0] for method,result in zip(methods,results)}
#metric_dict = {method.name_with_hyperparameters:method.metric for tup in methods}
#durations = {method.name_with_hyperparameters:result[1] for method,result in zip(methods,results)}
#pool.close()
#pool.join()    
#total_time_mp = time.perf_counter()-start_time_mp

# Reporting how long each matrix took to build and how much time parallel processing saved.
#print("Durations of generating each pairwise similarity matrix (hh:mm:ss)")
#print("-----------------------------------------------------------------")
#savings = total_time_mp/sum(durations.values())
#for (name,duration) in durations.items():
#    print("{:50} {}".format(name, to_hms(duration)))
#print("-----------------------------------------------------------------")
#print("{:15} {}".format("total", to_hms(sum(durations.values()))))
#print("{:15} {} ({:.2%} of single thread time)".format("multiprocess", to_hms(total_time_mp), savings))

In [18]:
# Generate all the pairwise distance matrices (not in parallel).
graphs = {}
names = []
durations = []
for method in methods:
    graph,duration = function_wrapper_with_duration(function=method.function, args=method.kwargs)
    graphs[method.name_with_hyperparameters] = graph
    names.append(method.name_with_hyperparameters)
    durations.append(to_hms(duration))
    print("{:50} {}".format(method.name_with_hyperparameters,to_hms(duration)))
durations_df = pd.DataFrame({"method":names,"duration":durations})
durations_df.to_csv(os.path.join(OUTPUT_DIR,"part_4_durations.csv"), index=False)

Doc2Vec Wikipedia:Size=300                         00:00:01
Doc2Vec PubMed:Size=100                            00:00:00
Word2Vec Wikipedia:Size=300,Mean                   00:00:00
Word2Vec Wikipedia:Size=300,Max                    00:00:00
N-Grams:Full,Words,1-grams,2-grams                 00:00:01
N-Grams:Full,Words,1-grams,2-grams,Binary          00:00:00
N-Grams:Full,Words,1-grams                         00:00:00
N-Grams:Full,Words,1-grams,Binary                  00:00:00
N-Grams:Full,Words,1-grams,2-grams,TFIDF           00:00:01
N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF    00:00:01
N-Grams:Full,Words,1-grams,TFIDF                   00:00:00
N-Grams:Full,Words,1-grams,Binary,TFIDF            00:00:00
N-Grams:Simple,Words,1-grams,2-grams               00:00:01
N-Grams:Simple,Words,1-grams,2-grams,Binary        00:00:00
N-Grams:Simple,Words,1-grams                       00:00:00
N-Grams:Simple,Words,1-grams,Binary                00:00:00
N-Grams:Simple,Words,1-grams,2-grams,TFI

<a id="merging"></a>
### Merging all of the distance matrices into a single dataframe specifying edges
This section also handles replacing IDs from the individual methods that are references individual phenes that are part of a larger phenotype, and replacing those IDs with IDs referencing the full phenotypes (one-to-one relationship between phenotypes and genes). In this case, the minimum distance found between any two phenes from those two phenotypes represents the distance between that pair of phenotypes.

In [19]:
# Merging all the edgelists together.
metric_dict = {method.name_with_hyperparameters:method.metric for method in methods}
tags_dict = {method.name_with_hyperparameters:method.tag for method in methods}
names = list(graphs.keys())
edgelists = {k:v.edgelist for k,v in graphs.items()}

# Modify the edgelists for the methods that were using a phene split.
for name,edgelist in edgelists.items():
    
    
    
    # Converting phene IDs back to phenotype (gene) IDs where applicable.
    if "phene" in tags_dict[name]:
        edgelist["from"] = edgelist["from"].map(lambda x: phene_id_to_id[x])
        edgelist["to"] = edgelist["to"].map(lambda x: phene_id_to_id[x])
        edgelist = edgelist.groupby(["from","to"], as_index=False).min()
        
        
    
    
    
    
    
    
    
    
    # Making sure the edges are listed with the nodes sorted consistently.
    cond = edgelist["from"] > edgelist["to"]
    edgelist.loc[cond, ['from', 'to']] = edgelist.loc[cond, ['to', 'from']].values
    edgelists[name] = edgelist

# Do the merge step and remove self edges from the full dataframe.
df = pw.merge_edgelists(edgelists, default_value=1.000)
df = pw.remove_self_loops(df)
df["from"] = df["from"].astype("int64")
df["to"] = df["to"].astype("int64")
df.head(20)

Unnamed: 0,from,to,Doc2Vec Wikipedia:Size=300,Doc2Vec PubMed:Size=100,"Word2Vec Wikipedia:Size=300,Mean","Word2Vec Wikipedia:Size=300,Max","N-Grams:Full,Words,1-grams,2-grams","N-Grams:Full,Words,1-grams,2-grams,Binary","N-Grams:Full,Words,1-grams","N-Grams:Full,Words,1-grams,Binary","N-Grams:Full,Words,1-grams,2-grams,TFIDF","N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Full,Words,1-grams,TFIDF","N-Grams:Full,Words,1-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,2-grams","N-Grams:Simple,Words,1-grams,2-grams,Binary","N-Grams:Simple,Words,1-grams","N-Grams:Simple,Words,1-grams,Binary","N-Grams:Simple,Words,1-grams,2-grams,TFIDF","N-Grams:Simple,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,TFIDF","N-Grams:Simple,Words,1-grams,Binary,TFIDF","N-Grams:Full,Nouns,1-grams","N-Grams:Full,Nouns,1-grams,Binary","N-Grams:Full,Nouns,1-grams,TFIDF","N-Grams:Full,Nouns,1-grams,Binary,TFIDF","N-Grams:Full,Adjectives,1-grams","N-Grams:Full,Adjectives,1-grams,Binary","N-Grams:Full,Adjectives,1-grams,TFIDF","N-Grams:Full,Adjectives,1-grams,Binary,TFIDF",NOBLE Coder:Precise,NOBLE Coder:Partial,"NOBLE Coder:Precise,TFIDF","NOBLE Coder:Partial,TFIDF",GO:Default,PO:Default,Phenes Doc2Vec Wikipedia:Size=300,Phenes Doc2Vec PubMed:Size=100,"Phenes Word2Vec Wikipedia:Size=300,Mean","Phenes Word2Vec Wikipedia:Size=300,Max"
1,436,449,0.440429,0.48031,0.093846,0.06014,0.862773,0.965602,0.804568,0.926554,0.942081,0.975838,0.898835,0.938549,0.617404,0.942913,0.510145,0.897674,0.883654,0.957547,0.809063,0.922304,0.841055,0.931507,0.879842,0.914691,1.0,1.0,1.0,1.0,0.767677,0.737643,0.741439,0.781159,0.9375,0.311594,0.243629,0.27274,0.118913,0.084956
2,436,451,0.448706,0.83035,0.144937,0.080291,0.90307,0.97482,0.867954,0.956204,0.95714,0.974107,0.932699,0.958877,0.73613,0.970899,0.656009,0.948864,0.941671,0.979164,0.899777,0.973427,0.889145,0.980392,0.968725,0.988432,1.0,1.0,1.0,1.0,0.78022,0.680233,0.763636,0.753923,0.941176,0.37037,0.279423,0.159204,0.154586,0.113171
3,436,452,0.5564,0.377964,0.19359,0.097068,0.976664,0.992832,0.964543,0.984496,0.995502,0.9963,0.99101,0.989247,0.777657,0.97832,0.690152,0.957317,0.962876,0.991755,0.925586,0.979721,0.974076,0.979167,0.986603,0.976018,1.0,1.0,1.0,1.0,0.836735,0.723958,0.862402,0.826559,0.941176,0.518797,0.336436,0.168364,0.185599,0.125401
4,436,453,0.508218,0.378605,0.170496,0.101583,0.844926,0.960784,0.818021,0.932203,0.956981,0.965118,0.935118,0.922609,0.716976,0.947059,0.652161,0.929032,0.912731,0.942427,0.876575,0.924853,0.820395,0.93617,0.920817,0.919669,0.52371,0.888889,0.815876,0.890583,0.922078,0.703704,0.94655,0.776442,0.933333,0.253333,0.066869,0.218946,0.0,0.0
5,436,454,0.581453,0.310584,0.155409,0.084839,0.926707,0.964286,0.913579,0.934783,0.979253,0.977879,0.967027,0.945618,0.684539,0.95122,0.591172,0.916667,0.935327,0.972594,0.881867,0.94137,0.924551,0.962264,0.967047,0.974057,0.804471,0.878788,0.921611,0.901661,0.855422,0.742424,0.867605,0.848427,0.928571,0.492958,0.090775,0.162308,0.0,0.0
6,436,455,0.458854,0.376883,0.073533,0.0523,0.869882,0.958656,0.827304,0.923497,0.962644,0.978284,0.936123,0.954789,0.456489,0.943074,0.344353,0.905738,0.854125,0.971174,0.757046,0.950967,0.925201,0.958904,0.961622,0.970323,0.908253,0.961538,0.968077,0.973008,0.787879,0.697368,0.819515,0.781505,0.941176,0.264516,0.201464,0.163906,0.097725,0.087744
7,436,465,0.427251,0.266459,0.160076,0.149803,0.878522,0.987395,0.796503,0.972973,0.928492,0.966779,0.844651,0.915064,0.819211,0.983819,0.757494,0.971631,0.934452,0.951292,0.891421,0.915231,0.926676,0.97619,0.947854,0.926807,0.764298,0.925926,0.748064,0.793501,0.766667,0.86,0.618308,0.728861,0.9375,0.676471,0.132484,0.21304,0.157447,0.175157
8,436,467,0.304022,0.436854,0.048605,0.054396,0.730182,0.861314,0.636714,0.766129,0.855822,0.818548,0.774088,0.679914,0.638961,0.869318,0.530892,0.779874,0.857142,0.828731,0.766311,0.724091,0.775054,0.784314,0.841257,0.689945,0.739421,0.828571,0.835437,0.773912,0.571429,0.449541,0.464786,0.471749,0.90625,0.022059,0.041267,0.174219,0.0,0.0
9,436,472,0.378091,0.357082,0.049062,0.051974,0.783986,0.894231,0.703857,0.838926,0.879991,0.86835,0.798777,0.784072,0.632617,0.885366,0.51423,0.813472,0.867904,0.869251,0.767799,0.795549,0.634703,0.785714,0.762621,0.720551,0.849384,0.9,0.830283,0.827512,0.722222,0.616279,0.567129,0.604425,0.892857,0.05,0.040721,0.17488,0.0,0.0
10,436,474,0.314291,0.536701,0.074955,0.051165,0.686499,0.850299,0.56664,0.735714,0.779869,0.80102,0.642262,0.642151,0.51899,0.841379,0.386013,0.736559,0.776646,0.792694,0.636662,0.659487,0.662821,0.719298,0.701525,0.619057,0.601473,0.837838,0.582568,0.757573,0.425,0.651786,0.485715,0.613898,0.909091,0.0,0.128363,0.149892,0.118671,0.073701


In [40]:
# Updating the arrays and id_to_row_index and id_to_col_index values for the graphs for the phene methods.
# What about the the get_nearest_neighbor() methods, they'll still be returning phene IDs? Is that good or bad?
for name in names:
    if "phene" in tags_dict[name]:
        print(name)
        idx = list(df.columns).index(name)+1

        n = len(descriptions)
        ids = list(descriptions.keys())
        arr = np.ones((n, n))
        id_to_idx = {i:idx for idx,i in enumerate(ids)}
        for row in df.itertuples():
            arr[id_to_idx[row[1]]][id_to_idx[row[2]]] = row[idx]
        
        # Doing the updates.
        graphs[name].array = arr
        graphs[name].id_to_row_index = 
        graphs[name].id_to_col_index =
        graphs[name].row_index_to_id = 
        graphs[name].col_index_to_id = 








Phenes Doc2Vec Wikipedia:Size=300
(460, 460)
Phenes Doc2Vec PubMed:Size=100
(460, 460)
Phenes Word2Vec Wikipedia:Size=300,Mean
(460, 460)
Phenes Word2Vec Wikipedia:Size=300,Max
(460, 460)


<a id="cluster_analysis"></a>
# Part 5. Cluster Analysis
The purpose of this section is to look at different ways that the embeddings obtained for the dataset of phenotype descriptions can be used to cluster or organize the genes to which those phenotypes are mapped into subgroups or representations. These approaches include generating topic models from the data, and doing agglomerative clustering to find clusters to which each gene belongs.

<a id="topic_modeling"></a>
### Approach 1: Topic modeling based on n-grams with a reduced vocabulary
Topic modelling learns a set of word probability distributions from the dataset of text descriptions, which represent distinct topics which are present in the dataset. Each text description can then be represented as a discrete probability distribution over the learned topics based on the probability that a given piece of text belongs to each particular topics. This is a form of data reduction because a high dimensionsal bag-of-words can be represented as a vector of *k* probabilities where *k* is the number of topics. The main advantages of topic modelling over clustering is that topic modelling provides soft classifications that can be additionally interpreted, rather than hard classifications into a single cluster. Topic models are also explainable, because the word probability distributions for that topic can be used to determine which words are most representative of any given topic. One problem with topic modelling is that is uses the n-grams embeddings to semantic similarity between different words is not accounted for. To help alleviate this, this section uses implementations of some existing algorithms to compress the vocabulary as a preprocessing step based on word distance matrices generated using word embeddings.

In [20]:
# Get a list of texts to create a topic model from, from one of the processed description dictionaries above. 
texts = [description for i,description in descriptions_linares_pontes.items()]

# Creating and fitting the topic model, either NFM or LDA.
number_of_topics = 20
seed = 0
vectorizer = TfidfVectorizer(max_features=10000, stop_words="english", max_df=0.95, min_df=2, lowercase=False)
features = vectorizer.fit_transform(texts)
cls = NMF(n_components=number_of_topics, random_state=seed)
cls.fit(features)

# Function for retrieving the topic vectors for a list of text descriptions.
def get_topic_embeddings(texts, model, vectorizer):
    ngrams_vectors = vectorizer.transform(texts).toarray()
    topic_vectors = model.transform(ngrams_vectors)
    return(topic_vectors)
    
# Create the dataframe containing the average score assigned to each topic for the genes from each subset.
group_to_topic_vector = {}
for group_id,ids in group_id_to_ids.items():
    texts = [descriptions_linares_pontes[i] for i in ids]
    topic_vectors = get_topic_embeddings(texts, cls, vectorizer)
    mean_topic_vector = np.mean(topic_vectors, axis=0)
    group_to_topic_vector[group_id] = mean_topic_vector
    
tm_df = pd.DataFrame(group_to_topic_vector)

# Changing the order of the Lloyd, Meinke phenotype subsets to match other figures for consistency.
#filename = "../data/group_related_files/lloyd/lloyd_function_hierarchy_irb_cleaned.csv"
#lmtm_df = pd.read_csv(filename)
#tm_df = tm_df[lmtm_df["Subset Symbol"].values]

# Reordering so consistency with the curated subsets can be checked by looking at the diagonal.
tm_df["idxmax"] = tm_df.idxmax(axis = 1)
tm_df["idxmax"] = tm_df["idxmax"].apply(lambda x: tm_df.columns.get_loc(x))
tm_df = tm_df.sort_values(by="idxmax")
tm_df.drop(columns=["idxmax"], inplace=True)
tm_df = tm_df.reset_index(drop=False).rename({"index":"topic"},axis=1).reset_index(drop=False).rename({"index":"order"},axis=1)
tm_df.to_csv(os.path.join(OUTPUT_DIR,"part_5_topic_modeling.csv"), index=False)
tm_df

Unnamed: 0,order,topic,PWY-6406,PWY-5837,PWY-5791,PWY-7270,ETHYL-PWY,PWY-6546,PWY-1081,NONOXIPENT-PWY,CALVIN-PWY,PWY-5723,PWY-6730,PWY-6842,PWY-6736,PWY-6007,PWYQT-4476,PWYQT-4477,PWY-6008,PWY-6443,PWY-5868,PWY-6064,PWY-7186,PWY-6199,PWY-6266,PWY-2181,PWY-5168,PWY-5391,PWY-1121,PWY-361,CAMALEXIN-SYN,LIPAS-PWY,PWY-5080,PWY-7036,PWY-695,PWY-3181,PWY-6446,PWY-6444,PWY-5945,PWY1F-823,PWY-6787,PWY-5152,PWY1F-FLAVSYN,PWY-5060,PWY-3101,PWY-6902,PWY-3982,PWY-5704,PWY-7226,PWY-5034,PWY-5032,GLYOXYLATE-BYPASS,GLYOXDEG-PWY,PWY-699,PWY-6544,PWY-5137,PWY-735,PWY-5136,PWY-6837,PWY-5138,PWY-1042,PWY66-399,SUCSYN-PWY,PWY-5484,GLUCONEO-PWY,GLYCOLYSIS,PWY-2,PWY-6137,PWY-6959,PWY-2261,PWY-6724,PWY-5980,PWY-7238,PWY-1422,PWY-7436,PWY-882,PWY4FS-13,PWY4FS-12,PWY-922,THIOREDOX-PWY,ARGSYNBSUB-PWY,ARGSYN-PWY,PWY-5686,CITRULBIO-PWY,PWY-7060,PWY-4984,GLUTAMINDEG-PWY,PWY0-1319,PWY-5667,PWYQT-4482,TRIGLSYN-PWY,PWY-581,PWY-2902,PWY-7199,PWY-7193,PWY-6556,PWY-1061,PWY-5097,LEUSYN-PWY,PWY-6352,PWY-381,PWY-6549,PWY-6963,GLNSYN-PWY,PWY-6964,PWY-7061,PWY-3301,HISTSYN-PWY,PWY0-1264,PWY-7388,PWY-3385,PWY4FS-6,PWY-6163,PWY-3781,PWY-5083,PWY-4302,LYSINE-DEG2-PWY,PWY-2541,OXIDATIVEPENT-PWY,PWY0-1507,CHLOROPHYLL-SYN,FASYN-ELONG-PWY,PWY-5971,PWY-6039,PWY-6040,PWY-6466,GLUT-REDOX-PWY,PWY-4081,PWY-43,PWY-3801,PWY-5992,GLYSYN2-PWY,PWY-7416,PWY-6803,SERSYN-PWY,ALANINE-DEG3-PWY,ALANINE-SYN2-PWY,ALACAT2-PWY,PWY-6806,PWYQT-4450,PWY-1186,PWY-4361,LEU-DEG2-PWY,PWY-801,PWY-6936,PWY-702,PWY-5041,PWY-7528,METHIONINE-DEG1-PWY,SAM-PWY,PWY-5441,PROSYN-PWY,PWY-3341,PWY-6922,ARGININE-SYN4-PWY,PWY-5366,PWY-5142,PWY-7417,PWY-622,PWY-6545,PWY-7184,PWY-7227,PWY0-166,PWY-6707,PYRUVDEHYD-PWY,PWY-5147,PWY-6663,TRESYN-PWY,PWY-5350,PWY-6477,PWY-321,PWY-5143,PWY-6733,PWY-5989,PWY-282,PWY-5884,PWY-2821,PWY-601,PWY-5079,PWY-5886,PWY-7432,PWYDQC-4,TYRFUMCAT-PWY,PWY-5765,PWY-6369,PWY-1881,PWY-6475,PWY-40,PWY-6305,ARGDEG-V-PWY,ARGASEDEG-PWY,ARG-PRO-PWY,PWY-7101,PANTO-PWY,PWY-7197,PWY-7187,PYRIDNUCSYN-PWY,PWY-2301,PWY-6363,PWY-4702,PWY-6799,PWY-4381,PWY-6596,PWY-6122,PWY-6121,PWY-3841,PWY-7909,PWY-6613,PWY-3742,PWY-1722,PWY-101,PWY-6614,PWY-2161,PWY-181,GLYSYN-PWY,PWY-5871,PWY-5285,PWY-6364,NONMEVIPP-PWY,PWY-7560,PWY-6804,PWY-5800,PWY-5175,PWY-5946,CAROTENOID-PWY,PWY-7120,RIBOSYN2-PWY,PWY-782,PWY-5995,PWY-762,PWY-4341,PWY-5934,PWY-1001,PWYQT-4475,PWYQT-4473,PWYQT-4474,PWYQT-4472,PWYQT-4471,PWY-1187,PWY-5267,PWY-5947,PWY-5659,MANNCAT-PWY,PWY-3881,PWY-3261,PWY-5997,PWY-7590,MANNOSYL-CHITO-DOLICHOL-BIOSYNTHESIS,PWY-63,PWY-6317,PWY-3821,PWY-7344,PWY-6527,PWY-5114,PWY4FS-2,PWY4FS-4,PWY4FS-3,PWY-6295,PWY-84,PWY-7219,PWY-2724,PWY-66,PWY-2582,PWY-6745,CYSTSYN-PWY,PWY-5670,PWY1F-467,PWY-4041,PWYQT-4432,GLYSYN-ALA-PWY,PWY-5381,PWY-2602,PWY-6424,BSUBPOLYAMSYN-PWY,PWY0-461,ARGSPECAT-PWY,PWY-6535,PWY-6473,PWY-4321,PWY-5910,PWY-5120,PWY-5121,PWY-5863,DETOX1-PWY,DETOX1-PWY-1,PWY-7039,PWY-5188,PWY-4841,UDPNACETYLGALSYN-PWY,PWY-4,PWY-82,PWYQT-4466,PWY-7343,PWYQT-4481,PWY-561,PWY-5690,PWY-5661,PWY-4101,GLUCOSE1PMETAB-PWY,PWY-621,PWY0-1182,TRPSYN-PWY,PWY-6890,PWYQT-4470,GLUTATHIONESYN-PWY,PWY-1581,PWY-6910,PWY-7356,PWY-6908,PWY-5986,PWY-5027,PWY-6118,PWY-4261,PWY-6952,PWY-7208,PWY-7196,PWY-7183,PWYQT-4445,THISYNARA-PWY,PWY-6909,PWY-7625,PHOSLIPSYN2-PWY,PWY-6351,PWY-5973,PWY-5486,PWY66-21,ETOH-ACETYLCOA-ANA-PWY,PWY-6333,PWY-1801,PWY-5070,PWY-5035,PWY-5036,PWY0-1313,PWY-5390,HOMOSER-THRESYN-PWY,PWY-7640,PWY-5271,PWY-6012,PWY-6287,PWY-7047,PWY-7048,MALATE-ASPARTATE-SHUTTLE-PWY,PWY-6348,LIPASYN-PWY,POLYAMINSYN3-PWY,PWY0-501,PWY-5337,PWY-5342,PWY-5687,PWY-7205,PWY-7176,PWY-7221,PWY-7224,PWY-3221,PWY-4861,PWY-5466,PWY-3561,PWYQT-4427,PWY1F-353,PWY-401,PLPSAL-PWY,PWY-7204,SULFMETII-PWY,PWY-5340,PWY-4203,PWY-2161B-PMN,PWY-5410,HEME-BIOSYNTHESIS-II,PWY-6809,GLUGLNSYN-PWY,PWY-5936,GLUTSYNIII-PWY,GLUTAMATE-SYN2-PWY,GLUTAMATE-DEG1-PWY,PWY-5129,PWY-6441,PWY-6932,PWY-6132,PWY-6668,PWY-5107,PWY-6619,ASPSYNII-PWY,P401-PWY,PWY-6066,PWY-1822,PWY-6235,VALDEG-PWY,PWY-6233,PWY-6220,PWY0A-6303,PWY-6303,PWY-6607,PWY-7185,PWY-6606,PWY-6927,PWY-7170,PWY-641,PWY-6035,ASPASN-ARA-PWY,ASPARTATESYN-PWY,ASPARTATE-DEG1-PWY,PWY-3001,THRESYN-PWY,PWY-5064,PWY-5068,PWY-5086,PWY-6786,PWY-5453,PWY-5963,PWY-5669,PWY-6754,PWY-6756,PWY-6605,PWY-5098,PWY-6019,PWY4FS-7,PWY4FS-8,PWY-5269,PWY-6845,PWY-4983,PWY-6773,PWY0-1021,PWY-7250,PWY-6823,PWY-6115
0,0,17,0.012457,0.006228,0.006228,0.0,0.0,0.0,0.013784,0.0,0.0,0.036701,0.0,0.002876,0.0,0.008204,0.013069,0.018704,0.026138,0.002166,0.00193,0.0,0.003061,0.0,0.0,0.010821,0.012589,0.009755,0.010722,0.018546,0.011135,0.010182,0.002973,0.010382,0.235671,0.23998,0.337158,0.337158,0.073657,0.02485,0.013976,0.017718,0.016772,0.035436,0.013006,0.024868,0.005223,0.0,0.0,0.16739,0.169039,0.015486,0.0,0.002254,0.0,0.033341,0.019729,0.03377,0.039901,0.022227,0.044766,0.005613,0.0,0.036701,0.007217,0.032113,0.071401,0.005742,0.0,0.0,0.0,0.0,0.0,0.003217,0.0,0.016411,0.016024,0.016024,0.012549,0.0,0.0,0.0,0.0,0.003629,0.0,0.0,0.004634,0.0,0.0,0.0,0.015285,0.004877,0.0,0.0,0.0,0.0,0.006473,0.0035,0.011073,0.0,0.036296,0.0,0.014224,0.0,0.0,0.0,0.006761,0.00466,0.026369,0.018834,0.013427,0.0,0.0,0.045757,0.035108,0.03813,0.018152,0.00709,0.0,0.0,0.028732,0.022598,0.013983,0.002573,0.002573,0.010812,0.0,0.036465,0.003477,0.0,0.0,0.069634,0.035007,0.021885,0.070015,0.038204,0.038204,0.022923,0.039547,0.0,0.014479,0.043438,0.021719,0.0,0.0,0.010996,0.004302,0.0,0.006453,0.0,0.043983,0.001925,0.001925,0.001283,0.001283,0.021988,0.021988,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014859,0.0,0.014862,0.138865,0.208977,0.008757,0.015324,0.004378,0.012259,0.0,0.01393,0.005517,0.005517,0.0,0.0,0.0,0.0,0.0,0.0,0.001252,0.0,0.023119,0.003477,0.002318,0.006954,0.0,0.0,0.039316,0.041525,0.0,0.0,0.018824,0.009022,0.071321,0.009022,0.0,0.0,0.052256,0.027041,0.027041,0.017126,0.0,0.0,0.002368,0.0,0.038457,0.0,0.020677,0.042928,0.034251,0.018822,0.0,0.117027,0.004655,0.003879,0.03307,0.00291,0.0,0.0,0.0,0.050326,0.002034,0.0,0.0,0.0,0.0,0.002407,0.005401,0.005635,0.005635,0.005635,0.005635,0.005635,0.005635,0.005123,0.010536,0.016163,0.000479,0.000479,0.01678,0.0,0.0,0.022174,0.0,0.0,0.0,0.0,0.0,0.0,0.02014,0.02014,0.02014,0.0,0.0,0.000658,0.0,0.004465,0.002817,0.057769,0.0,0.0,0.027912,0.114613,0.114613,0.114613,0.083119,0.0,0.004839,0.0,0.0,0.001595,0.000225,0.000225,0.00015,0.002267,0.002267,0.002267,0.002267,0.037443,0.073419,0.016636,0.012703,0.0,0.0,0.0,0.0,0.0,0.0,0.059489,0.059489,0.059489,0.0,0.0,0.0,0.0,0.0,0.008084,0.0,0.022821,0.022821,0.0,0.0,0.0,0.0,0.0,0.0,0.067918,0.067918,0.033959,0.0,0.0,0.0,0.052255,0.0,0.0,0.0,0.000759,0.0,0.0,0.0,0.006175,0.0,0.0,0.0,0.160464,0.160464,0.160464,0.009262,0.001791,0.015044,0.303052,0.303052,0.0,0.0,0.02323,0.02323,0.016885,0.057025,0.012367,0.00319,0.011939,0.029465,0.029465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018551,0.008985,0.0,0.003122,0.0,0.0,0.0,0.0,0.012623,0.007103,0.134373,0.0,0.212413,0.028447,0.019165,0.0,0.0,0.0,0.012589,0.0,0.0,0.104208,0.005286,0.03047,0.002633,0.0,0.0,0.081691,0.07104,0.141374,0.11693,0.040666,0.040666,0.01083,0.01083,0.104512,0.104512,0.104512,0.0,0.0,0.044833,0.044833,0.004197,0.004197,0.004197,0.004197,0.004197,0.0,0.0,0.0,0.268745,0.268745,0.286364,0.001519,0.0,0.0,0.04509,0.0,0.0,0.0,0.0,0.0,0.031822,0.031822,0.0,0.0,0.0,0.0,0.0
1,1,9,0.029083,0.014541,0.014541,0.023417,0.023417,0.0,0.0,0.007568,0.005045,0.00632,0.0,0.023807,0.0,0.004769,0.0,0.008286,0.0,0.0,0.018501,0.0,0.024669,0.0,0.0,0.026557,0.002832,0.0,0.021363,0.021769,0.018281,0.000406,0.011468,0.026165,0.011104,0.014282,0.028565,0.028565,0.021953,0.0,0.000762,0.0,0.000915,0.0,0.0,0.001561,0.0,0.0,0.097662,0.173997,0.030103,0.0,0.0,0.083844,0.153506,0.090268,0.040107,0.031173,0.062346,0.060179,0.005828,0.012746,0.033584,0.001995,0.016388,0.014339,0.0,0.008697,0.0,0.0,0.015199,0.02699,0.058049,0.045442,0.075736,0.012084,0.024597,0.024597,0.065891,0.050474,0.378099,0.378099,0.378099,0.08996,0.326164,0.189049,0.128514,0.025731,0.043558,0.068106,0.04614,0.012757,0.0,0.0,0.0,0.0,0.034887,0.025048,0.019836,0.173311,0.111147,0.091874,0.111147,0.222295,0.116404,0.116404,0.03973,0.028227,0.0,0.050546,0.0,0.030262,0.0,0.002687,0.013918,0.002942,0.0,0.050478,0.0,0.077205,0.022502,0.202184,0.089953,0.024668,0.024668,0.044297,0.0,0.0,0.002649,0.045093,0.0,0.058982,0.043716,0.031393,0.087432,0.016772,0.016772,0.010063,0.030943,0.008844,0.032698,0.062719,0.03136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012551,0.012551,0.071729,0.051424,0.062285,0.062285,0.093428,0.062285,0.0,0.0,0.03292,0.013101,0.0,0.0,0.020653,0.014418,0.025232,0.022193,0.067507,0.0,0.027325,0.041938,0.041938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048263,0.0,0.0,0.0,0.0,0.0,0.0,0.046414,0.0,0.0,0.0,0.0,0.0,0.017347,0.0,0.0,0.0,0.01868,0.005946,0.005946,0.089008,0.0,0.0,0.0,0.0,0.033629,0.0,0.089008,0.102913,0.178017,0.0,0.0,0.024685,0.015916,0.04258,0.0,0.034729,0.0,0.0,0.0,0.003495,0.113699,0.02035,0.02035,0.018914,0.010513,0.019223,0.032906,0.032895,0.032895,0.032895,0.032895,0.032895,0.032895,0.0,0.003616,0.032148,0.0,0.0,0.007429,0.0,0.0,0.02508,0.004809,0.004809,0.004809,0.007214,0.004809,0.029306,0.0,0.0,0.0,0.02281,0.02281,0.019586,0.0,0.010877,0.066428,0.060867,0.084042,0.054745,0.017338,0.050316,0.050316,0.050316,0.055535,0.046732,0.098879,0.030778,0.030778,0.0476,0.059242,0.059242,0.039495,0.037601,0.037601,0.037601,0.037601,0.017217,0.008609,0.035923,0.0,0.0,0.0,0.0,0.011364,0.0,0.0,0.0,0.0,0.0,0.074886,0.112328,0.074886,0.093685,0.112328,0.016004,0.128394,0.002985,0.002985,0.090705,0.142514,0.142514,0.142514,0.052204,0.0,0.0,0.0,0.022016,0.012393,0.012393,0.012393,0.024879,0.136731,0.136731,0.0,0.0,0.0,0.236607,0.143161,0.04772,0.143161,0.143161,0.143161,0.049444,0.049444,0.049444,0.0,0.0,0.006535,0.0,0.0,0.065853,0.0,0.0,0.0,0.0,0.0,0.0,0.064424,0.0,0.041681,0.041681,0.0,0.0,0.0,0.0,0.0,0.0,0.102797,0.02833,0.0,0.0,0.0,0.000329,0.013851,0.013851,0.051743,0.051743,0.0,0.0,0.01825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002832,0.0,0.093043,0.0,0.0,0.0,0.0,0.01703,0.01703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037361,0.037361,0.037361,0.082141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028831,0.0,0.0365,0.0365,0.027552,0.0,0.0,0.0,0.027777,0.048147,0.01044,0.152857,0.152857,0.152857,0.023288,0.023288,0.007815,0.231614,0.231614,0.231614,0.026203
2,2,5,0.265329,0.132664,0.132664,0.001662,0.001662,0.0,0.0,0.0,0.0,0.003011,0.187114,0.06523,0.187114,0.081499,0.093521,0.067165,0.13433,0.162197,0.121152,0.259527,0.104812,0.259527,0.259527,0.095219,0.130908,0.110501,0.042079,0.072414,0.112939,0.045416,1.3e-05,0.001793,0.008644,0.006723,0.013447,0.013447,0.000235,0.0,0.066016,0.106808,0.042897,0.0,0.060826,0.0,0.0,0.0,0.00064,0.0,0.0,0.0,0.0,0.003132,0.0,0.004663,0.021163,0.019377,0.008602,0.005341,0.003512,0.002342,0.0,0.003011,0.003011,0.002634,0.0,0.002232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032335,0.151594,0.151594,0.078194,0.007376,0.0,0.0,0.0,0.0,0.003336,0.0,0.0,0.0,0.004624,0.003682,0.007007,0.072502,0.342003,0.342003,0.342003,0.342003,0.0,0.0,5.7e-05,0.002531,0.026453,0.091661,0.005003,0.010007,0.005004,0.005004,0.001037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00054,0.0,0.0,0.001503,0.0,0.016709,0.075026,0.075026,0.0,0.0,0.0,0.0,0.0,0.078227,0.000298,0.0,0.01566,0.0,0.00053,0.00053,0.00209,0.0,0.0,0.0,0.0,0.0002,0.044893,0.044893,0.065226,0.043825,0.0,0.062987,0.0,0.0,0.0,0.0,0.0,0.002953,0.006984,0.006984,0.009248,0.001202,0.0,0.0,0.0,0.0,0.0,0.0,0.045079,0.0,0.003454,0.139439,0.005209,0.015494,0.02349,0.006712,0.020051,0.0,0.0,0.000213,0.000213,0.000399,0.000399,0.0002,0.000399,0.0002,0.0,0.13454,0.0,0.011159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006369,0.007863,0.006369,0.0,0.0,0.008457,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000298,0.0,0.0,0.0,0.004831,0.0,0.023026,0.0,0.0,0.0,0.0,0.0,0.008246,0.019826,0.0,0.0,0.0,0.0,0.005515,0.009499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012402,0.0,0.0,0.0,0.0,0.0,0.035878,0.0,0.0,0.0,0.0,0.0,0.012113,0.0,0.0,0.0,0.0,0.0,0.010648,0.0,0.0,0.003915,0.004884,0.000153,0.0,0.001383,0.001192,0.001192,0.001192,0.020639,0.0,0.0,0.0,0.0,0.0,0.015719,0.015719,0.01139,0.0,0.0,0.0,0.0,0.040242,0.020121,0.009438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041427,0.000175,0.063359,0.063359,0.006981,0.0,0.0,0.0,0.02697,0.0,0.0,0.0,0.001196,0.0,0.0,0.0,0.008457,0.00209,0.00209,0.0,0.0,0.0,0.006296,0.0,0.040445,0.0,0.0,0.0,0.006799,0.006799,0.006799,0.060668,0.091239,0.126727,0.003155,0.003155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002758,0.002758,0.0,0.0,0.0,0.0,0.0,0.0,0.048451,0.0,0.0,0.0,0.0,0.0,0.001607,0.001607,0.004806,0.004806,0.120017,0.0,0.004908,0.006845,0.005275,0.0,0.0,0.00443,0.00443,0.00443,0.001144,0.058676,0.0,0.0,0.006908,0.0,0.0,0.000459,0.000459,0.063828,0.028189,0.0,0.033701,0.057937,0.057937,0.0,0.0,0.016914,0.016914,0.016914,0.077212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011063,0.0,0.009816,0.009816,0.04981,0.0,0.0,0.0,0.037544,0.115817,0.0,0.011047,0.011047,0.011047,0.042229,0.042229,0.0,0.0,0.0,0.0,0.0
3,3,10,0.0,0.0,0.0,0.000164,0.000164,0.003859,0.014341,0.016338,0.010892,0.009336,0.0,0.027593,0.0,0.041589,0.007101,0.007101,0.014202,0.00571,0.001144,0.0,0.001525,0.0,0.0,0.001525,0.0,0.000755,0.005194,0.004472,0.0,0.0,0.0,0.000559,0.0,0.014322,0.0,0.0,0.023568,0.001511,0.001415,0.001511,0.001699,0.003022,0.001007,0.0,0.011853,0.0,0.034308,0.001063,0.0,0.0,0.0,0.005589,0.020832,0.002517,0.001007,0.000839,0.001678,0.001678,0.003165,0.000904,0.0,0.001162,0.0,0.001017,0.014322,0.0,0.0,0.0,0.014975,0.154293,0.002381,0.003647,0.003027,0.012412,0.00926,0.00926,0.001705,0.0,0.0,0.0,0.0,0.002034,0.0,0.0,0.001727,0.0,0.008532,0.0,0.006399,0.003064,0.0,0.0,0.0,0.0,0.280799,0.0,0.005006,0.004888,0.0,0.00403,0.0,0.0,0.006045,0.006045,0.0,0.004594,0.000829,0.002655,0.04552,0.0,0.0,0.017786,0.002095,0.001396,0.0,0.012222,0.009361,0.001203,0.0,0.008962,0.00935,0.001525,0.001525,0.0,0.0,0.0,0.0,0.002804,0.0,0.002297,0.0,0.001414,0.0,0.004617,0.004617,0.00277,0.0,0.006462,0.0,0.0,0.004596,0.0,0.0,0.0,0.010019,0.0,0.015029,0.0,0.0,0.004127,0.004127,0.002752,0.002752,0.0,0.0,0.017065,0.003476,0.002565,0.002565,0.003848,0.002565,0.0,0.0,0.001414,0.020305,0.02026,0.0,0.0,0.016284,0.002121,0.01072,0.009429,0.0,0.007899,0.0,0.0,0.009192,0.009192,0.004596,0.009192,0.004596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003789,0.0,0.0,0.0,0.0,0.006085,0.0,0.0,0.0,0.0,0.0,0.0,0.004045,0.0,0.0,0.0,0.0,0.002353,0.0,0.004045,0.002297,0.00809,0.006901,0.0,0.015211,0.0,0.0,0.0,0.042614,0.0,0.0,0.0,0.0,0.035119,0.0,0.0,0.0,0.01209,0.0,0.055084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002173,0.015525,0.029736,0.029736,0.045472,0.0,0.0,0.013979,0.043572,0.029931,0.02722,0.040829,0.02722,0.062407,0.06828,0.06828,0.06828,0.013925,0.013925,0.002825,0.012448,0.083934,0.001778,0.015085,0.000329,0.0,0.021243,0.001096,0.001096,0.001096,0.014253,0.001045,0.0,0.025974,0.025974,0.056053,0.003561,0.003561,0.002374,0.018189,0.018189,0.018189,0.018189,0.000314,0.000157,0.0,0.0,0.0,0.0,0.0,0.004828,0.004067,0.004067,0.0,0.0,0.0,0.003849,0.001706,0.003849,0.000853,0.001706,0.0,0.003473,0.003298,0.003298,0.0,0.0,0.0,0.0,0.030131,0.015476,0.0,0.0,0.045334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005285,0.005285,0.046204,0.0,0.0,0.0,0.0,0.0,0.0,0.086133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118914,0.004467,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019034,0.0,0.0,0.0,0.0,0.005014,0.000987,0.000987,0.0,0.0,0.0,0.005124,0.002652,0.002652,0.019336,0.019336,0.0,0.0,0.0,0.037366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05605,0.0,0.0,0.0,0.0,0.0,0.0,0.007214,0.0,0.0,0.0,0.02282
4,4,1,0.0,0.0,0.0,0.0,0.0,0.005124,0.00825,0.107308,0.11474,0.095391,6.5e-05,0.032574,6.5e-05,0.0,0.0,0.0,0.0,0.0,0.051182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04345,0.0,0.0,0.062948,0.062948,0.00247,0.000836,0.001671,0.001671,0.057578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035757,0.0,0.0,0.00082,0.0,0.004672,0.0,0.0,0.003426,0.002855,0.0,0.0,0.082859,0.029152,0.086387,0.071203,0.037381,0.062302,0.0,0.009094,0.0,0.0,0.000274,0.0,0.002435,0.011577,0.0,0.026357,0.01318,0.01318,0.088021,0.00366,0.0,0.0,0.0,0.062581,0.0,0.0,0.053686,0.224145,0.082382,0.150242,0.07116,0.0,0.0,0.0,0.0,0.0,0.000524,0.0,0.0,0.000684,0.0,0.0,0.005237,0.0,0.0,0.0,0.0,0.120926,0.207047,0.142199,0.027854,0.084193,0.319428,0.046127,0.057659,0.040937,0.176041,0.086887,0.281787,0.15494,0.105437,0.068508,0.01142,0.068243,0.068243,0.048389,0.174245,0.058093,0.214687,0.04241,0.0,0.009707,0.0,0.0,0.0,0.0,0.0,0.003664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021904,0.0,0.032856,0.0,0.0,0.125003,0.125003,0.083335,0.089441,0.183819,0.183819,0.156663,0.001801,0.033226,0.033226,0.049839,0.033226,0.408258,0.21253,0.018147,0.095238,0.089952,0.12895,0.0,0.002874,0.0,0.000134,0.0,0.0,0.0,0.00948,0.00948,0.0,0.0,0.023848,0.0,0.023848,0.0,0.0,0.0,0.102905,7.1e-05,4.8e-05,0.000142,0.0,0.0,0.0,0.045178,0.023865,0.023865,0.026712,0.0,0.007224,0.0,0.0,0.0,0.039868,0.004146,0.004146,0.018005,0.0,0.0,0.018583,0.0,0.005404,0.055749,0.018005,0.015636,0.03601,0.244927,0.290172,0.018061,0.04751,0.039592,0.0,0.001885,0.0,0.0,0.0,0.0,0.018052,0.0,0.0,0.0,0.0,0.0,0.014784,0.001326,0.001326,0.001326,0.001326,0.001326,0.001326,0.144492,0.0,0.034258,0.002293,0.002293,0.019158,0.0,0.0,0.0,0.0,0.000234,0.0,0.0,0.0,0.0,0.0138,0.0138,0.0138,0.0,0.0,0.031206,0.0,0.0,0.005841,0.048828,0.0,0.0,0.0,0.0,0.0,0.0,0.039639,0.0,0.046407,0.0,0.0,0.0,0.0,0.0,0.006106,0.0,0.0,0.0,0.0,0.0,0.0,0.008398,0.0,0.0,0.0,0.0,0.0,0.000351,0.000351,0.000492,0.000492,0.000492,0.005681,0.008171,0.005681,0.004085,0.008171,0.0,0.010403,0.059816,0.059816,0.015898,0.009505,0.009505,0.009505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012054,0.012054,0.0,0.0,0.0,0.0,0.004529,0.00151,0.004529,0.004529,0.004529,0.003971,0.003971,0.003971,0.0,0.0,0.0,0.0,0.0,0.063207,0.154357,0.00123,0.00123,0.00082,0.004977,0.002852,0.0,0.0,0.018755,0.018755,0.04773,0.04773,0.04773,0.04773,0.04773,0.0,0.0,0.0,0.004278,0.0,0.0,0.047496,0.010161,0.010161,0.0,0.0,0.0,0.0,0.006585,0.003139,0.0,0.010463,0.0,0.009159,0.009159,0.009159,0.108672,0.000242,0.0,0.0,0.005657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003681,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000421,0.0,0.01317,0.01317,0.002655,0.0,0.0,0.0,0.0,0.003186,0.002002,0.002435,0.002435,0.002435,0.01828,0.01828,0.074141,0.0,0.0,0.0,0.007047
5,5,13,0.007392,0.003696,0.003696,0.000857,0.000857,0.027937,0.05289,0.017574,0.011716,0.015182,0.0,0.00683,0.0,0.0,0.005924,0.008502,0.011846,0.001906,0.016739,0.0,0.022319,0.0,0.0,0.022319,0.0,0.000194,0.058888,0.036375,0.009786,0.00643,0.094155,0.101982,0.006751,0.038854,0.005833,0.005833,0.024655,0.0,0.001408,0.0,0.001689,0.0,0.000259,0.00767,0.0,0.0,0.028085,0.135037,0.119529,0.0,0.0,0.076871,0.135314,0.064698,0.02976,0.021566,0.043132,0.043132,0.01499,0.007473,0.010426,0.00514,0.009608,0.008407,0.035937,0.0,0.017462,0.017462,0.013898,0.054929,0.024788,0.007152,0.005837,0.003766,0.0,0.0,0.010471,0.016947,0.0,0.0,0.0,0.001005,0.0,0.0,0.001274,0.048063,0.054645,0.052356,0.045856,0.007761,0.0,0.0,0.0,0.0,0.026378,0.0,0.003326,0.011777,0.0,0.008136,0.0,0.0,0.0,0.0,0.0,0.0,0.014516,0.011191,0.017521,0.058449,0.0,0.018183,0.022102,0.016329,0.009913,0.061318,0.0,0.002149,0.009496,0.015729,0.082956,0.022319,0.022319,0.002972,0.0,0.0,0.0,0.026968,0.0,0.025552,0.060401,0.083263,0.024428,0.009335,0.009335,0.014478,0.008603,0.0,0.0,0.0,0.011642,0.0,0.0,0.006714,0.007009,0.0,0.0,0.0,0.0,0.000566,0.000566,0.000378,0.015173,0.0,0.0,0.034238,0.026743,0.002321,0.002321,0.003482,0.002321,0.0,0.0,0.063225,0.0447,0.013227,0.0,0.040228,0.077941,0.094693,0.069348,0.096402,0.349133,0.057907,0.017586,0.017586,0.023283,0.023283,0.011642,0.023283,0.011642,0.0,0.000758,0.019647,0.0,0.0,0.0,0.0,0.0,0.0,0.012905,0.007989,0.0,0.0,0.00599,0.000963,0.004516,0.000963,0.0,0.0,0.010326,0.001253,0.001253,0.044825,0.0,0.0,0.0,0.0,0.034235,0.0,0.044825,0.049642,0.089651,0.013427,0.001276,0.0,0.0,0.0,0.0,0.003535,0.0,0.0,0.0,0.286648,0.0,0.01201,0.013185,0.006848,0.0,0.088393,0.055408,0.002579,0.002579,0.002579,0.002579,0.002579,0.002579,0.0,0.0,0.011143,0.001264,0.001264,0.000407,0.001024,0.001024,0.063067,0.023968,0.021262,0.021262,0.031893,0.021262,0.018957,0.0,0.0,0.0,0.000816,0.000816,0.017122,0.026082,0.038917,0.062261,0.010246,0.006284,0.0,0.05818,0.0,0.0,0.0,0.007987,0.159435,0.109713,0.029536,0.029536,0.042673,0.023387,0.023387,0.023742,0.167996,0.167996,0.167996,0.167996,0.0,0.0,0.009036,0.0,0.0,0.0,0.0,0.008873,0.0,0.0,0.000502,0.000502,0.000502,0.047414,0.071121,0.047414,0.04338,0.071121,0.051687,0.0,0.004431,0.004431,0.02803,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010327,0.0,0.0,0.0,0.025381,0.0,0.103234,0.049683,0.016561,0.049683,0.049683,0.049683,0.163401,0.163401,0.163401,0.0,0.000388,0.002915,0.0,0.0,0.026138,0.0,0.0,0.0,0.005113,0.002352,0.034896,0.05581,0.005245,0.002959,0.002959,0.0,0.0,0.0,0.0,0.0,0.0,0.003924,0.003307,0.004157,0.00648,0.0,0.024225,0.0,0.0,0.077123,0.077123,0.0,0.0,0.015541,0.007929,0.0,0.0,0.02582,0.022194,0.022194,0.022194,0.0,0.049075,0.05209,0.031056,0.0,0.004015,0.006233,0.005143,0.005143,0.0,0.0,0.0,0.0,0.017507,0.017507,0.0,0.0,0.020653,0.020653,0.020653,0.000648,0.014221,0.0,0.0,0.015341,0.015341,0.015341,0.015341,0.015341,0.000201,0.016734,0.000201,0.031082,0.031082,0.069346,0.050761,0.00226,0.00226,0.0,0.000973,0.0,0.060943,0.060943,0.060943,0.01737,0.01737,0.0,0.0,0.0,0.0,0.074001
6,6,0,0.042195,0.021098,0.021098,0.0,0.0,0.029009,0.013459,0.022825,0.015217,0.02143,0.0,0.020963,0.0,0.047627,0.010745,0.026381,0.021368,0.0231,0.019172,0.0,0.035448,0.0,0.0,0.025088,0.01722,0.014525,0.035938,0.028629,0.030532,0.049842,0.002142,0.002362,0.073988,0.049158,0.080802,0.080802,0.045555,0.015855,0.013067,0.004049,0.007664,0.008097,0.019367,0.010147,0.048809,0.0,0.053213,0.0,0.00695,0.059939,0.077301,0.055679,0.048393,0.000424,0.013078,0.014119,0.011237,0.001538,0.019762,0.019232,0.003956,0.008387,0.024727,0.008822,0.008757,0.022031,0.0,0.0,0.03549,0.02846,0.009061,0.048071,0.052286,0.039855,0.069734,0.069734,0.024647,0.035009,0.0,0.0,0.0,0.020997,0.0,0.0,0.025782,0.0,0.004345,0.003507,0.004298,0.05698,0.0,0.0,0.0,0.0,0.042699,0.0,0.023277,0.017193,0.017493,0.0,0.006502,0.0,0.0,0.0,0.03107,0.001903,0.011332,0.012827,0.036753,0.054841,0.016403,0.016948,0.018593,7.3e-05,0.012286,0.014287,0.0,0.038503,0.031581,0.028643,0.045939,0.025563,0.025563,0.060159,0.0,0.014792,0.0,0.041209,0.0657,0.066591,0.006692,0.043397,0.0,0.087219,0.087219,0.054378,0.122769,0.002307,0.003722,0.00194,0.056281,0.0,0.0,0.0,0.000747,0.0,0.00112,0.0,0.0,0.014748,0.014748,0.009833,0.013243,0.07113,0.07113,0.002376,0.024321,0.036256,0.036256,0.054385,0.036256,0.032804,0.007914,0.053283,0.051499,0.042849,0.0,0.072025,0.045201,0.06175,0.017862,0.0494,0.0,0.000614,0.050751,0.050751,0.110622,0.110622,0.055311,0.110622,0.055311,0.0,0.036219,0.0,0.012978,0.0,0.0,0.0,0.0,0.0,0.129495,0.045089,0.026143,0.026143,0.01736,0.0,0.005998,0.0,0.0,0.0,0.00341,0.000919,0.000919,0.031136,0.0,0.0,0.017367,0.0,0.049473,0.0521,0.031136,0.050538,0.062271,0.038895,0.0,0.011584,0.071317,0.059431,0.005119,0.017368,0.054866,0.054866,0.054866,0.0,0.051658,7.3e-05,0.032547,0.01631,0.0,0.006352,0.066389,0.050135,0.050135,0.050135,0.050135,0.050135,0.050135,0.03204,0.052567,0.044222,0.042418,0.042418,0.012298,0.000145,0.000145,0.030237,0.027073,0.0152,0.0152,0.0228,0.0152,0.025544,0.00029,0.00029,0.00029,0.0,0.0,0.024032,0.0,0.029901,0.057501,0.083448,0.0,0.039342,0.035628,0.117668,0.117668,0.117668,0.023541,0.016339,0.023478,0.0,0.0,0.047677,0.033599,0.033599,0.02581,0.004025,0.004025,0.004025,0.004025,0.019391,0.031883,0.032363,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020504,0.020504,0.020504,0.013373,0.020059,0.013373,0.018629,0.020059,0.021056,0.0,0.043514,0.043514,0.011947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010603,0.010603,0.010603,0.008712,0.0,0.0,0.0,0.034069,0.0,0.0,0.03102,0.026626,0.03102,0.03102,0.03102,0.032944,0.032944,0.032944,0.024429,0.025002,0.011399,0.060332,0.060332,0.03389,0.0,0.051259,0.051259,0.037181,0.038048,0.034054,0.095354,0.065759,0.065988,0.065988,0.052285,0.052285,0.052285,0.052285,0.052285,0.0,0.020959,0.024395,0.044388,0.0,0.0,0.002984,0.03319,0.03319,0.040127,0.040127,0.0,0.0,0.052028,0.009711,0.035981,0.013005,0.034444,0.005116,0.005116,0.005116,0.018161,0.025211,0.041188,0.0,0.033732,0.044247,0.096129,0.0,0.0,0.014978,0.008029,0.0,0.0,0.075227,0.075227,0.083148,0.083148,0.006819,0.006819,0.006819,0.033898,0.001303,0.0,0.0,0.009024,0.009024,0.009024,0.009024,0.009024,0.0,0.011811,0.0,0.076933,0.076933,0.030947,0.068139,0.0,0.0,0.048794,0.050844,0.100502,0.01052,0.01052,0.01052,0.016755,0.016755,0.015439,0.053601,0.053601,0.053601,0.089523
7,7,15,0.003227,0.001613,0.001613,0.000942,0.000942,0.0,0.000782,0.104796,0.075354,0.059883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026192,0.0,0.0,0.014165,0.014165,2.1e-05,0.0,0.0,0.0,0.002856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.8e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.000329,0.000659,0.0,0.00549,0.00386,0.011199,0.004706,0.004796,0.004196,0.0,0.0,0.000251,0.000251,0.0,0.0,0.000137,0.0,0.0,0.057626,0.0,0.0,0.003787,0.001144,0.0,0.0,0.0,0.002806,0.0,0.0,0.004008,0.006135,0.002483,0.00409,0.001862,0.0,0.0,0.0,0.0,0.0,5e-06,0.209591,0.0,0.000145,0.001632,0.0,0.002535,0.0,0.0,0.0,0.0,0.09501,0.016187,0.116934,0.002584,0.056146,0.005322,0.002252,0.002656,0.001877,0.007573,0.004651,0.0,0.011523,0.001212,0.006173,0.001088,0.074479,0.074479,0.005811,0.265123,0.088368,0.0,0.003978,0.0,0.0,0.00024,8e-05,0.000481,0.0,0.0,0.000208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010949,0.0,0.016424,0.0,0.0,0.209591,0.209591,0.0,0.000346,0.0,0.0,0.0,0.00022,0.139727,0.139727,0.0,0.139727,0.0,0.101956,0.0,0.0,0.0,0.0,0.0,0.000178,0.0,0.0,7.3e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011667,0.0,0.0,0.0,0.0,0.139727,0.0,0.1177,0.209591,0.209591,0.216542,0.199231,0.079692,0.199231,0.419182,0.419182,0.135654,0.209591,0.209591,0.209591,0.419182,0.419182,0.234099,0.419182,0.001047,0.283115,0.0,0.0,0.0,0.005002,0.0,0.0,0.005188,0.004324,0.0,0.001539,0.0,0.0,0.0,0.0,0.00551,0.0,0.0,0.0,0.0,0.0,0.001029,0.0,0.0,0.0,0.0,0.0,0.0,0.026074,0.0,0.001113,0.002471,0.002471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001497,0.001497,0.001497,0.0,0.0,0.004901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003086,0.0,0.0,0.0,0.0,0.0,0.000248,0.000248,0.000512,0.0,0.0,0.0,0.0,0.0,0.0,0.001212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000128,0.000128,0.000128,0.000109,0.000164,0.000109,0.000239,0.000164,0.0,0.0,0.002548,0.002548,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000365,0.0,0.0,0.0,0.0,0.0,0.003325,0.003325,0.003325,0.0,0.0,0.0,0.0,0.0,0.003563,0.014839,0.0,0.0,0.000177,0.000391,0.001212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001818,0.000267,0.0,0.021284,0.004762,0.004762,0.0,0.0,0.0,0.0,5.6e-05,0.007114,0.0,0.005076,0.0,0.000519,0.000519,0.000519,0.019625,0.002073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00195,0.0,0.0,0.006345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000179,0.0,0.0,0.000532,0.000532,0.000532,0.000532,0.000532,0.0,0.001717,0.0,0.000111,0.000111,0.002266,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006978,0.0,0.0,0.0,0.0
8,8,11,0.006634,0.003317,0.003317,0.0,0.0,0.0,0.0,0.00246,0.00164,0.001405,0.0,0.0,0.0,0.035299,0.012228,0.080193,0.006492,0.003192,0.00639,0.0,0.008521,0.0,0.0,0.01007,0.002324,0.002219,0.061526,0.003357,0.051298,0.000627,0.027496,0.031606,0.008849,0.002045,0.002967,0.002967,0.02341,0.004438,0.001479,0.004438,0.001775,0.008876,0.002959,0.0,0.001688,0.0,0.010846,0.005225,0.0,0.0,0.0,0.037171,0.029529,0.013692,0.014649,0.014052,0.009128,0.012329,0.003429,0.0,0.0,0.0,0.0,0.0,0.000562,0.003202,0.0,0.0,0.000866,0.021126,0.000742,0.038847,0.032808,0.012982,0.02778,0.02778,0.014763,0.0,0.0,0.0,0.0,0.002399,0.00031,0.0,0.000839,0.118578,0.041833,0.079688,0.071166,0.005988,0.0,0.0,0.0,0.0,0.001073,0.008267,0.172691,0.005449,0.000465,0.002267,0.000475,0.000929,0.003401,0.003401,0.003935,0.003524,0.011028,0.013754,0.035521,0.0,0.0,0.008889,0.011111,0.007763,0.0,0.009484,0.0,0.0,0.0,0.032961,0.011104,0.00852,0.00852,0.0,0.0,0.0,0.0,0.000259,0.008749,0.050366,7e-05,0.005634,0.000139,0.005796,0.005796,0.003478,0.001802,0.277227,0.191372,0.078689,0.039344,0.0,0.0,0.00108,0.010408,0.0,0.015613,0.0,0.004319,0.009059,0.009059,0.00604,0.006039,0.074602,0.074602,0.002699,0.01365,0.004323,0.004323,0.006485,0.004323,0.0,0.008018,0.030478,0.015335,0.001924,0.0,0.0,0.024524,0.008416,0.056859,0.006732,0.000332,0.129017,0.051298,0.051298,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013517,0.0,0.0,0.002218,0.001185,0.00786,0.001185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003287,0.0,0.0,0.000807,0.0,0.013815,0.0,0.018467,0.005028,0.009114,0.332744,0.002073,0.0,0.0,0.0,0.0,0.0,0.213759,0.17001,0.247187,0.005873,0.0,0.001045,0.076947,0.076947,0.076947,0.076947,0.076947,0.076947,0.143835,0.013696,0.012145,0.02176,0.02176,0.014755,0.24347,0.24347,0.04231,0.011854,0.0,0.0,0.0,0.0,0.008891,0.053282,0.053282,0.053282,0.014802,0.014802,0.01702,0.035005,0.0,0.039082,0.0,0.0,0.0,0.00861,0.0,0.0,0.0,0.007839,0.002433,0.042086,0.019645,0.019645,0.009822,0.026134,0.026134,0.017423,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002151,0.0,0.0,0.027127,0.027127,0.027127,0.0,0.0,0.0,0.0,0.0,0.026017,0.006678,0.0,0.0,0.0,0.0,0.0,0.0,0.037467,0.019576,0.0,0.0,0.000495,0.0,0.0,0.0,0.0,0.026033,0.026033,0.0,0.0,0.0,0.0,0.0,0.041762,0.0,0.0,0.0,0.0,0.0,0.0,0.062644,0.0,0.000133,0.0,0.0,0.003158,0.0,0.0,0.0,0.002962,0.0,0.0,0.0,0.0,0.000789,0.000789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011722,0.0,0.001356,0.0,0.0,0.0,0.0,0.034137,0.034137,0.0,0.0,0.025063,0.007552,0.0,2.2e-05,0.007328,0.0,0.0,0.0,0.002324,0.026063,0.0,0.0,0.003847,0.013897,0.003518,0.0,0.0,0.156292,0.053939,0.0,0.111106,0.0047,0.0047,0.015962,0.015962,0.0,0.0,0.0,0.010208,0.0,0.002393,0.002393,0.008885,0.008885,0.008885,0.008885,0.008885,0.0,0.0,0.0,0.050126,0.050126,0.015083,0.0,0.0,0.0,0.0,0.015313,0.0,0.001906,0.001906,0.001906,0.0,0.0,0.007159,0.0,0.0,0.0,0.002569
9,9,16,0.00488,0.00244,0.00244,0.000189,0.000189,0.0,0.263761,0.0,0.0,0.0,0.0,0.03291,0.0,0.001294,0.042274,0.040999,0.081998,0.007306,0.00109,0.0,0.0,0.0,0.0,0.0,0.0,0.002662,0.100659,0.013491,6e-06,0.004014,0.021398,0.020805,0.0,0.0,0.0,0.0,0.007123,0.004178,0.010556,0.004178,0.012667,0.008356,0.003549,0.011235,0.0,0.0,0.022084,0.017051,0.0,0.006105,0.0,0.01474,0.026626,0.007363,0.022289,0.041944,0.083413,0.005384,0.001081,0.003101,0.003197,0.0,0.003987,0.001199,0.0,0.006088,0.0,0.0,0.01455,0.013516,0.00137,0.00176,0.0,0.009608,0.0,0.0,0.017726,0.004898,0.0,0.0,0.0,0.007328,0.0,0.0,0.006992,0.0,0.077241,0.0,0.060245,0.003068,0.0,0.0,0.0,0.0,0.016511,0.0,0.001272,0.003368,0.0,0.003027,0.0,0.0,0.00454,0.00454,0.0,0.003217,0.00232,0.003358,0.056133,0.000819,0.000501,0.009036,0.004585,0.007472,0.0,0.008953,0.0,0.000583,0.000413,0.008794,0.072838,0.001453,0.001453,0.013288,0.0,0.0,0.0,0.00869,0.0,0.00546,0.0,0.07071,0.0,0.002218,0.002218,0.001331,0.0,0.000563,0.0,0.0,0.003327,0.0,0.0,0.00929,0.012698,0.0,0.000467,0.0,0.0,0.012169,0.012169,0.008112,0.008113,0.00144,0.00144,0.03869,0.006826,0.006096,0.006096,0.009144,0.006096,0.001002,0.001197,0.07071,0.009563,0.015968,0.0,0.0,0.131263,0.106065,0.175468,0.085646,0.000238,0.0,0.003724,0.003724,0.006655,0.006655,0.003327,0.006655,0.003327,0.0,0.000473,0.013259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023605,0.0,0.0,0.001155,0.0,0.004243,0.0,0.0,0.0,0.001129,0.004524,0.004524,0.01092,0.0,0.0,0.053404,0.0,0.0,0.160213,0.01092,0.008658,0.021841,0.027027,0.0,0.010607,0.0,0.00059,0.0,0.007787,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00908,0.005496,0.058963,0.0,0.0,0.0,0.0,0.0,0.0,0.000376,0.001435,0.015369,0.004667,0.004667,0.003929,0.0,0.0,0.002929,0.00314,0.0,0.0,0.0,0.0,0.002355,0.0842,0.0842,0.0842,0.411007,0.411007,0.0,0.448431,0.0,0.011769,0.032315,0.0,0.0,0.033237,0.0,0.0,0.0,0.00154,0.001048,0.021095,0.015747,0.015747,0.007873,0.078558,0.078558,0.052372,0.012944,0.012944,0.012944,0.012944,0.0,0.0,0.00636,0.0,0.0,0.0,0.0,0.000953,0.0,0.0,0.010658,0.010658,0.010658,0.0,0.0,0.0,0.002398,0.0,0.004058,0.00067,0.005998,0.005998,0.002504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00832,0.0,0.003971,0.0,0.002225,0.0,0.0,0.0,0.0,0.0,0.0,0.003337,0.001146,0.001346,0.0,0.0,0.039134,0.0,0.009157,0.009157,0.006105,0.112356,0.000187,0.0,0.000127,0.003615,0.003615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003273,0.000281,0.0,0.0,0.0,0.0,0.0,0.021078,0.021078,0.289003,0.0,0.001642,0.012933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008646,0.042155,0.0,0.008931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001078,0.001078,0.0,0.0,0.0,0.003087,0.000318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005303,0.0,0.0,0.0,0.0,0.016641,0.0,0.0,0.012245,0.004631,0.0,0.0,0.0,0.0,0.0,0.0,0.104332,0.0,0.0,0.0,0.011259


In [21]:
# Describing what the most representative tokens for each topic in the model are.
num_top_words = 2
feature_names = vectorizer.get_feature_names()
for i,topic_vec in enumerate(cls.components_):
    print(i,end=": ")
    for fid in topic_vec.argsort()[-1:-num_top_words-1:-1]:
        word = feature_names[fid]
        word = " ".join(unreduce_lp[word])
        print(word, end=" ")
    print()

0: albino mutants mutated mutant types type 
1: embryogenesis embryos fertilization embryonic zygote embryo zygotic defect defective 
2: male sex female gametophytic meristem gametophyte 
3: amylopectin maltodextrin pectin dextrin sugar starch level higher levels 
4: root roots long short 
5: pathogens kanamycin bacterial syringae antimicrobial toxin avirulent pseudomonas bacteria yeast pathogenesis flagellin agrobacterium coli resistance susceptible resistant 
6: coat creamy yellowish pale paler spots greener green 
7: lethal lethality seedling soil harvested germinated seedlings 
8: sensitive hypersensitive sensitivity insensitive relaxation strain stress stresses 
9: lacunae shoots leaves hair curled fleshy 
10: wall walls inside cells cellular cell 
11: acetic butyric salicylic acids acid mutatn composition composed quartet chamber 
12: fructose sugars invertase xylose maltose sorbitol sucrose glucose exogenous exogenously 
13: reduced diminished habit stems stem erect 
14: tenfold

<a id="clustering"></a>
### Approach 2: Agglomerative clustering and comparison to predefined groups
This clustering approach uses agglomerative clustering to cluster the genes into a fixed number of clusters based off the distances between their embedding representations using all of the above methods. Clustering into a fixed number of clusters allows for clustering into a similar number of groups as a present in some existing grouping of the data, such as phenotype categories or biochemical pathways, and then determining if the clusters obtained are at all similar to the groupings that already exist.

In [25]:
names

['Doc2Vec Wikipedia:Size=300',
 'Doc2Vec PubMed:Size=100',
 'Word2Vec Wikipedia:Size=300,Mean',
 'Word2Vec Wikipedia:Size=300,Max',
 'N-Grams:Full,Words,1-grams,2-grams',
 'N-Grams:Full,Words,1-grams,2-grams,Binary',
 'N-Grams:Full,Words,1-grams',
 'N-Grams:Full,Words,1-grams,Binary',
 'N-Grams:Full,Words,1-grams,2-grams,TFIDF',
 'N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF',
 'N-Grams:Full,Words,1-grams,TFIDF',
 'N-Grams:Full,Words,1-grams,Binary,TFIDF',
 'N-Grams:Simple,Words,1-grams,2-grams',
 'N-Grams:Simple,Words,1-grams,2-grams,Binary',
 'N-Grams:Simple,Words,1-grams',
 'N-Grams:Simple,Words,1-grams,Binary',
 'N-Grams:Simple,Words,1-grams,2-grams,TFIDF',
 'N-Grams:Simple,Words,1-grams,2-grams,Binary,TFIDF',
 'N-Grams:Simple,Words,1-grams,TFIDF',
 'N-Grams:Simple,Words,1-grams,Binary,TFIDF',
 'N-Grams:Full,Nouns,1-grams',
 'N-Grams:Full,Nouns,1-grams,Binary',
 'N-Grams:Full,Nouns,1-grams,TFIDF',
 'N-Grams:Full,Nouns,1-grams,Binary,TFIDF',
 'N-Grams:Full,Adjectives,1-grams',
 '

In [26]:
# Generate a numpy array where values are mean distance percentiles between all the methods.
to_pct = lambda arr: np.array(pd.Series(arr.flatten()).rank(pct=True)).reshape(-1,arr.shape[0])
all_pct_arrays = np.array([to_pct(np.nan_to_num(graphs[name].array, nan=1)) for name in names])
mean_pct_array = np.mean(all_pct_arrays,axis=0)

# Do agglomerative clustering based on that distance matrix.
number_of_clusters = 50
to_id = graphs[names[0]].row_index_to_id
ac = AgglomerativeClustering(n_clusters=number_of_clusters, linkage="complete", affinity="precomputed")
clustering = ac.fit(mean_pct_array)
id_to_cluster = {}
cluster_to_ids = defaultdict(list)
for idx,c in enumerate(clustering.labels_):
    id_to_cluster[to_id[idx]] = c
    cluster_to_ids[c].append(to_id[idx])

ValueError: operands could not be broadcast together with shapes (460,460) (3392,3392) 

In [None]:
for i in cluster_to_ids[36]:
    print(descriptions[i], "\n\n")

In [None]:
# Create the dataframe containing the average score assigned to each topic for the genes from each subset.
group_to_cluster_vector = {}
for group_id,ids in group_id_to_ids.items():
    
    mean_cluster_vector = np.zeros(number_of_clusters)
    for i in ids:
        cluster = id_to_cluster[i]
        mean_cluster_vector[cluster] = mean_cluster_vector[cluster]+1
    mean_cluster_vector = mean_cluster_vector/mean_cluster_vector.sum(axis=0,keepdims=1)
    group_to_cluster_vector[group_id] = mean_cluster_vector
    
ac_df = pd.DataFrame(group_to_cluster_vector)

# Changing the order of the Lloyd, Meinke phenotype subsets to match other figures for consistency.
#filename = "../data/group_related_files/lloyd/lloyd_function_hierarchy_irb_cleaned.csv"
#lmac_df = pd.read_csv(filename)
#ac_df = ac_df[lmac_df["Subset Symbol"].values]

# Reordering so consistency with the curated subsets can be checked by looking at the diagonal.
ac_df["idxmax"] = ac_df.idxmax(axis = 1)
ac_df["idxmax"] = ac_df["idxmax"].apply(lambda x: ac_df.columns.get_loc(x))
ac_df = ac_df.sort_values(by="idxmax")
ac_df.drop(columns=["idxmax"], inplace=True)
ac_df = ac_df.reset_index(drop=False).rename({"index":"topic"},axis=1).reset_index(drop=False).rename({"index":"order"},axis=1)
ac_df.to_csv(os.path.join(OUTPUT_DIR,"part_5_agglomerative_clustering.csv"), index=False)
ac_df

<a id="phenologs"></a>
### Approach 3: Looking for phenolog relationships between clusters and OMIM disease phenotypes
This section produces a table of values that provides a score for the a particular pair of a cluster found for this dataset of plant genes and a disease phenotype. Currently the value indicates the fraction of the plant genes in that cluster that have orthologs associated with that disease phenotype. This should be replaced or supplemented with a p-value for evaluating the significance of this value given the distribution of genes and their mappings to all of the disease phenotypes. All the rows from the input dataframe containing the PantherDB and OMIM information where the ID from this dataset is not known or the mapping to a phenotype was unsuccessful are removed at this step, fix this if the metric for evaluating cluster to phenotype phenolog mappings need this information.

In [None]:
# Read in the dataframe mapping plant genes --> human orthologs --> disease phenotypes.
omim_df = pd.read_csv(panther_to_omim_filename)
# Add a column that indicates which ID in the dataset those plant genes refer to, for mapping to phenotypes.
name_to_id = dataset.get_name_to_id_dictionary()
omim_df["id"] = omim_df["gene_identifier"].map(lambda x: name_to_id.get(x,None))
omim_df = omim_df.dropna(subset=["id","phenotype_mim_name"], inplace=False)
omim_df["phenotype_mim_name"] = omim_df["phenotype_mim_name"].astype(str)
omim_df["compressed_phenotype_mim_name"] = omim_df["phenotype_mim_name"].map(lambda x: x.split(",")[0])
omim_df["id"] = omim_df["id"].astype("Int64")
# Generate mappings between the IDs in this dataset and disease phenotypes or orthologous genes.
id_to_mim_phenotype_names = defaultdict(list)
for i,p in zip(omim_df["id"].values,omim_df["compressed_phenotype_mim_name"].values):
    id_to_mim_phenotype_names[i].append(p)
id_to_human_gene_symbols = defaultdict(list)
for i,s in zip(omim_df["id"].values,omim_df["human_ortholog_gene_symbol"].values):
    id_to_human_gene_symbols[i].append(s)
omim_df.head(5)

In [None]:
phenolog_x_dict = defaultdict(dict)
phenolog_p_dict = defaultdict(dict)
candidate_genes_dict = defaultdict(dict)
phenotypes = pd.unique(omim_df["compressed_phenotype_mim_name"].values)
clusters = list(cluster_to_ids.keys())
for cluster,phenotype in itertools.product(clusters,phenotypes):

    #ids = cluster_to_ids[cluster]
    #x = list(set(flatten([id_to_mim_phenotype_names.get(i,[]) for i in ids]))).count(phenotype) 
    #phenotypes_in_cluster = flatten([id_to_mim_phenotype_names.get(i,[]) for i in ids])
    #phenotype_occurences_in_cluster = phenotypes_in_cluster.count(phenotype)
    #phenolog_dict[cluster][phenotype] = 0.000
    #if phenotype_occurences_in_cluster > 0:
    #    phenolog_dict[cluster][phenotype] = phenotype_occurences_in_cluster / len(ids)
    
    # What are the candidate genes predicted if this phenolog pairing is real?
    ids = cluster_to_ids[cluster]
    candidate_genes_dict[cluster][phenotype] = list(set(flatten([id_to_human_gene_symbols[i] for i in ids if phenotype not in id_to_mim_phenotype_names.get(i,[])])))

    # What is the p-value for this phenolog pairing?
    # The size of the population (genes in the dataset).
    M = len(id_to_cluster.keys())
    # The number of elements we draw without replacement (genes in the cluster).
    N = len(cluster_to_ids[cluster])     
    # The number of available successes in the population (genes that map to orthologs that map to this phenotype).
    n = len([i for i in id_to_cluster.keys() if phenotype in id_to_mim_phenotype_names.get(i,[])])
    # The number of successes drawn (genes in this cluster that map to orthologs that map to this phenotype).
    x = list(set(flatten([id_to_mim_phenotype_names.get(i,[]) for i in ids]))).count(phenotype)
    prob = hypergeom.cdf(x, M, n, N) # Equivalent to prob = sum([hypergeom.pmf(x_i, M, n, N) for x_i in range(0,x+1)])
    phenolog_x_dict[cluster][phenotype] = x
    phenolog_p_dict[cluster][phenotype] = prob
    

# Convert the dictionary to a table of values with cluster and phenotype as the rows and columns.
phenolog_matrix = pd.DataFrame(phenolog_x_dict)        
phenolog_matrix.head(5)

In [None]:
# Produce a melted version of the phenolog matrix sorted by value and including predicted candidate genes.
phenolog_matrix_reset = phenolog_matrix.reset_index(drop=False).rename({"index":"omim_phenotype_name"}, axis="columns")
phenolog_df = pd.melt(phenolog_matrix_reset, id_vars=["omim_phenotype_name"], value_vars=phenolog_matrix.columns[1:], var_name="cluster", value_name="x")
# What other information should be present in this melted phenologs matrix?
phenolog_df["size"] = phenolog_df["cluster"].map(lambda x: len(cluster_to_ids[x]))
phenolog_df["candidate_gene_symbols"] = np.vectorize(lambda x,y: concatenate_with_bar_delim(*candidate_genes_dict[x][y]))(phenolog_df["cluster"], phenolog_df["omim_phenotype_name"])
phenolog_df["p_value"] = np.vectorize(lambda x,y: phenolog_p_dict[x][y])(phenolog_df["cluster"], phenolog_df["omim_phenotype_name"])
phenolog_df["p_adjusted"] = multipletests(phenolog_df["p_value"].values, method='bonferroni')[1]
#phenolog_df.sort_values(by=["x"], inplace=True, ascending=False)
phenolog_df.sort_values(by=["p_value"], inplace=True, ascending=True)
phenolog_df = phenolog_df[["omim_phenotype_name", "cluster", "size", "x", "p_value", "p_adjusted", "candidate_gene_symbols"]]
phenolog_df.to_csv(os.path.join(OUTPUT_DIR,"part_5_phenologs.csv"), index=False)
phenolog_df.head(30)

### Approach 4: Agglomerative clustering and sillhouette scores for each NLP method

In [None]:
from sklearn.metrics.cluster import silhouette_score
# Note that homogeneity scores don't fit for evaluating how close the clustering is to pathway membership, etc.
# This is because genes can be assigned to more than one pathway, metric would have to be changed to account for this.
# So all this section does is determines which values of n_clusters provide good clustering results for each matrix.
n_clusters_silhouette_scores = defaultdict(dict)
min_n_clusters = 10
max_n_clusters = 400
step_size = 4
number_of_clusters = np.arange(min_n_clusters, max_n_clusters, step_size)
for n in number_of_clusters:
    for method in methods:
        distance_matrix = np.nan_to_num(graphs[method].array, nan=1)
        to_id = graphs[method].row_index_to_id
        ac = AgglomerativeClustering(n_clusters=n, linkage="complete", affinity="precomputed")
        clustering = ac.fit(distance_matrix)
        sil_score = silhouette_score(distance_matrix, clustering.labels_, metric="precomputed")
        n_clusters_silhouette_scores[method][n] = sil_score
sil_df = pd.DataFrame(n_clusters_silhouette_scores).reset_index(drop=False).rename({"index":"n"},axis="columns")
sil_df.to_csv(os.path.join(OUTPUT_DIR,"part_5_silhouette_scores_by_n.csv"), index=False)

# Part 6. Supervised Tasks

<a id="merging"></a>
### Option 1: Merging in the previously curated similarity values from the Oellrich, Walls et al. (2015) dataset
This section reads in a file that contains the previously calculated distance values from the Oellrich, Walls et al. (2015) dataset, and merges it with the values which are obtained here for all of the applicable natural language processing or machine learning methods used, so that the graphs which are specified by these sets of distances values can be evaluated side by side in the subsequent sections.

In [None]:
# Add a column that indicates the distance estimated using curated EQ statements.
df = df.merge(right=pppn_edgelist.df, how="left", on=["from","to"])
df.fillna(value=0.000,inplace=True)
df.rename(columns={"value":"EQs"}, inplace=True)
df["EQs"] = 1-df["EQs"]
methods.append("EQs")
df.head(10)

### Option 2: Merging with information about shared biochemical pathways or groups.
The relevant information for each edge includes questions like whether or not the two genes that edge connects share a group or biochemical pathway in common, or if those genes are from the same species. This information can then later be used as the target values for predictive models, or for filtering the graphs represented by these edge lists. Either the grouping information or the protein-protein interaction information should be used.

In [None]:
# Column indicating whether or not the two genes share this features (e.g., pathway in common, same group).
df["shared"] = df[["from","to"]].apply(lambda x: len(set(id_to_group_ids[x["from"]]).intersection(set(id_to_group_ids[x["to"]])))>0, axis=1)*1
# Column indicating whether the two genes are from the same species.
species_dict = dataset.get_species_dictionary()
df["same"] = df[["from","to"]].apply(lambda x: species_dict[x["from"]]==species_dict[x["to"]],axis=1)*1
print(Counter(df["shared"].values))
print(Counter(df["same"].values))

### Option 3: Merging with information about protein-protein interactions.

In [None]:
# Merging information from the protein-protein interaction database with this dataset.
df = df.merge(right=string_data.df, how="left", on=["from","to"])
df.fillna(value=0,inplace=True)
df["shared"] = (df["combined_score"] != 0.00)*1
df.tail(12)

<a id="ensemble"></a>
### Combining multiple distances measurements into summarizing distance values
The purpose of this section is to iteratively train models on subsections of the dataset using simple regression or machine learning approaches to predict a value from zero to one indicating indicating how likely is it that two genes share atleast one of the specified groups in common. The information input to these models is the distance scores provided by each method in some set of all the methods used in this notebook. The purpose is to see whether or not a function of these similarity scores specifically trained to the task of predicting common groupings is better able to used the distance metric information to report a score for this task.

In [None]:
# Get the average distance percentile as a means of combining multiple scores.
method = "Mean"
df[method] = df[methods].rank(pct=True).mean(axis=1)
methods.append(method)

In [None]:
# Iteratively create models for combining output values from multiple semantic similarity methods.
# Problem with this method in that the predictors are going to be highly correlated.
method = "Logistic Regression"
splits = 12
kf = KFold(n_splits=splits, random_state=14271, shuffle=True)
df[method] = pd.Series()
for train,test in kf.split(df):
    lr_model = train_logistic_regression_model(df=df.iloc[train], predictor_columns=methods, target_column="shared")
    df[method].iloc[test] = apply_logistic_regression_model(df=df.iloc[test], predictor_columns=methods, model=lr_model)
df[method] = 1-df[method]
methods.append(method)

In [None]:
# Iteratively create models for combining output values from multiple semantic similarity methods.
# Problem with overfitting if the duplicates between descriptions are not removed between the training and testing.
method = "Random Forest"
splits = 2
kf = KFold(n_splits=splits, random_state=14271, shuffle=True)
df[method] = pd.Series()
for train,test in kf.split(df):
    rf_model = train_random_forest_model(df=df.iloc[train], predictor_columns=methods, target_column="shared")
    df[method].iloc[test] = apply_random_forest_model(df=df.iloc[test],predictor_columns=methods, model=rf_model)
df[method] = 1-df[method]
methods.append(method)

<a id="ks"></a>
### Do the edges joining genes that share a group, pathway, or interaction come from a different distribution?
The purpose of this section is to visualize kernel estimates for the distributions of distance or similarity scores generated by each of the methods tested for measuring semantic similarity or generating vector representations of the phenotype descriptions. Ideally, better methods should show better separation betwene the distributions for distance values between two genes involved in a common specified group or two genes that are not. Additionally, a statistical test is used to check whether these two distributions are significantly different from each other or not, although this is a less informative measure than the other tests used in subsequent sections, because it does not address how useful these differences in the distributions actually are for making predictions about group membership.

In [None]:
# Use Kolmogorov-Smirnov test to see if edges between genes that share a group come from a distinct distribution.
ppi_pos_dict = {name:(df[df["shared"] > 0.00][name].values) for name in methods}
ppi_neg_dict = {name:(df[df["shared"] == 0.00][name].values) for name in methods}
for name in methods:
    stat,p = ks_2samp(ppi_pos_dict[name],ppi_neg_dict[name])
    pos_mean = np.average(ppi_pos_dict[name])
    neg_mean = np.average(ppi_neg_dict[name])
    pos_n = len(ppi_pos_dict[name])
    neg_n = len(ppi_neg_dict[name])
    TABLE[name].update({"mean_1":pos_mean, "mean_0":neg_mean, "n_1":pos_n, "n_0":neg_n})
    TABLE[name].update({"ks":stat, "ks_pval":p})
    
    
# Show the kernel estimates for each distribution of weights for each method.
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for name,ax in zip(methods,axs.flatten()):
    ax.set_title(name)
    ax.set_xlabel("value")
    ax.set_ylabel("density")
    sns.kdeplot(ppi_pos_dict[name], color="black", shade=False, alpha=1.0, ax=ax)
    sns.kdeplot(ppi_neg_dict[name], color="black", shade=True, alpha=0.1, ax=ax) 
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"part_6_kernel_density.png"),dpi=400)
plt.close()

<a id="within"></a>
### Looking at within-group or within-pathway distances in each graph
The purpose of this section is to determine which methods generated graphs which tightly group genes which share common pathways or group membership with one another. In order to compare across different methods where the distance value distributions are different, the mean distance values for each group for each method are convereted to percentile scores. Lower percentile scores indicate that the average distance value between any two genes that belong to that group is lower than most of the distance values in the entire distribution for that method.

In [None]:
# Get all the average within-pathway phenotype distance values for each method for each particular pathway.
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
group_ids = list(group_id_to_ids.keys())
graph = IndexedGraph(df)
within_weights_dict = defaultdict(lambda: defaultdict(list))
within_percentiles_dict = defaultdict(lambda: defaultdict(list))
all_weights_dict = {}
for method in methods:
    all_weights_dict[method] = df[method].values
    for group in group_ids:
        within_ids = group_id_to_ids[group]
        within_pairs = [(i,j) for i,j in itertools.permutations(within_ids,2)]
        mean_weight = np.mean((graph.get_values(within_pairs, kind=method)))
        within_weights_dict[method][group] = mean_weight
        within_percentiles_dict[method][group] = stats.percentileofscore(df[method].values, mean_weight, kind="rank")

# Generating a dataframe of percentiles of the mean in-group distance scores.
within_dist_data = pd.DataFrame(within_percentiles_dict)
within_dist_data = within_dist_data.dropna(axis=0, inplace=False)
within_dist_data = within_dist_data.round(4)

# Adding relevant information to this dataframe and saving.
within_dist_data["mean_rank"] = within_dist_data.rank().mean(axis=1)
within_dist_data["mean_percentile"] = within_dist_data.mean(axis=1)
within_dist_data.sort_values(by="mean_percentile", inplace=True)
within_dist_data.reset_index(inplace=True)
within_dist_data["group_id"] = within_dist_data["index"]
within_dist_data["full_name"] = within_dist_data["group_id"].apply(lambda x: groups.get_long_name(x))
within_dist_data["n"] = within_dist_data["group_id"].apply(lambda x: len(group_id_to_ids[x]))
within_dist_data = within_dist_data[flatten(["group_id","full_name","n","mean_percentile","mean_rank",methods])]
within_dist_data.to_csv(os.path.join(OUTPUT_DIR,"part_6_within_distances.csv"), index=False)
within_dist_data.head(5)

<a id="auc"></a>
### Predicting whether two genes belong to the same group, pathway, or share an interaction
The purpose of this section is to see if whether or not two genes share atleast one common pathway can be predicted from the distance scores assigned using analysis of text similarity. The evaluation of predictability is done by reporting a precision and recall curve for each method, as well as remembering the area under the curve, and ratio between the area under the curve and the baseline (expected area when guessing randomly) for each method.

In [None]:
y_true_dict = {name:df["shared"] for name in methods}
y_prob_dict = {name:(1 - df[name].values) for name in methods}
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for method,ax in zip(methods, axs.flatten()):
    
    # Obtaining the values and metrics.
    y_true, y_prob = y_true_dict[method], y_prob_dict[method]
    n_pos, n_neg = Counter(y_true)[1], Counter(y_true)[0]
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    baseline = Counter(y_true)[1]/len(y_true) 
    area = auc(recall, precision)
    auc_to_baseline_auc_ratio = area/baseline
    TABLE[method].update({"auc":area, "baseline":baseline, "ratio":auc_to_baseline_auc_ratio})

    # Producing the precision recall curve.
    step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})
    ax.step(recall, precision, color='black', alpha=0.2, where='post')
    ax.fill_between(recall, precision, alpha=0.7, color='black', **step_kwargs)
    ax.axhline(baseline, linestyle="--", color="lightgray")
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_ylim([0.0, 1.05])
    ax.set_xlim([0.0, 1.0])
    ax.set_title("PR {0} (Baseline={1:0.3f})".format(method, baseline))
    
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"part_6_prcurve_shared.png"),dpi=400)
plt.close()

<a id="y"></a>
### Are genes in the same group or pathway ranked higher with respect to individual nodes?
This is a way of statistically seeing if for some value k, the graph ranks more edges from some particular gene to any other gene that it has a true protein-protein interaction with higher or equal to rank k, than we would expect due to random chance. This way of looking at the problem helps to be less ambiguous than the previous methods, because it gets at the core of how this would actually be used. In other words, we don't really care how much true information we're missing as long as we're still able to pick up some new useful information by building these networks, so even though we could be missing a lot, what's going on at the very top of the results? These results should be comparable to very strictly thresholding the network and saying that the remaining edges are our guesses at interactions. This is comparable to just looking at the far left-hand side of the precision recall curves, but just quantifies it slightly differently.

In [None]:
# When the edgelist is generated above, only the lower triangle of the pairwise matrix is retained for edges in the 
# graph. This means that in terms of the indices of each node, only the (i,j) node is listed in the edge list where
# i is less than j. This makes sense because the graph that's specified is assumed to already be undirected. However
# in order to be able to easily subset the edgelist by a single column to obtain rows that correspond to all edges
# connected to a particular node, this method will double the number of rows to include both (i,j) and (j,i) edges.
df = pw.make_undirected(df)

# What's the number of functional partners ranked k or higher in terms of phenotypic description similarity for 
# each gene? Also figure out the maximum possible number of functional partners that could be theoretically
# recovered in this dataset if recovered means being ranked as k or higher here.
k = 10      # The threshold of interest for gene ranks.
n = 100     # Number of Monte Carlo simulation iterations to complete.
df[list(methods)] = df.groupby("from")[list(methods)].rank()
ys = df[df["shared"]==1][list(methods)].apply(lambda s: len([x for x in s if x<=k]))
ymax = sum(df.groupby("from")["shared"].apply(lambda s: min(len([x for x in s if x==1]),k)))

# Monte Carlo simulation to see what the probability is of achieving each y-value by just randomly pulling k 
# edges for each gene rather than taking the top k ones that the similarity methods specifies when ranking.
ysims = [sum(df.groupby("from")["shared"].apply(lambda s: len([x for x in s.sample(k) if x>0.00]))) for i in range(n)]
for method in methods:
    pvalue = len([ysim for ysim in ysims if ysim>=ys[method]])/float(n)
    TABLE[method].update({"y":ys[method], "y_max":ymax, "y_ratio":ys[method]/ymax, "y_pval":pvalue})

<a id="mean"></a>
### Predicting biochemical pathway or group membership based on mean vectors
This section looks at how well the biochemical pathways that a particular gene is a member of can be predicted based on the similarity between the vector representation of the phenotype descriptions for that gene and the average vector for all the vector representations of phenotypes asociated with genes that belong to that particular pathway. In calculating the average vector for a given biochemical pathway, the vector corresponding to the gene that is currently being classified is not accounted for, to avoid overestimating the performance by including information about the ground truth during classification. This leads to missing information in the case of biochemical pathways that have only one member. This can be accounted for by only limiting the overall dataset to only include genes that belong to pathways that have atleast two genes mapped to them, and only including those pathways, or by removing the missing values before calculating the performance metrics below.

In [None]:
# Get the list of methods to look at, and a mapping between each method and the correct similarity metric to apply.
vector_dicts = {k:v.vector_dictionary for k,v in graphs.items()}
methods = list(vector_dicts.keys())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
valid_group_ids = [group for group,id_list in group_id_to_ids.items() if len(id_list)>1]
valid_ids = [i for i in dataset.get_ids() if len(set(valid_group_ids).intersection(set(id_to_group_ids[i])))>0]
pred_dict = defaultdict(lambda: defaultdict(dict))
true_dict = defaultdict(lambda: defaultdict(dict))
for method in methods:
    for group in valid_group_ids:
        ids = group_id_to_ids[group]
        for identifier in valid_ids:
            # What's the mean vector of this group, without this particular one that we're trying to classify.
            vectors = np.array([vector_dicts[method][some_id] for some_id in ids if not some_id==identifier])
            mean_vector = vectors.mean(axis=0)
            this_vector = vector_dicts[method][identifier]
            pred_dict[method][identifier][group] = 1-metric_dict[method](mean_vector, this_vector)
            true_dict[method][identifier][group] = (identifier in group_id_to_ids[group])*1                

In [None]:
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for method,ax in zip(methods, axs.flatten()):
    
    # Obtaining the values and metrics.
    y_true = pd.DataFrame(true_dict[method]).as_matrix().flatten()
    y_prob = pd.DataFrame(pred_dict[method]).as_matrix().flatten()
    n_pos, n_neg = Counter(y_true)[1], Counter(y_true)[0]
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    baseline = Counter(y_true)[1]/len(y_true) 
    area = auc(recall, precision)
    auc_to_baseline_auc_ratio = area/baseline
    TABLE[method].update({"mean_auc":area, "mean_baseline":baseline, "mean_ratio":auc_to_baseline_auc_ratio})

    # Producing the precision recall curve.
    step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})
    ax.step(recall, precision, color='black', alpha=0.2, where='post')
    ax.fill_between(recall, precision, alpha=0.7, color='black', **step_kwargs)
    ax.axhline(baseline, linestyle="--", color="lightgray")
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_ylim([0.0, 1.05])
    ax.set_xlim([0.0, 1.0])
    ax.set_title("PR {0} (Baseline={1:0.3f})".format(method[:10], baseline))
    
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"part_6_prcurve_mean_classifier.png"),dpi=400)
plt.close()

### Predicting biochemical pathway membership based on mean similarity values
This section looks at how well the biochemical pathways that a particular gene is a member of can be predicted based on the average similarity between the vector representationt of the phenotype descriptions for that gene and each of the vector representations for other phenotypes associated with genes that belong to that particular pathway. In calculating the average similarity to other genes from a given biochemical pathway, the gene that is currently being classified is not accounted for, to avoid overestimating the performance by including information about the ground truth during classification. This leads to missing information in the case of biochemical pathways that have only one member. This can be accounted for by only limiting the overall dataset to only include genes that belong to pathways that have atleast two genes mapped to them, and only including those pathways, or by removing the missing values before calculating the performance metrics below.

### Predicting biochemical pathway or group membership with KNN classifier
This section looks at how well the group(s) or biochemical pathway(s) that a particular gene belongs to can be predicted based on a KNN classifier generated using every other gene. For this section, only the groups or pathways which contain more than one gene, and the genes mapped to those groups or pathways, are of interest. This is because for other genes, if we consider them then it will be true that that gene belongs to that group in the target vector, but the KNN classifier could never predict this because when that gene is held out, nothing could provide a vote for that group, because there are zero genes available to be members of the K nearest neighbors.

<a id="output"></a>
### Summarizing the results for this notebook
Write a large table of results to an output file. Columns are generally metrics and rows are generally methods.

In [None]:
results = pd.DataFrame(TABLE).transpose()
columns = flatten(["Hyperparams","Group","Order","Topic","Data",results.columns])
results["Hyperparams"] = ""
results["Group"] = ""
results["Order"] = np.arange(results.shape[0])
results["Topic"] = TOPIC
results["Data"] = DATA
results = results[columns]
results.reset_index(inplace=True)
results = results.rename({"index":"Method"}, axis="columns")
hyperparam_sep = ":"
results["Hyperparams"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[1] if hyperparam_sep in x else "None")
results["Method"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[0])
results.to_csv(os.path.join(OUTPUT_DIR,"part_6_full_table.csv"), index=False)
results