## Table of Contents

- [Introduction](#introduction)

- [Links of Interest](#links)

- [Loading Data](#paths)
    - [Setting input and output paths](#paths)
    - [Reading in a dataset of text descriptions](#read_this_data_)
    - [Reading in a dataset of groups or categories](#read_other_data)
    - [Relating the datasets to one another](#relating)
    - [Filtering the datasets](#filtering)
    
- [NLP Models](#word2vec_doc2vec)
    - [Word2Vec and Doc2Vec](#word2vec_doc2vec)
    - [BERT and BioBERT](#bert_biobert)
    - [Loading models](#load_models)

- [NLP Choices](#part8)
    - [Preprocessing the phenotype descriptions](#preprocessing)
    - [POS Tagging](#pos_tagging)
    - [Reducing the size of the vocabulary](#vocab)
    - [Annotating descriptions using biological ontologies](#annotation)
    
- [Building a Distance Matrix](#matrix)
    - [Defining a list of methods to use](#methods)
    - [Running each method](#run)

- [Analysis]()
    - [Topic modeling](#topic_modeling)
    - [Agglomerative clustering](#clustering)
    - [Adding additional information](#merging)
    - [Combining methods with ensemble approaches](#ensemble)
    - [Comparing distributions of distance values between methods](#ks)
    - [Comparing the within-group distance values across gene groups and methods](#within)
    - [Comparing the AUC for predicting shared pathways, gene groups, or interactions between methods](#auc)
    - [Comparing querying for similar genes using distance matrices for each method](#y)
    - [Comparing the AUC for predicting the specific pathway or group of a gene](#mean)
    - [Generating a table of resulting metrics for each method used](#output)

<a id="introduction"></a>
### Introduction: Text Mining Analysis of Phenotype Descriptions in Plants
The purpose of this notebook is to evaluate what can be learned from a natural language processing approach to analyzing free-text descriptions of phenotype descriptions of plants. The approach is to generate pairwise distances matrices between a set of plant phenotype descriptions across different species, sourced from academic papers and online model organism databases. These pairwise distance matrices can be constructed using any vectorization method that can be applied to natural language. In this notebook, we specifically evaluate the use of n-gram and bag-of-words techniques, word and document embedding using Word2Vec and Doc2Vec, context-dependent word-embeddings using BERT and BioBERT, and ontology term annotations with automated annotation tools such as NOBLE Coder.

Loading, manipulation, and filtering of the dataset of phenotype descriptions associated with genes across different plant species is largely handled through a Python package created for this purpose called OATS (Ontology Annotation and Text Similarity) which is available [here](https://github.com/irbraun/oats). Preprocessing of the descriptions, mapping the dataset to additional resources such as protein-protein interaction databases and biochemical pathway databases are handled in this notebook using that package as well. In the evaluation of each of these natural language processing approaches to analyzing this dataset of descriptions, we compare performance against a dataset generated through manual annotation of a similar dataset in Oellrich Walls et al. (2015) and against manual annotations with experimentally determined terms from the Gene Ontology (PO) and the Plant Ontology (PO).

<a id="links"></a>
### Relevant links of interest:
- Paper describing comparison of NLP and ontology annotation approaches to curation: [Braun, Lawrence-Dill (2019)](https://doi.org/10.3389/fpls.2019.01629)
- Paper describing results of manual phenotype description curation: [Oellrich, Walls et al. (2015](https://plantmethods.biomedcentral.com/articles/10.1186/s13007-015-0053-y)
- Plant databases with phenotype description text data available: [TAIR](https://www.arabidopsis.org/), [SGN](https://solgenomics.net/), [MaizeGDB](https://www.maizegdb.org/)
- Python package for working with phenotype descriptions: [OATS](https://github.com/irbraun/oats)
- Python package used for general NLP functions: [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html)
- Python package used for working with biological ontologies: [Pronto](https://pronto.readthedocs.io/en/latest/)
- Python package for loading pretrained BERT models: [PyTorch Pretrained BERT](https://pypi.org/project/pytorch-pretrained-bert/)
- For BERT Models pretrained on PubMed and PMC: [BioBERT Paper](https://arxiv.org/abs/1901.08746), [BioBERT Models](https://github.com/naver/biobert-pretrained)

In [1]:
import datetime
import nltk
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import time
import math
import sys
import gensim
import os
import warnings
import torch
import itertools
import multiprocessing as mp
from collections import Counter, defaultdict
from inspect import signature
from scipy.stats import ks_2samp
from sklearn.metrics import precision_recall_curve, f1_score, auc
from sklearn.model_selection import train_test_split, KFold
from scipy import spatial, stats
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from sklearn.neighbors import KNeighborsClassifier
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from gensim.parsing.preprocessing import strip_non_alphanum, stem_text, preprocess_string, remove_stopwords
from gensim.utils import simple_preprocess
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation as LDA

sys.path.append("../../oats")
from oats.utils.utils import save_to_pickle, load_from_pickle, merge_list_dicts, flatten, to_hms
from oats.datasets.dataset import Dataset
from oats.datasets.groupings import Groupings
from oats.annotation.ontology import Ontology
from oats.datasets.string import String
from oats.datasets.edges import Edges
from oats.annotation.annotation import annotate_using_noble_coder
from oats.graphs import pairwise as pw
from oats.graphs.indexed import IndexedGraph
from oats.graphs.weighting import train_logistic_regression_model, apply_logistic_regression_model
from oats.graphs.weighting import train_random_forest_model, apply_random_forest_model
from oats.nlp.vocabulary import get_overrepresented_tokens, get_vocabulary_from_tokens
from oats.nlp.vocabulary import reduce_vocabulary_connected_components, reduce_vocabulary_linares_pontes
from oats.utils.utils import function_wrapper_with_duration
from oats.nlp.preprocess import concatenate_with_bar_delim

from _utils import Method

mpl.rcParams["figure.dpi"] = 400
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

<a id="paths"></a>
### Setting up the output table and input and output filepaths
This section defines some constants which are used for creating a uniquely named directory to contain all the outputs from running this instance of this notebook. The naming scheme is based on the time that the notebook is run. The other constants are used for specifying information in the output table about what the topic was for this notebook when it was run, such as looking at KEGG biochemical pathways or STRING protein-protein interaction data some other type of gene function grouping or hierarchy. These values are arbitrary and are just for keeping better notes about what the output of the notebook corresponds to. All the input and output file paths for loading datasets or models are also contained within this cell, so that if anything is moved the directories and file names should only have to be changed at this point and nowhere else further into the notebook. If additional files are added to the notebook cells they should be put here as well.

In [2]:
# The summarizing output dictionary has the shape TABLE[method][metric] --> value.
TOPIC = "Biochemical Pathways"
DATA = "Filtered"
TABLE = defaultdict(dict)
OUTPUT_DIR = os.path.join("../outputs",datetime.datetime.now().strftime('%m_%d_%Y_h%Hm%Ms%S'))
os.mkdir(OUTPUT_DIR)

In [3]:
dataset_filename = "../data/pickles/text_plus_annotations_dataset.pickle"        # The full dataset pickle.
groupings_filename = "../data/pickles/pmn_pathways.pickle"                       # The groupings pickle.
background_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/background.txt"       # Text file with background content.
phenotypes_corpus_filename = "../data/corpus_related_files/untagged_text_corpora/phenotypes_all.txt" # Text file with specific content.
doc2vec_pubmed_filename = "../gensim/pubmed_dbow/doc2vec_2.bin"                  # File holding saved Doc2Vec model.
doc2vec_wikipedia_filename = "../gensim/enwiki_dbow/doc2vec.bin"                 # File holding saved Doc2Vec model.
word2vec_model_filename = "../gensim/wiki_sg/word2vec.bin"                       # File holding saved Word2Vec model.
ontology_filename = "../ontologies/mo.obo"                                       # Ontology file in OBO format.
noblecoder_jarfile_path = "../lib/NobleCoder-1.0.jar"                            # Jar for NOBLE Coder tool.
biobert_pmc_path = "../gensim/biobert_v1.0_pmc/pytorch_model"                    # Path for PyTorch BioBERT model.
biobert_pubmed_path = "../gensim/biobert_v1.0_pubmed/pytorch_model"              # Path for PyTorch BioBERT model.
biobert_pubmed_pmc_path = "../gensim/biobert_v1.0_pubmed_pmc/pytorch_model"      # Path for PyTorch BioBERT model.

<a id="read_this_data"></a>
### Reading in the dataset of genes and their associated phenotype descriptions and annotations

In [4]:
dataset = load_from_pickle(dataset_filename)
dataset.describe()
dataset.filter_by_species("ath")
dataset.filter_has_description()
dataset.filter_has_annotation()
dataset.describe()
dataset.filter_has_annotation("GO")
dataset.filter_has_annotation("PO")
dataset.describe()
dataset.to_pandas().head(10)

Number of rows in the dataframe: 30169
Number of unique IDs:            30169
Number of unique descriptions:   4566
Number of unique gene name sets: 30169
Number of species represented:   6
Number of rows in the dataframe: 5615
Number of unique IDs:            5615
Number of unique descriptions:   3378
Number of unique gene name sets: 5615
Number of species represented:   1
Number of rows in the dataframe: 3480
Number of unique IDs:            3480
Number of unique descriptions:   2884
Number of unique gene name sets: 3480
Number of species represented:   1


Unnamed: 0,id,species,gene_names,description,term_ids
0,0,ath,At3g49600|UBP26|AT3G49600|SUP32|ATUBP26|ubiqui...,50% defective seeds. Low penetrance of endospe...,GO:0005730|GO:0048316|PO:0000013|PO:0000037|PO...
1,1,ath,AT1G74380|XXT5|xyloglucan xylosyltransferase 5...,Abnormal roothairs. Reduction in xyloglucan le...,GO:0005794|GO:0048767|GO:0005515|GO:0000139|GO...
2,2,ath,AT1G74450|AT1G74450.1|F1M20.13|F1M20_13,No visible phenotype.,GO:0003674|GO:0008150|PO:0000013|PO:0000037|PO...
3,3,ath,AT1G74560|AT2G03440|NRP1|NAP1-related protein ...,mutants did not show any phenotype under in vi...,GO:0005634|GO:0005829|GO:0046686|GO:0003682|GO...
4,4,ath,AT1G74660|MIF1|mini zinc finger 1|F1M20.34|F1M...,Constitutive overexpression of MIF1 caused dra...,GO:0048509|GO:0045892|GO:0009640|GO:0003677|GO...
5,5,ath,AT1G74730|RIQ2|F25A4.30|F25A4_30,"Reduced NPQ, affected organization of light-ha...",GO:0009535|GO:0009534|GO:0003674|GO:0009507|GO...
6,6,ath,AT1G74740|CPK30|CDPK1A|ATCPK30|calcium-depende...,Embryo lethality of cpk10 cpk30 double mutant ...,GO:0005515|GO:0005886|PO:0000013|PO:0000037|PO...
7,7,ath,AT1G74910|KJC1|KONJAC 1|F25A4.12|F25A4_12|AT1G...,"Reduced levels of GDP-Man. Severe dwarf, small...",GO:0005829|GO:0005777|GO:0046686|PO:0000013|PO...
8,8,ath,AT1G75080|BZR1|BRASSINAZOLE-RESISTANT 1|F9E10....,"Insensitive to brassinazole (BRZ), an inhibito...",GO:0045892|GO:0048481|GO:0003700|GO:0005515|GO...
9,9,ath,AT1G75520|SRS5|SHI-related sequence 5|F1B16.17,18-25% of flowers have homeotic conversion pet...,GO:0048467|PO:0000037|PO:0009009|PO:0009010|PO...


<a id="read_other_data"></a>
### Reading in the dataset of groupings, pathways, or some other kind of categorization

In [5]:
groups = load_from_pickle(groupings_filename)
id_to_group_ids = groups.get_id_to_group_ids_dict(dataset.get_gene_dictionary())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
group_mapped_ids = [k for (k,v) in id_to_group_ids.items() if len(v)>0]
groups.describe()
groups.to_csv(os.path.join(OUTPUT_DIR,"groupings.csv"))
groups.to_pandas().head(10)

Number of groups present for each species
  ath: 627
  zma: 565
  mtr: 520
  osa: 569
  gmx: 618
  sly: 524
Number of genes names mapped to any group for each species
  ath: 9959
  zma: 14319
  mtr: 14100
  osa: 12156
  gmx: 20677
  sly: 13232


Unnamed: 0,species,pathway_id,pathway_name,gene_names,ec_number
0,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g52400-monomer|abscisic acid glucose ester ...,EC-3.2.1.175
1,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g15550-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
2,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g15260-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
3,ath,PWY-5272,abscisic acid degradation by glucosylation,at3g21790-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
4,ath,PWY-5272,abscisic acid degradation by glucosylation,at3g21760-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
5,ath,PWY-5272,abscisic acid degradation by glucosylation,at2g23210-monomer|abscisate &beta;-glucosyltra...,EC-2.4.1.263
6,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g05530-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
7,ath,PWY-5272,abscisic acid degradation by glucosylation,at1g05560-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
8,ath,PWY-5272,abscisic acid degradation by glucosylation,at4g34138-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263
9,ath,PWY-5272,abscisic acid degradation by glucosylation,at2g23250-monomer|abscisic acid glucosyltransf...,EC-2.4.1.263


<a id="relating"></a>
### Relating the dataset of genes to the dataset of groupings or categories
This section generates tables that indicate how the genes present in the dataset were mapped to the defined pathways or groups. This includes a summary table that indicates how many genes by species were succcessfully mapped to atleast one pathway or group, as well as a more detailed table describing how many genes from each species were mapped to each particular pathway or group.

In [6]:
# Generate a table describing how many of the genes input from each species map to atleast one group.
summary = defaultdict(dict)
species_dict = dataset.get_species_dictionary()
for species in dataset.get_species():
    summary[species]["input"] = len([x for x in dataset.get_ids() if species_dict[x]==species])
    summary[species]["mapped"] = len([x for x in group_mapped_ids if species_dict[x]==species])
table = pd.DataFrame(summary).transpose()
table.loc["total"]= table.sum()
table["fraction"] = table.apply(lambda row: "{:0.4f}".format(row["mapped"]/row["input"]), axis=1)
table = table.reset_index(inplace=False)
table = table.rename({"index":"species"}, axis="columns")
table.to_csv(os.path.join(OUTPUT_DIR,"mappings_summary.csv"), index=False)

# Generate a table describing how many genes from each species map to which particular group.
summary = defaultdict(dict)
for group_id,ids in group_id_to_ids.items():
    summary[group_id].update({species:len([x for x in ids if species_dict[x]==species]) for species in dataset.get_species()})
    summary[group_id]["total"] = len([x for x in ids])
table = pd.DataFrame(summary).transpose()
table = table.sort_values(by="total", ascending=False)
table = table.reset_index(inplace=False)
table = table.rename({"index":"pathway_id"}, axis="columns")
table["pathway_name"] = table["pathway_id"].map(groups.get_long_name)
table.loc["total"] = table.sum()
table.loc["total","pathway_id"] = "total"
table.loc["total","pathway_name"] = "total"
table = table[table.columns.tolist()[-1:] + table.columns.tolist()[:-1]]
table.to_csv(os.path.join(OUTPUT_DIR,"mappings_by_group.csv"), index=False)

<a id="filtering"></a>
### Option 1: Filtering the dataset based on presence in the curated Oellrich, Walls et al. (2015) dataset

In [7]:
# Filter the dataset based on whether or not the genes were in the curated dataset.
# This is similar to filtering based on protein interaction data because the dataset is a list of edge values.
pppn_edgelist_path = "../data/supplemental_files_oellrich_walls/13007_2015_53_MOESM9_ESM.txt"
pppn_edgelist = Edges(dataset.get_name_to_id_dictionary(), pppn_edgelist_path)
dataset.filter_with_ids(pppn_edgelist.ids)
dataset.describe()

Number of rows in the dataframe: 1899
Number of unique IDs:            1899
Number of unique descriptions:   1692
Number of unique gene name sets: 1899
Number of species represented:   1


### Option 2: Filtering the dataset based on protein-protein interactions
This is done to only include genes (and the corresponding phenotype descriptions and annotations) which are useful for the current analysis. In this case we want to only retain genes that are mentioned atleast one time in the STRING database for a given species. If a gene is not mentioned at all in STRING, there is no information available for whether or not it interacts with any other proteins in the dataset so choose to not include it in the analysis. Only genes that have atleast one true positive are included because these are the only ones for which the missing information (negatives) is meaningful. This should be run instead of the subsequent cell, or the other way around, based on whether or not protein-protein interactions is the prediction goal for the current analysis.

In [None]:
# Filter the dataset based on whether or not the genes were successfully mapped to an interaction.
# Reduce size of the dataset by removing genes not mentioned in the STRING.
naming_file = "../data/group_related_files/string/all_organisms.name_2_string.tsv"
interaction_files = [
    "../data/group_related_files/string/3702.protein.links.detailed.v11.0.txt", # Arabidopsis thaliana
    "../data/group_related_files/string/4577.protein.links.detailed.v11.0.txt", # maize
    "../data/group_related_files/string/4530.protein.links.detailed.v11.0.txt", # tomato 
    "../data/group_related_files/string/4081.protein.links.detailed.v11.0.txt", # medicago
    "../data/group_related_files/string/3880.protein.links.detailed.v11.0.txt", # rice 
    "../data/group_related_files/string/3847.protein.links.detailed.v11.0.txt", # soybean
]
genes = dataset.get_gene_dictionary()
string_data = String(genes, naming_file, *interaction_files)
dataset.filter_with_ids(string_data.ids)
dataset.describe()

### Option 3: Filtering the dataset based on membership in pathways or phenotype category
This is done to only include genes (and the corresponding phenotype descriptions and annotations) which are useful for the current analysis. In this case we want to only retain genes that are mapped to atleast one pathway in whatever the source of pathway membership we are using is (KEGG, Plant Metabolic Network, etc). This is because for these genes, it will be impossible to correctly predict their pathway membership, and we have no evidence that they belong or do not belong in certain pathways so they can not be identified as being true or false negatives in any case.

In [8]:
# Filter based on succcessful mappings to groups or pathways.
dataset.filter_with_ids(group_mapped_ids)
dataset.describe()
# Get the mappings in each direction again now that the dataset has been subset.
id_to_group_ids = groups.get_id_to_group_ids_dict(dataset.get_gene_dictionary())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())

Number of rows in the dataframe: 460
Number of unique IDs:            460
Number of unique descriptions:   433
Number of unique gene name sets: 460
Number of species represented:   1


<a id="word2vec_doc2vec"></a>
### Word2Vec and Doc2Vec
Word2Vec is a word embedding technique using a neural network trained on a so-called *false task*, namely either predicting a missing word from within a sequence of context words drawn from a sentence or phrase, or predicting which contexts words surround some given input word drawn from a sentence or phrase. Each of these tasks are supervised (the correct answer is fixed and known), but can be generated from unlabelled text data such as a collection of books or wikipedia articles, meaning that even though the task itself is supervised the training data can be generated automatically, enabling the creation of enormous training sets. The internal representation for particular words learned during the training process contain semantically informative features related to that given word, and can therefore be used as embeddings used downstream for tasks such as finding similarity between words or as input into additional models. Doc2Vec is an extension of this technique that determines vector embeddings for entire documents (strings containing multiple words, could be sentences, paragraphs, or documents).


<a id="bert_biobert"></a>
### BERT and BioBERT
BERT ('Bidirectional Encoder Representations from Transformers') is another neueral network-based model trained on two different false tasks, namely predicting the subsequent sentences given some input sentence, or predicting the identity of a set of words masked from an input sentence. Like Word2Vec, this architecture can be used to generate vector embeddings for a particular input word by extracting values from a subset of the encoder layer based that correspond to that input word. Practically, a major difference is that because the input word is input in the context of its surrounding sentence, the embedding reflects the meaning of a particular word in a particular context (such as the difference in the meaning of *root* in the phrases *plant root* and *root of the problem*. BioBERT refers to a set of BERT models which have been finetuned on the PubMed and PMC corpora. See the list of relevant links for the publications and pages associated with these models.

<a id="load_models"></a>
### Loading trained and saved models
Versions of the architectures discussed above which have been saved as trained models are loaded here. Some of these models are loaded as pretrained models from the work of other groups, and some were trained on data specific to this notebook and loaded here.

In [52]:
# Files and models related to the machine learning text embedding methods used here.
doc2vec_wiki_model = gensim.models.Doc2Vec.load(doc2vec_wikipedia_filename)
doc2vec_pubmed_model = gensim.models.Doc2Vec.load(doc2vec_pubmed_filename)
word2vec_model = gensim.models.Word2Vec.load(word2vec_model_filename)
bert_tokenizer_base = BertTokenizer.from_pretrained('bert-base-uncased')
bert_tokenizer_pmc = BertTokenizer.from_pretrained(biobert_pmc_path)
bert_tokenizer_pubmed = BertTokenizer.from_pretrained(biobert_pubmed_path)
bert_tokenizer_pubmed_pmc = BertTokenizer.from_pretrained(biobert_pubmed_pmc_path)
bert_model_base = BertModel.from_pretrained('bert-base-uncased')
bert_model_pmc = BertModel.from_pretrained(biobert_pmc_path)
bert_model_pubmed = BertModel.from_pretrained(biobert_pubmed_path)
bert_model_pubmed_pmc = BertModel.from_pretrained(biobert_pubmed_pmc_path)

<a id="preprocessing"></a>
### Preprocessing text descriptions
The preprocessing methods applied to the phenotype descriptions are a choice which impacts the subsequent vectorization and similarity methods which construct the pairwise distance matrix from each of these descriptions. The preprocessing methods that make sense are also highly dependent on the vectorization method or embedding method that is to be applied. For example, stemming (which is part of the full proprocessing done below using the `gensim.preprocess_string` function) is useful for the n-grams and bag-of-words methods but not for the document embeddings methods which need each token to be in the vocabulary that was constructed and used when the model was trained. For this reason, embedding methods with pretrained models where the vocabulary is fixed should have a lighter degree of preprocessing not involving stemming or lemmatization but should involve things like removal of non-alphanumerics and normalizing case. 

In [10]:
# Obtain a mapping between IDs and the raw text descriptions associated with that ID from the dataset.
descriptions = dataset.get_description_dictionary()

# Preprocessing of the text descriptions. Different methods are necessary for different approaches.
descriptions_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions.items()}
descriptions_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions.items()}
descriptions_no_stopwords = {i:remove_stopwords(d) for i,d in descriptions.items()}

<a id="pos_tagging"></a>
### POS tagging the phenotype descriptions for nouns and adjectives
Note that preprocessing of the descriptions should be done after part-of-speech tagging, because tokens that are removed during preprocessing before n-gram analysis contain information that the parser needs to accurately call parts-of-speech. This step should be done on the raw descriptions and then the resulting bags of words can be subset using additional preprocesssing steps before input in one of the vectorization methods.

In [11]:
get_pos_tokens = lambda text,pos: " ".join([t[0] for t in nltk.pos_tag(word_tokenize(text)) if t[1].lower()==pos.lower()])
descriptions_noun_only =  {i:get_pos_tokens(d,"NN") for i,d in descriptions.items()}
descriptions_noun_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_noun_only.items()}
descriptions_noun_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_noun_only.items()}
descriptions_adj_only =  {i:get_pos_tokens(d,"JJ") for i,d in descriptions.items()}
descriptions_adj_only_full_preprocessing = {i:" ".join(preprocess_string(d)) for i,d in descriptions_adj_only.items()}
descriptions_adj_only_simple_preprocessing = {i:" ".join(simple_preprocess(d)) for i,d in descriptions_adj_only.items()}
descriptions_noun_adj = {i:"{} {}".format(descriptions_noun_only[i],descriptions_adj_only[i]) for i in descriptions.keys()}
descriptions_noun_adj_full_preprocessing = {i:"{} {}".format(descriptions_noun_only_full_preprocessing[i],descriptions_adj_only_full_preprocessing[i]) for i in descriptions.keys()}
descriptions_noun_adj_simple_preprocessing = {i:"{} {}".format(descriptions_noun_only_simple_preprocessing[i],descriptions_adj_only_simple_preprocessing[i]) for i in descriptions.keys()}

<a id="vocab"></a>
### Reducing the vocabulary size using a word distance matrix
These approaches for reducing the vocabulary size of the dataset work by replacing multiple words that occur throughout the dataset of descriptions with an identical word that is representative of this larger group of words. The total number of unique words across all descriptions is therefore reduced, and when observing n-gram overlaps between vector representations of these descriptions, overlaps will now occur between descriptions that included different but similar words. These methods work by actually generating versions of these descriptions that have the word replacements present. The returned objects for these methods are the revised description dictionary, a dictionary mapping tokens in the full vocabulary to tokens in the reduced vocabulary, and a dictionary mapping tokens in the reduced vocabulary to a list of tokens in the full vocabulary.

In [36]:
# Reducing the size of the vocabulary for descriptions treated with simple preprocessing.
tokens = list(set([w for w in flatten(d.split() for d in descriptions_simple_preprocessing.values())]))
tokens_dict = {i:w for i,w in enumerate(tokens)}
graph = pw.pairwise_square_word2vec(word2vec_model, tokens_dict, "cosine")

# Make sure that the tokens list is in the same order as the indices representing each word in the distance matrix.
# This is only trivial here because the IDs used are ordered integers 0 to n, but this might not always be the case.
distance_matrix = graph.array
tokens = [tokens_dict[graph.row_index_to_id[index]] for index in np.arange(distance_matrix.shape[0])]
n = 3
threshold = 0.2
descriptions_linares_pontes, reduce_lp, unreduce_lp = reduce_vocabulary_linares_pontes(descriptions_simple_preprocessing, tokens, distance_matrix, n)
descriptions_connected_components, reduce_cc, unreduce_cc = reduce_vocabulary_connected_components(descriptions_simple_preprocessing, tokens, distance_matrix, threshold)

### Reducing vocabulary size based on identifying important words
These approcahes for reducing the vocabulary size of the dataset work by identifying which words in the descriptions are likely to be the most important for identifying differences between the phenotypes and meaning of the descriptions. One approach is to determine which words occur at a higher rate in text of interest such as articles about plant phenotypes as compared to their rates in more general texts such as a corpus of news articles. These approaches do not create modified versions of the descriptions but rather provide vocabulary objects that can be passed to the sklearn vectorizer or constructors.

In [21]:
# Constructing a vocabulary by looking at what words are overrepresented in domain specific text.
background_corpus = open(background_corpus_filename,"r").read()
phenotypes_corpus = open(phenotypes_corpus_filename,"r").read()
tokens = get_overrepresented_tokens(phenotypes_corpus, background_corpus, max_features=5000)
vocabulary_from_text = get_vocabulary_from_tokens(tokens)

# Constructing a vocabulary by assuming all words present in a given ontology are important.
ontology = Ontology(ontology_filename)
vocabulary_from_ontology = get_vocabulary_from_tokens(ontology.get_tokens())

<a id="annotation"></a>
### Annotating descriptions with ontology terms
This section generates dictionaries that map gene IDs from the dataset to lists of strings, where those strings are ontology term IDs. How the term IDs are found for each gene entry with its corresponding phenotype description depends on the cell below. Firstly, the terms are found by using the NOBLE Coder annotation tool through these wrapper functions to identify the terms by looking for instances of the term's label or synonyms in the actual text of the phenotype descriptions. Secondly, the next cell just draws the terms directly from the dataset itself. In this case, these are high-confidence annotations done by curators for a comparison against what can be accomplished through computational analysis of the text.

In [22]:
# Run the ontology term annotators over the raw input text descriptions. NOBLE-Coder handles simple issues like case
# normalization so preprocessed descriptions are not used for this step.
ontology = Ontology(ontology_filename)
annotations_noblecoder_precise = annotate_using_noble_coder(descriptions, noblecoder_jarfile_path, "mo", precise=1)
annotations_noblecoder_partial = annotate_using_noble_coder(descriptions, noblecoder_jarfile_path, "mo", precise=0)

In [23]:
# Get the ID to term list annotation dictionaries for each ontology in the dataset.
annotations = dataset.get_annotations_dictionary()
go_annotations = {k:[term for term in v if term[0:2]=="GO"] for k,v in annotations.items()}
po_annotations = {k:[term for term in v if term[0:2]=="PO"] for k,v in annotations.items()}

<a id="matrix"></a>
<a id="methods"></a>
<a id="run"></a>
### Generating vector representations and pairwise distances matrices
This section uses the text descriptions, preprocessed text descriptions, or ontology term annotations created or read in the previous sections to generate a vector representation for each gene and build a pairwise distance matrix for the whole dataset. Each method specified is a unique combination of a method of vectorization (bag-of-words, n-grams, document embedding model, etc) and distance metric (Euclidean, Jaccard, cosine, etc) applied to those vectors in constructing the pairwise matrix. The method of vectorization here is equivalent to feature selection, so the task is to figure out which type of vectors will encode features that are useful (n-grams, full words, only words from a certain vocabulary, etc).

In [24]:
# Define a list of different methods for calculating distance between text descriptions using the Methods object 
# defined in the utilities for this notebook. The constructor takes a string for the method name, a string defining
# the hyperparameter choices for that method, a function to be called to run this method, a dictionary of arguments
# by keyword that should be passed to that function, and a distance metric from scipy.spatial.distance to associate
# with this method.

methods = [

    
    # Methods that use neural networks to generate embeddings.
    Method("Doc2Vec Wikipedia", "Size=300", pw.pairwise_square_doc2vec, {"model":doc2vec_wiki_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Doc2Vec PubMed", "Size=100", pw.pairwise_square_doc2vec, {"model":doc2vec_pubmed_model, "ids_to_texts":descriptions, "metric":"cosine"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Mean", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"mean"}, spatial.distance.cosine),
    Method("Word2Vec Wikipedia", "Size=300,Max", pw.pairwise_square_word2vec, {"model":word2vec_model, "ids_to_texts":descriptions, "metric":"cosine", "method":"max"}, spatial.distance.cosine),
    #Method("BERT", "Base:Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=2,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":2}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=3,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":3}, spatial.distance.cosine),
    #Method("BERT", " Base:Layers=4,Summed", pw.pairwise_square_bert, {"model":bert_model_base, "tokenizer":bert_tokenizer_base, "ids_to_texts":descriptions, "metric":"cosine", "method":"sum", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=2,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":2}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=3,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":3}, spatial.distance.cosine),
    #Method("BioBERT", "PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pmc, "tokenizer":bert_tokenizer_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PubMed,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed, "tokenizer":bert_tokenizer_pubmed, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
    #Method("BioBERT", "PubMed,PMC,Layers=4,Concatenated", pw.pairwise_square_bert, {"model":bert_model_pubmed_pmc, "tokenizer":bert_tokenizer_pubmed_pmc, "ids_to_texts":descriptions, "metric":"cosine", "method":"concat", "layers":4}, spatial.distance.cosine),
        
    # Methods that use variations on the n-grams approach with full preprocessing (includes stemming).
    Method("N-Grams", "Full,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach with simple preprocessing (no stemming).
    Method("N-Grams", "Simple,Words,1-grams,2-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,2),"max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,2-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,2), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Simple,Words,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_simple_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use variations on the n-grams approach selecting for specific parts-of-speech.
    Method("N-Grams", "Full,Nouns,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Nouns,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Nouns,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_noun_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":False}, spatial.distance.jaccard),
    Method("N-Grams", "Full,Adjectives,1-grams,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":False, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    Method("N-Grams", "Full,Adjectives,1-grams,Binary,TFIDF", pw.pairwise_square_ngrams, {"ids_to_texts":descriptions_adj_only_full_preprocessing, "metric":"cosine", "binary":True, "analyzer":"word", "ngram_range":(1,1), "max_features":10000, "tfidf":True}, spatial.distance.cosine),
    
    
    # Methods that use terms inferred from automated annotation of the text.
    Method("NOBLE Coder", "Precise", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    Method("NOBLE Coder", "Partial", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"jaccard", "tfidf":False}, spatial.distance.jaccard),
    Method("NOBLE Coder", "Precise,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_precise, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    Method("NOBLE Coder", "Partial,TFIDF", pw.pairwise_square_annotations, {"ids_to_annotations":annotations_noblecoder_partial, "ontology":ontology, "binary":True, "metric":"cosine", "tfidf":True}, spatial.distance.cosine),
    
    # Methods that use terms assigned by humans that are present in the dataset.
    Method("GO", "Default", pw.pairwise_square_annotations, {"ids_to_annotations":go_annotations, "ontology":ontology, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "tfidf":False}, spatial.distance.jaccard),
    Method("PO", "Default", pw.pairwise_square_annotations, {"ids_to_annotations":po_annotations, "ontology":ontology, "metric":"jaccard", "binary":True, "analyzer":"word", "ngram_range":(1,1), "tfidf":False}, spatial.distance.jaccard),


]

In [None]:
# Generate all of the pairwise distance matrices in parallel.
#start_time_mp = time.perf_counter()
#pool = mp.Pool(mp.cpu_count())
#results = [pool.apply_async(function_wrapper_with_duration, args=(method.function, method.kwargs)) for method in methods]
#results = [result.get() for result in results]
#graphs = {method.name_with_hyperparameters:result[0] for method,result in zip(methods,results)}
#metric_dict = {method.name_with_hyperparameters:method.metric for tup in methods}
#durations = {method.name_with_hyperparameters:result[1] for method,result in zip(methods,results)}
#pool.close()
#pool.join()    
#total_time_mp = time.perf_counter()-start_time_mp

# Reporting how long each matrix took to build and how much time parallel processing saved.
#print("Durations of generating each pairwise similarity matrix (hh:mm:ss)")
#print("-----------------------------------------------------------------")
#savings = total_time_mp/sum(durations.values())
#for (name,duration) in durations.items():
#    print("{:50} {}".format(name, to_hms(duration)))
#print("-----------------------------------------------------------------")
#print("{:15} {}".format("total", to_hms(sum(durations.values()))))
#print("{:15} {} ({:.2%} of single thread time)".format("multiprocess", to_hms(total_time_mp), savings))

In [25]:
# Generate all the pairwise distance matrices (not in parallel).
graphs = {}
names = []
durations = []
for method in methods:
    graph,duration = function_wrapper_with_duration(function=method.function, args=method.kwargs)
    graphs[method.name_with_hyperparameters] = graph
    names.append(method.name_with_hyperparameters)
    durations.append(to_hms(duration))
    print("{:50} {}".format(method.name_with_hyperparameters,to_hms(duration)))
durations_df = pd.DataFrame({"method":names,"duration":durations})
durations_df.to_csv(os.path.join(OUTPUT_DIR,"durations.csv"), index=False)

Doc2Vec Wikipedia:Size=300                         00:00:01
Doc2Vec PubMed:Size=100                            00:00:00
Word2Vec Wikipedia:Size=300,Mean                   00:00:00
Word2Vec Wikipedia:Size=300,Max                    00:00:00
N-Grams:Full,Words,1-grams,2-grams                 00:00:01
N-Grams:Full,Words,1-grams,2-grams,Binary          00:00:00
N-Grams:Full,Words,1-grams                         00:00:00
N-Grams:Full,Words,1-grams,Binary                  00:00:00
N-Grams:Full,Words,1-grams,2-grams,TFIDF           00:00:01
N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF    00:00:01
N-Grams:Full,Words,1-grams,TFIDF                   00:00:00
N-Grams:Full,Words,1-grams,Binary,TFIDF            00:00:00
N-Grams:Simple,Words,1-grams,2-grams               00:00:01
N-Grams:Simple,Words,1-grams,2-grams,Binary        00:00:00
N-Grams:Simple,Words,1-grams                       00:00:00
N-Grams:Simple,Words,1-grams,Binary                00:00:00
N-Grams:Simple,Words,1-grams,2-grams,TFI

In [26]:
# Merging all of the edgelist dataframes together.
metric_dict = {method.name_with_hyperparameters:method.metric for method in methods}
methods = list(graphs.keys())
edgelists = {k:v.edgelist for k,v in graphs.items()}
df = pw.merge_edgelists(edgelists, default_value=1.000)
df = pw.remove_self_loops(df)
df.head(10)

Unnamed: 0,from,to,Doc2Vec Wikipedia:Size=300,Doc2Vec PubMed:Size=100,"Word2Vec Wikipedia:Size=300,Mean","Word2Vec Wikipedia:Size=300,Max","N-Grams:Full,Words,1-grams,2-grams","N-Grams:Full,Words,1-grams,2-grams,Binary","N-Grams:Full,Words,1-grams","N-Grams:Full,Words,1-grams,Binary","N-Grams:Full,Words,1-grams,2-grams,TFIDF","N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Full,Words,1-grams,TFIDF","N-Grams:Full,Words,1-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,2-grams","N-Grams:Simple,Words,1-grams,2-grams,Binary","N-Grams:Simple,Words,1-grams","N-Grams:Simple,Words,1-grams,Binary","N-Grams:Simple,Words,1-grams,2-grams,TFIDF","N-Grams:Simple,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,TFIDF","N-Grams:Simple,Words,1-grams,Binary,TFIDF","N-Grams:Full,Nouns,1-grams","N-Grams:Full,Nouns,1-grams,Binary","N-Grams:Full,Nouns,1-grams,TFIDF","N-Grams:Full,Nouns,1-grams,Binary,TFIDF","N-Grams:Full,Adjectives,1-grams","N-Grams:Full,Adjectives,1-grams,Binary","N-Grams:Full,Adjectives,1-grams,TFIDF","N-Grams:Full,Adjectives,1-grams,Binary,TFIDF",NOBLE Coder:Precise,NOBLE Coder:Partial,"NOBLE Coder:Precise,TFIDF","NOBLE Coder:Partial,TFIDF",GO:Default,PO:Default
1,436,449,0.440386,0.494182,0.093846,0.06014,0.862773,0.965602,0.804568,0.926554,0.942081,0.975838,0.898835,0.938549,0.617404,0.942913,0.510145,0.897674,0.883654,0.957547,0.809063,0.922304,0.841055,0.931507,0.879842,0.914691,1.0,1.0,1.0,1.0,0.767677,0.737643,0.741439,0.781159,0.9375,0.311594
2,436,451,0.448607,0.690981,0.144937,0.080291,0.90307,0.97482,0.867954,0.956204,0.95714,0.974107,0.932699,0.958877,0.73613,0.970899,0.656009,0.948864,0.941671,0.979164,0.899777,0.973427,0.889145,0.980392,0.968725,0.988432,1.0,1.0,1.0,1.0,0.78022,0.680233,0.763636,0.753923,0.941176,0.37037
3,436,452,0.556068,0.363119,0.19359,0.097068,0.976664,0.992832,0.964543,0.984496,0.995502,0.9963,0.99101,0.989247,0.777657,0.97832,0.690152,0.957317,0.962876,0.991755,0.925586,0.979721,0.974076,0.979167,0.986603,0.976018,1.0,1.0,1.0,1.0,0.836735,0.723958,0.862402,0.826559,0.941176,0.518797
4,436,453,0.507984,0.403257,0.170496,0.101583,0.844926,0.960784,0.818021,0.932203,0.956981,0.965118,0.935118,0.922609,0.716976,0.947059,0.652161,0.929032,0.912731,0.942427,0.876575,0.924853,0.820395,0.93617,0.920817,0.919669,0.52371,0.888889,0.815876,0.890583,0.922078,0.703704,0.94655,0.776442,0.933333,0.253333
5,436,454,0.581529,0.318814,0.155409,0.084839,0.926707,0.964286,0.913579,0.934783,0.979253,0.977879,0.967027,0.945618,0.684539,0.95122,0.591172,0.916667,0.935327,0.972594,0.881867,0.94137,0.924551,0.962264,0.967047,0.974057,0.804471,0.878788,0.921611,0.901661,0.855422,0.742424,0.867605,0.848427,0.928571,0.492958
6,436,455,0.458948,0.392024,0.073533,0.0523,0.869882,0.958656,0.827304,0.923497,0.962644,0.978284,0.936123,0.954789,0.456489,0.943074,0.344353,0.905738,0.854125,0.971174,0.757046,0.950967,0.925201,0.958904,0.961622,0.970323,0.908253,0.961538,0.968077,0.973008,0.787879,0.697368,0.819515,0.781505,0.941176,0.264516
7,436,465,0.427002,0.347636,0.160076,0.149803,0.878522,0.987395,0.796503,0.972973,0.928492,0.966779,0.844651,0.915064,0.819211,0.983819,0.757494,0.971631,0.934452,0.951292,0.891421,0.915231,0.926676,0.97619,0.947854,0.926807,0.764298,0.925926,0.748064,0.793501,0.766667,0.86,0.618308,0.728861,0.9375,0.676471
8,436,467,0.30361,0.443597,0.048605,0.054396,0.730182,0.861314,0.636714,0.766129,0.855822,0.818548,0.774088,0.679914,0.638961,0.869318,0.530892,0.779874,0.857142,0.828731,0.766311,0.724091,0.775054,0.784314,0.841257,0.689945,0.739421,0.828571,0.835437,0.773912,0.571429,0.449541,0.464786,0.471749,0.90625,0.022059
9,436,472,0.377933,0.339621,0.049062,0.051974,0.783986,0.894231,0.703857,0.838926,0.879991,0.86835,0.798777,0.784072,0.632617,0.885366,0.51423,0.813472,0.867904,0.869251,0.767799,0.795549,0.634703,0.785714,0.762621,0.720551,0.849384,0.9,0.830283,0.827512,0.722222,0.616279,0.567129,0.604425,0.892857,0.05
10,436,474,0.314378,0.565077,0.074955,0.051165,0.686499,0.850299,0.56664,0.735714,0.779869,0.80102,0.642262,0.642151,0.51899,0.841379,0.386013,0.736559,0.776646,0.792694,0.636662,0.659487,0.662821,0.719298,0.701525,0.619057,0.601473,0.837838,0.582568,0.757573,0.425,0.651786,0.485715,0.613898,0.909091,0.0


<a id="topic_modeling"></a>
### Approach 1: Topic modeling based on n-grams with a reduced vocabulary

In [37]:
# Get a list of texts to create a topic model from, from one of the processed description dictionaries above. 
texts = [description for i,description in descriptions_linares_pontes.items()]

# Creating and fitting the topic model, either NFM or LDA.
number_of_topics = 50
seed = 0
vectorizer = TfidfVectorizer(max_features=10000, stop_words="english", max_df=0.95, min_df=2, lowercase=False)
features = vectorizer.fit_transform(texts)
cls = NMF(n_components=number_of_topics, random_state=seed)
cls.fit(features)

# Function for retrieving the topic vectors for a list of text descriptions.
def get_topic_embeddings(texts, model, vectorizer):
    ngrams_vectors = vectorizer.transform(texts).toarray()
    topic_vectors = model.transform(ngrams_vectors)
    return(topic_vectors)
    
# Create the dataframe containing the average score assigned to each topic for the genes from each subset.
group_to_topic_vector = {}
for group_id,ids in group_id_to_ids.items():
    texts = [descriptions_linares_pontes[i] for i in ids]
    topic_vectors = get_topic_embeddings(texts, cls, vectorizer)
    mean_topic_vector = np.mean(topic_vectors, axis=0)
    group_to_topic_vector[group_id] = mean_topic_vector
    
tm_df = pd.DataFrame(group_to_topic_vector)

# Changing the order of the Lloyd, Meinke phenotype subsets to match other figures for consistency.
#filename = "../data/group_related_files/lloyd/lloyd_function_hierarchy_irb_cleaned.csv"
#lmtm_df = pd.read_csv(filename)
#tm_df = tm_df[lmtm_df["Subset Symbol"].values]

# Reordering so consistency with the curated subsets can be checked by looking at the diagonal.
tm_df["idxmax"] = tm_df.idxmax(axis = 1)
tm_df["idxmax"] = tm_df["idxmax"].apply(lambda x: tm_df.columns.get_loc(x))
tm_df = tm_df.sort_values(by="idxmax")
tm_df.drop(columns=["idxmax"], inplace=True)
tm_df = tm_df.reset_index(drop=False).rename({"index":"topic"},axis=1).reset_index(drop=False).rename({"index":"order"},axis=1)
tm_df.to_csv(os.path.join(OUTPUT_DIR,"topic_modeling.csv"), index=False)
tm_df

Unnamed: 0,order,topic,PWY-6406,PWY-5837,PWY-5791,PWY-7270,ETHYL-PWY,PWY-6546,PWY-1081,NONOXIPENT-PWY,CALVIN-PWY,PWY-5723,PWY-6730,PWY-6842,PWY-6736,PWY-6007,PWYQT-4476,PWYQT-4477,PWY-6008,PWY-6443,PWY-5868,PWY-6064,PWY-7186,PWY-6199,PWY-6266,PWY-2181,PWY-5168,PWY-5391,PWY-1121,PWY-361,CAMALEXIN-SYN,LIPAS-PWY,PWY-5080,PWY-7036,PWY-695,PWY-3181,PWY-6446,PWY-6444,PWY-5945,PWY1F-823,PWY-6787,PWY-5152,PWY1F-FLAVSYN,PWY-5060,PWY-3101,PWY-6902,PWY-3982,PWY-5704,PWY-7226,PWY-5034,PWY-5032,GLYOXYLATE-BYPASS,GLYOXDEG-PWY,PWY-699,PWY-6544,PWY-5137,PWY-735,PWY-5136,PWY-6837,PWY-5138,PWY-1042,PWY66-399,SUCSYN-PWY,PWY-5484,GLUCONEO-PWY,GLYCOLYSIS,PWY-2,PWY-6137,PWY-6959,PWY-2261,PWY-6724,PWY-5980,PWY-7238,PWY-1422,PWY-7436,PWY-882,PWY4FS-13,PWY4FS-12,PWY-922,THIOREDOX-PWY,ARGSYNBSUB-PWY,ARGSYN-PWY,PWY-5686,CITRULBIO-PWY,PWY-7060,PWY-4984,GLUTAMINDEG-PWY,PWY0-1319,PWY-5667,PWYQT-4482,TRIGLSYN-PWY,PWY-581,PWY-2902,PWY-7199,PWY-7193,PWY-6556,PWY-1061,PWY-5097,LEUSYN-PWY,PWY-6352,PWY-381,PWY-6549,PWY-6963,GLNSYN-PWY,PWY-6964,PWY-7061,PWY-3301,HISTSYN-PWY,PWY0-1264,PWY-7388,PWY-3385,PWY4FS-6,PWY-6163,PWY-3781,PWY-5083,PWY-4302,LYSINE-DEG2-PWY,PWY-2541,OXIDATIVEPENT-PWY,PWY0-1507,CHLOROPHYLL-SYN,FASYN-ELONG-PWY,PWY-5971,PWY-6039,PWY-6040,PWY-6466,GLUT-REDOX-PWY,PWY-4081,PWY-43,PWY-3801,PWY-5992,GLYSYN2-PWY,PWY-7416,PWY-6803,SERSYN-PWY,ALANINE-DEG3-PWY,ALANINE-SYN2-PWY,ALACAT2-PWY,PWY-6806,PWYQT-4450,PWY-1186,PWY-4361,LEU-DEG2-PWY,PWY-801,PWY-6936,PWY-702,PWY-5041,PWY-7528,METHIONINE-DEG1-PWY,SAM-PWY,PWY-5441,PROSYN-PWY,PWY-3341,PWY-6922,ARGININE-SYN4-PWY,PWY-5366,PWY-5142,PWY-7417,PWY-622,PWY-6545,PWY-7184,PWY-7227,PWY0-166,PWY-6707,PYRUVDEHYD-PWY,PWY-5147,PWY-6663,TRESYN-PWY,PWY-5350,PWY-6477,PWY-321,PWY-5143,PWY-6733,PWY-5989,PWY-282,PWY-5884,PWY-2821,PWY-601,PWY-5079,PWY-5886,PWY-7432,PWYDQC-4,TYRFUMCAT-PWY,PWY-5765,PWY-6369,PWY-1881,PWY-6475,PWY-40,PWY-6305,ARGDEG-V-PWY,ARGASEDEG-PWY,ARG-PRO-PWY,PWY-7101,PANTO-PWY,PWY-7197,PWY-7187,PYRIDNUCSYN-PWY,PWY-2301,PWY-6363,PWY-4702,PWY-6799,PWY-4381,PWY-6596,PWY-6122,PWY-6121,PWY-3841,PWY-7909,PWY-6613,PWY-3742,PWY-1722,PWY-101,PWY-6614,PWY-2161,PWY-181,GLYSYN-PWY,PWY-5871,PWY-5285,PWY-6364,NONMEVIPP-PWY,PWY-7560,PWY-6804,PWY-5800,PWY-5175,PWY-5946,CAROTENOID-PWY,PWY-7120,RIBOSYN2-PWY,PWY-782,PWY-5995,PWY-762,PWY-4341,PWY-5934,PWY-1001,PWYQT-4475,PWYQT-4473,PWYQT-4474,PWYQT-4472,PWYQT-4471,PWY-1187,PWY-5267,PWY-5947,PWY-5659,MANNCAT-PWY,PWY-3881,PWY-3261,PWY-5997,PWY-7590,MANNOSYL-CHITO-DOLICHOL-BIOSYNTHESIS,PWY-63,PWY-6317,PWY-3821,PWY-7344,PWY-6527,PWY-5114,PWY4FS-2,PWY4FS-4,PWY4FS-3,PWY-6295,PWY-84,PWY-7219,PWY-2724,PWY-66,PWY-2582,PWY-6745,CYSTSYN-PWY,PWY-5670,PWY1F-467,PWY-4041,PWYQT-4432,GLYSYN-ALA-PWY,PWY-5381,PWY-2602,PWY-6424,BSUBPOLYAMSYN-PWY,PWY0-461,ARGSPECAT-PWY,PWY-6535,PWY-6473,PWY-4321,PWY-5910,PWY-5120,PWY-5121,PWY-5863,DETOX1-PWY,DETOX1-PWY-1,PWY-7039,PWY-5188,PWY-4841,UDPNACETYLGALSYN-PWY,PWY-4,PWY-82,PWYQT-4466,PWY-7343,PWYQT-4481,PWY-561,PWY-5690,PWY-5661,PWY-4101,GLUCOSE1PMETAB-PWY,PWY-621,PWY0-1182,TRPSYN-PWY,PWY-6890,PWYQT-4470,GLUTATHIONESYN-PWY,PWY-1581,PWY-6910,PWY-7356,PWY-6908,PWY-5986,PWY-5027,PWY-6118,PWY-4261,PWY-6952,PWY-7208,PWY-7196,PWY-7183,PWYQT-4445,THISYNARA-PWY,PWY-6909,PWY-7625,PHOSLIPSYN2-PWY,PWY-6351,PWY-5973,PWY-5486,PWY66-21,ETOH-ACETYLCOA-ANA-PWY,PWY-6333,PWY-1801,PWY-5070,PWY-5035,PWY-5036,PWY0-1313,PWY-5390,HOMOSER-THRESYN-PWY,PWY-7640,PWY-5271,PWY-6012,PWY-6287,PWY-7047,PWY-7048,MALATE-ASPARTATE-SHUTTLE-PWY,PWY-6348,LIPASYN-PWY,POLYAMINSYN3-PWY,PWY0-501,PWY-5337,PWY-5342,PWY-5687,PWY-7205,PWY-7176,PWY-7221,PWY-7224,PWY-3221,PWY-4861,PWY-5466,PWY-3561,PWYQT-4427,PWY1F-353,PWY-401,PLPSAL-PWY,PWY-7204,SULFMETII-PWY,PWY-5340,PWY-4203,PWY-2161B-PMN,PWY-5410,HEME-BIOSYNTHESIS-II,PWY-6809,GLUGLNSYN-PWY,PWY-5936,GLUTSYNIII-PWY,GLUTAMATE-SYN2-PWY,GLUTAMATE-DEG1-PWY,PWY-5129,PWY-6441,PWY-6932,PWY-6132,PWY-6668,PWY-5107,PWY-6619,ASPSYNII-PWY,P401-PWY,PWY-6066,PWY-1822,PWY-6235,VALDEG-PWY,PWY-6233,PWY-6220,PWY0A-6303,PWY-6303,PWY-6607,PWY-7185,PWY-6606,PWY-6927,PWY-7170,PWY-641,PWY-6035,ASPASN-ARA-PWY,ASPARTATESYN-PWY,ASPARTATE-DEG1-PWY,PWY-3001,THRESYN-PWY,PWY-5064,PWY-5068,PWY-5086,PWY-6786,PWY-5453,PWY-5963,PWY-5669,PWY-6754,PWY-6756,PWY-6605,PWY-5098,PWY-6019,PWY4FS-7,PWY4FS-8,PWY-5269,PWY-6845,PWY-4983,PWY-6773,PWY0-1021,PWY-7250,PWY-6823,PWY-6115
0,0,44,0.000173,8.7e-05,8.7e-05,0.001293,0.001293,0.0,0.253288,0.0,0.0,0.0,0.0,0.007383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003031,0.002981,0.002437,0.001604,0.008053,0.006271,0.01371,0.025064,0.050127,0.050127,0.026837,0.005549,0.003096,0.0,0.003715,0.0,0.0,0.014039,0.0,0.0,0.015434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013453,0.026137,0.0,0.019664,0.001,0.003,0.0,0.001286,0.001125,0.01724,0.001663,0.034788,0.034788,0.001194,0.011934,0.002199,0.0,0.0,0.010004,0.0,0.0,0.0,0.009582,0.0,0.0,0.0,0.004818,0.0,0.0,0.006883,0.0,0.010539,0.0,0.010221,0.000784,0.0,0.0,0.0,0.0,0.002912,0.0,0.000336,0.023317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025624,0.012812,0.0,0.002821,0.0,0.001443,0.0,0.0,0.0,0.030671,0.0,0.036727,0.006031,0.0,0.0191,0.0,0.0,0.016059,0.0,0.001986,0.0,0.020848,0.0,0.011117,0.0,0.000858,0.0,0.000784,0.000784,0.001506,0.0,0.0,0.0,0.0,0.001176,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001725,0.000111,0.000111,0.021078,0.01749,0.00348,0.00348,0.00522,0.00348,0.0,0.0,0.000895,0.0,0.017508,0.0,0.028892,0.004827,0.001288,0.013396,0.022919,0.015093,0.001015,0.0,0.0,0.002353,0.002353,0.001176,0.002353,0.001176,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002329,0.002329,0.046395,0.0,0.002955,0.0,0.0,0.0,0.012276,0.0,0.0,0.0,0.0,0.0,0.019001,0.0,0.036158,0.057003,0.0,0.024278,0.0,0.060468,0.0,0.0,0.0,0.0,0.0,0.004346,0.0007,0.0007,0.0007,0.02849,0.0,0.0,0.0,0.0,0.0,0.001547,0.00562,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015806,0.0,0.0,0.00138,0.0,0.0,0.034032,0.0,0.0,0.0,0.0,0.0,0.035308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011074,0.0,0.032227,0.022131,0.0,0.0,0.0,0.06186,0.0,0.055173,0.0,0.0,0.005332,0.0,0.0,0.0,0.002034,0.002034,0.002034,0.002034,0.000284,0.003121,0.003275,0.0,0.0,0.0,0.0,0.000808,0.0,0.0,0.0,0.0,0.0,0.002131,0.003196,0.002131,0.006751,0.003196,0.016145,0.008321,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007389,0.0,0.0,0.0,0.041785,0.0,0.109446,0.0,0.00077,0.0,0.0,0.0,0.017216,0.017216,0.017216,0.001155,0.0,0.0,0.005803,0.005803,0.051896,0.0,0.0,0.0,0.0,0.0,0.003275,0.010664,0.004811,0.0,0.0,0.004659,0.004659,0.004659,0.004659,0.004659,0.0,0.141232,0.0,0.004913,0.001549,0.03448,0.00345,0.0,0.0,0.044523,0.044523,0.0,0.0,0.00918,0.0,0.0,0.0,0.0,0.002588,0.002588,0.002588,0.0,1.8e-05,0.030826,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014777,0.014777,0.014777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009432,0.0,0.018361,0.018361,0.025531,0.083571,0.008426,0.008426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,17,0.002475,0.001238,0.001238,0.0,0.0,0.0,0.000104,0.001027,0.000685,0.022192,0.0,0.0,0.0,0.003635,0.0,0.001961,0.0,0.000492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000779,0.00829,0.002006,0.0,0.000606,0.000606,0.111103,0.194192,0.22158,0.22158,0.044067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002033,0.0,0.0,0.0,0.000112,0.0,0.0,0.026071,0.0,0.0,0.021605,0.0,0.018905,0.083403,0.0,0.0,0.0,0.000647,0.0,0.000412,1.4e-05,2.4e-05,0.000168,0.0,0.0,0.001772,0.005136,0.0,0.0,0.0,0.000639,0.00154,0.000791,0.000461,0.0,0.0,0.0,0.000784,0.000721,0.0,0.0,0.0,0.0,0.0,0.0,0.004226,0.000324,0.039395,0.00154,0.00231,0.00462,0.00231,0.00231,0.000356,0.0,0.000313,0.001291,0.005006,0.0,0.0,0.0,0.024924,0.0,0.0,0.000519,0.0,0.0,0.028855,0.004537,0.001925,0.0,0.0,0.001075,0.0,0.021884,0.0,0.0,0.0,0.028767,0.0,0.001169,0.0,0.03106,0.03106,0.018636,0.023754,0.0,0.009139,0.027417,0.01479,0.0,0.0,0.000605,0.0,0.0,0.0,0.0,0.00242,0.0,0.0,0.001055,0.0,0.0,0.0,0.0,0.000154,0.0,0.0,0.0,0.0,0.0,0.000601,0.001169,0.0,0.00293,0.075114,0.0,0.001391,0.001753,0.000785,0.001402,0.0,0.0,0.001308,0.001308,0.002162,0.002162,0.001081,0.002162,0.001081,0.0,0.0,0.0,0.024871,0.000791,0.001055,0.0,0.001582,0.001054,0.033776,0.0,0.0,0.0,0.001078,0.000266,0.057045,0.000266,0.0,0.0,0.056486,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029337,0.0,0.0,0.022754,0.0,0.001159,0.0,0.08586,0.000126,0.000135,0.00041,0.0,0.0,0.0,0.0,0.021948,0.013036,0.0,0.0,0.0,0.0,0.012192,0.002005,0.001961,0.001961,0.001961,0.001961,0.001961,0.001961,0.004974,0.001094,0.000129,0.0,0.0,0.000126,0.0,0.0,0.036424,0.0,0.0,0.0,0.0,0.0,0.000102,0.007509,0.007509,0.007509,0.0,0.0,0.0,0.0,0.0,0.002542,0.045509,0.0,0.0,0.0,0.091017,0.091017,0.091017,0.012027,0.0,0.0,0.0,0.0,0.0,0.00434,0.00434,0.002893,0.0,0.0,0.0,0.0,0.002983,0.034317,0.012912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000392,0.000392,0.000392,0.0,0.0,0.0,0.0,0.0,0.000308,0.0,0.0,0.0,3.5e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056486,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000592,0.0,0.0,0.0,0.008544,0.008544,0.008544,0.000889,0.0,0.001394,0.151913,0.151913,0.0,0.0,0.0,0.0,0.002092,0.0,0.012912,0.0,0.0,0.005356,0.005356,0.0,0.0,0.0,0.0,0.0,0.0,0.000408,0.0,0.019367,0.000268,0.0,0.002819,0.0,0.0,0.000616,0.000616,0.0,0.0,0.056018,0.0,0.182225,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005861,0.007921,0.0,0.0,0.0,0.101117,0.108447,0.204095,0.148406,0.015579,0.015579,0.000696,0.000696,0.112971,0.112971,0.112971,0.000371,0.0,0.000987,0.000987,0.006277,0.006277,0.006277,0.006277,0.006277,0.001249,0.008329,0.001249,0.112036,0.112036,0.184924,0.0,0.0,0.0,0.005227,0.000557,0.0,0.0,0.0,0.0,0.002874,0.002874,0.0,0.0,0.0,0.0,0.0
2,2,19,0.0,0.0,0.0,0.0,0.0,0.0,0.005262,0.0,0.004698,0.004291,0.0,0.074992,0.0,0.0,0.001412,0.002272,0.0,0.0,0.000193,0.0,0.000257,0.0,0.0,0.059905,0.08947282,0.068697,0.005227,0.024587,0.004022,0.0,0.001162,0.004972,0.015989,0.0,0.0,0.0,0.005538,0.245757,0.128789,0.126031,0.154547,0.252061,0.091596,0.253004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000909,0.0,0.017145,0.007184,0.006135,0.01143,0.01143,0.009704,0.003169,0.009506,0.008318,0.004074,0.00732,0.0,0.0,0.0,0.0,0.000125,0.000189,4.7e-05,0.005664,0.003501,0.001582,0.0,0.0,0.001223,0.001701,0.0,0.0,0.0,0.000581,0.0,0.0,0.0,0.0,0.0,0.0,0.007474,0.000941,0.0,0.0,0.0,0.0,0.0,0.0,0.00104,0.000874,0.0,0.0,0.002707,0.0,0.0,0.0,0.003521,0.0,0.018979,0.012092,0.0,0.009412,0.0,0.0,0.003311,0.002459,0.000445,0.000681,0.0,0.013915,0.004346,0.010409,0.006184,0.000257,0.000257,0.0,0.0,0.00151,0.002698,0.004441,0.0,0.025823,0.010591,0.00798,0.021182,0.0,0.0,0.0,0.007677,0.0,0.0,0.0,0.0,0.0,0.0,0.012788,0.03345,0.0,0.050175,0.0,0.051152,0.002905,0.002905,0.001937,0.001937,0.022593,0.022593,0.0,0.003532,0.0,0.0,0.0,0.0,0.0,0.0,0.00445,0.003309,0.007895,0.0,0.01171,0.003814,0.006674,0.013628,0.005339,0.0,0.020681,0.005093,0.005093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00649,0.002698,0.001799,0.005396,0.0,0.0,0.0,0.057602,0.0,0.0,0.008928,0.001063,0.002473,0.001063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004535,0.0,0.008009,0.013606,0.0,0.009656,0.0,0.012112,0.0,0.005119,0.017217,0.015225,0.013563,0.005057,0.0,0.0,0.0,0.0,0.004088,0.0,0.008393,0.004196,0.0,0.0,0.005637,0.004067,0.004067,0.004067,0.004067,0.004067,0.004067,0.006926,0.000924,0.001856,0.0,0.0,0.024807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025551,0.025551,0.011059,0.0,0.0,0.001136,0.019963,0.0,0.000554,0.019576,0.0,0.0,0.0,0.035548,0.0,0.000382,0.0,0.0,0.0,0.0,0.0,0.0,0.01061,0.01061,0.01061,0.01061,0.001357,0.002943,0.004282,0.008978,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.3e-05,0.0,0.000726,0.035755,0.00955,0.00955,0.0,0.012114,0.012114,0.012114,0.00221,0.0,0.0,0.0,0.005166,0.0,0.0,0.0,0.0,0.019267,0.019267,0.0,0.007295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002785,0.002785,0.002785,0.0,0.011364,0.0,0.017051,0.017051,0.0,0.009735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027928,0.0,0.0,0.0,0.0,0.0,0.0,0.002516,0.0,0.0,0.005414,0.0,0.0,0.0,0.0,0.08947278,0.0,0.0,0.0,0.0,0.010737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000457,0.01339,0.146809,0.146809,0.0,0.0,0.0,0.0,0.0,0.042213,0.023658,0.042213,0.005033,0.005033,0.003341,0.01459,0.0,0.0,0.0,0.000686,0.0,0.0,0.0,0.0,0.003276,0.003276,0.0,0.0,0.0,0.0,0.000154
3,3,42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019162,0.012775,0.012263,0.0,0.018912,0.0,0.0,0.002418,0.012067,0.004837,0.001163,0.01431,0.0,0.01908,0.0,0.0,0.01908,0.0,0.0,0.014051,0.01841,0.006432,0.0,0.007194,0.007194,0.002242,0.0,0.0,0.0,0.004719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.186788,0.158037,0.0,0.0,0.07355,0.130004,0.0,0.0,0.008063,0.016126,0.0,0.001532,0.001021,0.0,0.001313,0.001313,0.001149,0.0,0.001196,0.0,0.0,0.005268,0.009538,0.043737,0.002797,0.0,0.000268,0.0,0.0,0.036653,0.027796,0.0,0.0,0.0,0.00144,0.0,0.0,0.002057,0.047074,0.027669,0.051996,0.020752,0.0,0.0,0.0,0.0,0.0,0.017086,0.0,0.000629,0.0,0.0,0.070478,0.0,0.0,0.005225,0.005225,0.0,0.0,0.0,0.009454,0.0,0.026327,0.0,0.0,0.015805,0.0,0.0,0.083881,0.0,0.0607,0.042034,0.037815,0.029583,0.01908,0.01908,0.001316,0.0,0.0,0.004246,0.032578,0.0,0.00805,0.003245,0.001834,0.006489,0.011699,0.011699,0.008023,0.014469,0.0,0.0,0.0,0.0,0.0,0.0,0.000585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001672,0.0,0.0,0.023955,0.000375,0.0,0.0,0.0,0.0,0.0,0.0,0.009253,0.0,0.0,0.0,0.015085,0.000645,0.001128,0.000637,0.027936,0.0,7.1e-05,0.008377,0.008377,0.0,0.0,0.0,0.0,0.0,0.0,9e-06,0.0,0.0,0.004246,0.002831,0.008492,0.0,0.0,0.021704,0.002162,0.0,0.0,0.0,0.0,0.008595,0.0,0.0,0.0,0.018948,0.006249,0.006249,0.001127,0.0,0.0,0.0,0.0,0.008019,0.0,0.001127,0.036855,0.002254,0.002928,0.0,0.002541,1e-05,0.019882,0.0,0.031383,0.0,0.0,0.0,0.03149,0.023307,0.004491,0.004491,0.002245,0.010451,0.021978,0.040865,0.011283,0.011283,0.011283,0.011283,0.011283,0.011283,0.0,0.0,0.000375,0.000938,0.000938,8.8e-05,0.0,0.0,0.031147,0.000884,0.000884,0.000884,0.001326,0.000884,0.000663,0.0,0.0,0.0,0.002202,0.002202,0.016738,0.0,0.010248,0.059436,0.042072,0.080686,0.042642,0.064706,0.029947,0.029947,0.029947,0.026112,0.102603,0.031045,0.117609,0.117609,0.058805,0.003982,0.003982,0.004327,0.04088,0.04088,0.04088,0.04088,0.02024,0.01012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086874,0.130312,0.086874,0.065156,0.130312,0.000455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.070414,0.070414,0.070414,0.054155,0.0,0.0,0.0,0.000595,0.0,0.135166,0.05407,0.018023,0.05407,0.05407,0.05407,0.130948,0.130948,0.130948,0.0,0.0,0.001171,0.0,0.0,0.002806,0.0,0.0,0.0,0.0,0.00252,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.5e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.00091,0.00091,0.0,0.0,0.004245,0.0,0.000215,0.0,0.0,0.002509,0.002509,0.002509,0.0,0.000258,0.0,0.004212,0.0,0.0,0.002546,0.0,0.0,0.000129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037895,0.037895,0.037895,0.000757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002562,0.0,0.00849,0.00849,0.0,0.00119,0.0,0.0,0.001699,0.0,0.032061,0.061839,0.061839,0.061839,0.019468,0.019468,0.058017,0.182099,0.182099,0.182099,0.0
4,4,31,0.0,0.0,0.0,0.0,0.0,0.0,0.014782,0.0,0.0,0.014352,0.0,0.002783,0.0,0.003269,0.008618,0.01118,0.017235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002189,0.005345,0.013307,0.001708,0.019731,0.00201,0.007935,0.097598,0.032291,0.064583,0.064583,0.027436,0.0,0.00146,0.0,0.0,0.0,0.002919,0.0,0.0,0.0,0.001033,0.153228,0.195599,0.04471095,0.003451,0.00683,0.025764,0.025053,0.022935,0.03716,0.036127,0.018418,0.019129,0.01452,0.0,0.014352,0.018669,0.012558,0.0,0.009005,0.0,0.0,0.001109,0.0,0.0,0.003818,0.0,0.014407,0.0,0.0,0.0,0.003238,0.0,0.0,0.0,0.002297,0.0,0.0,0.002663,0.0,0.0,0.0,0.011611,0.000395,0.0,0.0,0.0,0.0,0.005315,0.0,0.003979,0.000438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007752,0.023603,0.011801,0.010077,0.0,0.0,0.051358,0.019651,0.042101,0.013205,0.003598,0.0,0.0,0.00885,0.0,0.008017,0.0,0.0,0.006213,0.0,0.023765,0.0,0.0,0.0,0.025516,0.007671,0.010574,0.015343,0.000395,0.000395,0.000237,0.002914,0.0,0.0,0.0,0.000593,0.0,0.0,0.004402,0.0,0.0,0.0,0.0,0.017607,0.002165,0.002165,0.001443,0.001443,0.018944,0.018944,0.0,0.003304,0.001033,0.001033,0.00155,0.001033,0.0,0.0,0.010691,0.0,0.010282,0.091579,0.172398,0.006872,0.012026,0.003748,0.009621,0.0,0.001201,0.009781,0.009781,0.001185,0.001185,0.000593,0.001185,0.000593,0.0,0.0,0.0,0.004118,0.0,0.0,0.0,0.0,0.0,0.0,0.045238,0.0,0.0,0.008815,0.0,0.010646,0.0,0.0,0.0,0.0,0.029921,0.029921,0.028595,0.0,0.0,0.002237,0.0,0.00868,0.006711,0.028595,0.021087,0.057189,0.006712,0.0,0.026614,0.00474,0.00395,0.007805,0.000913,0.0,0.0,0.0,0.044918,0.011683,0.0,0.0,0.0,0.0,0.0,0.003062,0.002562,0.002562,0.002562,0.002562,0.002562,0.002562,0.0,0.000479,0.02017,0.005846,0.005846,0.008171,0.0,0.0,0.0,0.000826,0.000826,0.000826,0.001239,0.000826,0.000619,0.015115,0.015115,0.015115,0.0,0.0,0.0,0.0,0.0,0.002097,0.004174,0.0,0.0,0.0,0.0,0.0,0.0,0.050507,0.0,0.019925,0.0,0.0,0.005093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030133,0.050714,0.014779,0.0,0.0,0.0,0.0,0.002581,0.0,0.0,0.076585,0.076585,0.076585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036143,0.036143,0.003229,0.0,0.0,0.0,0.0,0.0,0.102607,0.102607,0.051304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002514,0.0,0.0,0.01617,0.012749,0.01617,0.01617,0.01617,0.146536,0.146536,0.146536,0.011039,0.004379,0.0,0.055298,0.055298,0.0,0.0,0.065341,0.065341,0.04514,0.067633,0.008098,0.010185,0.008758,0.006829,0.006829,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012147,0.0,0.0,0.003982,0.0,0.0,0.002304,0.002304,0.0,0.0,0.062565,0.0,0.012993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004608,0.017349,0.0,0.003622,0.0,0.0,0.0,0.001561,0.0,0.0,0.0,0.005531,0.005531,0.0,0.0,0.0,0.0,0.0,0.002543,0.0,0.017313,0.017313,0.004739,0.004739,0.004739,0.004739,0.004739,0.010447,0.010081,0.010447,0.125129,0.125129,0.096125,0.005029,0.0,0.0,0.0,0.003814,0.0,0.0,0.0,0.0,0.01527,0.01527,0.0,0.0,0.0,0.0,0.0
5,5,9,0.0,0.0,0.0,0.000371,0.000371,0.000222,0.00388,0.023872,0.015914,0.013641,0.0,0.0,0.0,0.0,0.000603,0.000103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000822,0.000164,0.00038,0.000201,0.002121,0.000875,0.000284,0.000253,0.000752,0.001504,0.001504,0.000303,0.001645,0.000548,0.001645,0.000658,0.003289,0.001097,0.0,0.0,0.00043,0.0,0.0,0.0,0.01310974,0.039328,0.000269,0.001143,0.0,0.000732,4.2e-05,8.5e-05,0.0,0.000198,0.037057,0.038751,0.029136,0.016608,0.040026,0.0,0.001061,0.010778,0.010778,0.147719,0.006237,0.070015,0.0,0.0,0.000158,0.0,0.0,0.0,7.7e-05,0.0,0.0,0.0,0.001098,0.0,0.002746,0.0,0.000129,0.005124,8.6e-05,0.004163,0.000402,0.0,0.0,0.0,0.0,0.031106,0.0,0.000717,0.000228,0.0,0.0,0.000782,0.0,0.0,0.0,0.00412,0.0,0.001383,0.000692,0.0,0.001863,0.0,0.0,0.0,0.0,0.0,0.000243,0.128437,0.0,0.000452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055573,0.005234,0.000175,0.0,0.0,0.0,0.000233,0.000233,0.000629,0.0,0.001255,0.000854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003662,0.000816,0.0,0.0,0.010162,0.128197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000515,0.005493,0.0,0.0,0.0,0.0,0.0,0.0,0.000218,0.0,6.8e-05,6.8e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002746,0.003662,0.0,0.005492,0.003661,0.0,0.0,0.0,0.0,0.000529,0.000552,0.000221,0.000552,0.0,0.0,0.0,0.001496,0.001496,0.0,0.0,0.0,0.0,0.0,0.000756,0.0,0.0,0.002731,0.0,0.0,0.0,0.0,0.00032,0.000266,0.0,0.0,0.0,0.0,0.0,0.0,0.089214,0.000429,0.0,0.001392,0.0,0.002245,0.00043,0.000103,0.000103,0.000103,0.000103,0.000103,0.000103,0.0,0.0,0.023251,0.0,0.0,0.0,0.000858,0.000858,0.0,0.022312,0.067984,0.0,0.0,0.0,0.031136,0.0,0.0,0.0,0.0,0.0,0.000505,0.0,0.0,5.1e-05,0.000349,0.0,0.0,0.001476,0.0007,0.0007,0.0007,0.00015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05256,0.101976,0.101976,0.003068,0.003068,0.003068,0.070639,0.003982,0.070639,0.031054,0.003982,0.0,0.000785,0.0,0.0,0.0,0.00157,0.00157,0.00157,0.0,0.0,0.0,0.0,0.015581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001781,0.001781,0.001781,0.0,0.0,0.0,0.001783,0.001783,0.0,0.0,0.0,0.0,0.001504,0.004437,0.0,0.0,0.006365,0.000553,0.000553,0.0,0.0,0.0,0.0,0.0,0.0,0.057608,0.0,0.0,0.002623,0.0,0.00101,0.0,0.0,0.066967,0.066967,0.001669,0.0,0.001831,0.004354,0.0,0.001564,0.0,0.001224,0.001224,0.001224,0.0,0.00033,0.133933,0.0,0.0,0.0,0.001341,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004514,0.004514,0.004514,0.004514,0.004514,0.0,0.000116,0.0,0.0,0.0,0.008163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,5,0.114665,0.057333,0.057333,0.0,0.0,0.0,0.0,0.0001,6.6e-05,0.00055,0.131596,0.043865,0.131596,0.0,0.016176,0.013266,0.026531,0.042031,0.055831,0.099899,0.036735,0.099899,0.099899,0.036135,0.04994947,0.050561,0.015513,0.034639,0.051266,0.0,0.0,0.000171,0.001752,0.000738,0.001475,0.001475,0.0,0.0,0.017125,0.000206,8.2e-05,0.0,0.034114,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000341,0.005095,0.003866,0.000858,0.000513,0.000575,0.000383,0.0,0.000493,0.000493,0.000431,0.0,0.000286,0.0,0.0,0.00013,0.0,0.0,0.0,0.0,0.014391,0.075295,0.075295,0.040889,0.00094,0.0,0.0,0.0,0.0,0.000372,0.0,0.0,0.0,6.3e-05,0.000895,0.000564,0.024302,0.195824,0.195824,0.195824,0.195824,0.0,0.0,0.0,0.0,0.006892,0.051134,0.000558,0.001117,0.000558,0.000558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000149,0.0,0.0,0.0,0.0,0.005202,0.041141,0.041141,0.0,0.0,0.0,0.0,0.0,0.04109,0.000843,0.000128,0.005245,0.0,0.001123,0.001123,0.001443,0.0,0.0,0.0,0.0,0.0,0.00566,0.00566,0.005158,0.019374,0.0,0.029061,0.0,0.0,0.0,0.0,0.0,0.001281,0.001501,0.001501,0.000125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005702,0.0,0.000533,0.084243,0.001084,0.004459,0.007803,0.002256,0.006242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039267,0.0,0.00748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001546,0.0,0.0,0.0,0.003865,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000843,0.0,0.0,0.0,0.0,8.6e-05,0.006433,0.0,0.0,0.0,0.0,0.0,0.000686,0.0,0.0,0.0,0.0,0.0,0.0,0.001208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005089,0.0,0.0,0.0,0.0,0.0,0.008305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000183,0.000183,0.001104,0.0,0.0,0.0,0.001685,0.0,0.0,0.0,0.003371,0.003371,0.003371,0.002529,0.0,0.0,0.0,0.0,0.0,0.00024,0.00024,0.000474,0.0,0.0,0.0,0.0,0.01304,0.00652,0.00138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003176,0.0,0.032995,0.032995,0.004728,0.0,0.0,0.0,0.001164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003865,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007616,0.0,0.0,0.0,0.000986,0.000986,0.000986,0.011424,0.051172,0.009359,0.000441,0.000441,0.0,0.0,0.0,0.0,0.0,0.0,8.6e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001895,0.001895,0.0,0.0,0.0,0.0,0.0,0.0,0.000715,0.0,0.0,0.001922,0.001922,0.001922,0.0,0.034047,0.0,0.0,0.001065,0.0,0.0,0.0,0.0,0.006361,0.003276,0.0,0.004984,0.014939,0.014939,0.0,0.0,0.007729,0.007729,0.007729,0.01649,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00141,0.0,0.0,0.0,0.024359,0.0,0.0,0.0,0.00405,0.024736,0.0,0.002685,0.002685,0.002685,0.012935,0.012935,0.0,0.0,0.0,0.0,0.0
7,7,3,0.0,0.0,0.0,0.0,0.0,0.0,0.006307,0.002463,0.001642,0.001408,0.0,0.008938,0.0,0.012521,0.001582,0.001582,0.003164,0.00139,0.000162,0.0,0.0,0.0,0.0,0.0,0.0,0.000556,0.001847,0.001209,2.7e-05,0.0,0.000124,0.000305,0.000179,0.004308,0.0,0.0,0.008,0.000737,0.000636,0.000737,0.000763,0.001474,0.000742,0.0,0.004796,0.0,0.009243,0.0,0.0,0.0,0.0,0.001614,0.005775,0.000814,0.000326,0.000271,0.000543,0.000543,0.00163,0.0,0.0,0.0,0.0,0.0,0.004308,0.000937,0.0,0.0,0.003861,0.049945,0.000259,0.001897,0.000782,0.001148,0.000732,0.000732,0.000507,0.0,0.0,0.0,0.0,0.000763,0.0,0.000226,0.000637,0.0,0.002606,0.0,0.002601,0.000425,0.0,0.0,0.0,0.0,0.100993,0.002196,0.000507,0.002387,0.000102,0.001487,0.000104,0.0,0.002231,0.002231,0.000649,0.001643,0.0,0.0,0.017459,0.0,0.0,0.004998,0.0,0.0,0.0,0.002713,0.0,0.00032,0.000215,0.0,0.002226,0.000216,0.000216,0.0,0.0,0.0,0.0,0.000435,0.000351,0.001216,0.0,0.000369,0.0,0.001456,0.001456,0.000874,0.0,0.000568,0.0,0.0,0.000637,0.0,0.0,0.0,0.003406,0.0,0.005109,0.0,0.0,0.001132,0.001132,0.001056,0.000755,0.0,0.0,0.005211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000369,0.002106,0.00676,0.0,0.0,0.006082,0.000553,0.002891,0.002672,0.0,0.001664,0.0,0.0,0.001274,0.001274,0.000637,0.001274,0.000637,0.0,0.0,0.0,0.0,0.000226,0.000301,0.0,0.000452,0.000302,0.0,0.00139,0.0,0.0,0.000252,0.0,0.001803,0.0,0.0,0.0,0.0,0.0,0.0,0.000883,0.0,0.0,0.0,0.0,0.001026,0.0,0.000883,0.001216,0.001766,0.002938,0.0,0.004507,0.000265,0.000221,0.007646,0.014816,0.0,0.0,0.0,0.0,0.002591,0.000205,0.000205,0.000103,0.004462,0.0,0.015118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00091,0.001461,0.0,0.0,0.017846,0.0,0.0,0.0,0.014727,0.00933,0.00933,0.013995,0.00933,0.021733,0.026188,0.026188,0.026188,0.004002,0.004002,0.001231,0.004468,0.025857,0.000574,0.005604,0.0,0.0,0.004272,0.003097,0.003097,0.003097,0.005604,0.0,0.0,0.006807,0.006807,0.017235,0.001643,0.001643,0.001095,0.001094,0.001094,0.001094,0.001094,0.0,0.0,0.0,0.001788,0.0,0.0,0.0,0.0,0.0,0.0,0.000289,0.000289,0.000289,0.0,0.0,0.0,0.0,0.0,0.0,0.000117,0.001058,0.001058,0.0,0.0,0.0,0.0,0.011709,0.0,0.0,0.0,0.011994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011145,0.0,0.000833,0.0,0.0,0.0,0.0,0.0,0.0,0.001249,0.000376,0.0,0.00485,0.00485,0.009255,0.0,0.0,0.0,0.0,0.0,0.0,0.027662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042753,0.0,0.0,0.000368,0.0,0.0,0.0,0.0,0.0,0.0,0.000231,0.0,0.0,0.0,0.0,0.000209,0.0,0.0,0.0,0.0,0.0,0.00711,0.0,0.0,0.0,0.0,0.002613,0.0,0.0,0.00064,0.0,0.0,0.001705,0.001881,0.001881,0.00435,0.00435,0.0,0.0,0.0,0.007446,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011169,0.0,0.0,0.0,0.0,0.0,0.0,0.002606,0.0,0.0,0.0,0.000468
8,8,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008707,0.005805,0.005041,0.0,0.001463,0.0,0.009579,0.0,0.00705,0.0,0.000406,0.0,0.0,0.013165,0.0,0.0,0.0,0.0,0.017291,0.001673,0.000913,0.005251,0.0,0.008663,0.003094,0.009821,0.003532,0.0,0.0,0.005861,0.003018,0.011527,0.003018,0.001908,0.006035,0.023054,0.0,0.058076,0.0,0.001535,0.0,0.0,0.0,0.0,0.00282,0.0,0.007583,0.003132,0.005037,0.005056,0.005056,0.000721,5.1e-05,0.0,6.5e-05,6.5e-05,5.7e-05,0.003532,0.011194,0.038417,0.038417,0.020349,0.006952,0.015339,0.000124,0.000206,0.027913,0.0,0.0,0.00051,0.007752,0.0,0.0,0.0,0.015716,0.0,0.02282,0.0,0.0,0.0,0.0,0.000794,0.005931,0.0,0.0,0.0,0.0,0.0,0.165598,0.036395,0.001518,0.00698,0.0,0.024025,0.0,0.0,0.0,0.034363,0.0,0.001176,0.000588,0.013593,0.000792,0.0,0.0,0.0,0.011584,0.007584,0.000788,0.0,0.004257,0.0,0.0,0.002509,0.0,0.0,0.0,0.0,0.0,0.043435,0.001392,0.07255,0.09152,0.0,0.002509,0.0,0.09472,0.09472,0.059576,0.003364,0.057187,0.0,0.0,0.008071,0.084806,0.084806,0.021202,0.016677,0.0,0.025015,0.0,0.0,0.032941,0.032941,0.052388,0.026534,9e-06,9e-06,0.0,0.007049,0.0,0.0,0.0,0.0,0.0,0.004311,0.002509,0.004075,0.006359,0.0,0.0,0.004911,0.003763,0.001195,0.003011,0.0,0.0,0.016627,0.016627,0.016142,0.016142,0.008071,0.016142,0.008071,0.0,0.0,0.0,0.0,0.066255,0.059384,0.08687,0.04564,0.030427,0.0026,0.0,0.0,0.0,0.025633,0.0,0.000488,0.0,0.0,0.0,0.001221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035201,0.0,0.0,0.0,0.0,0.000136,0.000113,0.002158,0.0,0.0,0.0,0.0,0.0,0.007165,0.004096,0.004096,0.004836,0.0,0.0,0.007871,0.016754,0.016754,0.016754,0.016754,0.016754,0.016754,0.0,0.0,0.039079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020389,0.020389,0.020389,0.000839,0.000839,0.001398,0.0,0.0,0.003525,0.022551,0.0,0.044992,0.0,0.0451,0.0451,0.0451,0.013305,0.0,0.0,0.0,0.0,0.0,0.005765,0.005765,0.008417,0.0,0.0,0.0,0.0,0.020409,0.010205,0.0,0.056119,0.0,0.0,0.0,0.000803,0.0,0.0,0.013109,0.013109,0.013109,0.003712,0.005569,0.003712,0.0035,0.005569,0.148546,0.011981,0.00822,0.00822,0.0,0.0,0.0,0.0,0.144516,0.0,0.0,0.0,0.008162,0.0,0.0,0.0,0.00122,0.0,0.0,0.0,0.0,0.0,0.0,0.006584,0.007212,0.006584,0.006584,0.006584,0.006231,0.006231,0.006231,0.007526,0.031564,0.028879,0.000134,0.000134,7.4e-05,0.0,0.0,0.0,0.018669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044247,0.0,0.001034,0.0,0.000265,0.007088,0.007088,0.024024,0.024024,0.0,0.0,0.037295,0.015518,0.0,0.04805,0.0,0.006859,0.006859,0.006859,0.0,0.0,0.0,0.0,0.001939,0.0,0.00559,0.0,0.0,0.0,0.090944,0.022832,0.0,0.0,0.0,0.00018,0.00018,0.002441,0.002441,0.002441,0.0,0.000624,0.0,0.0,0.056007,0.056007,0.056007,0.056007,0.056007,0.0,0.011628,0.0,0.058928,0.058928,0.016287,0.0,0.0,0.0,0.015132,0.0,0.023121,0.0,0.0,0.0,0.006772,0.006772,0.0,0.01277,0.01277,0.01277,0.0
9,9,30,0.015806,0.007903,0.007903,0.002306,0.002306,0.0,0.0,0.001341,0.000894,0.01443,0.0,0.001843,0.0,0.0,0.0,0.000121,0.0,0.0,0.002916,0.0,0.000177,0.0,0.0,0.001717,0.002310691,0.002564,0.004029,0.002389,8e-05,0.0,0.002579,0.001627,0.003659,0.008292,0.016585,0.016585,0.002569,0.006004,0.007506,0.005128,0.009007,0.010256,0.003418,0.017763,0.0,0.001388,0.022666,0.049768,0.0,0.0,0.0,0.035458,0.10132,0.007243,0.005067,0.002414,0.004829,0.004829,0.016544,0.016919,0.018876,0.013664,0.021754,0.019034,0.0,0.007645,0.0,0.0,0.0,0.003978,0.009109,0.026696,0.044494,0.007376,0.0,0.0,0.053361,0.0,0.205474,0.205474,0.205474,0.041397,0.136983,0.102736,0.059139,0.0,0.013989,0.021506,0.017073,0.0,0.0,0.0,0.0,0.0,0.005674,0.0,0.01207,0.216839,0.0,0.0,0.0,0.0,0.0,0.0,0.027184,0.009016,0.0,0.015501,0.0,0.005383,0.0,0.0,0.004351,0.0,0.0,0.003459,0.0,0.03736,0.002657,0.062006,0.052334,0.003888,0.003888,0.001009,0.0,0.0,0.0,0.009531,0.0,0.014091,0.016212,0.01136,0.032425,0.006242,0.006242,0.003745,0.008497,0.005159,0.018097,0.033653,0.016826,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012377,0.012377,0.027978,0.025244,0.020607,0.020607,0.030911,0.020607,0.0,0.000118,0.010852,0.002693,0.0,0.006587,0.0,0.005105,0.008933,0.008388,0.050399,0.0,0.013926,0.010765,0.010765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02305,0.0,0.0,0.0,0.0,0.0,0.0,0.012746,0.0,0.0,0.0,0.000115,0.0,0.008241,0.0,0.0,0.0,0.007936,0.0,0.0,0.01882,0.0,0.0,0.0,0.0,0.01189,0.0,0.01882,0.021504,0.037641,0.000563,0.0,0.012665,0.005518,0.043492,0.0,0.03394,0.0,0.0,0.0,0.0,0.023231,0.009637,0.009637,0.010026,0.0,0.0,0.002822,0.011135,0.011135,0.011135,0.011135,0.011135,0.011135,0.0,0.004724,0.021653,0.0,0.0,0.0,0.0,0.0,0.00566,0.012943,0.012943,0.012943,0.019415,0.012943,0.025877,0.0,0.0,0.0,0.012883,0.012883,0.0,0.0,0.00346,0.018992,0.011971,0.027849,0.019292,0.0,0.018723,0.018723,0.018723,0.065216,0.0,0.0,0.0,0.0,0.00068,0.001581,0.001581,0.001054,0.0,0.0,0.0,0.0,0.0,0.0,0.01755,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002378,0.003567,0.002378,0.024516,0.003567,0.000648,0.003331,0.0,0.0,0.035465,0.006662,0.006662,0.006662,0.004517,0.0,0.0,0.0,0.064847,0.0,0.0,0.0,0.007937,0.000857,0.000857,0.0,0.0,0.0,0.216265,0.022803,0.007601,0.022803,0.022803,0.022803,4.2e-05,4.2e-05,4.2e-05,0.0,0.0,0.0,0.0,0.0,0.038248,0.0,0.0,0.0,0.0,0.0,0.0,0.00136,0.0,0.038471,0.038471,0.0,0.0,0.0,0.0,0.0,0.0,0.064678,0.025706,0.0,0.0,0.0,0.0,0.057203,0.057203,0.025181,0.025181,0.0,0.0,0.002682,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002310715,0.000613,0.050361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,0.015873,0.015873,0.024858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005364,0.005364,0.0,0.0,0.0,0.0,0.022049,0.0,0.0,0.064517,0.064517,0.064517,0.000117,0.000117,0.0,0.112081,0.112081,0.112081,0.005386


In [39]:
# Describing what the most representative tokens for each topic in the model are.
num_top_words = 4
feature_names = vectorizer.get_feature_names()
for i,topic_vec in enumerate(cls.components_):
    print(i,end=" ")
    for fid in topic_vec.argsort()[-1:-num_top_words-1:-1]:
        word = feature_names[fid]
        word = " ".join(unreduce_lp[word])
        print(word, end=" ")
    print()

0 wild types type comparison comparable comparing compared level levels higher 
1 embryo embryogenesis embryos zygote embryonic fertilization zygotic defective defect transition meristem gametophytic gametophyte 
2 meristem gametophytic gametophyte sex female male defective defect complete full partial incomplete 
3 wall walls inside cellular cell cells abnormal abnormalities normal malformed aberrant abnormally xylan 
4 roots root lateral longitudinal laterally number numbers long short 
5 antimicrobial bacterial yeast kanamycin bacteria coli toxin syringae agrobacterium avirulent pathogenesis flagellin pathogens pseudomonas untreated infections inoculation infected wound infection hiv uninfected disease susceptible resistance resistant biotrophic fungal oomycete phytophthora hyphal fungus fungi endophytic irregulare 
6 green greener creamy coat yellowish paler pale spots seeds seed germinated harvested seedling soil hyperresponsive seedlings 
7 lethality lethal germinated harvested s

<a id="clustering"></a>
### Approach 2: Agglomerative clustering and comparison to predefined groups

In [40]:
# Generate a numpy array where values are mean distance percentiles between all the methods.
to_pct = lambda arr: np.array(pd.Series(arr.flatten()).rank(pct=True)).reshape(-1,arr.shape[0])
all_pct_arrays = np.array([to_pct(np.nan_to_num(graphs[method].array, nan=1)) for method in methods])
mean_pct_array = np.mean(all_pct_arrays,axis=0)


# Do agglomerative clustering based on that distance matrix.
from sklearn.cluster import AgglomerativeClustering
number_of_clusters = 50
to_id = graphs[methods[0]].row_index_to_id
ac = AgglomerativeClustering(n_clusters=number_of_clusters, linkage="complete", affinity="precomputed")
clustering = ac.fit(mean_pct_array)
id_to_cluster = {}
for idx,c in enumerate(clustering.labels_):
    id_to_cluster[to_id[idx]] = c

In [41]:
# Create the dataframe containing the average score assigned to each topic for the genes from each subset.
group_to_cluster_vector = {}
for group_id,ids in group_id_to_ids.items():
    
    mean_cluster_vector = np.zeros(number_of_clusters)
    for i in ids:
        cluster = id_to_cluster[i]
        mean_cluster_vector[cluster] = mean_cluster_vector[cluster]+1
    mean_cluster_vector = mean_cluster_vector/mean_cluster_vector.sum(axis=0,keepdims=1)
    group_to_cluster_vector[group_id] = mean_cluster_vector
    
ac_df = pd.DataFrame(group_to_cluster_vector)

# Changing the order of the Lloyd, Meinke phenotype subsets to match other figures for consistency.
#filename = "../data/group_related_files/lloyd/lloyd_function_hierarchy_irb_cleaned.csv"
#lmac_df = pd.read_csv(filename)
#ac_df = ac_df[lmac_df["Subset Symbol"].values]

# Reordering so consistency with the curated subsets can be checked by looking at the diagonal.
ac_df["idxmax"] = ac_df.idxmax(axis = 1)
ac_df["idxmax"] = ac_df["idxmax"].apply(lambda x: ac_df.columns.get_loc(x))
ac_df = ac_df.sort_values(by="idxmax")
ac_df.drop(columns=["idxmax"], inplace=True)
ac_df = ac_df.reset_index(drop=False).rename({"index":"topic"},axis=1).reset_index(drop=False).rename({"index":"order"},axis=1)
ac_df.to_csv(os.path.join(OUTPUT_DIR,"agglomerative_clustering.csv"), index=False)
ac_df

Unnamed: 0,order,topic,PWY-6406,PWY-5837,PWY-5791,PWY-7270,ETHYL-PWY,PWY-6546,PWY-1081,NONOXIPENT-PWY,CALVIN-PWY,PWY-5723,PWY-6730,PWY-6842,PWY-6736,PWY-6007,PWYQT-4476,PWYQT-4477,PWY-6008,PWY-6443,PWY-5868,PWY-6064,PWY-7186,PWY-6199,PWY-6266,PWY-2181,PWY-5168,PWY-5391,PWY-1121,PWY-361,CAMALEXIN-SYN,LIPAS-PWY,PWY-5080,PWY-7036,PWY-695,PWY-3181,PWY-6446,PWY-6444,PWY-5945,PWY1F-823,PWY-6787,PWY-5152,PWY1F-FLAVSYN,PWY-5060,PWY-3101,PWY-6902,PWY-3982,PWY-5704,PWY-7226,PWY-5034,PWY-5032,GLYOXYLATE-BYPASS,GLYOXDEG-PWY,PWY-699,PWY-6544,PWY-5137,PWY-735,PWY-5136,PWY-6837,PWY-5138,PWY-1042,PWY66-399,SUCSYN-PWY,PWY-5484,GLUCONEO-PWY,GLYCOLYSIS,PWY-2,PWY-6137,PWY-6959,PWY-2261,PWY-6724,PWY-5980,PWY-7238,PWY-1422,PWY-7436,PWY-882,PWY4FS-13,PWY4FS-12,PWY-922,THIOREDOX-PWY,ARGSYNBSUB-PWY,ARGSYN-PWY,PWY-5686,CITRULBIO-PWY,PWY-7060,PWY-4984,GLUTAMINDEG-PWY,PWY0-1319,PWY-5667,PWYQT-4482,TRIGLSYN-PWY,PWY-581,PWY-2902,PWY-7199,PWY-7193,PWY-6556,PWY-1061,PWY-5097,LEUSYN-PWY,PWY-6352,PWY-381,PWY-6549,PWY-6963,GLNSYN-PWY,PWY-6964,PWY-7061,PWY-3301,HISTSYN-PWY,PWY0-1264,PWY-7388,PWY-3385,PWY4FS-6,PWY-6163,PWY-3781,PWY-5083,PWY-4302,LYSINE-DEG2-PWY,PWY-2541,OXIDATIVEPENT-PWY,PWY0-1507,CHLOROPHYLL-SYN,FASYN-ELONG-PWY,PWY-5971,PWY-6039,PWY-6040,PWY-6466,GLUT-REDOX-PWY,PWY-4081,PWY-43,PWY-3801,PWY-5992,GLYSYN2-PWY,PWY-7416,PWY-6803,SERSYN-PWY,ALANINE-DEG3-PWY,ALANINE-SYN2-PWY,ALACAT2-PWY,PWY-6806,PWYQT-4450,PWY-1186,PWY-4361,LEU-DEG2-PWY,PWY-801,PWY-6936,PWY-702,PWY-5041,PWY-7528,METHIONINE-DEG1-PWY,SAM-PWY,PWY-5441,PROSYN-PWY,PWY-3341,PWY-6922,ARGININE-SYN4-PWY,PWY-5366,PWY-5142,PWY-7417,PWY-622,PWY-6545,PWY-7184,PWY-7227,PWY0-166,PWY-6707,PYRUVDEHYD-PWY,PWY-5147,PWY-6663,TRESYN-PWY,PWY-5350,PWY-6477,PWY-321,PWY-5143,PWY-6733,PWY-5989,PWY-282,PWY-5884,PWY-2821,PWY-601,PWY-5079,PWY-5886,PWY-7432,PWYDQC-4,TYRFUMCAT-PWY,PWY-5765,PWY-6369,PWY-1881,PWY-6475,PWY-40,PWY-6305,ARGDEG-V-PWY,ARGASEDEG-PWY,ARG-PRO-PWY,PWY-7101,PANTO-PWY,PWY-7197,PWY-7187,PYRIDNUCSYN-PWY,PWY-2301,PWY-6363,PWY-4702,PWY-6799,PWY-4381,PWY-6596,PWY-6122,PWY-6121,PWY-3841,PWY-7909,PWY-6613,PWY-3742,PWY-1722,PWY-101,PWY-6614,PWY-2161,PWY-181,GLYSYN-PWY,PWY-5871,PWY-5285,PWY-6364,NONMEVIPP-PWY,PWY-7560,PWY-6804,PWY-5800,PWY-5175,PWY-5946,CAROTENOID-PWY,PWY-7120,RIBOSYN2-PWY,PWY-782,PWY-5995,PWY-762,PWY-4341,PWY-5934,PWY-1001,PWYQT-4475,PWYQT-4473,PWYQT-4474,PWYQT-4472,PWYQT-4471,PWY-1187,PWY-5267,PWY-5947,PWY-5659,MANNCAT-PWY,PWY-3881,PWY-3261,PWY-5997,PWY-7590,MANNOSYL-CHITO-DOLICHOL-BIOSYNTHESIS,PWY-63,PWY-6317,PWY-3821,PWY-7344,PWY-6527,PWY-5114,PWY4FS-2,PWY4FS-4,PWY4FS-3,PWY-6295,PWY-84,PWY-7219,PWY-2724,PWY-66,PWY-2582,PWY-6745,CYSTSYN-PWY,PWY-5670,PWY1F-467,PWY-4041,PWYQT-4432,GLYSYN-ALA-PWY,PWY-5381,PWY-2602,PWY-6424,BSUBPOLYAMSYN-PWY,PWY0-461,ARGSPECAT-PWY,PWY-6535,PWY-6473,PWY-4321,PWY-5910,PWY-5120,PWY-5121,PWY-5863,DETOX1-PWY,DETOX1-PWY-1,PWY-7039,PWY-5188,PWY-4841,UDPNACETYLGALSYN-PWY,PWY-4,PWY-82,PWYQT-4466,PWY-7343,PWYQT-4481,PWY-561,PWY-5690,PWY-5661,PWY-4101,GLUCOSE1PMETAB-PWY,PWY-621,PWY0-1182,TRPSYN-PWY,PWY-6890,PWYQT-4470,GLUTATHIONESYN-PWY,PWY-1581,PWY-6910,PWY-7356,PWY-6908,PWY-5986,PWY-5027,PWY-6118,PWY-4261,PWY-6952,PWY-7208,PWY-7196,PWY-7183,PWYQT-4445,THISYNARA-PWY,PWY-6909,PWY-7625,PHOSLIPSYN2-PWY,PWY-6351,PWY-5973,PWY-5486,PWY66-21,ETOH-ACETYLCOA-ANA-PWY,PWY-6333,PWY-1801,PWY-5070,PWY-5035,PWY-5036,PWY0-1313,PWY-5390,HOMOSER-THRESYN-PWY,PWY-7640,PWY-5271,PWY-6012,PWY-6287,PWY-7047,PWY-7048,MALATE-ASPARTATE-SHUTTLE-PWY,PWY-6348,LIPASYN-PWY,POLYAMINSYN3-PWY,PWY0-501,PWY-5337,PWY-5342,PWY-5687,PWY-7205,PWY-7176,PWY-7221,PWY-7224,PWY-3221,PWY-4861,PWY-5466,PWY-3561,PWYQT-4427,PWY1F-353,PWY-401,PLPSAL-PWY,PWY-7204,SULFMETII-PWY,PWY-5340,PWY-4203,PWY-2161B-PMN,PWY-5410,HEME-BIOSYNTHESIS-II,PWY-6809,GLUGLNSYN-PWY,PWY-5936,GLUTSYNIII-PWY,GLUTAMATE-SYN2-PWY,GLUTAMATE-DEG1-PWY,PWY-5129,PWY-6441,PWY-6932,PWY-6132,PWY-6668,PWY-5107,PWY-6619,ASPSYNII-PWY,P401-PWY,PWY-6066,PWY-1822,PWY-6235,VALDEG-PWY,PWY-6233,PWY-6220,PWY0A-6303,PWY-6303,PWY-6607,PWY-7185,PWY-6606,PWY-6927,PWY-7170,PWY-641,PWY-6035,ASPASN-ARA-PWY,ASPARTATESYN-PWY,ASPARTATE-DEG1-PWY,PWY-3001,THRESYN-PWY,PWY-5064,PWY-5068,PWY-5086,PWY-6786,PWY-5453,PWY-5963,PWY-5669,PWY-6754,PWY-6756,PWY-6605,PWY-5098,PWY-6019,PWY4FS-7,PWY4FS-8,PWY-5269,PWY-6845,PWY-4983,PWY-6773,PWY0-1021,PWY-7250,PWY-6823,PWY-6115
0,0,37,1.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.2,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,31,0.0,0.0,0.0,0.833333,0.833333,1.0,0.0,0.25,0.166667,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,1.0,0.2,0.0,0.0,0.0,0.166667,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.125,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.2,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.333333,0.333333,0.5,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.666667,1.0,1.0,0.5,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,45,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,1.0,0.5,0.5,1.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.166667,0.0,0.0,0.166667,0.111111,0.0,0.142857,0.142857,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.166667,0.0,0.333333,0.333333,0.2,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,1.0,0.142857,0.25,0.071429,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.5,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,1.0,1.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.5,1.0,0.333333,1.0,1.0,0.333333,0.5,0.25,0.111111,0.222222,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.833333,0.5,1.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.6,0.666667,0.285714,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.5,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.2,0.0,0.5,0.25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.5,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.5,0.333333,0.25,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0
8,8,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.5,0.25,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.5,0.5,0.6,1.0,0.333333,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,9,47,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.5,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Approach 3: Agglomerative clustering and sillhouette scores for each NLP method

In [42]:
from sklearn.metrics.cluster import silhouette_score
# Note that homogeneity scores don't fit for evaluating how close the clustering is to pathway membership, etc.
# This is becuase genes can be assigned to more than one pathway, metric would have to be changed to account for this.
# So all this section does is determines which values of n_clusters provide good clustering results for each matrix.
n_clusters_silhouette_scores = defaultdict(dict)
number_of_clusters = np.arange(10,400,4)
for n in number_of_clusters:
    for method in methods:
        distance_matrix = np.nan_to_num(graphs[method].array, nan=1)
        to_id = graphs[method].row_index_to_id
        ac = AgglomerativeClustering(n_clusters=n, linkage="complete", affinity="precomputed")
        clustering = ac.fit(distance_matrix)
        sil_score = silhouette_score(distance_matrix, clustering.labels_, metric="precomputed")
        n_clusters_silhouette_scores[method][n] = sil_score
sil_df = pd.DataFrame(n_clusters_silhouette_scores).reset_index(drop=False).rename({"index":"n"},axis="columns")
sil_df.to_csv(os.path.join(OUTPUT_DIR,"silhouette_scores.csv"), index=False)

<a id="merging"></a>
### Option 1: Merging in the previously curated similarity values from the Oellrich, Walls et al. (2015) dataset
This section reads in a file that contains the previously calculated distance values from the Oellrich, Walls et al. (2015) dataset, and merges it with the values which are obtained here for all of the applicable natural language processing or machine learning methods used, so that the graphs which are specified by these sets of distances values can be evaluated side by side in the subsequent sections.

In [43]:
# Add a column that indicates the distance estimated using curated EQ statements.
df = df.merge(right=pppn_edgelist.df, how="left", on=["from","to"])
df.fillna(value=0.000,inplace=True)
df.rename(columns={"value":"EQs"}, inplace=True)
df["EQs"] = 1-df["EQs"]
methods.append("EQs")
df.head(10)

Unnamed: 0,from,to,Doc2Vec Wikipedia:Size=300,Doc2Vec PubMed:Size=100,"Word2Vec Wikipedia:Size=300,Mean","Word2Vec Wikipedia:Size=300,Max","N-Grams:Full,Words,1-grams,2-grams","N-Grams:Full,Words,1-grams,2-grams,Binary","N-Grams:Full,Words,1-grams","N-Grams:Full,Words,1-grams,Binary","N-Grams:Full,Words,1-grams,2-grams,TFIDF","N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Full,Words,1-grams,TFIDF","N-Grams:Full,Words,1-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,2-grams","N-Grams:Simple,Words,1-grams,2-grams,Binary","N-Grams:Simple,Words,1-grams","N-Grams:Simple,Words,1-grams,Binary","N-Grams:Simple,Words,1-grams,2-grams,TFIDF","N-Grams:Simple,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,TFIDF","N-Grams:Simple,Words,1-grams,Binary,TFIDF","N-Grams:Full,Nouns,1-grams","N-Grams:Full,Nouns,1-grams,Binary","N-Grams:Full,Nouns,1-grams,TFIDF","N-Grams:Full,Nouns,1-grams,Binary,TFIDF","N-Grams:Full,Adjectives,1-grams","N-Grams:Full,Adjectives,1-grams,Binary","N-Grams:Full,Adjectives,1-grams,TFIDF","N-Grams:Full,Adjectives,1-grams,Binary,TFIDF",NOBLE Coder:Precise,NOBLE Coder:Partial,"NOBLE Coder:Precise,TFIDF","NOBLE Coder:Partial,TFIDF",GO:Default,PO:Default,EQs
106193,3444,3456,0.429865,0.410482,0.197785,0.13696,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.878704,0.948718,0.835917,0.923077,0.97705,0.985221,0.962636,0.974086,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.764706,0.552448,0.783348,0.620223,0.8,0.608696,1.0
106194,3444,3464,0.325759,0.338864,0.22879,0.132152,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.950481,0.988372,0.923806,0.977778,0.992536,0.997197,0.984144,0.992305,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.542169,0.678322,0.590104,0.787873,0.916667,0.241379,1.0
106195,3444,3469,0.278411,0.408278,0.330119,0.177175,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.920635,0.740385,0.956351,0.782949,0.866667,0.19708,1.0
106196,3444,3478,0.340357,0.225442,0.153962,0.105329,0.974148,0.984615,0.963666,0.972973,0.990146,0.989656,0.983402,0.974979,0.817614,0.95,0.760621,0.924528,0.96087,0.978731,0.935391,0.957633,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.376623,0.535484,0.416747,0.570573,0.928571,0.19708,1.0
106197,3456,3464,0.349974,0.136647,0.235639,0.104354,0.761333,0.9,0.675,0.84375,0.914752,0.942426,0.833081,0.88155,0.802915,0.931034,0.721382,0.888889,0.930451,0.960519,0.860541,0.910536,0.850929,0.9,0.927238,0.911297,0.622783,0.923077,0.808539,0.944382,0.69863,0.626667,0.708223,0.744976,0.9,0.682759,0.777778
106198,3456,3469,0.433688,0.436261,0.432078,0.180774,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.8,0.766667,0.89057,0.788039,0.928571,0.664234,0.952381
106199,3456,3478,0.391264,0.375514,0.210761,0.10216,0.950949,0.957143,0.92964,0.921053,0.978006,0.968144,0.956784,0.913238,0.886873,0.942857,0.847163,0.910714,0.97565,0.978123,0.9558,0.952663,1.0,1.0,1.0,1.0,0.916376,0.888889,0.944561,0.900988,0.7625,0.44586,0.807703,0.471332,0.916667,0.664234,0.956522
106200,3464,3469,0.249483,0.462069,0.210232,0.125712,0.904052,0.924528,0.88233,0.896552,0.938315,0.912869,0.913111,0.876735,0.956231,0.971014,0.926578,0.944444,0.974436,0.96787,0.931772,0.904403,1.0,1.0,1.0,1.0,0.852558,0.9,0.870375,0.819007,0.902778,0.773585,0.950425,0.854435,0.952381,0.055944,0.85
106201,3464,3478,0.352245,0.299576,0.166063,0.085506,0.902443,0.922078,0.887424,0.909091,0.954943,0.925319,0.94452,0.907283,0.899935,0.925926,0.875799,0.898305,0.958734,0.93659,0.941961,0.907391,1.0,1.0,1.0,1.0,0.942169,0.923077,0.931709,0.863573,0.488889,0.604938,0.56686,0.732438,0.888889,0.055944,0.863636
106202,3469,3478,0.354691,0.306102,0.257876,0.157193,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.907895,0.782609,0.951908,0.7991,0.956522,0.0,0.875


### Option 2: Merging with information about shared biochemical pathways or groups.
The relevant information for each edge includes questions like whether or not the two genes that edge connects share a group or biochemical pathway in common, or if those genes are from the same species. This information can then later be used as the target values for predictive models, or for filtering the graphs represented by these edge lists. Either the grouping information or the protein-protein interaction information should be used.

In [45]:
# Column indicating whether or not the two genes share this features (e.g., pathway in common, same group).
df["shared"] = df[["from","to"]].apply(lambda x: len(set(id_to_group_ids[x["from"]]).intersection(set(id_to_group_ids[x["to"]])))>0, axis=1)*1
# Column indicating whether the two genes are from the same species.
species_dict = dataset.get_species_dictionary()
df["same"] = df[["from","to"]].apply(lambda x: species_dict[x["from"]]==species_dict[x["to"]],axis=1)*1
print(Counter(df["shared"].values))
print(Counter(df["same"].values))

Counter({0: 105060, 1: 1143})
Counter({1: 106203})


### Option 3: Merging with information about protein-protein interactions.

In [None]:
# Merging information from the protein-protein interaction database with this dataset.
df = df.merge(right=string_data.df, how="left", on=["from","to"])
df.fillna(value=0,inplace=True)
df["shared"] = (df["combined_score"] != 0.00)*1
df.tail(12)

<a id="ensemble"></a>
### Combining multiple distances measurements into summarizing distance values
The purpose of this section is to iteratively train models on subsections of the dataset using simple regression or machine learning approaches to predict a value from zero to one indicating indicating how likely is it that two genes share atleast one of the specified groups in common. The information input to these models is the distance scores provided by each method in some set of all the methods used in this notebook. The purpose is to see whether or not a function of these similarity scores specifically trained to the task of predicting common groupings is better able to used the distance metric information to report a score for this task.

In [46]:
# Get the average distance percentile as a means of combining multiple scores.
method = "Mean"
df[method] = df[methods].rank(pct=True).mean(axis=1)
methods.append(method)

In [None]:
# Iteratively create models for combining output values from multiple semantic similarity methods.
# Problem with this method in that the predictors are going to be highly correlated.
method = "Logistic Regression"
splits = 12
kf = KFold(n_splits=splits, random_state=14271, shuffle=True)
df[method] = pd.Series()
for train,test in kf.split(df):
    lr_model = train_logistic_regression_model(df=df.iloc[train], predictor_columns=methods, target_column="shared")
    df[method].iloc[test] = apply_logistic_regression_model(df=df.iloc[test], predictor_columns=methods, model=lr_model)
df[method] = 1-df[method]
methods.append(method)

In [None]:
# Iteratively create models for combining output values from multiple semantic similarity methods.
# Problem with overfitting if the duplicates between descriptions are not removed between the training and testing.
method = "Random Forest"
splits = 2
kf = KFold(n_splits=splits, random_state=14271, shuffle=True)
df[method] = pd.Series()
for train,test in kf.split(df):
    rf_model = train_random_forest_model(df=df.iloc[train], predictor_columns=methods, target_column="shared")
    df[method].iloc[test] = apply_random_forest_model(df=df.iloc[test],predictor_columns=methods, model=rf_model)
df[method] = 1-df[method]
methods.append(method)

<a id="ks"></a>
### Do the edges joining genes that share a group, pathway, or interaction come from a different distribution?
The purpose of this section is to visualize kernel estimates for the distributions of distance or similarity scores generated by each of the methods tested for measuring semantic similarity or generating vector representations of the phenotype descriptions. Ideally, better methods should show better separation betwene the distributions for distance values between two genes involved in a common specified group or two genes that are not. Additionally, a statistical test is used to check whether these two distributions are significantly different from each other or not, although this is a less informative measure than the other tests used in subsequent sections, because it does not address how useful these differences in the distributions actually are for making predictions about group membership.

In [48]:
# Use Kolmogorov-Smirnov test to see if edges between genes that share a group come from a distinct distribution.
ppi_pos_dict = {name:(df[df["shared"] > 0.00][name].values) for name in methods}
ppi_neg_dict = {name:(df[df["shared"] == 0.00][name].values) for name in methods}
for name in methods:
    stat,p = ks_2samp(ppi_pos_dict[name],ppi_neg_dict[name])
    pos_mean = np.average(ppi_pos_dict[name])
    neg_mean = np.average(ppi_neg_dict[name])
    pos_n = len(ppi_pos_dict[name])
    neg_n = len(ppi_neg_dict[name])
    TABLE[name].update({"mean_1":pos_mean, "mean_0":neg_mean, "n_1":pos_n, "n_0":neg_n})
    TABLE[name].update({"ks":stat, "ks_pval":p})
    
    
# Show the kernel estimates for each distribution of weights for each method.
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for name,ax in zip(methods,axs.flatten()):
    ax.set_title(name)
    ax.set_xlabel("value")
    ax.set_ylabel("density")
    sns.kdeplot(ppi_pos_dict[name], color="black", shade=False, alpha=1.0, ax=ax)
    sns.kdeplot(ppi_neg_dict[name], color="black", shade=True, alpha=0.1, ax=ax) 
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"kernel_density.png"),dpi=400)
plt.close()

<a id="within"></a>
### Looking at within-group or within-pathway distances in each graph
The purpose of this section is to determine which methods generated graphs which tightly group genes which share common pathways or group membership with one another. In order to compare across different methods where the distance value distributions are different, the mean distance values for each group for each method are convereted to percentile scores. Lower percentile scores indicate that the average distance value between any two genes that belong to that group is lower than most of the distance values in the entire distribution for that method.

In [49]:
# Get all the average within-pathway phenotype distance values for each method for each particular pathway.
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
group_ids = list(group_id_to_ids.keys())
graph = IndexedGraph(df)
within_weights_dict = defaultdict(lambda: defaultdict(list))
within_percentiles_dict = defaultdict(lambda: defaultdict(list))
all_weights_dict = {}
for method in methods:
    all_weights_dict[method] = df[method].values
    for group in group_ids:
        within_ids = group_id_to_ids[group]
        within_pairs = [(i,j) for i,j in itertools.permutations(within_ids,2)]
        mean_weight = np.mean((graph.get_values(within_pairs, kind=method)))
        within_weights_dict[method][group] = mean_weight
        within_percentiles_dict[method][group] = stats.percentileofscore(df[method].values, mean_weight, kind="rank")

# Generating a dataframe of percentiles of the mean in-group distance scores.
within_dist_data = pd.DataFrame(within_percentiles_dict)
within_dist_data = within_dist_data.dropna(axis=0, inplace=False)
within_dist_data = within_dist_data.round(4)

# Adding relevant information to this dataframe and saving.
within_dist_data["mean_rank"] = within_dist_data.rank().mean(axis=1)
within_dist_data["mean_percentile"] = within_dist_data.mean(axis=1)
within_dist_data.sort_values(by="mean_percentile", inplace=True)
within_dist_data.reset_index(inplace=True)
within_dist_data["group_id"] = within_dist_data["index"]
within_dist_data["full_name"] = within_dist_data["group_id"].apply(lambda x: groups.get_long_name(x))
within_dist_data["n"] = within_dist_data["group_id"].apply(lambda x: len(group_id_to_ids[x]))
within_dist_data = within_dist_data[flatten(["group_id","full_name","n","mean_percentile","mean_rank",methods])]
within_dist_data.to_csv(os.path.join(OUTPUT_DIR,"within_distances.csv"), index=False)
within_dist_data.head(5)

Unnamed: 0,group_id,full_name,n,mean_percentile,mean_rank,Doc2Vec Wikipedia:Size=300,Doc2Vec PubMed:Size=100,"Word2Vec Wikipedia:Size=300,Mean","Word2Vec Wikipedia:Size=300,Max","N-Grams:Full,Words,1-grams,2-grams","N-Grams:Full,Words,1-grams,2-grams,Binary","N-Grams:Full,Words,1-grams","N-Grams:Full,Words,1-grams,Binary","N-Grams:Full,Words,1-grams,2-grams,TFIDF","N-Grams:Full,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Full,Words,1-grams,TFIDF","N-Grams:Full,Words,1-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,2-grams","N-Grams:Simple,Words,1-grams,2-grams,Binary","N-Grams:Simple,Words,1-grams","N-Grams:Simple,Words,1-grams,Binary","N-Grams:Simple,Words,1-grams,2-grams,TFIDF","N-Grams:Simple,Words,1-grams,2-grams,Binary,TFIDF","N-Grams:Simple,Words,1-grams,TFIDF","N-Grams:Simple,Words,1-grams,Binary,TFIDF","N-Grams:Full,Nouns,1-grams","N-Grams:Full,Nouns,1-grams,Binary","N-Grams:Full,Nouns,1-grams,TFIDF","N-Grams:Full,Nouns,1-grams,Binary,TFIDF","N-Grams:Full,Adjectives,1-grams","N-Grams:Full,Adjectives,1-grams,Binary","N-Grams:Full,Adjectives,1-grams,TFIDF","N-Grams:Full,Adjectives,1-grams,Binary,TFIDF",NOBLE Coder:Precise,NOBLE Coder:Partial,"NOBLE Coder:Precise,TFIDF","NOBLE Coder:Partial,TFIDF",GO:Default,PO:Default,EQs,Mean
0,PWY-7047,malate-oxaloacetate shuttle I,2,1.259041,5.875,0.0009,18.5579,0.1563,0.072,0.0508,0.0508,0.0574,0.0325,0.0508,0.0508,0.0574,0.0325,0.0518,0.0527,0.0574,0.064,0.0518,0.0527,0.0574,0.064,0.065,0.1587,0.065,0.0395,0.1441,0.2509,0.1441,0.073,0.2114,0.1003,0.1549,0.0956,0.0636,18.7264,0.7895,0.0056
1,PWY-7048,malate-oxaloacetate shuttle II,2,1.259041,5.875,0.0009,18.5579,0.1563,0.072,0.0508,0.0508,0.0574,0.0325,0.0508,0.0508,0.0574,0.0325,0.0518,0.0527,0.0574,0.064,0.0518,0.0527,0.0574,0.064,0.065,0.1587,0.065,0.0395,0.1441,0.2509,0.1441,0.073,0.2114,0.1003,0.1549,0.0956,0.0636,18.7264,0.7895,0.0056
2,PWY-5992,thalianol and derivatives biosynthesis,2,2.254777,13.958333,0.0923,55.9108,0.178,0.1789,0.1271,0.0829,0.097,0.0716,0.1394,0.0763,0.161,0.08,0.1139,0.0697,0.1017,0.0791,0.097,0.065,0.1318,0.0772,0.145,0.331,0.1412,0.1008,1.7443,0.9858,2.8078,1.0207,0.4571,0.3145,0.3211,0.3051,0.1139,1.943,0.7895,0.0169
3,ARGASEDEG-PWY,L-arginine degradation I (arginase pathway),2,2.549923,8.819444,0.0019,2.8342,0.1361,0.072,0.0122,0.0254,0.0231,0.0325,0.0254,0.0254,0.0287,0.0325,0.0113,0.0264,0.0231,0.032,0.0259,0.0264,0.0287,0.032,0.0292,0.1587,0.0325,0.0395,0.0457,0.2509,0.0706,0.073,0.2114,0.1003,0.1549,0.0956,0.8794,79.1131,0.7895,0.0282
4,PWY-6066,IAA biosynthesis VII,2,3.275841,31.972222,0.6827,26.6433,1.2448,2.5188,0.5876,0.8456,0.7015,1.1765,0.5169,0.4755,0.5951,0.6488,1.2749,0.8973,1.8559,1.4854,0.5461,0.6017,0.6648,0.9162,0.3879,3.4552,0.3484,2.1996,4.199,1.4227,2.7805,0.7448,9.6753,5.4113,2.3417,1.0565,0.8794,8.6335,0.7895,0.0292


<a id="auc"></a>
### Predicting whether two genes belong to the same group, pathway, or share an interaction
The purpose of this section is to see if whether or not two genes share atleast one common pathway can be predicted from the distance scores assigned using analysis of text similarity. The evaluation of predictability is done by reporting a precision and recall curve for each method, as well as remembering the area under the curve, and ratio between the area under the curve and the baseline (expected area when guessing randomly) for each method.

In [50]:
y_true_dict = {name:df["shared"] for name in methods}
y_prob_dict = {name:(1 - df[name].values) for name in methods}
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for method,ax in zip(methods, axs.flatten()):
    
    # Obtaining the values and metrics.
    y_true, y_prob = y_true_dict[method], y_prob_dict[method]
    n_pos, n_neg = Counter(y_true)[1], Counter(y_true)[0]
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    baseline = Counter(y_true)[1]/len(y_true) 
    area = auc(recall, precision)
    auc_to_baseline_auc_ratio = area/baseline
    TABLE[method].update({"auc":area, "baseline":baseline, "ratio":auc_to_baseline_auc_ratio})

    # Producing the precision recall curve.
    step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})
    ax.step(recall, precision, color='black', alpha=0.2, where='post')
    ax.fill_between(recall, precision, alpha=0.7, color='black', **step_kwargs)
    ax.axhline(baseline, linestyle="--", color="lightgray")
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_ylim([0.0, 1.05])
    ax.set_xlim([0.0, 1.0])
    ax.set_title("PR {0} (Baseline={1:0.3f})".format(method, baseline))
    
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"prcurve_shared.png"),dpi=400)
plt.close()

<a id="y"></a>
### Are genes in the same group or pathway ranked higher with respect to individual nodes?
This is a way of statistically seeing if for some value k, the graph ranks more edges from some particular gene to any other gene that it has a true protein-protein interaction with higher or equal to rank k, than we would expect due to random chance. This way of looking at the problem helps to be less ambiguous than the previous methods, because it gets at the core of how this would actually be used. In other words, we don't really care how much true information we're missing as long as we're still able to pick up some new useful information by building these networks, so even though we could be missing a lot, what's going on at the very top of the results? These results should be comparable to very strictly thresholding the network and saying that the remaining edges are our guesses at interactions. This is comparable to just looking at the far left-hand side of the precision recall curves, but just quantifies it slightly differently.

In [None]:
# When the edgelist is generated above, only the lower triangle of the pairwise matrix is retained for edges in the 
# graph. This means that in terms of the indices of each node, only the (i,j) node is listed in the edge list where
# i is less than j. This makes sense because the graph that's specified is assumed to already be undirected. However
# in order to be able to easily subset the edgelist by a single column to obtain rows that correspond to all edges
# connected to a particular node, this method will double the number of rows to include both (i,j) and (j,i) edges.
df = pw.make_undirected(df)

# What's the number of functional partners ranked k or higher in terms of phenotypic description similarity for 
# each gene? Also figure out the maximum possible number of functional partners that could be theoretically
# recovered in this dataset if recovered means being ranked as k or higher here.
k = 10      # The threshold of interest for gene ranks.
n = 100     # Number of Monte Carlo simulation iterations to complete.
df[list(methods)] = df.groupby("from")[list(methods)].rank()
ys = df[df["shared"]==1][list(methods)].apply(lambda s: len([x for x in s if x<=k]))
ymax = sum(df.groupby("from")["shared"].apply(lambda s: min(len([x for x in s if x==1]),k)))

# Monte Carlo simulation to see what the probability is of achieving each y-value by just randomly pulling k 
# edges for each gene rather than taking the top k ones that the similarity methods specifies when ranking.
ysims = [sum(df.groupby("from")["shared"].apply(lambda s: len([x for x in s.sample(k) if x>0.00]))) for i in range(n)]
for method in methods:
    pvalue = len([ysim for ysim in ysims if ysim>=ys[method]])/float(n)
    TABLE[method].update({"y":ys[method], "y_max":ymax, "y_ratio":ys[method]/ymax, "y_pval":pvalue})

<a id="mean"></a>
### Predicting biochemical pathway or group membership based on mean vectors
This section looks at how well the biochemical pathways that a particular gene is a member of can be predicted based on the similarity between the vector representation of the phenotype descriptions for that gene and the average vector for all the vector representations of phenotypes asociated with genes that belong to that particular pathway. In calculating the average vector for a given biochemical pathway, the vector corresponding to the gene that is currently being classified is not accounted for, to avoid overestimating the performance by including information about the ground truth during classification. This leads to missing information in the case of biochemical pathways that have only one member. This can be accounted for by only limiting the overall dataset to only include genes that belong to pathways that have atleast two genes mapped to them, and only including those pathways, or by removing the missing values before calculating the performance metrics below.

In [51]:
# Get the list of methods to look at, and a mapping between each method and the correct similarity metric to apply.
vector_dicts = {k:v.vector_dictionary for k,v in graphs.items()}
methods = list(vector_dicts.keys())
group_id_to_ids = groups.get_group_id_to_ids_dict(dataset.get_gene_dictionary())
valid_group_ids = [group for group,id_list in group_id_to_ids.items() if len(id_list)>1]
valid_ids = [i for i in dataset.get_ids() if len(set(valid_group_ids).intersection(set(id_to_group_ids[i])))>0]
pred_dict = defaultdict(lambda: defaultdict(dict))
true_dict = defaultdict(lambda: defaultdict(dict))
for method in methods:
    for group in valid_group_ids:
        ids = group_id_to_ids[group]
        for identifier in valid_ids:
            # What's the mean vector of this group, without this particular one that we're trying to classify.
            vectors = np.array([vector_dicts[method][some_id] for some_id in ids if not some_id==identifier])
            mean_vector = vectors.mean(axis=0)
            this_vector = vector_dicts[method][identifier]
            pred_dict[method][identifier][group] = 1-metric_dict[method](mean_vector, this_vector)
            true_dict[method][identifier][group] = (identifier in group_id_to_ids[group])*1                

In [None]:
num_plots, plots_per_row, row_width, row_height = (len(methods), 4, 14, 3)
fig,axs = plt.subplots(math.ceil(num_plots/plots_per_row), plots_per_row, squeeze=False)
for method,ax in zip(methods, axs.flatten()):
    
    # Obtaining the values and metrics.
    y_true = pd.DataFrame(true_dict[method]).as_matrix().flatten()
    y_prob = pd.DataFrame(pred_dict[method]).as_matrix().flatten()
    n_pos, n_neg = Counter(y_true)[1], Counter(y_true)[0]
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    baseline = Counter(y_true)[1]/len(y_true) 
    area = auc(recall, precision)
    auc_to_baseline_auc_ratio = area/baseline
    TABLE[method].update({"mean_auc":area, "mean_baseline":baseline, "mean_ratio":auc_to_baseline_auc_ratio})

    # Producing the precision recall curve.
    step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})
    ax.step(recall, precision, color='black', alpha=0.2, where='post')
    ax.fill_between(recall, precision, alpha=0.7, color='black', **step_kwargs)
    ax.axhline(baseline, linestyle="--", color="lightgray")
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_ylim([0.0, 1.05])
    ax.set_xlim([0.0, 1.0])
    ax.set_title("PR {0} (Baseline={1:0.3f})".format(method[:10], baseline))
    
fig.set_size_inches(row_width, row_height*math.ceil(num_plots/plots_per_row))
fig.tight_layout()
fig.savefig(os.path.join(OUTPUT_DIR,"prcurve_mean_classifier.png"),dpi=400)
plt.close()

### Predicting biochemical pathway membership based on mean similarity values
This section looks at how well the biochemical pathways that a particular gene is a member of can be predicted based on the average similarity between the vector representationt of the phenotype descriptions for that gene and each of the vector representations for other phenotypes associated with genes that belong to that particular pathway. In calculating the average similarity to other genes from a given biochemical pathway, the gene that is currently being classified is not accounted for, to avoid overestimating the performance by including information about the ground truth during classification. This leads to missing information in the case of biochemical pathways that have only one member. This can be accounted for by only limiting the overall dataset to only include genes that belong to pathways that have atleast two genes mapped to them, and only including those pathways, or by removing the missing values before calculating the performance metrics below.

### Predicting biochemical pathway or group membership with KNN classifier
This section looks at how well the group(s) or biochemical pathway(s) that a particular gene belongs to can be predicted based on a KNN classifier generated using every other gene. For this section, only the groups or pathways which contain more than one gene, and the genes mapped to those groups or pathways, are of interest. This is because for other genes, if we consider them then it will be true that that gene belongs to that group in the target vector, but the KNN classifier could never predict this because when that gene is held out, nothing could provide a vote for that group, because there are zero genes available to be members of the K nearest neighbors.

<a id="output"></a>
### Summarizing the results for this notebook
Write a large table of results to an output file. Columns are generally metrics and rows are generally methods.

In [None]:
results = pd.DataFrame(TABLE).transpose()
columns = flatten(["Hyperparams","Group","Order","Topic","Data",results.columns])
results["Hyperparams"] = ""
results["Group"] = ""
results["Order"] = np.arange(results.shape[0])
results["Topic"] = TOPIC
results["Data"] = DATA
results = results[columns]
results.reset_index(inplace=True)
results = results.rename({"index":"Method"}, axis="columns")
hyperparam_sep = ":"
results["Hyperparams"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[1] if hyperparam_sep in x else "None")
results["Method"] = results["Method"].map(lambda x: x.split(hyperparam_sep)[0])
results.to_csv(os.path.join(OUTPUT_DIR,"full_table.csv"), index=False)
results