# Looking at the Dataset
The purpose of this notebook is to look closer at the dataset of genes, natural language descriptions, and ontology term annotations that are used in this work. As included in the preprocessing notebooks, these data are drawn from files from either publications supplements like Oellrich, Walls et al. (2015) or model species databases such as TAIR, MaizeGDB, and SGN. The datasets are already loaded and merged using the `Dataset` class available through the oats python package. 

In [1]:
import datetime
import nltk
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import time
import math
import sys
import gensim
import os
import warnings
import torch
import itertools
import multiprocessing as mp
from collections import Counter, defaultdict
from inspect import signature
from scipy.stats import ks_2samp, hypergeom
from sklearn.metrics import precision_recall_curve, f1_score, auc
from sklearn.model_selection import train_test_split, KFold
from scipy import spatial, stats
from statsmodels.sandbox.stats.multicomp import multipletests
from nltk.corpus import brown
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.neighbors import KNeighborsClassifier
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
from gensim.parsing.preprocessing import strip_non_alphanum, stem_text, preprocess_string, remove_stopwords
from gensim.utils import simple_preprocess
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.cluster import AgglomerativeClustering

sys.path.append("../../oats")
from oats.utils.utils import save_to_pickle, load_from_pickle, merge_list_dicts, flatten, to_hms
from oats.datasets.dataset import Dataset
from oats.datasets.groupings import Groupings
from oats.annotation.ontology import Ontology
from oats.datasets.string import String
from oats.datasets.edges import Edges
from oats.annotation.annotation import annotate_using_noble_coder
from oats.graphs import pairwise as pw
from oats.graphs.editing import merge_edgelists, make_undirected, remove_self_loops, subset_edgelist_with_ids
from oats.graphs.indexed import IndexedGraph
from oats.graphs.weighting import train_logistic_regression_model, apply_logistic_regression_model
from oats.graphs.weighting import train_random_forest_model, apply_random_forest_model
from oats.nlp.vocabulary import get_overrepresented_tokens, get_vocabulary_from_tokens
from oats.nlp.vocabulary import reduce_vocabulary_connected_components, reduce_vocabulary_linares_pontes
from oats.utils.utils import function_wrapper_with_duration
from oats.nlp.preprocess import concatenate_with_bar_delim

mpl.rcParams["figure.dpi"] = 400
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

True

In [2]:
data = load_from_pickle("../data/pickles/gene_phenotype_dataset_all_text_and_annotations_unmerged.pickle")
data.to_pandas().head()
data.describe()

Unnamed: 0,species,num_genes,unique_descriptions
0,ath,623507,9110
1,gmx,93,48
2,mtr,194,154
3,osa,434,388
4,sly,522,313
5,zma,6163,998
6,total,630913,11011


In [3]:
data = load_from_pickle("../data/pickles/gene_phenotype_dataset_all_text_and_annotations.pickle")
data.filter_has_description()
data.to_pandas().head()
data.describe()

Unnamed: 0,species,num_genes,unique_descriptions
0,ath,6364,3813
1,gmx,30,24
2,mtr,37,36
3,osa,92,85
4,sly,70,70
5,zma,1406,811
6,total,7999,4839


### What's there for each species?
The previously loaded dataset contains all of the genes that across six plant species that have natural language description data for phenotype(s) related to that gene. Each gene can have multiple descriptions annotated to it, which were combined or concatenated when the datasets from multiple sources were merged in creating the pickled datasets. Arabidopsis has the highest number of genes that satisfy this criteria, followed by maize, and then followed by the other four species which have a relatively low number of genes that satisfy this criteria, atleast given the sources used for this work. Note that the number of unique descriptions is lower than the number of genes in call cases, because multiple genes can have the same phenotype description associated with them.

In [4]:
tokens = {}
vocabs = {}
for species in data.get_species():
    df = data.to_pandas()
    subset = df[df["species"]==species]
    descriptions = subset["description"].values
    descriptions = [simple_preprocess(d) for d in descriptions]
    token_list = flatten(descriptions)
    tokens[species] = token_list
    descriptions = set(token_list)
    vocabs[species] = descriptions
    
df = data.describe()[:-1]
df["vocab_size"] = df["species"].map(lambda x: len(vocabs[x]))
df["total_tokens"] = df["species"].map(lambda x: len(tokens[x]))
df

Unnamed: 0,species,num_genes,unique_descriptions,vocab_size,total_tokens
0,ath,6364,3813,7085,264189
1,gmx,30,24,81,233
2,mtr,37,36,718,2672
3,osa,92,85,826,3887
4,sly,70,70,577,1810
5,zma,1406,811,1846,50029


### How do the vocabularies used for different species compare?
One of the things we are interested in is discovering or recovering phenotype similarity between different species in order to identify phenologs (phenotypes between species that share some underlying genetic cause). For this reason, we are interested in how the vocabularies used to describe phenotypes between different species vary, because this will impact how feasible it is to use a dataset like this to identify phenologs. Because the Arabidopsis and maize datasets are the largest in this case, we will compare the vocabularies used in describing the phenotypes associated with the genes from these species in this dataset.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
fdist_zma = FreqDist(tokens["zma"])
fdist_ath = FreqDist(tokens["ath"])
union_vocab = vocabs["zma"].union(vocabs["ath"])
table = pd.DataFrame({"token":list(union_vocab)})
stops = set(stopwords.words('english'))
table = table[~table.token.isin(stops)]
table["part_of_speech"] = table["token"].map(lambda x: nltk.pos_tag([x])[0][1][:2])
table["ath_freq"] = table["token"].map(lambda x: fdist_ath[x])
table["ath_rate"] = table["ath_freq"]*100/len(tokens["ath"])
table["zma_freq"] = table["token"].map(lambda x: fdist_zma[x])
table["zma_rate"] = table["zma_freq"]*100/len(tokens["zma"])
table["diff"] = table["ath_rate"]-table["zma_rate"]
table.head()

Unnamed: 0,token,part_of_speech,ath_freq,ath_rate,zma_freq,zma_rate,diff
0,phenotypein,NN,1,0.000379,0,0.0,0.000379
1,beneficial,JJ,1,0.000379,0,0.0,0.000379
2,diminished,VB,11,0.004164,0,0.0,0.004164
3,edr,NN,9,0.003407,0,0.0,0.003407
4,thiamine,NN,60,0.022711,0,0.0,0.022711


In [7]:
# What are the tokens more frequently used for Arabidopsis than maize?
table.sort_values(by="diff", ascending=False, inplace=True)
table.head(30)

Unnamed: 0,token,part_of_speech,ath_freq,ath_rate,zma_freq,zma_rate,diff
2517,embryo,NN,4410,1.66926,146,0.291831,1.377429
6950,mutant,NN,3504,1.326323,43,0.08595,1.240373
7164,phenotype,NN,3316,1.255162,53,0.105939,1.149223
3149,wild,NN,2456,0.929637,7,0.013992,0.915646
3104,type,NN,2490,0.942507,14,0.027984,0.914523
662,defective,JJ,3341,1.264625,285,0.56967,0.694955
5320,reduced,VB,2860,1.082558,216,0.43175,0.650809
2723,root,NN,1862,0.704798,42,0.083951,0.620847
245,plants,NN,2293,0.867939,146,0.291831,0.576109
133,growth,NN,1830,0.692686,79,0.157908,0.534778


In [8]:
# What are the tokens more frequently used for maize than Arabidopsis?
table.sort_values(by="diff", ascending=True, inplace=True)
table.head(30)

Unnamed: 0,token,part_of_speech,ath_freq,ath_rate,zma_freq,zma_rate,diff
5344,endosperm,NN,124,0.046936,1078,2.15475,-2.107814
5783,seedling,VB,636,0.240737,925,1.848928,-1.608191
7546,yellow,NN,304,0.115069,775,1.549102,-1.434032
6100,kernel,NN,0,0.0,689,1.377201,-1.377201
178,leaf,NN,1258,0.476174,922,1.842931,-1.366757
1930,green,JJ,884,0.334609,779,1.557097,-1.222488
6871,white,JJ,375,0.141944,642,1.283256,-1.141312
4572,plant,NN,412,0.155949,449,0.897479,-0.741531
7536,albino,NN,222,0.084031,396,0.791541,-0.70751
556,usually,RB,46,0.017412,353,0.705591,-0.688179


In [9]:
# Is the mean absolute value of the rate differences different between the different parts of speech?
table["abs_diff"] = abs(table["diff"])
pos_table = table.groupby("part_of_speech").mean()
pos_table.sort_values(by="abs_diff", inplace=True, ascending=False)
pos_table = pos_table[["abs_diff"]]
pos_table.reset_index()

Unnamed: 0,part_of_speech,abs_diff
0,MD,0.058763
1,CD,0.027431
2,IN,0.026202
3,JJ,0.023842
4,DT,0.019077
5,RB,0.016582
6,NN,0.013349
7,VB,0.011641
8,CC,0.007617
9,WP,0.00162
