## Reading in data available through Maize GDB (Maize Genetics and Genomics Database)
The purpose of this notebook is to read in and do a preliminary analysis of the data related to text descriptions that are available through Maize GDB. The data was provided in the form of the input file by a request through Maize GDB curators, rather than obtained through an already available file from the database. The data needs to be organized and also restructured into a standard format that will allow it to be easily combined with datasets from other resources. This notebook takes the following input files that were obtained from MaizeGDB and produces a set of files that have standard columns, including the species name, gene names, gene synonyms, text descriptions, and ontology term annotations. The gene names column includes unique gene accessions, names, symbols, or identifiers. The gene synonyms column is included for strings that are not necessary unique identifiers for a particular gene but still refer to that gene or describe its function.

### Files read
```
phenologs-with-oats/data/gene_related_files/maizegdb/pheno_genes.txt
phenologs-with-oats/data/gene_related_files/maizegdb/maize_v3.gold.gaf
```


### Files created
```
phenologs-with-oats/data/reshaped_files/zma_phenotypes.csv
phenologs-with-oats/data/reshaped_files/zma_high_confidence_go_annotations.csv
```

In [1]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import sys
import os
import warnings
import pandas as pd
import numpy as np
import itertools
import re
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

sys.path.append("../../oats")
from oats.utils.constants import NCBI_TAG, UNIPROT_TAG
from oats.utils.constants import EVIDENCE_CODES
from oats.utils.utils import to_abbreviation
from oats.nlp.preprocess import concatenate_with_bar_delim
from oats.nlp.preprocess import other_delim_to_bar_delim
from oats.nlp.preprocess import remove_punctuation
from oats.nlp.preprocess import remove_enclosing_brackets
from oats.nlp.preprocess import concatenate_descriptions
from oats.nlp.preprocess import add_prefix

mpl.rcParams["figure.dpi"] = 200
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
OUTPUT_DIR = "../data/reshaped_files"

### File with genes and phenotype descriptions (pheno_genes.txt)
Note that fillna is being used here to replace missing values with an empty string. This is done so that the missing string will be quantified when checking for the number of occurences of unique values from different columns, see the analysis below. However this is not necessary as a preprocessing step because when the data is read in and appended to a dataset object later, any missing values or empty strings will be handled at that step.

In [3]:
filename = "../data/gene_related_files/maizegdb/pheno_genes.txt"
usecols = ["phenotype_name", "phenotype_description", "locus_name", "alleles", "locus_synonyms", "v3_gene_model", "v4_gene_model", "uniprot_id", "ncbi_gene"]
df = pd.read_table(filename, usecols=usecols)
df.fillna("", inplace=True)
print(df[["phenotype_name","phenotype_description"]].head(10))
print(df.shape)

        phenotype_name                              phenotype_description
0     2-seeded kernels                                                   
1   A1 null transcript  No color in the aleurone, specifically no colo...
2    aberrant seedling  first leaf is small, round and flat; second le...
3    aberrant seedling  first leaf is small, round and flat; second le...
4  abnormal root hairs                        root hairs fail to elongate
5  abnormal root hairs                        root hairs fail to elongate
6  abnormal root hairs                        root hairs fail to elongate
7  abnormal root hairs                        root hairs fail to elongate
8  abnormal root hairs                        root hairs fail to elongate
9  abnormal root hairs                        root hairs fail to elongate
(3616, 9)


In [4]:
df

Unnamed: 0,phenotype_name,phenotype_description,locus_name,alleles,locus_synonyms,v3_gene_model,v4_gene_model,uniprot_id,ncbi_gene
0,2-seeded kernels,,wcr1,Wcr1-reference,wandering carpel1|wcr1,,,,
1,A1 null transcript,"No color in the aleurone, specifically no colo...",a1,a1-m2-8004::dSpm,a1|anthocyaninless1|bnl(a1)|DFR|FNR|gsy38(A1)|...,GRMZM2G026930,Zm00001d044122,P51108,100286107
2,aberrant seedling,"first leaf is small, round and flat; second le...",pve1,pve1-M2|pve1-R,AC211276.4_FG008|cl12053_1|cl12053_1(383)|inve...,AC211276.4_FG008,Zm00001d013672,,103626037
3,aberrant seedling,"first leaf is small, round and flat; second le...",ubl1,ubl1-1,si945031h05(438)|si945031h05a|ubl1,GRMZM2G156575,Zm00001d017432,,100275088
4,abnormal root hairs,root hairs fail to elongate,bhlh10,bhlh10-1|bhlh10-2,Lotus japonicus roothairless1-like|lrl5|transc...,GRMZM2G067654,Zm00001d045107,,103637976
...,...,...,...,...,...,...,...,...,...
3611,zebra necrotic leaf,Necrotic tissue appears between veins in regul...,zb1,zb1,zb1|zebra crossbands1,,,,
3612,zebra necrotic leaf,Necrotic tissue appears between veins in regul...,zn1,zn1|zn1-N25,zebra necrotic1|zn1,,,,
3613,zebra necrotic leaf,Necrotic tissue appears between veins in regul...,zn2,zn2|zn2-4-6(4461)|zn2-56-3012-10|zn2-94-234|zn...,zebra necrotic2|zn2,,,,
3614,zebra necrotic leaf,Necrotic tissue appears between veins in regul...,zn*-N571D,zn*-N571D,zebra necroticN571D|zn*-571D|zn*-N571D,,,,


Text information about the phenotypes are contained in both the phenotype name and phenotype description for these data. The can be concatenated and retained together in a new description column that contains all this information, or just the phenotype description could be retained, depending on which data should be used downstream for making similarity comparisons. This is different than for most of the other sources of text used. The next cell looks at how many unique values there are in this data for each column.

In [5]:
# Finding out how many unique values there are for each column.
unique_values = {col:len(pd.unique(df[col].values)) for col in df.columns}
for k,v in unique_values.items():
    print("{:24}{:8}".format(k,v))

phenotype_name               648
phenotype_description        379
locus_name                  1410
alleles                     2088
locus_synonyms              1408
v3_gene_model                482
v4_gene_model                469
uniprot_id                   140
ncbi_gene                    503


There are a fairly small number of distinct phenotype descriptions (379) compared to the number of lines that are in the complete dataset (3,616). This means that the same descriptions is occuring many times. Look at which descriptions are occuring most often.

In [6]:
# Get a list sorted by number of occurences for each phenotype description.
description_counts = df["phenotype_description"].value_counts().to_dict()
sorted_tuples = sorted(description_counts.items(), key = lambda x: x[1], reverse=True)
for t in sorted_tuples[0:10]:
    print("{:6}    {:20}".format(t[1],t[0][:70]))

   892                        
    90    Anthers shriveled, not usually exerted from glum. Pollen absent or abo
    87    seedling white, yellow or palegreen, becomes green, often normal
    81    Two classes of albino seedlings: Class I characterized by white or pal
    81    Seedling leaves are white.
    77    endosperm is opaque and firm, not chalky and not waxy
    73    a general term describing an improperly developed endosperm that appea
    73    small kernel is the consistent characteristic but variable in other as
    71    endosperm has a soft, chalk-like texture, usually a reduced yellow col
    65    lighter green seedlings or plants; less yellow than yellow green


The only description that occurs far more often than the next is an empty string, where this information is missing entirely. The next cell looks at how many phrases are included in the phenotype description values. Most have a single phrase, some have multiple. These look like they are mainly separated with semicolons.

In [7]:
# Plotting distributions of number of phrases in each description.
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.set_title("Phenotype Descriptions")
ax2.set_title("Phenotype Descriptions")
ax1.set_xlabel("Number of phrases")
ax2.set_xlabel("Number of words")
x1 = [len(sent_tokenize(x)) for x in df["phenotype_description"].values]
x2 = [len(word_tokenize(x)) for x in df["phenotype_description"].values]
ax1.hist(x1, bins=15, range=(0,15), density=False, alpha=0.8, histtype='stepfilled', color="black", edgecolor='none')
ax2.hist(x2, bins=30, range=(0,150), density=False, alpha=0.8, histtype='stepfilled', color="black", edgecolor='none')
fig.set_size_inches(15,4)
fig.tight_layout()
fig.show()
plt.close()

In [8]:
# Restructuring the dataset to include all the expected column names.
df["description"] = np.vectorize(concatenate_descriptions)(df["phenotype_name"], df["phenotype_description"])
df["uniprot_id"] = df["uniprot_id"].apply(add_prefix, prefix=UNIPROT_TAG)
df["ncbi_gene"] = df["ncbi_gene"].apply(add_prefix, prefix=NCBI_TAG)
df["gene_names"] = np.vectorize(concatenate_with_bar_delim)(df["locus_name"], df["v3_gene_model"], df["v4_gene_model"], df["uniprot_id"], df["ncbi_gene"])
df["gene_synonyms"] = np.vectorize(concatenate_with_bar_delim)(df["alleles"], df["locus_synonyms"])
df["species"] = "zma"
df["term_ids"] = ""
df["sources"] = "MaizeGDB"
df = df[["species", "gene_names", "gene_synonyms", "description", "term_ids", "sources"]]
df.head()

Unnamed: 0,species,gene_names,gene_synonyms,description,term_ids,sources
0,zma,wcr1,Wcr1-reference|wandering carpel1|wcr1,2-seeded kernels.,,MaizeGDB
1,zma,a1|GRMZM2G026930|Zm00001d044122|uniprot=P51108...,a1-m2-8004::dSpm|a1|anthocyaninless1|bnl(a1)|D...,"A1 null transcript. No color in the aleurone, ...",,MaizeGDB
2,zma,pve1|AC211276.4_FG008|Zm00001d013672|ncbi=1036...,pve1-M2|pve1-R|AC211276.4_FG008|cl12053_1|cl12...,"aberrant seedling. first leaf is small, round ...",,MaizeGDB
3,zma,ubl1|GRMZM2G156575|Zm00001d017432|ncbi=100275088,ubl1-1|si945031h05(438)|si945031h05a|ubl1,"aberrant seedling. first leaf is small, round ...",,MaizeGDB
4,zma,bhlh10|GRMZM2G067654|Zm00001d045107|ncbi=10363...,bhlh10-1|bhlh10-2|Lotus japonicus roothairless...,abnormal root hairs. root hairs fail to elongate.,,MaizeGDB


In [9]:
# Outputting the dataset of descriptions to a csv file.
path = os.path.join(OUTPUT_DIR,"zma_phenotypes.csv")
df.to_csv(path, index=False)

### File with high confidence gene ontology annotations (maize_v3.gold.gaf)
This file was generated as part of the [Maize GAMER](https://onlinelibrary.wiley.com/doi/full/10.1002/pld3.52)  publication (Wimalanathan et al., 2018). The annotations include all of the associations between maize genes and ontology terms from GO where the terms have been experimentally confirmed to represent correct functional annotations for those genes.

In [10]:
filename = "../data/gene_related_files/maizegdb/maize_v3.gold.gaf"
df = pd.read_table(filename, skiprows=1)
df.fillna("", inplace=True)
df.head()

Unnamed: 0,!db,db_object_id,db_object_symbol,qualifier,term_accession,db_reference,evidence_code,with,aspect,db_object_name,db_object_synonym,db_object_type,taxon,date,assigned_by,annotation_extension,gene_product_form_id
0,MaizeGDB,bx1,GRMZM2G085381,,GO:0000162,PMID:9235894,IMP,,P,benzoxazinless1,MaizeGDB:GRMZM2G085381,protein,taxon:4577,20151010,MaizeGDB,,
1,MaizeGDB,mpk2,GRMZM2G020216,,GO:0000302,PMID:20693409,IDA,,P,MAP kinase2,MaizeGDB:GRMZM2G020216,protein,taxon:4577,20151010,MaizeGDB,,
2,MaizeGDB,crs1,GRMZM2G078412,,GO:0000373,PMID:15598799,IDA,,P,chloroplast RNA splicing1,MaizeGDB:GRMZM2G078412,protein,taxon:4577,20151010,MaizeGDB,,
3,MaizeGDB,crs2,GRMZM2G132021,,GO:0000373,PMID:11179231,IDA,,P,chloroplast RNA splicing2,MaizeGDB:GRMZM2G132021,protein,taxon:4577,20151010,MaizeGDB,,
4,MaizeGDB,gams1,ga-ms1,,GO:0000775,MaizeGDB:123623,IMP,,C,gametophytic male sterile1,MaizeGDB:ga-ms1,protein,taxon:4577,20151010,MaizeGDB,,


In [11]:
# Restructuring the dataset to include all the expected column names.
df["description"] = ""
df["gene_names"] = np.vectorize(concatenate_with_bar_delim)(df["db_object_id"], df["db_object_symbol"])
df["gene_synonyms"] = df["db_object_synonym"]
df["species"] = "zma"
df["term_ids"] = df["term_accession"]
df["sources"] = "MaizeGDB"
df = df[["species", "gene_names", "gene_synonyms", "description", "term_ids", "sources"]]
df.head()

Unnamed: 0,species,gene_names,gene_synonyms,description,term_ids,sources
0,zma,bx1|GRMZM2G085381,MaizeGDB:GRMZM2G085381,,GO:0000162,MaizeGDB
1,zma,mpk2|GRMZM2G020216,MaizeGDB:GRMZM2G020216,,GO:0000302,MaizeGDB
2,zma,crs1|GRMZM2G078412,MaizeGDB:GRMZM2G078412,,GO:0000373,MaizeGDB
3,zma,crs2|GRMZM2G132021,MaizeGDB:GRMZM2G132021,,GO:0000373,MaizeGDB
4,zma,gams1|ga-ms1,MaizeGDB:ga-ms1,,GO:0000775,MaizeGDB


In [12]:
# Outputting the dataset of annotations to a csv file.
path = os.path.join(OUTPUT_DIR,"zma_high_confidence_go_annotations.csv")
df.to_csv(path, index=False)