## Reading in data available through Maize GDB (Maize Genetics and Genomics Database)
The purpose of this notebook is to read in and do a preliminary analysis of the data related to text descriptions that are available through Maize GDB. The data was provided in the form of the input file by a request through Maize GDB curators, rather than obtained through an already available file from the database. The data needs to be organized and also restructured into a standard format that will allow it to be easily combined with datasets from other resources.

### Files read
```
phenologs-with-oats/data/gene_related_files/maizegdb/pheno_genes.txt
phenologs-with-oats/data/gene_related_files/maizegdb/maize_v3.gold.gaf
```


### Files created
```
phenologs-with-oats/data/reshaped_files/zma_phenotypes.csv
phenologs-with-oats/data/reshaped_files/zma_high_confidence_go_annotations.csv
```

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import sys
import os
import warnings
import pandas as pd
import numpy as np
import itertools
import re
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

sys.path.append("../../oats")
from oats.utils.constants import NCBI_TAG, UNIPROT_TAG
from oats.utils.constants import EVIDENCE_CODES
from oats.utils.utils import to_abbreviation
from oats.nlp.preprocess import concatenate_with_bar_delim
from oats.nlp.preprocess import other_delim_to_bar_delim
from oats.nlp.preprocess import remove_punctuation
from oats.nlp.preprocess import remove_enclosing_brackets
from oats.nlp.preprocess import concatenate_descriptions
from oats.nlp.preprocess import add_prefix

mpl.rcParams["figure.dpi"] = 200
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
OUTPUT_DIR = "../data/reshaped_files"

### File with genes and phenotype descriptions (pheno_genes.txt)
Note that fillna is being used here to replace missing values with an empty string. This is done so that the missing string will be quantified when checking for the number of occurences of unique values from different columns, see the analysis below. However this is not necessary as a preprocessing step because when the data is read in and appended to a dataset object later, any missing values or empty strings will be handled at that step.

In [None]:
filename = "../data/gene_related_files/maizegdb/pheno_genes.txt"
usecols = ["phenotype_name", "phenotype_description", "locus_name", "alleles", "locus_synonyms", "v3_gene_model", "v4_gene_model", "uniprot_id", "ncbi_gene"]
df = pd.read_table(filename, usecols=usecols)
df.fillna("", inplace=True)
print(df[["phenotype_name","phenotype_description"]].head(10))
print(df.shape)

Text information about the phenotypes are contained in both the phenotype name and phenotype description for these data. The can be concatenated and retained together in a new description column that contains all this information, or just the phenotype description could be retained, depending on which data should be used downstream for making similarity comparisons. This is different than for most of the other sources of text used. The next cell looks at how many unique values there are in this data for each column.

In [None]:
# Finding out how many unique values there are for each column.
unique_values = {col:len(pd.unique(df[col].values)) for col in df.columns}
for k,v in unique_values.items():
    print("{:24}{:8}".format(k,v))

There are a fairly small number of distinct phenotype descriptions (379) compared to the number of lines that are in the complete dataset (3,616). This means that the same descriptions is occuring many times. Look at which descriptions are occuring most often.

In [None]:
# Get a list sorted by number of occurences for each phenotype description.
description_counts = df["phenotype_description"].value_counts().to_dict()
sorted_tuples = sorted(description_counts.items(), key = lambda x: x[1], reverse=True)
for t in sorted_tuples[0:10]:
    print("{:6}    {:20}".format(t[1],t[0][:70]))

The only description that occurs far more often than the next is an empty string, where this information is missing entirely. The next cell looks at how many phrases are included in the phenotype description values. Most have a single phrase, some have multiple. These look like they are mainly separated with semicolons.

In [None]:
# Plotting distributions of number of phrases in each description.
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.set_title("Phenotype Descriptions")
ax2.set_title("Phenotype Descriptions")
ax1.set_xlabel("Number of phrases")
ax2.set_xlabel("Number of words")
x1 = [len(sent_tokenize(x)) for x in df["phenotype_description"].values]
x2 = [len(word_tokenize(x)) for x in df["phenotype_description"].values]
ax1.hist(x1, bins=15, range=(0,15), density=False, alpha=0.8, histtype='stepfilled', color="black", edgecolor='none')
ax2.hist(x2, bins=30, range=(0,150), density=False, alpha=0.8, histtype='stepfilled', color="black", edgecolor='none')
fig.set_size_inches(15,4)
fig.tight_layout()
fig.show()
plt.close()

In [None]:
# Restructuring the dataset to include all the expected column names.
df["description"] = np.vectorize(concatenate_descriptions)(df["phenotype_name"], df["phenotype_description"])
df["uniprot_id"] = df["uniprot_id"].apply(add_prefix, prefix=UNIPROT_TAG)
df["ncbi_gene"] = df["ncbi_gene"].apply(add_prefix, prefix=NCBI_TAG)
df["gene_names"] = np.vectorize(concatenate_with_bar_delim)(df["locus_name"], df["alleles"], df["locus_synonyms"], df["v3_gene_model"], df["v4_gene_model"], df["uniprot_id"], df["ncbi_gene"])
df["species"] = "zma"
df["term_ids"] = ""
df = df[["species", "gene_names", "description", "term_ids"]]
print(df[["species","gene_names"]].head(10))
print(df[["species","gene_names"]].sample(10))

In [None]:
# Outputting the dataset of descriptions to a csv file.
path = os.path.join(OUTPUT_DIR,"zma_phenotypes.csv")
df.to_csv(path, index=False)

### File with high confidence gene ontology annotations (maize_v3.gold.gaf)
This file was generated as part of the [Maize GAMER](https://onlinelibrary.wiley.com/doi/full/10.1002/pld3.52)  publication (Wimalanathan et al., 2018). The annotations include all of the associations between maize genes and ontology terms from GO where the terms have been experimentally confirmed to represent correct functional annotations for those genes.

In [None]:
filename = "../data/gene_related_files/maizegdb/maize_v3.gold.gaf"
df = pd.read_table(filename, skiprows=1)
df.fillna("", inplace=True)
df.head()

In [None]:
# Restructuring the dataset to include all the expected column names.
df["description"] = ""
df["gene_names"] = np.vectorize(concatenate_with_bar_delim)(df["db_object_id"], df["db_object_symbol"])
df["species"] = "zma"
df["term_ids"] = df["term_accession"]
df = df[["species", "gene_names", "description", "term_ids"]]
df.head()

In [None]:
# Outputting the dataset of annotations to a csv file.
path = os.path.join(OUTPUT_DIR,"zma_high_confidence_go_annotations.csv")
df.to_csv(path, index=False)