## Reading in data available through Oryzabase (Integrated Rice Science Database)
The purpose of this notebook is to read in and do a preliminary analysis of the data related to text descriptions and ontology term annotations that are available through Oryzabase. The data needs to be organized and also restructured into a standard format that will allow it to be easily combined with datasets from other resources. This notebook takes the following input files that were obtained from OryzaBase and produces a set of files that have standard columns that are listed and described below.

### Files read
```
plant-data/databases/oryzabase/OryzabaseGeneListEn_20190826010113.txt
```

### Files created
```
plant-data/reshaped_data/oryzabase_phenotype_descriptions_and_annotations.csv
```

### Columns in the created files
* **species**: A string indicating what species the gene is in, currently uses the 3-letter codes from the KEGG database.
* **unique_gene_identifiers**: Pipe delimited list of gene identifers, names, models, etc which must uniquely refer to this gene.
* **other_gene_identifiers**: Pipe delimited list of other identifers, names, aliases, synonyms for the gene, which may but do not have to uniquely refer to it.
* **gene_models**: Pipe delimited list of gene model names that map to this gene.
* **descriptions**: A free text field for any descriptions of phenotyes associated with this gene.
* **annotations**: Pipe delimited list of gene ontology term identifiers.
* **sources**: Pipe delimited list of strings that indicate where this data comes from such as database names.

In [6]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import sys
import os
import warnings
import pandas as pd
import numpy as np
import itertools
import re
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

sys.path.append("../utils")
from constants import NCBI_TAG, EVIDENCE_CODES

sys.path.append("../../oats")
from oats.nlp.small import add_prefix_safely
from oats.nlp.small import get_ontology_ids, remove_punctuation, remove_enclosing_brackets
from oats.nlp.preprocess import concatenate_texts, concatenate_with_delim, replace_delimiter

OUTPUT_DIR = "../reshaped_data"
mpl.rcParams["figure.dpi"] = 200
warnings.simplefilter('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [7]:
a = ["a","b","c"]
b = ["ee", "eleeese", "eee|eee"]
d = pd.DataFrame({"c1":a, "c2":b})


combine_columns = lambda row, col_names: concatenate_with_delim("|", [row[col_name] for col_name in col_names])

d["thing"] = d.apply(lambda x: combine_columns(x, ["c1","c2"]), axis=1)
d

Unnamed: 0,c1,c2,thing
0,a,ee,a|ee
1,b,eleeese,b|eleeese
2,c,eee|eee,c|eee


In [8]:
# Columns that should be in the final reshaped files.
reshaped_columns = ["species", 
 "unique_gene_identifiers", 
 "other_gene_identifiers", 
 "gene_models", 
 "descriptions", 
 "annotations", 
 "sources"]

# Creating and testing a lambda for finding gene model strings.
gene_model_pattern_1 = re.compile("grmzm.+")
gene_model_pattern_2 = re.compile("zm[0-9]+d[0-9]+")
is_gene_model = lambda s: bool(gene_model_pattern_1.match(s.lower() or gene_model_pattern_2.match(s.lower())))

### Part 1: Phenotypic Text Data
There are several columns that contain information about gene names and accessions. We need to know what type of information is in each in order to know which should be retained or parsed to form the desired cleaned dataset. We are interested in both gene names that should map to a specific accession (like cms-54257) as well as gene names that are enzyme descriptions (like Ubiquitin-Specific Protease) that could map to more than one gene in a particular species. Each type of information is valuable, but needs to be differentiated so that when comparing whether two rows are specifying the same gene, this is not confused with specifying two different genes that have the same function.

The gene symbols in this dataset are typically surrounded by square brackets but not always. If a second symbol for that same gene is mentioned in the same column, the second symbol might be enclosed in parentheses. The synonynms for the gene symbols are similarly sometimes enclosed in square brackets, and are typically separated by commas in cases where more than one are mentioned in this column. Also note that an underscore is being used to represent missing data, so this has to handled so that that character is not treated as a gene name that appears many times.

In [9]:
filename = "../databases/oryzabase/OryzabaseGeneListEn_20190826010113.txt"
usecols = ["CGSNL Gene Symbol", "Gene symbol synonym(s)", "CGSNL Gene Name", "Gene name synonym(s)", "Protein Name", 
           "Allele", "Explanation", "Trait Class", "RAP ID", "MUS ID", "Gramene ID", "Gene Ontology", 
           "Trait Ontology", "Plant Ontology"]
df = pd.read_table(filename, usecols=usecols, sep="\t")
df.fillna("", inplace=True)
unique_values = {col:len(pd.unique(df[col].values)) for col in df.columns}
print(df.shape)
for k,v in unique_values.items():
    print("{:25}{:8}".format(k,v))

(17674, 14)
CGSNL Gene Symbol           10041
Gene symbol synonym(s)      15948
CGSNL Gene Name              9845
Gene name synonym(s)        15216
Protein Name                 5588
Allele                       1151
Explanation                 13841
Trait Class                   726
RAP ID                      11531
MUS ID                      11642
Gramene ID                   1656
Gene Ontology                7624
Trait Ontology               2476
Plant Ontology                797


In [10]:
print(df[["CGSNL Gene Symbol", "Gene symbol synonym(s)"]].head(5))
print(df[["CGSNL Gene Symbol", "Gene symbol synonym(s)"]].sample(5))

    CGSNL Gene Symbol                  Gene symbol synonym(s)
0         [CMS-54257]               [cms-54257]*, [cms-54257]
1  [CMS-AK]([CMS-JP])  [cms-ak]([cms-jp]), [cms-jp], [cms-ak]
2           [CMS-ARC]   [cms-ARC]*, [cms-ARC] [mt], [cms-ARC]
3            [CMS-BO]                                [cms-bo]
4            [CMS-CW]                                [cms-CW]
      CGSNL Gene Symbol Gene symbol synonym(s)
13546           RLCK172              OsRLCK172
14146                 _               OsSTA219
10718                 _                       
14694                 _       OsProCP4, PROCP4
2926               NDK3                   Ndk3


The values in the gene name synonym(s) column can be comma delimited lists if more than one synonym for the gene name was known. Sometimes quotes are used. Empty strings and possibly underscores can be used to denote missing information.

In [11]:
print(df[["CGSNL Gene Name"]].head(5))
print(df[["CGSNL Gene Name"]].sample(5))

                    CGSNL Gene Name
0  CYTOPLASMIC MALE STERILITY 54257
1     CYTOPLASMIC MALE STERILITY AK
2    CYTOPLASMIC MALE STERILITY ARC
3     CYTOPLASMIC MALE STERILITY BO
4     CYTOPLASMIC MALE STERILITY CW
                      CGSNL Gene Name
16487                               _
16120                               _
17383           ACYL-COA THIOESTERASE
7692                    MICRORNA1884B
3127   WALL-ASSOCIATED KINASE GENE 95


The gene name column has strings representing the full name of each gene rather than just the symbol. Note that an underscore is also being used to denote missing values in this column as well.

In [12]:
print(df[["Gene name synonym(s)"]].head(5))
print(df[["Gene name synonym(s)"]].sample(5))

                                Gene name synonym(s)
0  Cytoplasmic mutant induced by somaclonal varia...
1            Akebono' cytoplasm, 'Akebono' cytoplasm
2     ARC13829-16 cytoplasm, `ARC13829-26' cytoplasm
3  Chinsurah boro II' cytoplasm, `Chinsurah boro ...
4                        Chinese wild rice cytoplasm
                    Gene name synonym(s)
7445         MICRORNA169o, osa-miRNA169o
11063  transposon CACTG element Rim2-M22
17438                           PINOID b
16648        Hypothetical conserved gene
6975                copper transporter 2


The values in the gene name synonym(s) column can be comma delimited lists if more than one synonym for the gene name was known. Sometimes quotes are used. Empty strings and possibly underscores can be used to denote missing information.

In [13]:
print(df[["Protein Name","Allele"]].sample(5))

                                 Protein Name Allele
11621                                               
11771                                               
5138   LATE EMBRYOGENESIS ABUNDANT PROTEIN 24       
10503                    dye1, dye1-1, dye1-2       
16394                                               


Both the protein name and allele columns are sparse within the dataset. Either can be a single value or a comma delimited list of values. These may not need to be retained for finding reference to genes in other resources because we already have more standardized representations of the genes in other columns.

In [14]:
print(df[["RAP ID","MUS ID"]].sample(5))

             RAP ID                              MUS ID
9146   Os03g0215900                    LOC_Os03g11670.1
15101  Os08g0192200  LOC_Os08g09300.2, LOC_Os08g09300.1
7496                                                   
14953  Os03g0853200                    LOC_Os03g63620.1
917                                                    


Both the RAP ID and the MUS ID can columns can contain multiple values for a given gene which are included as members of a comma delimited list. These values can also be missing using the same scheme for missing information as in the rest of the dataset.

The following functions were created based on the needs following how the gene symbols, names, synonyms, and accessions are previously described in this dataset. These are a not guaranteed to be a perfectly accurate method of parsing in the information in this dataset but they are meant to approximate what is required based on going through the dataset by hand. The methods are meant to be applied only to specific columns within the dataset, and to make the code that later cleans the columns more readable by compressing multiple cleaning steps into a single function. Some of these rely on other very specific functions that are within the text preprocessing module and not shown here.

In [15]:
def handle_synonym_in_parentheses(text, min_length):
    # Looks at a string that is suspected to be in a format like "name (othername)". If
    # that is the case then a list of strings is returned that looks like [name, othername].
    # This is useful when a column is specifying something like a gene name but a synonym
    # might be mentioned in the same column in parentheses, so the whole string in that 
    # column is not useful for searching against as whole. Does not consider text in
    # parentheses shorter than min_length to be a real synonym, but rather part of the 
    # name, such as gene_name(t) for example. 
    names = []
    pattern = r"\(.*?\)"
    results = re.findall(pattern, text)
    for result in results:
        enclosed_string = result[1:-1]
        if len(enclosed_string)>=min_length:
            text = text.replace(result, "")
            names.append(enclosed_string)
    names.append(text)
    names = [name.strip() for name in names]
    return(names)


def clean_oryzabase_symbol(string):
    # Should be applied to the gene symbol column in the dataset.
    # Returns a single string representing a bar delimited list of gene symbols.
    string = string.replace("*","")
    names = handle_synonym_in_parentheses(string, min_length=4)
    names = [remove_enclosing_brackets(name) for name in names]
    names = [name for name in names if len(name)>=2] # Retain only names that are atleast two characters.
    names_string = concatenate_with_delim("|", names)
    return(names_string)

def clean_oryzabase_symbol_synonyms(string):
    # Should be applied to the gene symbol synonym(s) column in the dataset.
    # Returns a single string representing a bar delimited list of gene symbols.
    string = string.replace("*","")
    names = string.split(",")
    names = [name.strip() for name in names]
    names = [remove_enclosing_brackets(name) for name in names]
    names_string = concatenate_with_delim("|", names)
    return(names_string)

### Part 2: Ontology Term Annotations
Multiple columns within the dataset specify ontology term annotations that have been applied to the geen mentioned on that particular line. Ontology term annotations are separated into different columns based on which ontology the terms belong to, and both the term ID of each annotation and the accompanying label for that term and explicitly given. Columns for terms from the Gene Ontolgoy (GO), Plant Ontology (PO), and Plant Trait Ontology (TO) are all included. There is no information about what evidence codes these annotations are associated with in this dataset.

In [16]:
print(df[["Gene Ontology"]].head(5))
print(df[["Plant Ontology"]].head(5))
print(df[["Trait Ontology"]].head(5))

                                       Gene Ontology
0  GO:0000001 - mitochondrion inheritance, GO:000...
1  GO:0000001 - mitochondrion inheritance, GO:000...
2  GO:0007275 - multicellular organismal development
3  GO:0007275 - multicellular organismal developm...
4  GO:0007275 - multicellular organismal development
                                      Plant Ontology
0                               PO:0009066 - anther 
1                               PO:0009066 - anther 
2  PO:0009082 - spikelet floret , PO:0020048 - mi...
3  PO:0009082 - spikelet floret , PO:0020048 - mi...
4  PO:0009082 - spikelet floret , PO:0020048 - mi...
                                      Trait Ontology
0  TO:0000232 - cytoplasmic male sterility (sensu...
1  TO:0000232 - cytoplasmic male sterility (sensu...
2  TO:0000232 - cytoplasmic male sterility (sensu...
3  TO:0000232 - cytoplasmic male sterility (sensu...
4  TO:0000232 - cytoplasmic male sterility (sensu...


Both the term IDs and labels for each annotation are given. Multiple annotations from the same ontology for a given line are separated by commas. We want to parse out just the gene ontology IDs for the cleaned dataset so that they can be referenced later, all the other information is not needed. There is a function to the return a list of gene IDs present in a longer string of text that is in the preprocessing module.

### Part 3: Handling other text description and keyword information in this data
This dataset does not contain any columns that consistly contain a natural language description of a phenotype associated with a given gene. But some text-based information is still present. The trait class column contains a value from a limited set of keyword descriptors for the trait a particular gene is associated with. The size of the vocabulary used is obtained here. Also see the specific description of this keyword vocabulary here (https://shigen.nig.ac.jp/rice/oryzabase/traitclass/). The explaination column also occasionally contains text information about a corresponding phenotype.

In [18]:
# Get a list sorted by number of occurences for each trait class.
description_counts = df["Trait Class"].value_counts().to_dict()
sorted_tuples = sorted(description_counts.items(), key = lambda x: x[1], reverse=True)
for t in sorted_tuples[0:10]:
    print("{:6}  {:20}".format(t[1],t[0][:70]))

  4121                      
  3621   Biochemical character
  2566   Other              
   875   Tolerance and resistance - Stress tolerance
   862   Biochemical character,  Tolerance and resistance - Stress tolerance
   527   Tolerance and resistance - Disease resistance
   293   Reproductive organ - Spikelet, flower, glume, awn
   244   Biochemical character,  Tolerance and resistance - Disease resistance
   222   Character as QTL - Yield and productivity
   204   Tolerance and resistance - Stress tolerance,  Other


The most common value in the trait class column is whitespace or an empty string indicating missing data. Another very common value though is 'Other' which has 2,566 occurences out of the 17,674 total instances. This needs to be handled if using this information as text descriptions because this contains no semantics relevant to the phenotype (two phenotypes with trait classes of 'Other' should not be considered similar).

In [19]:
print(df[["Explanation"]].sample(30))

                                             Explanation
8018   miRBASE accession: MI0010708. osa-MIR2120 in m...
15182                  Q75LR2. AB122057. LOC_Os03g27230.
12548  class III PUB protein (U-box + GKL-box). LOC_O...
4424   NBS-LRR protein. AB604636, AB604637, AB604638,...
5799                                                    
6223                                     LOC_Os08g38270.
9365                           BN000589. LOC_Os04g59160.
14301  OSCA1 homologue. KJ920371. LOC_Os01g35050. Ory...
10180                                                   
1140   QTL which controls days to heading, located be...
1075   Pollen semi-sterility found in the hybrid betw...
2537   AF484682. AY661468. X76064. U37687. UBQ10 in J...
13194                            LOC_Os07g01090. Q69LA1.
8288   LOC_Os03g61480. homologous to the tomato fruit...
15109                                                   
6203                                     LOC_Os05g12580.
14761  LOC_Os04g02980. Subtilis

The explanation column holds information potentially about the phenotype, but also sometimes contains redundant information about the gene names or identifiers and sometimes the ontology term annotations as well. Sometimes methods are mentioned as well. Some of this could be handled with parsing to remove the redundant information that already appears somewhere else in a particular column for this line, but this should be considered irregular text annotations or descriptions for the purposes of downstream analyses. The following cell contains a preliminary attempt at a function that cleans values in this column by removing some redundant information from other columns.

In [20]:
def clean_oryzabase_explainations(string):
    # Should be applied to the explanation column in the dataset.
    # Returns a version of the the value in that column without some redundant information.
    ontology_ids = get_ontology_ids(string)
    for ontology_id in ontology_ids:
        string = string.replace(ontology_id,"")
        string = remove_punctuation(string)
    return(string)

In [22]:
# Restructuring and combining columns that have gene name information.
df["CGSNL Gene Symbol"] = df["CGSNL Gene Symbol"].apply(clean_oryzabase_symbol)
df["Gene symbol synonym(s)"] = df["Gene symbol synonym(s)"].apply(clean_oryzabase_symbol_synonyms)
df["CGSNL Gene Name"] = df["CGSNL Gene Name"].apply(lambda x: x.replace("_","").strip())
df["Gene name synonym(s)"] = df["Gene name synonym(s)"].apply(lambda x: replace_delimiter(text=x, old_delim=",", new_delim="|"))
combine_columns = lambda row, columns: concatenate_with_delim("|", [row[column] for column in columns])

df["gene_names"] = df.apply(lambda x: combine_columns(x, ["RAP ID","MUS ID","CGSNL Gene Symbol", "Gene symbol synonym(s)", "CGSNL Gene Name", "Gene name synonym(s)"]), axis=1)
#df["gene_names"] = np.vectorize(concatenate_with_delim)(df["RAP ID"], df["MUS ID"], df["CGSNL Gene Symbol"], df["Gene symbol synonym(s)"], df["CGSNL Gene Name"], df["Gene name synonym(s)"])


# Restructuring and combining columns that have ontology term annotations.
df["Gene Ontology"] = df["Gene Ontology"].apply(lambda x: concatenate_with_delim("|", get_ontology_ids(x)))
df["Trait Ontology"] = df["Trait Ontology"].apply(lambda x: concatenate_with_delim("|", get_ontology_ids(x))) 
df["Plant Ontology"] = df["Plant Ontology"].apply(lambda x: concatenate_with_delim("|", get_ontology_ids(x))) 




df["term_ids"] = df.apply(lambda x: combine_columns(x, ["Gene Ontology","Trait Ontology","Plant Ontology"]), axis=1)


#df["term_ids"] = np.vectorize(concatenate_with_bar_delim)(df["Gene Ontology"], df["Trait Ontology"], df["Plant Ontology"])

# Adding other expected columns and subsetting the dataset.
df["species"] = "osa"
df["description"] = df["Explanation"].apply(clean_oryzabase_explainations)
df["pmid"] = ""
df = df[["species", "gene_names", "description", "term_ids"]]
print(df[["species","gene_names"]].head(10))
print(df[["species","gene_names"]].sample(10))

  species                                         gene_names
0     osa  CMS-54257|cms-54257|CYTOPLASMIC MALE STERILITY...
1     osa  CMS-JP|CMS-AK|[cms-ak]([cms-jp])|cms-jp|cms-ak...
2     osa  CMS-ARC|cms-ARC|cms-ARC] [mt|CYTOPLASMIC MALE ...
3     osa  CMS-BO|cms-bo|CYTOPLASMIC MALE STERILITY BO|Ch...
4     osa  CMS-CW|cms-CW|CYTOPLASMIC MALE STERILITY CW|Ch...
5     osa  CMS-GAM|cms-GAM|CYTOPLASMIC MALE STERILITY GAM...
6     osa  CMS-HL|cms-HL|CYTOPLASMIC MALE STERILITY HL|HL...
7     osa  CMS-IR66707A|cms-IR66707A|CYTOPLASMIC MALE STE...
8     osa  CMS-KALINGA-I|cms-Kalinga-I|CYTOPLASMIC MALE S...
9     osa  CMS-KHIABORO|cms-Khiaboro|CYTOPLASMIC MALE STE...
      species                                         gene_names
3555      osa  Os07g0606600|LOC_Os07g41580.1|HAP3F|OsHAP3F|NF...
10620     osa  Os11g0127000|LOC_Os11g03310.1|NAC123|ONAC123|N...
3146      osa  Os11g0555600|LOC_Os11g35220.1|WAK117|OsWAK117|...
7928      osa  MIR5072|miR5072|osa-miR5072|osa-MIR5072|MICROR...
1388

In [23]:
# Outputting the dataset of descriptions to a csv file.
path = os.path.join(OUTPUT_DIR,"oryzabase_phenotype_descriptions_and_annotations.csv")
df.to_csv(path, index=False)