## Data from Oryzabase (Integrated Rice Science Database)
The purpose of this notebook is to read in and do a preliminary analysis of the data related to text descriptions and ontology term annotations that are available through Oryzabase. The data needs to be organized and also restructured into a standard format that will allow it to be easily combined with datasets from other resources.

In [82]:
import sys
import os
import pandas as pd
import numpy as np
import itertools
import re
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

sys.path.append("../../oats")
from oats.utils.utils import to_abbreviation
from oats.utils.constants import NCBI_TAG
from oats.utils.constants import EVIDENCE_CODES
from oats.nlp.preprocess import add_prefix
from oats.nlp.preprocess import concatenate_with_bar_delim
from oats.nlp.preprocess import other_delim_to_bar_delim
from oats.nlp.preprocess import get_ontology_ids
from oats.nlp.preprocess import concatenate_descriptions
from oats.nlp.preprocess import remove_character
from oats.nlp.preprocess import remove_punctuation
from oats.nlp.preprocess import handle_synonym_in_parentheses
from oats.nlp.preprocess import remove_enclosing_brackets
from oats.nlp.preprocess import remove_short_tokens
from oats.nlp.preprocess import concatenate_with_bar_delim

OUTPUT_DIR = "../data/reshaped"

### Reading in the dataset of annotations and some other descriptions

In [83]:
filename = "../data/sources/oryzabase/OryzabaseGeneListEn_20190826010113.txt"
usecols = ["CGSNL Gene Symbol", "Gene symbol synonym(s)", "CGSNL Gene Name", "Gene name synonym(s)", "Protein Name", "Allele", 
    "Explanation", "Trait Class", "RAP ID", "MUS ID", "Gramene ID", "Gene Ontology", "Trait Ontology", "Plant Ontology"]
df = pd.read_table(filename, usecols=usecols, sep="\t")
df.fillna("", inplace=True)
print(df.shape)

(17674, 14)


### Describing the gene names and accessions in the data
There are several columns that contain information about gene names and accessions. We need to know what type of information is in each in order to know which should be retained or parsed to form the desired cleaned dataset. We are interested in both gene names that should map to a specific accession (like cms-54257) as well as gene names that are enzyme descriptions (like Ubiquitin-Specific Protease) that could map to more than one gene in a particular species. Each type of information is valuable, but needs to be differentiated so that when comparing whether two rows are specifying the same gene, this is not confused with specifying two different genes that have the same function.

In [84]:
print(df[["CGSNL Gene Symbol", "Gene symbol synonym(s)"]].head(5))
print(df[["CGSNL Gene Symbol", "Gene symbol synonym(s)"]].sample(5))

    CGSNL Gene Symbol                  Gene symbol synonym(s)
0         [CMS-54257]               [cms-54257]*, [cms-54257]
1  [CMS-AK]([CMS-JP])  [cms-ak]([cms-jp]), [cms-jp], [cms-ak]
2           [CMS-ARC]   [cms-ARC]*, [cms-ARC] [mt], [cms-ARC]
3            [CMS-BO]                                [cms-bo]
4            [CMS-CW]                                [cms-CW]
      CGSNL Gene Symbol                   Gene symbol synonym(s)
16456           CYP71Y5                                OsCYP71Y5
15574                 _                                         
6320              ITPK1  OsITPK1, OsITP5/6K-1, ITP5/6K-1, OsITL2
8232               ISC5                                   OsISC5
7170             MADS87                                 OsMADS87


The gene symbols in this dataset are typically surrounded by square brackets but not always. If a second symbol for that same gene is mentioned in the same column, the second symbol might be enclosed in parentheses. The synonynms for the gene symbols are similarly sometimes enclosed in square brackets, and are typically separated by commas in cases where more than one are mentioned in this column. Also note that an underscore is being used to represent missing data, so this has to handled so that that character is not treated as a gene name that appears many times.

In [85]:
print(df[["CGSNL Gene Name"]].head(5))
print(df[["CGSNL Gene Name"]].sample(5))

                    CGSNL Gene Name
0  CYTOPLASMIC MALE STERILITY 54257
1     CYTOPLASMIC MALE STERILITY AK
2    CYTOPLASMIC MALE STERILITY ARC
3     CYTOPLASMIC MALE STERILITY BO
4     CYTOPLASMIC MALE STERILITY CW
                          CGSNL Gene Name
13393  RECEPTOR-LIKE CYTOPLASMIC KINASE 1
10736                                   _
13309                                   _
14486                                   _
3526                     HOMEOBOX GENE 19


The gene name column has strings representing the full name of each gene rather than just the symbol. Note that an underscore is also being used to denote missing values in this column as well.

In [86]:
print(df[["Gene name synonym(s)"]].head(5))
print(df[["Gene name synonym(s)"]].sample(5))

                                Gene name synonym(s)
0  Cytoplasmic mutant induced by somaclonal varia...
1            Akebono' cytoplasm, 'Akebono' cytoplasm
2     ARC13829-16 cytoplasm, `ARC13829-26' cytoplasm
3  Chinsurah boro II' cytoplasm, `Chinsurah boro ...
4                        Chinese wild rice cytoplasm
                                    Gene name synonym(s)
15115                                                   
15032                             retrotransposon gene 1
4310   jasmonyl-L-isoleucine synthase 1, JASMONATE RE...
1070                                      Notched kernel
2685                                                    


The values in the gene name synonym(s) column can be comma delimited lists if more than one synonym for the gene name was known. Sometimes quotes are used. Empty strings and possibly underscores can be used to denote missing information.

In [87]:
print(df[["Protein Name","Allele"]].sample(5))

      Protein Name              Allele
15458                                 
13261               osgme1-1, osgme1-2
3210                                  
9187                                  
4055                                  


Both the protein name and allele columns are sparse within the dataset. Either can be a single value or a comma delimited list of values. These may not need to be retained for finding reference to genes in other resources because we already have more standardized representations of the genes in other columns.

In [88]:
print(df[["RAP ID","MUS ID"]].sample(5))

              RAP ID                                             MUS ID
7132                                                                   
14614  Os03g0186100                                    LOC_Os03g08730.1
6166    Os06g0153900  LOC_Os06g06040.7, LOC_Os06g06040.6, LOC_Os06g0...
13774  Os05g0272800                  LOC_Os05g19030.2, LOC_Os05g19030.1
9040    Os11g0702200                                   LOC_Os11g47610.1


Both the RAP ID and the MUS ID can columns can contain multiple values for a given gene which are included as members of a comma delimited list. These values can also be missing using the same scheme for missing information as in the rest of the dataset.

### Parsing the needed information about gene names from the dataset
The following functions were created based on the needs following how the gene symbols, names, synonyms, and accessions are previously described in this dataset. These are a not guaranteed to be a perfectly accurate method of parsing in the information in this dataset but they are meant to approximate what is required based on going through the dataset by hand. The methods are meant to be applied only to specific columns within the dataset, and to make the code that later cleans the columns more readable by compressing multiple cleaning steps into a single function. Some of these rely on other very specific functions that are within the text preprocessing module and not shown here.

In [89]:
def clean_oryzabase_symbol(string):
    # Should be applied to the gene symbol column in the dataset.
    # Returns a single string representing a bar delimited list of gene symbols.
    string = remove_character(string, "*")
    names = handle_synonym_in_parentheses(string, min_length=4)
    names = [remove_enclosing_brackets(name) for name in names]
    names = remove_short_tokens(names, min_length=2)
    names_string = concatenate_with_bar_delim(*names)
    return(names_string)

def clean_oryzabase_symbol_synonyms(string):
    # Should be applied to the gene symbol synonym(s) column in the dataset.
    # Returns a single string representing a bar delimited list of gene symbols.
    string = remove_character(string, "*")
    names = string.split(",")
    names = [name.strip() for name in names]
    names = [remove_enclosing_brackets(name) for name in names]
    names_string = concatenate_with_bar_delim(*names)
    return(names_string)

### Handling the ontology term annotations in the data
Multiple columns within the dataset specify ontology term annotations that have been applied to the geen mentioned on that particular line. Ontology term annotations are separated into different columns based on which ontology the terms belong to, and both the term ID of each annotation and the accompanying label for that term and explicitly given. Columns for terms from the Gene Ontolgoy (GO), Plant Ontology (PO), and Plant Trait Ontology (TO) are all included. There is no information about what evidence codes these annotations are associated with in this dataset.

In [90]:
print(df[["Gene Ontology"]].head(5))
print(df[["Plant Ontology"]].head(5))
print(df[["Trait Ontology"]].head(5))

                                       Gene Ontology
0  GO:0000001 - mitochondrion inheritance, GO:000...
1  GO:0000001 - mitochondrion inheritance, GO:000...
2  GO:0007275 - multicellular organismal development
3  GO:0007275 - multicellular organismal developm...
4  GO:0007275 - multicellular organismal development
                                      Plant Ontology
0                               PO:0009066 - anther 
1                               PO:0009066 - anther 
2  PO:0009082 - spikelet floret , PO:0020048 - mi...
3  PO:0009082 - spikelet floret , PO:0020048 - mi...
4  PO:0009082 - spikelet floret , PO:0020048 - mi...
                                      Trait Ontology
0  TO:0000232 - cytoplasmic male sterility (sensu...
1  TO:0000232 - cytoplasmic male sterility (sensu...
2  TO:0000232 - cytoplasmic male sterility (sensu...
3  TO:0000232 - cytoplasmic male sterility (sensu...
4  TO:0000232 - cytoplasmic male sterility (sensu...


Both the term IDs and labels for each annotation are given. Multiple annotations from the same ontology for a given line are separated by commas. We want to parse out just the gene ontology IDs for the cleaned dataset so that they can be referenced later, all the other information is not needed. There is a function to the return a list of gene IDs present in a longer string of text that is in the preprocessing module.

### Handling text description and keyword information in this data
This dataset does not contain any columns that consistly contain a natural language description of a phenotype associated with a given gene. But some text-based information is still present. The trait class column contains a value from a limited set of keyword descriptors for the trait a particular gene is associated with. The size of the vocabulary used is obtained here. Also see the specific description of this keyword vocabulary here (https://shigen.nig.ac.jp/rice/oryzabase/traitclass/). The explaination column also occasionally contains text information about a corresponding phenotype.

In [91]:
# Get a list sorted by number of occurences for each trait class.
description_counts = df["Trait Class"].value_counts().to_dict()
sorted_tuples = sorted(description_counts.items(), key = lambda x: x[1], reverse=True)
for t in sorted_tuples[0:10]:
    print("{:6}  {:20}".format(t[1],t[0][:70]))

  4121                      
  3621   Biochemical character
  2566   Other              
   875   Tolerance and resistance - Stress tolerance
   862   Biochemical character,  Tolerance and resistance - Stress tolerance
   527   Tolerance and resistance - Disease resistance
   293   Reproductive organ - Spikelet, flower, glume, awn
   244   Biochemical character,  Tolerance and resistance - Disease resistance
   222   Character as QTL - Yield and productivity
   204   Tolerance and resistance - Stress tolerance,  Other


The most common value in the trait class column is whitespace or an empty string indicating missing data. Another very common value though is 'Other' which has 2,566 occurences out of the 17,674 total instances. This needs to be handled if using this information as text descriptions because this contains no semantics relevant to the phenotype (two phenotypes with trait classes of 'Other' should not be considered similar).

In [92]:
print(df[["Explanation"]].sample(30))

                                             Explanation
2615   LOC_Os01g60540. BK005023. WRKY40 in Zhang and ...
12001       LOC_Os10g03620. Os_F0127 in Hua et al. 2011.
2671   LOC_Os09g25060. BK005079. AY323479. AF467736. ...
14559                                                   
16102  LOC_Os01g11810. GO:0071554: cell wall organiza...
10647                                          AB086076.
16884       closely related to Arabidopsis BON1. copine.
16596                                           Q2QLY4. 
6488   homologous recombination (HR)-related gene. GO...
3800                  LOC_Os07g04240. Q6ZDY8. EC=1.3.5.1
9010                                           AB332051.
772    Responsible for the production of the pigment ...
2566   TS was located between S13528 and S10581 on th...
12950                                    LOC_Os06g04670.
5232                        BK004860. Q6AWY4. AJ566409. 
12705                                    LOC_Os07g28140.
10471                          

The explanation column holds information potentially about the phenotype, but also sometimes contains redundant information about the gene names or identifiers and sometimes the ontology term annotations as well. Sometimes methods are mentioned as well. Some of this could be handled with parsing to remove the redundant information that already appears somewhere else in a particular column for this line, but this should be considered irregular text annotations or descriptions for the purposes of downstream analyses. The following cell contains a preliminary attempt at a function that cleans values in this column by removing some redundant information from other columns.

In [93]:
def clean_oryzabase_explainations(string):
    # Should be applied to the explanation column in the dataset.
    # Returns a version of the the value in that column without some redundant information.
    ontology_ids = get_ontology_ids(string)
    for ontology_id in ontology_ids:
        string = string.replace(ontology_id,"")
        string = remove_punctuation(string)
    return(string)

### Restructuring the data and saving to a csv file

In [94]:
# Restructuring and combining columns that have gene name information.
df["CGSNL Gene Symbol"] = df["CGSNL Gene Symbol"].apply(clean_oryzabase_symbol)
df["Gene symbol synonym(s)"] = df["Gene symbol synonym(s)"].apply(clean_oryzabase_symbol_synonyms)
df["CGSNL Gene Name"] = df["CGSNL Gene Name"].apply(lambda x: x.replace("_","").strip())
df["Gene name synonym(s)"] = df["Gene name synonym(s)"].apply(lambda x: other_delim_to_bar_delim(string=x, delim=","))
df["gene_names"] = np.vectorize(concatenate_with_bar_delim)(df["RAP ID"], df["MUS ID"], df["CGSNL Gene Symbol"], df["Gene symbol synonym(s)"], df["CGSNL Gene Name"], df["Gene name synonym(s)"])

# Restructuring and combining columns that have ontology term annotations.
df["Gene Ontology"] = df["Gene Ontology"].apply(lambda x: concatenate_with_bar_delim(*get_ontology_ids(x))) 
df["Trait Ontology"] = df["Trait Ontology"].apply(lambda x: concatenate_with_bar_delim(*get_ontology_ids(x))) 
df["Plant Ontology"] = df["Plant Ontology"].apply(lambda x: concatenate_with_bar_delim(*get_ontology_ids(x))) 
df["term_ids"] = np.vectorize(concatenate_with_bar_delim)(df["Gene Ontology"], df["Trait Ontology"], df["Plant Ontology"])

# Adding other expected columns and subsetting the dataset.
df["species"] = "osa"
df["description"] = df["Explanation"].apply(clean_oryzabase_explainations)
df["pmid"] = ""
df = df[["species", "gene_names", "description", "term_ids" ,"pmid"]]
print(df[["species","gene_names"]].head(10))
print(df[["species","gene_names"]].sample(10))

  species                                         gene_names
0     osa  CMS-54257|cms-54257|CYTOPLASMIC MALE STERILITY...
1     osa  CMS-JP|CMS-AK|[cms-ak]([cms-jp])|cms-jp|cms-ak...
2     osa  CMS-ARC|cms-ARC|cms-ARC] [mt|CYTOPLASMIC MALE ...
3     osa  CMS-BO|cms-bo|CYTOPLASMIC MALE STERILITY BO|Ch...
4     osa  CMS-CW|cms-CW|CYTOPLASMIC MALE STERILITY CW|Ch...
5     osa  CMS-GAM|cms-GAM|CYTOPLASMIC MALE STERILITY GAM...
6     osa  CMS-HL|cms-HL|CYTOPLASMIC MALE STERILITY HL|HL...
7     osa  CMS-IR66707A|cms-IR66707A|CYTOPLASMIC MALE STE...
8     osa  CMS-KALINGA-I|cms-Kalinga-I|CYTOPLASMIC MALE S...
9     osa  CMS-KHIABORO|cms-Khiaboro|CYTOPLASMIC MALE STE...
      species                                         gene_names
7351      osa  Os12g0628100|LOC_Os12g43340.1|ADF11|OsADF11|AC...
13565     osa  Os06g0151700|LOC_Os06g05830.1|RLCK196|OsRLCK19...
6374      osa  Os09g0497900|LOC_Os09g32260.2, LOC_Os09g32260....
13420     osa  Os01g0296000|LOC_Os01g19160.2, LOC_Os01g19160....
2758

In [95]:
# Outputting the dataset of descriptions to a csv file.
path = os.path.join(OUTPUT_DIR,"oryzabase_dataset_descriptions.csv")
df.to_csv(path, index=False)