## Description
This notebook is for generating tables and corresponding files that contain mappings between plant gene identifiers from the dataset of interest, the gene symbols of orthologous human genes, and a description of the disease phenotype that is associated with that human gene symbol, where this information is known. The mappings between plant gene identifiers and human orthologs are identfied using [PantherDB](http://www.pantherdb.org/), and the associations between human gene symbols and disease phenotypes are identified using [OMIM](https://omim.org/). The purpose of mapping the dataset of plant gene identifiers to this information is to test if there are disease phenotypes that are phenologs or highly associated with groups of plant genes based on their phenotype descriptions. 

In [1]:
from collections import defaultdict
import pandas as pd
import urllib.request
import json
import time
import sys

sys.path.append("../../oats")
from oats.utils.utils import flatten
from _hidden import omim_api_key

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
# Script needs this information in the shape dict[organism name]-->[gene1, gene2, gene3, ...]
input_data = pd.read_table("/Users/irbraun/Desktop/Locus_Germplasm_Phenotype_20180702.txt")
gene_names = pd.unique(input_data["LOCUS_NAME"].values)

# Subset the gene names to not query PantherDB for the entire dataset at once, and name file accordingly.
lower_index = 4800
upper_index = 6000
gene_names = gene_names[lower_index:upper_index]
input_data = {"ARATH":gene_names}
mim2gene_filename = "../data/gene_related_files/omim/mim2gene_irb_cleaned.txt"
temp_df_filename = "../data/scratch/arabidopsis_orthologs_df_{}_to_{}.csv".format(lower_index,upper_index)
full_df_filename = "../data/orthology_related_files/arabidopsis_panther_omim_irb_{}_to_{}.csv".format(lower_index,upper_index)

### Querying PantherDB for Human Orthologs of Plant Genes

This text is copied from [this page](http://www.pantherdb.org/help/PANTHERhelp.jsp#V.A.).

Search via a URL

The following parameters are required.
1. type - This refers to the search type and should be specified as "matchingOrtholog"
2. inputOrganism - The organism/genome being queried by the search term(s)
3. targetOrganism -  Target organism for ortholgos from search.  Multiple organisms can be specified, separated by commas
4. orthologType - LDO or all - LDO will return only least diverged ortholog for each gene (single "best" ortholog), and all will return all orthologs if more than one
5. searchTerm - query terms, these should optimally be Uniprot of MOD gene identifiers, but, other identifiers are supported such as gene symbols.  Maximum of 10 query terms can be submitted, separated by commas.

For the inputOrganism and targetOrganism, the 5-letter Uniprot code is used, see the Short Name field on the [summaryStats](http://www.pantherdb.org/panther/summaryStats.jsp) page for a list of available organisms and the associated codes.
Example - http://www.pantherdb.org/webservices/ortholog.jsp?type=matchingOrtholog&inputOrganism=MOUSE&targetOrganism=HUMAN&orthologType=all&searchTerm=101816

For each match, the following data are returned:

If search term has no match in the input organism:
<searchTerm>    "Search term not found in input organism <inputOrganism"
If gene is found, but it has no ortholog in the database or orthologType specified is not found in the database:
<searchTerm>     <matchedGene>    "No ortholog found in target organism <targetOrganism>"

If one or more hit, example results, one line per match (each field is tab-separated):
101816 MOUSE|MGI=MGI=101816|UniProtKB=P43247 HUMAN|ENSEMBL=ENSG00000095002|UniProtKB=P43246 MSH2 LDO 

1. searchTerm that was matched

2. Information about matched gene in input organism:organism|gene_id(database=id)|protein_id(database=id)

2. information about ortholog in target organism:    organism|gene_id(database=id)|protein_id(database=id)

4. gene symbol of the matched gene in the target organism

5. Type of ortholog, LDO (least diverged) or O (other ortholog, if more than one) or P for paralog or X for horizontal gene transfer or LDX for least diverged horizontal gene transfer

In [3]:
# Constructing a query for orthologs, the search terms should be a gene symbols or UniProt IDs.
query_format = "http://pantherdb.org/webservices/ortholog.jsp?type=matchingOrtholog&inputOrganism={}&targetOrganism={}&orthologType=all&searchTerm={}"
target_organism = "HUMAN"
query_term_limit = 10
wait_time_s = 2


# A class for a single line of the PantherDB query results, not all this information is currently used.
class Result:
    def __init__(self, result_str):
        fields = result_str.split(r"\t")
        self.search_term = fields[0].strip()
        self.input_info = fields[1].strip()
        self.target_info = fields[2].strip()
        self.target_gene_symbol = fields[3].strip()
        self.ortholog_type = fields[4].strip()


# Given an input dict[organism]-->[g1, g2, g3..], make all the queries with a maximum of ten at a time 
# to ahdere to the instructions in the guidelines for the PantherDB guidelines listed above.
tuples = []
for input_organism,gene_terms in input_data.items():
    for lower_idx in range(0,len(gene_terms),query_term_limit):
        upper_idx = min(lower_idx+query_term_limit-1,len(gene_terms)-1)
        gene_terms_sublist = gene_terms[lower_idx:upper_idx+1]
        query = query_format.format(input_organism,target_organism,",".join(gene_terms_sublist))
        result = urllib.request.urlopen(query).read()
        all_results = str(result).replace(r"b'\r\n\r\n","").replace(r"\n\r\n'","")
        results = all_results.split(r"\n")
        results = [Result(r) for r in results if len(r.split(r"\t"))==5]    
        print("completed queries for {}/{} {} genes".format(upper_idx+1, len(gene_terms), input_organism.lower()))
        tuples.extend([(input_organism, result.search_term, result.target_gene_symbol) for result in results])
        time.sleep(wait_time_s)
print("done with all queries")

completed queries for 10/503 arath genes
completed queries for 20/503 arath genes
completed queries for 30/503 arath genes
completed queries for 40/503 arath genes
completed queries for 50/503 arath genes
completed queries for 60/503 arath genes
completed queries for 70/503 arath genes
completed queries for 80/503 arath genes
completed queries for 90/503 arath genes
completed queries for 100/503 arath genes
completed queries for 110/503 arath genes
completed queries for 120/503 arath genes
completed queries for 130/503 arath genes
completed queries for 140/503 arath genes
completed queries for 150/503 arath genes
completed queries for 160/503 arath genes
completed queries for 170/503 arath genes
completed queries for 180/503 arath genes
completed queries for 190/503 arath genes
completed queries for 200/503 arath genes
completed queries for 210/503 arath genes
completed queries for 220/503 arath genes
completed queries for 230/503 arath genes
completed queries for 240/503 arath genes
c

In [4]:
# Creating a dataframe containing all of the PantherDB query results.
ortholog_df = pd.DataFrame(tuples, columns=["species", "gene_identifier", "human_ortholog_gene_symbol"])
ortholog_df.head()
ortholog_df.to_csv(temp_df_filename, index=False)

### Querying OMIM for the Asscoiated Disease Phenotypes
See [this page](https://www.omim.org/help/api) for structuring queries to the OMIM API.

In [5]:
# Reading in the table provided by OMIM for mapping between gene symbols and MIM numbers.
mim2gene_df = pd.read_table(mim2gene_filename)
gene2mim_dict = {gene_symbol:mim_number for gene_symbol,mim_number in zip(mim2gene_df["Approved Gene Symbol (HGNC)"].values, mim2gene_df["MIM Number"].values)}
# What are the lists of human gene symboles and corresponding MIM numbers?
gene_symbols = pd.unique(ortholog_df["human_ortholog_gene_symbol"].values)
gene_mim_numbers = [gene2mim_dict.get(symbol,None) for symbol in gene_symbols]

In [6]:
# Queryinq OMIM for the phenotypes which are associated with the gene MIM numbers of these orthologs.
query_format = "https://api.omim.org/api/geneMap?mimNumber={}&format=json&phenotypeExists=true&start=0&limit=10&apiKey={}"
api_key = omim_api_key
wait_time_s = 1
tuples = []
for gene_symbol, gene_mim_number in zip(gene_symbols, gene_mim_numbers):
    if gene_mim_number is not None:
        try:
            query = query_format.format(gene_mim_number, api_key)
            result_dict = json.loads(urllib.request.urlopen(query).read())
            for phenotype_map_list_item in result_dict["omim"]["listResponse"]["geneMapList"][0]["geneMap"]["phenotypeMapList"]:
                phenotype_mim_number = phenotype_map_list_item["phenotypeMap"]["phenotypeMimNumber"]
                phenotype_mim_name = phenotype_map_list_item["phenotypeMap"]["phenotype"]
                tuples.append((gene_symbol, gene_mim_number, phenotype_mim_number, phenotype_mim_name))  
        except KeyError:
            continue
        except IndexError:
            continue
        time.sleep(wait_time_s)   
        
gene_mim_df = pd.DataFrame(tuples, columns=["human_ortholog_gene_symbol","gene_mim_number","phenotype_mim_number","phenotype_mim_name"])
gene_mim_df.head()

Unnamed: 0,human_ortholog_gene_symbol,gene_mim_number,phenotype_mim_number,phenotype_mim_name


In [7]:
# Producing a dataframe and CSV file contains the merged information from both dataframes created.
full_df = pd.merge(left=ortholog_df, right=gene_mim_df, how="left")
full_df["gene_mim_number"] = full_df["gene_mim_number"].astype("Int64")
full_df["phenotype_mim_number"] = full_df["phenotype_mim_number"].astype("Int64")
full_df.to_csv(full_df_filename, index=False)
full_df.head()

Unnamed: 0,species,gene_identifier,human_ortholog_gene_symbol,gene_mim_number,phenotype_mim_number,phenotype_mim_name
0,ARATH,GR2,GLYR1,,,
1,ARATH,PUP3,-,,,
2,ARATH,PUP3,MTG1,,,
3,ARATH,RE,-,,,
4,ARATH,RE,MTG1,,,
