# Setup for Conservation-Based Synthetic Lethal Discovery Pipeline
```
Title:   Saving Yeast-Human Orthology Information into Bigquery Tables 
Author:  Taek-Kyun Kim
Created: 02-08-2022
Purpose: Download the yeast -human ortholog mapping information and save in bigquery tables
Notes: Runs in Mybinder   
```

This notebook outlines our approach to pre-processing publicly-available datasets to facillitate the discovery of synthetic lethal interactions conserved between human and yeast speicies. The wrangled data are uploaded to bigquery to facillitate .sql-based querying (see notebook conservedSL.ipynb in the Pipelines folder).

In [None]:
# If you are running this notebook in MyBinder, please don't run this code block
# This code block installs the dependencies, please uncomment the commands and run it only once, the first time you run this notebook on your computer

#!pip3 install pandas
#!pip3 install google.cloud
#!pip3 install pyarrow

In [None]:
# load required modules
import pandas as pd

### Reference Dataset for Ortholog Mapping

Our source for human to yeast ortholog mapping is the [Alliance of Genome Resources (AGR)](https://www.alliancegenome.org) - Release 3.0.1. 
Full download is found by navigating from the home page to Data -> Downloads -> Orthology. We are using the 'Alliance combined orthology data' found on the [downloads](https://www.alliancegenome.org/downloads#orthology) page. 

#### Option 1: pre-download combined orthology data

In [None]:
# local file path to download location
# AGR_ORTHOLOGS = 'ORTHOLOGY-ALLIANCE_COMBINED_4.tsv' 

#### Option 2: web-based download of combined orthology data

In [None]:
# url to file download
AGR_ORTHOLOGS = 'http://download.alliancegenome.org/3.0.1/ORTHOLOGY-ALLIANCE/COMBINED/ORTHOLOGY-ALLIANCE_COMBINED_4.tsv'

In [None]:
ortholog_table = pd.read_csv(AGR_ORTHOLOGS, sep='\t', comment='#')
ortholog_table.head()

#### Data Cleaning

In [None]:
# data cleaning - remove identifier names within each field
# e..g HGNC:28697 --> 28697
headers = ortholog_table.columns.values
cols_to_clean = headers[['ID' in s for s in headers]]

In [None]:
def remove_column_annotation(gene_info):
    """ Clean up identifier columns
    Remove the exact source info and keep true identifiers 
    e.g. remove 'MGI:' and keep mouse gene identifier
    """
    return(gene_info.str.split(':').str[-1])

In [None]:
df = ortholog_table.apply(lambda x: remove_column_annotation(x) if x.name in cols_to_clean else x)
df

In [None]:
# add column: Algorithm Match Percentage - the % of algorithms that agree on the ortholog mapping 
df["AlgorithmsMatchPerc"] = df["AlgorithmsMatch"]/df["OutOfAlgorithms"] 
df

### Data Filtering
Keep ortholog mapping information for relevant species for our purposes

In [None]:
# create human to yeast mapping table
human2yeast = df.loc[(df['Gene1SpeciesName'] == 'Homo sapiens') & (df['Gene2SpeciesName'] == 'Saccharomyces cerevisiae')]
human2yeast = human2yeast.rename(columns={'Gene1ID': 'HumanID', 'Gene1Symbol': 'HumanSymbol',
                                'Gene2ID': 'YeastID', 'Gene2Symbol': 'YeastSymbol',})

human2yeast = human2yeast.filter(items=['HumanID', 'HumanSymbol', 'YeastID', 'YeastSymbol',
                'Algorithms', 'AlgorithmsMatch', 'OutOfAlgorithms', 'AlgorithmsMatchPerc',
                          'IsBestScore', 'IsBestRevScore',])
human2yeast.head()

In [None]:
human2yeast.infer_objects()
human2yeast['HumanID']=human2yeast['HumanID'].astype(str).astype(int)

In [None]:
human2yeast.dtypes

In [None]:
# create yeast to human mapping table
yeast2human = df.loc[(df['Gene1SpeciesName'] == 'Saccharomyces cerevisiae') & (df['Gene2SpeciesName'] == 'Homo sapiens')]
yeast2human = yeast2human.rename(columns={'Gene1ID': 'YeastID', 'Gene1Symbol': 'HumanSymbol',
                                'Gene2ID': 'HumanID', 'Gene2Symbol': 'YeastSymbol',})
yeast2human = yeast2human.filter(items= ['YeastID', 'YeastSymbol','HumanID', 'HumanSymbol',
                         'Algorithms', 'AlgorithmsMatch', 'OutOfAlgorithms', 'AlgorithmsMatchPerc',
                            'IsBestScore', 'IsBestRevScore'])
yeast2human.head()

In [None]:
yeast2human['HumanID']=yeast2human['HumanID'].astype(str).astype(int)

In [None]:
# write ortholog mapping tables to file
yeast2human.to_csv(path_or_buf='yeast2human_alliance_v4.csv', index=False)
human2yeast.to_csv(path_or_buf='human2yeast_alliance_v4.csv', index=False)

### Create BigQuery Dataset and Upload Data


## Google Authentication
The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries.

In [None]:
# If you are running this notebook in MyBinder, please don't run this code block
# google cloud authentication
!gcloud auth application-default login

In [None]:
# pip install pyarrow
from google.cloud import bigquery

In [None]:
# configure project info and bigquery client
project='syntheticlethality'

# construct a BigQuery client object.
client = bigquery.Client(project) # Replace XXXXXXXX with your project ID

#### Create a bigquery dataset within your project 


In [None]:
# if it does not already exist
dataset_name='Orthology_backup'
dataset_id = bigquery.Dataset(project +'.'+ dataset_name)
try:
    client.get_dataset(dataset_id)
    print("Dataset: {} already exists".format(project, dataset_name))
except:
    # construct a full Dataset object to send to the API.
    dataset = bigquery.Dataset(dataset_id)
    
    # send the dataset to the API for creation.
    dataset = client.create_dataset(dataset_id)  # Make an API request.
    print("Created dataset: {}.{}".format(project_id,  dataset_id))

#### Configure upload job - Yeast2Human Table


In [None]:
yeast2human.columns.values

In [None]:
print(yeast2human.dtypes) 

In [None]:
table_description = '''
Mapping conserved genes from yeast (Saccharomyces cerevisiae) to human (Homo sapiens) derived from Integrated orthology inferences created using the Drosophila RNAi Screening Center(DRSC) Integrative Ortholog Prediction Tool (DIOPT) 
at Harvard Medical School see more information at: https://www.flyrnai.org/diopt. 

Table downloaded from the Alliance for Genome Resources. Release 3.0.1"
https://www.alliancegenome.org/downloads#orthology
'''

In [None]:
job_config = bigquery.LoadJobConfig(
   destination_table_description=table_description,
    
    # specify table schema
    schema=[
        bigquery.SchemaField(name="YeastID", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Saccharomyces Genome Database (SGD) gene identifier'),
        bigquery.SchemaField(name="YeastSymbol", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description="Yeast official gene symbol"),
        bigquery.SchemaField(name="HumanID", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='HGNC gene identifier'),
        bigquery.SchemaField(name="HumanSymbol", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='HGNC gene symbol'),
       bigquery.SchemaField(name="Algorithms", field_type=bigquery.enums.SqlTypeNames.STRING, 
                          description='The orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="AlgorithmsMatch", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='The number of orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="OutOfAlgorithms", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='The toal number of orthology methods that could make the orthology association for the species'),
        bigquery.SchemaField(name="AlgorithmsMatchPerc", field_type=bigquery.enums.SqlTypeNames.FLOAT, 
                             description='The proportion of orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="IsBestScore", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Within the species, whether this gene is called the ortholog of the input gene by the highest number of algorithms'),
        bigquery.SchemaField(name="IsBestRevScore", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Within the species of the input gene, whether the input gene is called the ortholog of the gene by the highest number of algorithms'), 
    ],
    
    write_disposition="WRITE_TRUNCATE" #replaces the table with the loaded data
)


In [None]:
# create table name
table_name = 'YEAST2HUMAN'
table_id = "{}.{}.{}".format(project,  dataset_name, table_name)

#### Load tables to bigquery

In [None]:
job = client.load_table_from_dataframe(yeast2human, table_id, job_config=job_config)

In [None]:
job.result()  # Wait for the job to complete.


In [None]:
table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

#### Configure upload job - Human2Yeast Table

In [None]:
table_description = '''
Mapping conserved genes from human (Homo sapiens) to yeast (Saccharomyces cerevisiae) to  derived from Integrated orthology inferences created using the Drosophila RNAi Screening Center(DRSC) Integrative Ortholog Prediction Tool (DIOPT) 
at Harvard Medical School see more information at: https://www.flyrnai.org/diopt. 

Table downloaded from the Alliance for Genome Resources. Release 3.0.1"
https://www.alliancegenome.org/downloads#orthology
'''

In [None]:
human2yeast.columns.values

In [None]:
# configure upload
job_config = bigquery.LoadJobConfig(
   destination_table_description=table_description,
    
    # specify table schema
    schema=[
        bigquery.SchemaField(name="HumanID", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='HGNC gene identifier'),
        bigquery.SchemaField(name="HumanSymbol", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='HGNC gene symbol'),
               bigquery.SchemaField(name="YeastID", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Saccharomyces Genome Database (SGD) gene identifier'),
        bigquery.SchemaField(name="YeastSymbol", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description="Yeast official gene symbol"),
       bigquery.SchemaField(name="Algorithms", field_type=bigquery.enums.SqlTypeNames.STRING, 
                          description='The orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="AlgorithmsMatch", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='The number of orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="OutOfAlgorithms", field_type=bigquery.enums.SqlTypeNames.INTEGER, 
                             description='The toal number of orthology methods that could make the orthology association for the species'),
        bigquery.SchemaField(name="AlgorithmsMatchPerc", field_type=bigquery.enums.SqlTypeNames.FLOAT, 
                             description='The proportion of orthology methods that make the orthology association for the species'),
        bigquery.SchemaField(name="IsBestScore", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Within the species, whether this gene is called the ortholog of the input gene by the highest number of algorithms'),
        bigquery.SchemaField(name="IsBestRevScore", field_type=bigquery.enums.SqlTypeNames.STRING, 
                             description='Within the species of the input gene, whether the input gene is called the ortholog of the gene by the highest number of algorithms'), 
    ],
    
    write_disposition="WRITE_TRUNCATE" #replaces the table with the loaded data
)


In [None]:
# create table name
table_name = 'HUMAN2YEAST'
table_id = "{}.{}.{}".format(project,  dataset_name, table_name)

In [None]:
job = client.load_table_from_dataframe(human2yeast, table_id, job_config=job_config)

In [None]:
job.result()

In [None]:
table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)