## Saving DEPMAP20Q3 datasets in BigQuery Tables 
```
Title:   How to save DEPMAP20Q3 datasets in BigQuery Tables
Author:  Bahar Tercan
Created: 02-08-2022
Purpose: To download data from DEPMAP20Q3 project into BigQuery Tables 
Notes: MyBinder may restart kernel because of long runtime of this notebook, we recommend only local run for this notebook 
```

This notebook provides code for saving DepMap_public_20Q3 data in bigquery tables.
Users don't need to run this pipeline to get the data, this pipeline shows how we saved the data in bigquery tables.

Please contact Bahar Tercan, btercan@systemsbiology.org, if you have further questions for this notebook.


Installing and importing the required libraries 

In [None]:
# Please don't run this code block if you are running the notebook in MyBinder
# This code block installs the dependencies, please uncomment and run it only once, the first time you run this notebook on your computer
# (If you have already run this block for the shRNA_save_data pipeline, you do not need to run)
#!pip3 install numpy
#!pip3 install pandas
#!pip3 install google.cloud
#!pip3 install pandas_gbq
#!pip3 install importlib

In [None]:
import numpy as np
import sys
sys.path.append('../../Scripts/')
import importlib
import BIGQUERY_operations
importlib.reload(BIGQUERY_operations)
from BIGQUERY_operations import *
import DEPMAP_data_preprocessing
importlib.reload(DEPMAP_data_preprocessing)
from DEPMAP_data_preprocessing import *


## Google Authentication
The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries.

In [None]:
#Please don't run this code block if you are running the notebook in MyBinder
!gcloud auth application-default login

In [None]:
from google.cloud import bigquery

In [None]:
# configure project info and bigquery client
# please replace syntheticlethality with your own project_id


project_id='syntheticlethality'

# construct a BigQuery client object.
client = bigquery.Client(project_id)

Defining dataset name and dataset description

In [None]:
dataset_name='DepMap_public_20Q3_backup'
dataset_description="""  
This DepMap release contains data from CRISPR knockout 
screens from project Achilles, as well as genomic characterization data from the CCLE project.

References:
Dempster, J.M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D.E., and Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines.

Meyers, R.M., Bryan, J.G., McFarland, J.M., Weir, B.A., Sizemore, A.E., Xu, H., Dharia, N.V., Montgomery, P.G., Cowley, G.S., Pantel, S., et al. (2017). Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784.

Ghandi, M., Huang, F.W., Jané-Valbuena, J., Kryukov, G.V., Lo, C.C., McDonald, E.R., 3rd, Barretina, J., Gelfand, E.T., Bielski, C.M., Li, H., et al. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508

"""

CreateDataSet(client, dataset_name, project_id, dataset_description)

Download the DEPMAP20Q3 datasets from DEPMAP portal

In [None]:
#CCLE_mutation_data.csv file
mutation_data=pd.read_csv("https://ndownloader.figshare.com/files/24613355", sep="\t") 
#https://ndownloader.figshare.com/files/24613355

#sample_info.csv file
sample_info=pd.read_csv("https://ndownloader.figshare.com/files/24613394",  sep=",")
#https://ndownloader.figshare.com/files/24613394

#WES_SNP_CN_data.csv file
cn_data=pd.read_csv("https://ndownloader.figshare.com/files/24613352", index_col=0)
#https://ndownloader.figshare.com/files/24613352

#CCLE_expression.csv
gene_exp_data=pd.read_csv("https://ndownloader.figshare.com/files/24613325", index_col=0)
#https://ndownloader.figshare.com/files/24613325

#D2_Achilles_gene_effect.csv file
achilles_gene_effect=pd.read_csv("https://ndownloader.figshare.com/files/24613292", index_col=0)
#https://ndownloader.figshare.com/files/24613292


Read the manually created annotations for columns of tables

In [None]:
# get annotations from the excel file 
depmap_annotations=pd.ExcelFile("../Depmap20Q3_annotation.xlsx")

In [None]:
mutation_annotation=depmap_annotations.parse('CCLE_Mutations')
sample_annotation=depmap_annotations.parse('Sample_Info')
achilles_gene_effect_annotation=depmap_annotations.parse('Achilles_Gene_Effect')
CCLE_expression_annotation=depmap_annotations.parse('CCLE_Gene_Expression')
cnv_annotations=depmap_annotations.parse('CCLE_Copy_Number')

Create the BigQuery dataset in the google cloud project

In [None]:
dataset_name='DepMap_public_20Q3_backup'
dataset_description="""  
This DepMap release contains data from CRISPR knockout 
screens from project Achilles, as well as genomic characterization data from the CCLE project.

References:
Dempster, J.M., Rossen, J., Kazachkova, M., Pan, J., Kugener, G., Root, D.E., and Tsherniak, A. (2019). Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines.

Meyers, R.M., Bryan, J.G., McFarland, J.M., Weir, B.A., Sizemore, A.E., Xu, H., Dharia, N.V., Montgomery, P.G., Cowley, G.S., Pantel, S., et al. (2017). Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784.

Ghandi, M., Huang, F.W., Jané-Valbuena, J., Kryukov, G.V., Lo, C.C., McDonald, E.R., 3rd, Barretina, J., Gelfand, E.T., Bielski, C.M., Li, H., et al. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508

"""

CreateDataSet(client, dataset_name, project_id, dataset_description)

Save mutation data into BigQuery table 

In [None]:
mutation_table_name='CCLE_mutation'
mutation_table_desc='''Pipeline: Mutations MAF of gene mutations. Original file: CCLE_mutations.csv
Download link: https://ndownloader.figshare.com/files/24613355.'''

mutation_dict=mutation_annotation.to_dict('records')
mutation_table=CreateTable(client, mutation_data, dataset_name, mutation_table_name, project_id,  mutation_table_desc, mutation_dict)

Save sample info data into BigQuery table 

In [None]:
sample_info_table_name="sample_info"
sample_info_table_desc='''Cell line information definitions. Original file:sample_info.csv 
Download link: https://ndownloader.figshare.com/files/24613394'''

sample_dict=sample_annotation.to_dict('records')
CreateTable(client, sample_info, dataset_name, sample_info_table_name, project_id, sample_info_table_desc, sample_dict)

Save copy number data into BigQuery table 

In [None]:
cn_table_desc='''Pipeline: Copy number Gene level copy number data, log2 transformed with a pseudo count of 1.
This is generated by mapping genes onto the segment level calls. 
Original file: WES_SNP_CN_data.csv Download link: https://ndownloader.figshare.com/files/24613352.'''

cnv_long_format=CRISPRPreprocess(cn_data, 'CNA')
cnv_long_format['Entrez_ID']=pd.to_numeric(cnv_long_format['Entrez_ID'])
cnv_table_name="CCLE_gene_cn"
cnv_dict=cnv_annotations.to_dict('records')
CreateTable(client, cnv_long_format, dataset_name, cnv_table_name, project_id, cn_table_desc, cnv_dict)

Save gene expression data into BigQuery table 

In [None]:
CCLE_expression_table_desc='''
Pipeline: Expression Random TPM gene expression data for just protein coding genes using RSEM. 
Log2 transformed, using a pseudo-count of 1. Original file: CCLE_expression.csv 
Download link: https://ndownloader.figshare.com/files/24613325'''

CCLE_expression_long_format=CRISPRPreprocess(gene_exp_data, 'TPM')
CCLE_expression_long_format['Entrez_ID']=pd.to_numeric(CCLE_expression_long_format['Entrez_ID'])
CCLE_expression_table_name="CCLE_gene_expression"
CCLE_expression_dict=CCLE_expression_annotation.to_dict('records')
CreateTable(client, CCLE_expression_long_format, dataset_name, CCLE_expression_table_name, project_id, CCLE_expression_table_desc, CCLE_expression_dict)




Save gene effect data into BigQuery table 

In [None]:
achilles_gene_effect_table_desc='''Pipeline: Achilles_Post-CERES_ CERES data with principle components strongly related to known batch effects removed, then shifted and scaled per cell line so the median nonessential KO effect is 0 and the median essential KO effect is -1.
Original file: Achilles_gene_effect.csv 
Download link: https://ndownloader.figshare.com/files/24613352
'''
achilles_gene_effect_long_format=CRISPRPreprocess(achilles_gene_effect, 'Gene_Effect')
achilles_gene_effect_long_format['Entrez_ID']=pd.to_numeric(achilles_gene_effect_long_format['Entrez_ID'])
achilles_gene_effect_table_name="Achilles_gene_effect"
achilles_gene_effect_dict=achilles_gene_effect_annotation.to_dict('records')
CreateTable(client, achilles_gene_effect_long_format, dataset_name, achilles_gene_effect_table_name, project_id, achilles_gene_effect_table_desc, achilles_gene_effect_dict)

