## Saving DEMETER2v6 datasets in BigQuery Tables 
```
Title:   How to save DEMETER2v6 datasets in BigQuery Tables
Author:  Bahar Tercan
Created: 02-08-2022
Purpose: To download data from DEMETER2v6 project into BigQuery Tables 
Notes: MyBinder may restart kernel because of long runtime of this notebook, we recommend only local run for this notebook
```

This notebook provides code for saving DEMETER2 version 6 data in BigQuery tables. 
Users don't need to run this pipeline to get the data, this pipeline shows how we saved the data in bigquery tables.

Please contact Bahar Tercan, btercan@systemsbiology.org, if you have further questions for this notebook.

Installing and importing the required libraries 

In [None]:
#Please don't run this code block if you are running the notebook in MyBinder
#This code block installs the dependencies, please uncomment the commands and run it only once, the first time you run this notebook on your computer
#(If you have already run this block for the CRISPR_save_data pipeline, you do not need to run)
#!pip3 install numpy
#!pip3 install pandas
#!pip3 install google.cloud
#!pip3 install pandas_gbq
#!pip3 install importlib
#!pip3 install openpyxl

In [None]:
import pandas as pd
from google.cloud import bigquery
import sys
sys.path.append('../../Scripts/')
import importlib
import BIGQUERY_operations
importlib.reload(BIGQUERY_operations)
from BIGQUERY_operations import *
import DEPMAP_data_preprocessing
importlib.reload(DEPMAP_data_preprocessing)
from DEPMAP_data_preprocessing import *

## Google Authentication
The first step is to authorize access to BigQuery and the Google Cloud. For more information see ['Quick Start Guide to ISB-CGC'](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) and alternative authentication methods can be found [here](https://googleapis.dev/python/google-api-core/latest/auth.html).

Moreover you need to [create a google cloud](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) project to be able to run BigQuery queries.

In [None]:
#Please don't run this code block if you are running the notebook in MyBinder
!gcloud auth application-default login

In [None]:
# configure project info and bigquery client
# please replace syntheticlethality with your own project_id

project_id='syntheticlethality'

# construct a BigQuery client object.
client = bigquery.Client(project_id)

Defining dataset name and dataset description

In [None]:
dataset_name='DEMETER2_v6_backup'
dataset_description=""" Cancer cell line genetic dependencies estimated using the DEMETER2 model. 
    DEMETER2 is applied to three large-scale RNAi screening datasets: 
    the Broad Institute Project Achilles, Novartis Project DRIVE, and the Marcotte et al. breast cell line dataset.
    The model is also applied togenerate a combined dataset of gene dependencies covering a total of 712 unique cancer cell lines.
    For version history, see the description in the figshare
    (https://figshare.com/articles/DEMETER2_data/6025238/6).
    For more information visit https://depmap.org/R2-D2/.
    Reference: McFarland, J.M., Ho, Z.V., Kugener, G., Dempster, J.M., Montgomery, P.G., Bryan, J.G., Krill-Burger, J.M., Green, T.M., Vazquez, F., Boehm, J.S., et al. (2018).
    Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration. Nat. Commun. 9, 4610 """

Download the DEMETER2v6 datasets from DEPMAP portal

In [None]:
#CCLE_mutation_data.csv file
mutation_data=pd.read_csv("https://ndownloader.figshare.com/files/13110674") 

#sample_info.csv file
sample_info=pd.read_csv("https://ndownloader.figshare.com/files/11489717")

#WES_SNP_CN_data.csv file
cn_data=pd.read_csv("https://ndownloader.figshare.com/files/11489726", index_col=0)

#RNAseq_lRPKM_data.csv file 
gene_exp_data=pd.read_csv("https://ndownloader.figshare.com/files/13110677", index_col=0)

#D2_combined_gene_dep_scores.csv
combined_gene_dep_scores=pd.read_csv("https://ndownloader.figshare.com/files/13515395", index_col=0)

Read the manually created annotations for columns of tables

In [None]:
# get annotations from the excel file 
demeter6_annotations=pd.ExcelFile("../DEMETER2_Data_V6_annotation.xlsx")

In [None]:
mutation_annotation=demeter6_annotations.parse('CCLE_Mutation')
sample_annotation=demeter6_annotations.parse('Sample_Info')
gene_dep_scores_annotation=demeter6_annotations.parse('D2_combined_gene_dep_scores')
RNAseq_IRPKM_annotation=demeter6_annotations.parse('RNAseq_IRPKM_data')
cnv_annotations=demeter6_annotations.parse('WES_SNP_CN_data')

Create the BigQuery dataset in the google cloud project

In [None]:
dataset_name='DEMETER2_V6_backup'
dataset_description=''' Cancer cell line genetic dependencies estimated using the DEMETER2 model. 
    DEMETER2 is applied to three large-scale RNAi screening datasets: 
    the Broad Institute Project Achilles, Novartis Project DRIVE, and the Marcotte et al. breast cell line dataset.
    The model is also applied togenerate a combined dataset of gene dependencies covering a total of 712 unique cancer cell lines.
    For version history, see the description in the figshare
    (https://figshare.com/articles/DEMETER2_data/6025238/6).
    For more information visit https://depmap.org/R2-D2/.
    Reference: McFarland, J.M., Ho, Z.V., Kugener, G., Dempster, J.M., Montgomery, P.G., Bryan, J.G.,
    Krill-Burger, J.M., Green, T.M., Vazquez, F., Boehm, J.S., et al. (2018). Improved estimation of 
    cancer dependencies from large-scale RNAi screens using model-based normalization and data integration.
    Nat. Commun. 9, 4610
    '''
   
CreateDataSet(client, dataset_name, project_id, dataset_description)

Save mutation data into BigQuery table 

In [None]:
# Save mutation data into bigquery table
mutation_table_name='CCLE_mutation'
mutation_table_desc='''Mutation data taken from the file CCLE_DepMap_18Q1_maf_20180207.txt. 
Original file: CCLE_mutation_data.csv
Download link: https://ndownloader.figshare.com/files/13110674'''

mutation_dict=mutation_annotation.to_dict('records')
mutation_table=CreateTable(client, mutation_data, dataset_name, mutation_table_name, project_id,  mutation_table_desc, mutation_dict)

Save sample info data into BigQuery table 

In [None]:
sample_info_table_name="sample_info"
sample_info_table_desc='''Table of meta data per cell line. Original file: sample_info.csv 
Download link: https://ndownloader.figshare.com/files/11489717 '''

sample_dict=sample_annotation.to_dict('records')
CreateTable(client, sample_info, dataset_name, sample_info_table_name, project_id, sample_info_table_desc, sample_dict)

Save copy number data into BigQuery table 

In [None]:
cn_table_desc='''Gene-level copy number data per cell line, derived from CCLE 
whole-exome sequencing data, along with CCLE SNP array data. Used for feature-dependency association analysis presented in the DEMETER2 manuscript.
Original file: WES_SNP_CN_data.csv 
Download link: https://ndownloader.figshare.com/files/11489726'''

cnv_long_format=shRNAPreprocess(cn_data, 'CNA')
cnv_table_name="WES_snp_cn"
cnv_dict=cnv_annotations.to_dict('records')
CreateTable(client, cnv_long_format, dataset_name, cnv_table_name, project_id,cn_table_desc, cnv_dict)

Save gene expression data into BigQuery table 

In [None]:
RNAseq_IRPKM_table_desc='''log10(RPKM + 0.001) for protein-coding genes, 
derived from the file CCLE_DepMap_18Q1_RNAseq_RPKM_20180214.gct. 
Original file: RNAseq_lRPKM_data.csv Download link: https://ndownloader.figshare.com/files/13110677
'''

RNAseq_IRPKM_long_format=shRNAPreprocess(gene_exp_data, 'RPKM')
RNAseq_IRPKM_table_name="RNAseq_IRPKM"
RNAseq_IRPKM_dict=RNAseq_IRPKM_annotation.to_dict('records')
CreateTable(client, RNAseq_IRPKM_long_format, dataset_name, RNAseq_IRPKM_table_name, project_id, RNAseq_IRPKM_table_desc, RNAseq_IRPKM_dict)

Save gene dependency scores data into BigQuery table 

In [None]:
gene_dep_scores_table_desc='''Estimated gene dependency for each cell line and gene
(posterior mean estimates). Original file: D2_combined_gene_dep_scores.csv 
Download link: https://ndownloader.figshare.com/files/13515395
'''
gene_dep_scores_long_format=shRNAPreprocess(combined_gene_dep_scores, 'Combined_Gene_Dep_Score')
gene_dep_scores_table_name="D2_combined_gene_dep_score"
gene_dep_scores_dict=gene_dep_scores_annotation.to_dict('records')
CreateTable(client, gene_dep_scores_long_format, dataset_name, gene_dep_scores_table_name, project_id, gene_dep_scores_table_desc, gene_dep_scores_dict)