NextGenETL: ETL Scripts (2019)

A set of scripts for pulling data out of GDC.

Support scripts

These are in ./scripts:

setEnvVars.sh: Needs to be edited to match your set-up. The running scripts pull configuration files out of Google Cloud Storage. These two variables say where in GCS to find these files.
setup_vm.sh: Almost all these scripts must be run on a Google cloud VM. This script will configure the VM. Chicken-and-egg: though this script is in GitHub, you need to run it on the VM to get GitHub support and download of this repo on the VM. Copy and paste the raw code.
reload-from-github.sh: If you make code changes to the repo, run this script on the VM to pull changes in. Dumb script just deletes the tree and clones the repo.

Installing GDC Release Data into BQ tables

When there is a new GDC data release, we pull data about the release out of GDC using their API and build BQ tables holding these data in the isb-cgc.GDC_metadata data set. These steps are currently (10/2019) performed on an ISB internal server and uploaded to BQ. Tables created (e.g. * = 16):

rel*_aliquot2caseIDmap
rel*_caseData
rel*_fileData_active
rel*_fileData_legacy
rel*_slide2caseIDmap

Extracting DCF Manifest Data into BQ Tables

For each GDC release, the DCF issues manifests (active, legacy) of all the files they are hosting in TSV files, including the mapping of each file UUID to a GCS gs:// bucket path. We import these file into BQ, and then create simple file mapping tables (e.g. * = 16):

DCF_DR*_active_manifest
DCF_DR*_legacy_manifest
dr*_active_file_map
dr*_legacy_file_map

This script, that does this operation, is configured via using a YAML file:

run-manifest.sh
- Sample config file: ./ConfigFiles.ManifestBQBuild.yaml

These sample files should be customized for parameters like project, dataset, table names, etc., and then uploaded into the Google bucket specified in your personalized ~/setEnvVars.sh file on the VM.

ETL Scripts to Build BigQuery Data Tables

Once the above tables have been created for a release, you can then use these scripts to build BQ data tables for the release, if any of the relevant data has been updated.

run-gexp.sh Used to build BQ tables like isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression.
- Sample config file: ./ConfigFiles.RnaSeqGexpBQBuild.yaml
run-mirna-expr.sh: Used to build BQ tables like isb-cgc.TCGA_hg38_data_v0.miRNAseq_Expression.
- Sample config file: ./ConfigFiles.MirnaExprBQBuild.yaml
run-mirna-isoforms.sh Used to build BQ tables like isb-cgc.TCGA_hg38_data_v0.miRNAseq_Isoform_Expression
- Sample config file: ./ConfigFiles.MirnaIsoformExprBQBuild.yaml

ETL Script to Build BigQuery Metadata Used by Web App

The script is still under development. However, a version that recreates the existing table from archival data is complete. However, BQ table schemas have changed, so this script is in the process of being updated:

run-meta-archive-test.sh
- Sample config file: ./ConfigFiles.ArchivalMetadataTest.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 7,099 Commits
BQ_Table_Building		BQ_Table_Building
ConfigFiles		ConfigFiles
DCF-Manifest-Pulls		DCF-Manifest-Pulls
GDC-Metadata-Processing		GDC-Metadata-Processing
Reactome		Reactome
SchemaFiles		SchemaFiles
TP53		TP53
Targetome		Targetome
cda_bq_etl		cda_bq_etl
common_etl		common_etl
gdc_clinical_resources		gdc_clinical_resources
labels_description_name		labels_description_name
packages		packages
scripts		scripts
temp		temp
tests/common_etl		tests/common_etl
README.md		README.md

isb-cgc/NextGenETL

Folders and files

Latest commit

History

Repository files navigation

NextGenETL: ETL Scripts (2019)

Support scripts

Installing GDC Release Data into BQ tables

Extracting DCF Manifest Data into BQ Tables

ETL Scripts to Build BigQuery Data Tables

ETL Script to Build BigQuery Metadata Used by Web App

About

Resources

Stars

Watchers

Forks

Languages