Skip to content

Commit

Permalink
Create BigQuery-ETL.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
DeenaBleich committed Jan 19, 2021
1 parent 7bbe5ee commit d12b039
Showing 1 changed file with 34 additions and 0 deletions.
34 changes: 34 additions & 0 deletions docs/source/sections/BigQuery/BigQuery-ETL.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
How ISB-CGC BigQuery Tables are Created
=======================================

The ISB-CGC team extracts, transforms, and loads data from cancer data repositories into Google BigQuery tables to make it easier to access for analysis.

Extract, Transform and Load (ETL) process overview
---------------------------------------------------

The process differs slightly based on the source of the data (GDC, PDC, GENCODE, etc.) and the data type (RNA sequencing, Somatic Mutation, etc.) but generally,
data are either gathered by using an Application Programming Interface (API) provided by the source for this purpose or by accessing files provided by the source.

Data for each program are consolidated by data type (ex. Clinical, DNA Methylation, RNAseq, Somatic Mutation, etc.) and transformed into ISB-CGC Google BigQuery tables.
This novel approach allows our users to quickly analyze information from thousands of patients in our curated BigQuery tables.

ISB-CGC Workflow Components
+++++++++++++++++++++++++++

Each workflow is made up of the following files:

- YAML file (configuration file)
- Python files (extracts, transforms, and loads data)
- shell script file (runs the workflow)

GDC Workflows
-------------



PDC Workflows
-------------


Other Workflows
---------------

0 comments on commit d12b039

Please sign in to comment.