-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #494 from isb-cgc/staging
Staging
- Loading branch information
Showing
195 changed files
with
3,922 additions
and
1,824 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
================== | ||
Running CWL RNAseq | ||
================== | ||
|
||
|
||
This workflow maps read-pairs to reference genome and produces the transcript | ||
|
||
|
||
Requirements: | ||
============= | ||
|
||
|
||
- CWLtool | ||
- Docker | ||
|
||
Download this tutorial: | ||
:: | ||
|
||
$sudo add-apt-repository universe | ||
$sudo apt update | ||
$sudo apt install subversion | ||
|
||
#cloning this tutorial | ||
$svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/CWL-RNAseq | ||
|
||
To install Docker and CWL, you can visit our **Cheatsheet** listed below: | ||
|
||
- `Cheatsheet <https://isb-cancer-genomics-cloud.readthedocs.io/en/kyle-staging/sections/gcp-info/Cheatsheet.html>`_ | ||
|
||
Starting folder **CWL-RNAseq** should look like this: | ||
|
||
|
||
:: | ||
|
||
. | ||
└── CWL-RNAseq | ||
├── create_bam.cwl | ||
├── create_transcript.cwl | ||
├── CWL-RNAseq.cwl | ||
├── CWL-RNAseq.yml | ||
├── data | ||
│ ├── sample_1.fq | ||
│ ├── sample_2.fq | ||
│ ├── sample.fa | ||
│ └── sample.gtf | ||
├── hisat2_align.cwl | ||
└── index_build.cwl | ||
|
||
Let's run it by using: | ||
|
||
:: | ||
|
||
$cwltool CWL-RNAseq.cwl CWL-RNAseq.yml | ||
|
||
If you receive this error: "docker: Got permission denied while trying to connect to the Docker daemon socket at unix" | ||
|
||
Try: | ||
|
||
:: | ||
|
||
$sudo groupadd docker | ||
$sudo usermod -aG docker ${USER} | ||
close and reopen VM then run the script again | ||
|
||
|
||
|
||
After CWLtool finishes: | ||
|
||
- **CWL-RNAseq.cwl** is the main cwl file that connects all other cwl tools together | ||
- **CWL-RNAseq.yml** is the file that contains all the inputs that are necessary to run the pipeline | ||
- **index_build.cwl** builds index files from a Fasta file, using Hisat2-build | ||
- **hisat2_align.cwl** builds a sam file from forward and reverse reads, and the indices built from previous step, using Hisat2 | ||
- **create_bam.cwl** builds a bam file from the newly built sam file, using Samtools | ||
- **create_transcript.cwl** creates transcript from the bam file from previous step, using Stringtie | ||
|
||
|
||
Let's take a look at the folder after cwltool finishes: | ||
|
||
:: | ||
|
||
. | ||
└── CWL-RNAseq | ||
├── create_bam.cwl | ||
├── create_transcript.cwl | ||
├── CWL-RNAseq.cwl | ||
├── CWL-RNAseq.yml | ||
├── data | ||
│ ├── sample_1.fq | ||
│ ├── sample_2.fq | ||
│ ├── sample.fa | ||
│ └── sample.gtf | ||
├── [final_ref.gtf] | ||
├── [final_transcript.gtf] | ||
├── [final.tsv] | ||
├── hisat2_align.cwl | ||
├── [hisat2_align_out] | ||
│ ├── [hisat2_align_out.log] | ||
│ └── [sample.sam] | ||
├── [hisat2_build.log] | ||
├── index_build.cwl | ||
├── [sample] | ||
│ ├── [index.1.ht2] | ||
│ ├── [index.2.ht2] | ||
│ ├── [index.3.ht2] | ||
│ ├── [index.4.ht2] | ||
│ ├── [index.5.ht2] | ||
│ ├── [index.6.ht2] | ||
│ ├── [index.7.ht2] | ||
│ └── [index.8.ht2] | ||
└── [sample.bam] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified
BIN
-8.8 KB
(93%)
docs/source/sections/BigQuery/BigQueryTableSearch-UI-homepage.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
************************* | ||
Variant Call Format (VCF) | ||
************************* | ||
|
||
Variant Call Format (VCF) is the standard file format which stores variants (structural variants such as SNPs and indels) identified from next generation sequencing data. More information on the specifications and the VCF file format can be found here: https://samtools.github.io/hts-specs/ | ||
|
||
As variant data continues to grow both in the amount of data generated as well as in size, researchers face the challenge of having to identify ways to analyze large variant data sets. The traditional way of downloading individual VCF files to compute on local machines is untenable and prohibitive. As a solution to this problem, ISB-CGC has created a VCF extract, transform and load (ETL) pipeline that produces Google BigQuery tables that serve as central repositories for VCF files for a given cancer program (e.g. TCGA, TARGET). For example, the ETL process takes all VCF files from TARGET and transforms them into a single BigQuery table. Instead of analyzing variant data one VCF file at a time, with the variant data all in one central BigQuery table users will be able to query and interrogate the data without the need to download. In addition, we have ensured that the ETL process maintains the column composition of the VCF file format that researchers are familiar with. | ||
|
||
|
||
VCF BigQuery Table | ||
=================== | ||
|
||
Because VCF files at the GDC contain sensitive patient information which cannot be displayed to the public, they are deemed controlled-access, meaning only authorized users can access the data. For the purposes of demonstration, we have generated a random VCF file that emulates a typical TCGA VCF file. The BigQuery table in the image below was generated using the randomized VCF file and mimics a controlled access VCF BigQuery table. | ||
|
||
.. note:: The actual BiqQuery variant data tables are not randomized and are controlled access. | ||
|
||
The first 11 columns, seen in the image, begin just as a VCF file does. In addition to keeping a similar structure, the new table splits VCF columns such as NORMAL and TUMOR into their own individual columns. The objective of the flattened file is to bring ease and understandability to our users who have worked with VCF files in the past or who are brand new to this area of research. | ||
|
||
.. figure:: BigQuery_VCF_Flattened.png | ||
:scale: 50 | ||
:align: center | ||
|
||
.. note:: The tables found in our repository are clustered based on CHROM, ID, analysis_workflow_type, and project_short_name. This will help with faster queries and reducing costs. | ||
|
||
|
||
Accessing Controlled Variant Data | ||
================================= | ||
Some ISB-CGC BigQuery tables contain sensitive information about patients. These type of files are known as controlled access files. To obtain access to this controlled data, please follow the steps in the `Accessing Controlled Data <https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Gaining-Access-To-Controlled-Access-Data.html>`_ page. | ||
|
||
Controlled access VCF BigQuery tables can be found in the project **isb-cgc-cbq**. All VCF tables on BigQuery are stored under their parent program. For instance, the GDC release 22 TARGET VCF BigQuery table will found in the data sets known as "TARGET" and "TARGET_versioned" in the project isb-cgc-cbq. | ||
|
||
ISB-CGC BigQuery Table Search | ||
----------------------------- | ||
To see the available VCF BigQuery tables hosted on our Google Cloud projects, visit our `ISB-CGC BigQuery Table Search <https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQueryTableSearchUI.html>`_ and select **VCF** for the Filter **Data Type**. Preview of the data won't be available because the data is controlled access. | ||
|
||
|
||
VCF Programs Available | ||
^^^^^^^^^^^^^^^^^^^^^^ | ||
* TARGET | ||
|
||
VCF Programs Coming Soon | ||
^^^^^^^^^^^^^^^^^^^^^^^^ | ||
* TCGA | ||
* FM | ||
* HCMI | ||
* VAREPOP | ||
* ORGANOID | ||
* CPTAC | ||
* BEATAML 1.0 | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
************************* | ||
ISB-CGC BigQuery Projects | ||
************************* | ||
|
||
ISB-CGC has two open-access Google BigQuery projects. To quickly access the ISB-CGC tables from your project on the Google BigQuery Console, you'll need to link to these projects. This process, known as "pinning a project", is described `here <../progapi/bigqueryGUI/LinkingBigQueryToIsb-cgcProject.html>`_. | ||
|
||
- **isb-cgc** - This project has been in use since ISB-CGC's inception. | ||
- **isb-cgc-bq** - This is a new project as of July 2020. It will hold all new ISB-CGC tables, and many of the tables in the isb-cgc project will be migrated here over time. | ||
|
||
.. figure:: ISBCGC-BQ-projects.png | ||
:align: right | ||
:figwidth: 300px | ||
|
||
|
||
isb-cgc project | ||
=============== | ||
|
||
The isb-cgc project contains all of the ISB-CGC BiqQuery tables created before July 2020. | ||
|
||
Tables in isb-cgc will be retired and labeled as deprecated as we copy them over to the new project. Table descriptions will include the new table location. Eventually they will be turned into only views (with no preview ability) to ensure that existing references will continue to work correctly. Many older tables with light usage may remain in isb-cgc and not be copied over; tables with no logged recent usage may be deleted. When using the `BigQuery Table Search UI <https://isb-cgc.appspot.com/bq_meta_search/>`_ to find these retired tables, select Status of **Deprecated**. | ||
|
||
Many tables will continue to have the status of **Current**, at least for the time being, until they are copied to the new project. In addition, there are tables with the status of **Archived** in the isb-cgc project and more may become archived. **Archived** indicates that the table contains an older version of data; a newer version of the same data exists in another table. | ||
|
||
isb-cgc-bq project | ||
=================== | ||
|
||
The isb-cgc-bq project contains all new ISB-CGC BigQuery tables created after July 1, 2020 as well as tables that have been migrated from project isb-cgc. It features a more intuitive data set and table organization, as well as consistent table naming both within and across cancer research programs. | ||
|
||
This new project is a work in progress. The migration of existing tables from the isb-cgc project will be occurring over time, and will not be all at once. | ||
**All new tables** will be created in this project. | ||
|
||
isb-cgc-bq Data Set and Table Organization | ||
------------------------------------------ | ||
|
||
Each Program has two data sets, one containing the most current data that ISB-CGC has, and one containing versioned tables, which serves as an archive of previously released tables. | ||
|
||
As new data releases occur, the data in the "_current" tables will be replaced with this new data. If you want the most up-to-date data, use these tables in your queries. | ||
However, if you want to ensure that your queries create a reproducible result, use a table from the "_versioned" data set. The most current data is also in this data set; however, the name of the table will end with the release number or year and not "current". | ||
|
||
See below for more details. | ||
|
||
.. list-table:: | ||
:header-rows: 1 | ||
|
||
* - Data Set Name | ||
- Data Set Contents | ||
- Table Name Format | ||
- Table Status | ||
* - <Program> | ||
- Latest tables for each data type (ex. miRNA Expression, File Metadata) that ISB-CGC has, per Program | ||
- Data Type, Reference Genome, Source, Current. Ex. ``TARGET.miRNAseq_hg38_gdc_current`` | ||
- When using the `BigQuery Table Search UI <https://isb-cgc.appspot.com/bq_meta_search/>`_ to find these tables, select Status of **Current**. | ||
* - <Program>_versioned | ||
- Previously released tables, as well as the most current table | ||
- Data Type, Reference Genome, Source, Release Number or Year. Ex. ``TARGET_versioned.miRNAseq_hg38_gdc_r22``. Here, the name of the most current table will end with the release number or year and not "current". | ||
- Previously released tables have status of **Archived**. The most current table has the status of **Current**. | ||
|
||
See below for a snapshot of the isb-cgc-bq data set and table organization in the Google BigQuery Console. | ||
|
||
.. image:: ISBCGC-BQ-tables.png | ||
:align: center | ||
|
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.