Skip to content

Commit

Permalink
Merge pull request #494 from isb-cgc/staging
Browse files Browse the repository at this point in the history
Staging
  • Loading branch information
boaguilar committed Nov 23, 2020
2 parents 1531656 + 7f09f64 commit 2e8e871
Show file tree
Hide file tree
Showing 195 changed files with 3,922 additions and 1,824 deletions.
110 changes: 110 additions & 0 deletions CWL-RNAseq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
==================
Running CWL RNAseq
==================


This workflow maps read-pairs to reference genome and produces the transcript


Requirements:
=============


- CWLtool
- Docker

Download this tutorial:
::

$sudo add-apt-repository universe
$sudo apt update
$sudo apt install subversion

#cloning this tutorial
$svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/CWL-RNAseq

To install Docker and CWL, you can visit our **Cheatsheet** listed below:

- `Cheatsheet <https://isb-cancer-genomics-cloud.readthedocs.io/en/kyle-staging/sections/gcp-info/Cheatsheet.html>`_

Starting folder **CWL-RNAseq** should look like this:


::

.
└── CWL-RNAseq
├── create_bam.cwl
├── create_transcript.cwl
├── CWL-RNAseq.cwl
├── CWL-RNAseq.yml
├── data
│   ├── sample_1.fq
│   ├── sample_2.fq
│   ├── sample.fa
│   └── sample.gtf
├── hisat2_align.cwl
└── index_build.cwl

Let's run it by using:

::

$cwltool CWL-RNAseq.cwl CWL-RNAseq.yml

If you receive this error: "docker: Got permission denied while trying to connect to the Docker daemon socket at unix"

Try:

::

$sudo groupadd docker
$sudo usermod -aG docker ${USER}
close and reopen VM then run the script again



After CWLtool finishes:

- **CWL-RNAseq.cwl** is the main cwl file that connects all other cwl tools together
- **CWL-RNAseq.yml** is the file that contains all the inputs that are necessary to run the pipeline
- **index_build.cwl** builds index files from a Fasta file, using Hisat2-build
- **hisat2_align.cwl** builds a sam file from forward and reverse reads, and the indices built from previous step, using Hisat2
- **create_bam.cwl** builds a bam file from the newly built sam file, using Samtools
- **create_transcript.cwl** creates transcript from the bam file from previous step, using Stringtie


Let's take a look at the folder after cwltool finishes:

::

.
└── CWL-RNAseq
├── create_bam.cwl
├── create_transcript.cwl
├── CWL-RNAseq.cwl
├── CWL-RNAseq.yml
├── data
│   ├── sample_1.fq
│   ├── sample_2.fq
│   ├── sample.fa
│   └── sample.gtf
├── [final_ref.gtf]
├── [final_transcript.gtf]
├── [final.tsv]
├── hisat2_align.cwl
├── [hisat2_align_out]
│   ├── [hisat2_align_out.log]
│   └── [sample.sam]
├── [hisat2_build.log]
├── index_build.cwl
├── [sample]
│   ├── [index.1.ht2]
│   ├── [index.2.ht2]
│   ├── [index.3.ht2]
│   ├── [index.4.ht2]
│   ├── [index.5.ht2]
│   ├── [index.6.ht2]
│   ├── [index.7.ht2]
│   └── [index.8.ht2]
└── [sample.bam]
20 changes: 10 additions & 10 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# All configuration values have a default; values that are commented out
# serve to show the default.

import sys
import sys; sys.setrecursionlimit(1500)
import os
import sphinx.environment
import shlex
Expand All @@ -33,7 +33,7 @@
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.todo',
'sphinx.ext.todo'
]

# Add any paths that contain templates here, relative to this directory.
Expand All @@ -51,7 +51,7 @@
master_doc = 'index'

# General information about the project.
project = u'ISB Cancer Genomics Cloud'
project = u'ISB Cancer Gateway in the Cloud'
copyright = u'2015-2020, the ISB-CGC team'
author = u'the ISB-CGC team'

Expand All @@ -60,9 +60,9 @@
# built documents.
#
# The short X.Y version.
version = '1.0'
version = '2.0'
# The full version, including alpha/beta/rc tags.
release = '1.0.0'
release = '2.0.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down Expand Up @@ -246,7 +246,7 @@
#html_search_scorer = 'scorer.js'

# Output file base name for HTML help builder.
htmlhelp_basename = 'ISBCancerGenomicsClouddoc'
htmlhelp_basename = 'ISBCancerGatewayClouddoc'

# -- Options for LaTeX output ---------------------------------------------

Expand All @@ -268,7 +268,7 @@
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'ISBCancerGenomicsCloud.tex', u'ISB Cancer Genomics Cloud Documentation',
(master_doc, 'ISBCancerGatewayCloud.tex', u'ISB Cancer Gateway in the Cloud Documentation',
u'the ISB-CGC team', 'manual'),
]

Expand Down Expand Up @@ -298,7 +298,7 @@
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'isbcancergenomicscloud', u'ISB Cancer Genomics Cloud Documentation',
(master_doc, 'isbcancergatewaycloud', u'ISB Cancer Gateway in the Cloud Documentation',
[author], 1)
]

Expand All @@ -312,8 +312,8 @@
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'ISBCancerGenomicsCloud', u'ISB Cancer Genomics Cloud Documentation',
author, 'ISBCancerGenomicsCloud', 'One line description of project.',
(master_doc, 'ISBCancerGatewayCloud', u'ISB Cancer Gateway to the Cloud Documentation',
author, 'ISBCancerGatewayCloud', 'One line description of project.',
'Miscellaneous'),
]

Expand Down
15 changes: 8 additions & 7 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,23 +36,24 @@ The `ISB-CGC <https://isb-cgc.org>`_ aims to serve the needs of a broad range of
:maxdepth: 1
:caption: USER GUIDE

sections/data/Mitelman_about
sections/Hosted-Data
sections/data/TCGA_Data_Security
sections/Gaining-Access-To-Controlled-Access-Data
sections/Web-UI
sections/BigQuery
sections/BigQueryTableSearchUI
sections/DataBrowser
sections/data/Mitelman_about
sections/DataExplorer
sections/Web-UI
sections/progapi/progAPI-v4/Programmatic-Demo
sections/ProgrammaticAccess
sections/HowTos
sections/RegulomeExplorerNotebooks
sections/data/TCGA_Data_Security
sections/Gaining-Access-To-Controlled-Access-Data

.. toctree::
:hidden:
:maxdepth: 1
:caption: MORE INFORMATION

sections/HowTos
sections/RegulomeExplorerNotebooks
sections/TutorialsAndHow-ToGuides
sections/Releases
sections/Quick-links-updated
Expand Down
9 changes: 6 additions & 3 deletions docs/source/sections/About-ISB-CGC.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,25 @@
About the ISB-CGC Platform
**************************

The ISB Cancer Genomics Cloud (ISB-CGC) is one of three `National Cancer Institute (NCI) Cloud Resources <https://datascience.cancer.gov/data-commons/cloud-resources>`_ tasked with bringing cancer data and computation power together through cloud platforms. It is a collaboration between the `Institute for Systems Biology <https://isbscience.org/>`_ (ISB) and `General Dynamics Information Technology Inc. <https://www.gdit.com/>`_ (GDIT). Since starting in 2014 as part of NCI’s Cloud Pilot Resource initiative, ISB-CGC has provided access to increasing amounts of cancer data in the cloud.
The ISB Cancer Gateway in the Cloud (ISB-CGC) is one of three `National Cancer Institute (NCI) Cloud Resources <https://datascience.cancer.gov/data-commons/cloud-resources>`_ tasked with bringing cancer data and computation power together through cloud platforms. It is a collaboration between the `Institute for Systems Biology <https://isbscience.org/>`_ (ISB) and `General Dynamics Information Technology Inc. <https://www.gdit.com/>`_ (GDIT). Since starting in 2014 as part of NCI’s Cloud Pilot Resource initiative, ISB-CGC has provided access to increasing amounts of cancer data in the cloud.

-------------------------
Exploring Cancer Data
-------------------------

The ISB-CGC Platform enables a wide range of users to bring their analysis tools to the data in the cloud, eliminating the need to download and store large data sets. Built with the Google Cloud Platform, it provides several entry points for exploring and analyzing cancer data:

* The **ISB-CGC Web Application** allows users to interactively create and explore cohorts of interest.
* The **ISB-CGC Web Application** allows users to interactively create and explore cohorts of interest. It includes the functionality of the Cancer Data File Browser and the Cohort Builder/Data Explorer as well as other tools.

- The **Cancer Data File Browser** allows users to explore a comprehensive selection of cancer related data files in Google Cloud Storage Buckets, such as raw sequencing, cancer nucleotide variation, pathology or radiology images.
- The **Cohort Builder/Data Explorer** is a web interface which builds cohorts based on clinical demographics and molecular filters. Compare patient cohorts with various exploration tools including IGV viewer, image viewers, and analytical visualization.
* The **ISB-CGC API** gives users the ability to programmatically work with data such as cases, samples, cohorts, files and cloud projects.
* The **ISB-CGC BigQuery Table Search** is a discovery tool that allows the user to explore and search for ISB-CGC Google BiqQuery tables.
* On the **Google Cloud Platform BigQuery Console**, ISB-CGC tables can be viewed and queried directly.
* **Python and R** can interface with the ISB-CGC tables, retrieving and analyzing data.
* Using **Google Compute Engines and VMs**, workflows can be run to perform data analysis.

Please see the USER GUIDE section to learn more about each of these tools and the MORE INFORMATION section to see examples, tutorials, `Jupyter and R Notebooks <https://github.com/isb-cgc/Community-Notebooks>`_, Frequently Asked Questions and more.
Please see the USER GUIDE section to learn more about each of these tools and to see `Jupyter and R Notebook <https://github.com/isb-cgc/Community-Notebooks>`_ examples. See the MORE INFORMATION section for tutorials, release notes, Frequently Asked Questions and more.

.. image:: ToolsForISBCGC.png
:align: center
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/source/sections/BestPractices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Don't Download the Data
===========================


The ISB-CGC platform is one of NCI’s `Cancer Genomics Cloud Resources <https://datascience.cancer.gov/data-commons/cloud-resources>`_ and our mission is to host cancer data (such as TCGA and TARGET data) in the cloud so that researchers around the world may work with data without needing to download and store the data at their own local institutions.
The ISB-CGC platform is one of NCI’s `Cancer Cloud Resources <https://datascience.cancer.gov/data-commons/cloud-resources>`_ and our mission is to host cancer data (such as TCGA and TARGET data) in the cloud so that researchers around the world may work with data without needing to download and store the data at their own local institutions.

Remember those times when you had to wait weeks to download the data - you don’t need to do that any more! The data is already on the cloud, so you can collaborate with other researchers much more easily.
Be mindful that if you download data, you’ll incur egress charges.
Expand Down
3 changes: 2 additions & 1 deletion docs/source/sections/BigQuery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@ Note that dbGaP authorization is not required to access these tables.
:maxdepth: 1

progapi/bigqueryGUI/HowToAccessBigQueryFromTheGoogleCloudPlatform
BigQuery/ISBCGC-BQ-Projects
progapi/bigqueryGUI/LinkingBigQueryToIsb-cgcProject
BigQuery/data_in_BQ
progapi/bigqueryGUI/GettingStartedWithGoogleBigQuery
BigQuery/VariantDataInBigQuery
PanCancer-Atlas-Mirror
BigQuery/BigQueryUsageCosts

Expand Down
Binary file added docs/source/sections/BigQuery/Access-filter.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/sections/BigQuery/BigQueryTableSearchUI.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions docs/source/sections/BigQuery/ControlledAccessVCF.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
*************************
Variant Call Format (VCF)
*************************

Variant Call Format (VCF) is the standard file format which stores variants (structural variants such as SNPs and indels) identified from next generation sequencing data. More information on the specifications and the VCF file format can be found here: https://samtools.github.io/hts-specs/

As variant data continues to grow both in the amount of data generated as well as in size, researchers face the challenge of having to identify ways to analyze large variant data sets. The traditional way of downloading individual VCF files to compute on local machines is untenable and prohibitive. As a solution to this problem, ISB-CGC has created a VCF extract, transform and load (ETL) pipeline that produces Google BigQuery tables that serve as central repositories for VCF files for a given cancer program (e.g. TCGA, TARGET). For example, the ETL process takes all VCF files from TARGET and transforms them into a single BigQuery table. Instead of analyzing variant data one VCF file at a time, with the variant data all in one central BigQuery table users will be able to query and interrogate the data without the need to download. In addition, we have ensured that the ETL process maintains the column composition of the VCF file format that researchers are familiar with.


VCF BigQuery Table
===================

Because VCF files at the GDC contain sensitive patient information which cannot be displayed to the public, they are deemed controlled-access, meaning only authorized users can access the data. For the purposes of demonstration, we have generated a random VCF file that emulates a typical TCGA VCF file. The BigQuery table in the image below was generated using the randomized VCF file and mimics a controlled access VCF BigQuery table.

.. note:: The actual BiqQuery variant data tables are not randomized and are controlled access.

The first 11 columns, seen in the image, begin just as a VCF file does. In addition to keeping a similar structure, the new table splits VCF columns such as NORMAL and TUMOR into their own individual columns. The objective of the flattened file is to bring ease and understandability to our users who have worked with VCF files in the past or who are brand new to this area of research.

.. figure:: BigQuery_VCF_Flattened.png
:scale: 50
:align: center

.. note:: The tables found in our repository are clustered based on CHROM, ID, analysis_workflow_type, and project_short_name. This will help with faster queries and reducing costs.


Accessing Controlled Variant Data
=================================
Some ISB-CGC BigQuery tables contain sensitive information about patients. These type of files are known as controlled access files. To obtain access to this controlled data, please follow the steps in the `Accessing Controlled Data <https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Gaining-Access-To-Controlled-Access-Data.html>`_ page.

Controlled access VCF BigQuery tables can be found in the project **isb-cgc-cbq**. All VCF tables on BigQuery are stored under their parent program. For instance, the GDC release 22 TARGET VCF BigQuery table will found in the data sets known as "TARGET" and "TARGET_versioned" in the project isb-cgc-cbq.

ISB-CGC BigQuery Table Search
-----------------------------
To see the available VCF BigQuery tables hosted on our Google Cloud projects, visit our `ISB-CGC BigQuery Table Search <https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQueryTableSearchUI.html>`_ and select **VCF** for the Filter **Data Type**. Preview of the data won't be available because the data is controlled access.


VCF Programs Available
^^^^^^^^^^^^^^^^^^^^^^
* TARGET

VCF Programs Coming Soon
^^^^^^^^^^^^^^^^^^^^^^^^
* TCGA
* FM
* HCMI
* VAREPOP
* ORGANOID
* CPTAC
* BEATAML 1.0



62 changes: 62 additions & 0 deletions docs/source/sections/BigQuery/ISBCGC-BQ-Projects.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
*************************
ISB-CGC BigQuery Projects
*************************

ISB-CGC has two open-access Google BigQuery projects. To quickly access the ISB-CGC tables from your project on the Google BigQuery Console, you'll need to link to these projects. This process, known as "pinning a project", is described `here <../progapi/bigqueryGUI/LinkingBigQueryToIsb-cgcProject.html>`_.

- **isb-cgc** - This project has been in use since ISB-CGC's inception.
- **isb-cgc-bq** - This is a new project as of July 2020. It will hold all new ISB-CGC tables, and many of the tables in the isb-cgc project will be migrated here over time.

.. figure:: ISBCGC-BQ-projects.png
:align: right
:figwidth: 300px


isb-cgc project
===============

The isb-cgc project contains all of the ISB-CGC BiqQuery tables created before July 2020.

Tables in isb-cgc will be retired and labeled as deprecated as we copy them over to the new project. Table descriptions will include the new table location. Eventually they will be turned into only views (with no preview ability) to ensure that existing references will continue to work correctly. Many older tables with light usage may remain in isb-cgc and not be copied over; tables with no logged recent usage may be deleted. When using the `BigQuery Table Search UI <https://isb-cgc.appspot.com/bq_meta_search/>`_ to find these retired tables, select Status of **Deprecated**.

Many tables will continue to have the status of **Current**, at least for the time being, until they are copied to the new project. In addition, there are tables with the status of **Archived** in the isb-cgc project and more may become archived. **Archived** indicates that the table contains an older version of data; a newer version of the same data exists in another table.

isb-cgc-bq project
===================

The isb-cgc-bq project contains all new ISB-CGC BigQuery tables created after July 1, 2020 as well as tables that have been migrated from project isb-cgc. It features a more intuitive data set and table organization, as well as consistent table naming both within and across cancer research programs.

This new project is a work in progress. The migration of existing tables from the isb-cgc project will be occurring over time, and will not be all at once.
**All new tables** will be created in this project.

isb-cgc-bq Data Set and Table Organization
------------------------------------------

Each Program has two data sets, one containing the most current data that ISB-CGC has, and one containing versioned tables, which serves as an archive of previously released tables.

As new data releases occur, the data in the "_current" tables will be replaced with this new data. If you want the most up-to-date data, use these tables in your queries.
However, if you want to ensure that your queries create a reproducible result, use a table from the "_versioned" data set. The most current data is also in this data set; however, the name of the table will end with the release number or year and not "current".

See below for more details.

.. list-table::
:header-rows: 1

* - Data Set Name
- Data Set Contents
- Table Name Format
- Table Status
* - <Program>
- Latest tables for each data type (ex. miRNA Expression, File Metadata) that ISB-CGC has, per Program
- Data Type, Reference Genome, Source, Current. Ex. ``TARGET.miRNAseq_hg38_gdc_current``
- When using the `BigQuery Table Search UI <https://isb-cgc.appspot.com/bq_meta_search/>`_ to find these tables, select Status of **Current**.
* - <Program>_versioned
- Previously released tables, as well as the most current table
- Data Type, Reference Genome, Source, Release Number or Year. Ex. ``TARGET_versioned.miRNAseq_hg38_gdc_r22``. Here, the name of the most current table will end with the release number or year and not "current".
- Previously released tables have status of **Archived**. The most current table has the status of **Current**.

See below for a snapshot of the isb-cgc-bq data set and table organization in the Google BigQuery Console.

.. image:: ISBCGC-BQ-tables.png
:align: center

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2e8e871

Please sign in to comment.