Scripts and clients for GA4GH Federated Analysis Systems Project
[TOC]
Script summary
- Settings file
- The examples directory contains a template settings file with a number of parameters for the FASP scripts. Place a copy of this file in your file system and set the environment variable FASP_SETTINGS to point to it. Edit the settings as appropriate.
- Python 3
- See the code for the modules required
- A folder in your home directory called .keys containing keys for various services. Not all keys required for all scripts.
- bdc_credentials.json - api_key file obtained from BioDataCatalyst
- crdc_credentials.json - api_key file obtained from Cancer Research Data Commons
- anvil_credentials.json - api_key file obtained from Anvil
- sevenbridges_keys.json - keys for cgc and or cavatica
- The following modules are used by different scripts. All scripts are unlikely to be relevant to all users these modules are not installed with the fasp package. Please install those needed for the scripts you will run.
- Google Life Sciences API enabled for your GCP account
- BigQuery python libraries - for scripts that use BigQuery
- pyega3 - EGA client libraries for download. See also EGA documentation for client API.
This script queries Thousand Genomes data on subjects and specimens which was exported from BioDataCatalyst and loaded into BigQuery.
-
FASPScript4 uses the following GA4GH APIs to perform each step
- Discovery Search Server (DNA Stack) - Presto on BigQuery
- DRS server (BioDataCatalyst)
- WES Server (DNA Stack) fasp-scripts/blob/master/fasp/scripts The other two scripts were proof of concept using direct APIs from different stacks
-
FASPScript1 uses BigQuery for the query and directly submits a GCP Life Sciences pipeline for the compute. This compute uses samtools stats.
-
FASPScript3 is the same as but substitutes in public DNAStack Discovery Search server for search
-
Possible to do's
- Troubleshoot samtools stats workflow on DNAStack WES server
Script: FASPScriptGWAS.py
- Queries Discovery Search for Thousand Genomes non-annotated recalibrated vcf file for Chromosome 21, obtaining prefixed DRS ids for the file.
- Resolves which DRS server needs to be called to obtain a URL to access the file.
- Submits the GWAS WDL workflow to the DNAStack WES Server using the URL provided by DRS.
Script: FASPScript2.py
-
Query COPDGene data in BigQuery, exported from BioDataCatalyst via PFB.
-
Query TCGA data in the ISB-CGC tables in BigQuery,
-
Both queries use an appropriate prefix to identify which DRS server should be called to obtain a url to the file.
Currently submits directly to a GCP Life Sciences pipeline. This will be substituted by a submission to a WES Server.
Both datasets are controlled access data. Access is controlled by the respective Fence access tokens on the CRDC and BioDataCatalyst DRS servers. The COPD data in BigQuery is under GCP IAM access control.
-
Possible to do's
- Add additional dbGaP datasets.
- Move query to Discovery Search - requires access control on Discovery Search.
Script: FASPScript8.py
- Queries TCGA data via BigQuery to obtain DRS ids
- Uses DRS to identify files for these cases are on both Google Cloud and AWS
- Runs samtools stats on Google Cloud and Seven Bridges (AWS)
Script: FASPScript6.py
-
This script demonstrates that the DNAStack WES Server can perform a compute on the urls returned by the SRA Data Locator. The SDL is a place holder for the NCBI DRS service.
-
A checksum was computed by the DNAStack WES implementation on the sra format file for which a URL could be obtained. Though GetObject showed there are BAM files with an access_id of gs.us URLs to these could not be obtained.
-
Possible to do's
- Substitute in SRA DRS server
- Identify why BAM file URLS are not returned by SDL.
Script: FASPScript7.py
-
Uses the ISB-CGC BigQuery tables to query for subjects from TCGA with variants in the JMJD1C gene. This is the gene in the example shared by Anne Deslattes Mays. This illustrates the kind of query that could be used for the workflows Anne wants to perform.
-
Possible to do's
-
Substitute in SRA DRS server
-
Identify other GA4GH data sources that might contain relevant data for this disease.
-
Script: DRSMetaResolver.py
-
Simulates how compact identifier prefixing can be used to redirect DRS GetObject calls to the relevant DRS Server.
-
Possible to do's
- Do trial registration with nt2/identifers of DRS server prefixes
Wrapper to call the DNAStack DiscoveryClient and return results of a query.
Perform searches via BigQuery
Superclass for DRS Clients
This is a python wrapper for the two DRS functions. It also handles Gen3 authentication using Fence. This is necessary until RAS/Passport support is in place.
There are two clients for specific Gen3 DRS servers
- crdcDRSClient - client for Cancer Research Data Commons DRS server
- bdcDRSClient - client for BioDataCatalyst DRS server
A DRS client for Seven Bridges DRS services. Handles SB specific authentication. Two specific classes are provided.
- sbcgcDRSClient - client for Seven Bridges Cancer Genomics Cloud DRS server
- cavaticaDRSClient - client for Cavatica DRS Server
A DRS-like wrapper around the SRA Data Locator. Uses standard dbGaP/SRA authentication (.ngc file)
Wrapper to make a WES call. Currently does an MD5 checksum, plan is to use samtools and eventually a GWAS workflow.
Wrapper to prepare a job to run samtools as a GCP Life Sciences pipeline.
Submits a samtools stats task via Seven Bridges API
Superclass for WES clients
checksum.wdl - a simple workflow for testing WES submission - calculates a checksum
More to be added
testSearchPagination.py - demonstrates how Discovery Seach query results are returned over several pages
examples - examples of using individual APIs used in the main examples