Skip to content

Collection of scripts for conveniently downloading data from the archs4 resource

Notifications You must be signed in to change notification settings

jhawe/archs4_loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This small collection of scripts makes use of the ARCHS4 resource1 (What is ARCHS4?). The scripts allow for an easy (but rather naive) definition of samples which are to be extracted. Additionally, a definition of samples can be obtained from the ARCHS4 data portal. The download script is adjusted from the scripts which are provided by the ARCHS4 website, allowing for optional normalization and batch effect removal. Furthermore, simple diagnostic plots are created for the extracted data.

Snakemake workflow

Data download and processing

Lazy matching

We use a Snakemake based approach to enable a fully automated workflow of retrieving and normalizing ARCHS4 data. To download the data, as well as perform batch effect correction, the pipeline can be called by executing the following line of code in the root directory of the project:

snakemake results/downloads/{your_keywords}/design.tsv

This will obtain all samples from ARCHS4 which have the specified keywords (separated by "_") annotated in their tissue meta-data field (combined by &). As of now, the downloaded expression data will be automatically normalized using ComBat and both the raw gene counts and the normalized data will be saved.

Exact matching

In addition to the fuzzy matching, which simply 'greps' for the individual keywords in the 'source' meta-data column of the respective sample design, we can perform exact matching of keywords (e.g. we can match 'liver' without including 'liver cancer', etc.). However, at the moment it is necessary to modify the workflow. This will be amended in the future.

For example, to get all (manually curated) liver samples, you might want to define the following in the Snakemake file (excact keywords separated by '|'):

tissue_to_keyword = {"liver":"liver|human liver|liver tissue"

Then you could obtain the matched samples using the following line of code:

snakemake results/data/exact/liver/design.tsv

NOTE: In the above case, snakemake will look for a key 'liver' in the 'tissue_to_keyword' dictionary.

Data exploration

After downloading and processing the data, we now can go one step further and create a basic overview. This is implemented as an Rmarkdown and is now integrated in the snakemake workflow:

snakemake results/downloads/{your_keywords}/summary.html

Note

Snakemake version 5.2.2 or greater is required for the Rmarkdown to render properly. This version contains a bugfix crucial for successfully loading the data

TODOs

  • allow parametrization (e.g. type of normalization, raw data saving etc)
  • possibility to define exact keywords via config file, not via workflow modifications

About

Collection of scripts for conveniently downloading data from the archs4 resource

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published