Skip to content

njpipeorgan/L1000-bayesian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

L1000 peak deconvolution based on Bayesian analysis

Overview

This project is intended to generate high quality perturbagen signatures from LINCS L1000 assay. We build a pipeline, in parallel with L1000 group, to process raw fluorescent intensity data into z-scores as perturbagen signatures. Pre-computed datasets covering a majority of LINCS L1000 Phase I and Phase II is available in Downloads and Zenodo.

Our pipeline is different from the L1000 pipeline mostly in the peak deconvolution algorithm. We implement our algorithm in both C++ and CUDA, which can be used with various languages. We give two examples for how to use these functions with C++ natively and how to be called in Wolfram Mathematica.

Also, we have prepared a small batch of real data and relavant code for you to test our pipeline at a very small scale. You may follow the instructions, run the pipeline, and check the results.

Datasets

Summary

LINCS L1000 Phase I (GSE92742) & Phase II (GSE70138) datasets generated by our pipeline are currently available. The datasets cover three levels: Our Level 4 and Level 5 data are equivalent to Level 4 and Level 5 data provided by L1000; the marginal distributions data of peak locations (GSE92742 small molecule treatments only and GSE70138) are similiar to L1000 Level 2 data, except that they are probability distributions instead of precise numbers of peak locations.

Unless you are interested in managing z-score inference and combination, we encourage you to use combined z-scores by bio-replicates (Level 5 data).

Downloads

Description Download
Marginal distributions of peak locations Bayesian_GSE70138_Level2_DPEAK.zip
Bayesian_GSE92742_Level2_DPEAK.zip
Plate control z-scores Bayesian_GSE70138_Level4_ZSPC_n335465x978.h5
Bayesian_GSE92742_Level4_ZSPC_n1093191x978.h5
Combined z-scores by bio-replicates Bayesian_GSE70138_Level5_COMPZ_n116218x978.h5
Bayesian_GSE92742_Level5_COMPZ_n361481x978.h5
Checksum Bayesian_L1000_sha512sum.txt

The meta data are available from the publication by L1000 group: GSE70138 and GSE92742. They include perturbagen and cell line information associated with signature and instance IDs in the datasets.

Data stuctures

The z-score results (as HDF5) are compatible with those published by L1000 group. Each of them contains three datasets as follows:

  • /colid are the signature IDs (Level 5) or instance IDs (Level 4);

  • /rowid are the names of landmark genes;

  • /data are the z-scores as a matrix.

Each marginal distribution file contain the information of peak locations on one plate. It contains four datasets as follows:

  • /colid are the instance IDs;

  • /rowid are the names of landmark genes;

  • /peakloc are the locations of the peaks for calculating likelihood function;

  • /data are encoded log-likelihoods as a rank-3 array of 16-bit unsigned integers. To retrieve the log-likelihoods, the values should be multiplied by a factor of -0.001. Note that they are not normalized.

Citation

Qiu, Yue, et al., 2020, Bioinformatics, 36(9), 2787, https://doi.org/10.1093/bioinformatics/btaa064