Skip to content

jgolob/pycorncob

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CORNCOB: Beta-binomial regression of count data

Introduction

This is a python implementation of the R corncob package. This is a beta-binomal model-based approach for hypothesis testing of count data such as that generated by microbiome 16S rRNA or RNAseq experiments.

The method was first developed by Dr. Willis and her mentee Bryan D Martin, and is fully described in Modeling microbial abundances and dysbiosis with beta-binomial regression.

CORNCOB stands for: COunt RegressioN for Correlated Observations with the Beta-binomial

Why you should use this technique is detailed below.

Installation

Installation can be done via pip: pip install corncob

Usage

corncob -C counts.csv -VA covariates_abund.csv -VD covariates_disp.csv -O output.csv

counts.csv format

element,specimen_label_01,specimen_label_02,....
total,1244,1344,....
sv_name,20,0,.....

The header row should provide specimen (nee observation) labels.

The first data row should be for the total count across all elements for each specimen (e.g. the total count of reads for all ASVs in a 16S rRNA experiment.)

Each row after should be for one element (ESV, OTU, gene, etc). The first column should be an unique identifier for that element. The remainder of columns should be the counts for that element.

Zeros are fine. Unlike other count-based metrics (i.e. DESEQ2) you do not need to zero-inflate and in fact should not. Nulls or blank cells will be replaced with zeros.

covariates.csv format

cov1,cov2,cov3,...
331,0.12,0000.1,...
....

This is an exogenous matrix for the covariates. Each column corresponds to one covariate to be fitted. An intercept column is always added. Any empty cells will be replaced with zeros. If not provided, only an intercept will be used for regression.

One can provide distinct covariate tables for abundance (-VA) and dispersion (-VD)

Output

A CSV file with a header. The header has the various attributes, in the rough format of element_id,converged,abd__cov_name__Estimate,abd__cov_name__se,abd__cov_name__t,abd__cov_name__p,...

The first block indicates if this is for abundance (abd) or dispersion (disp).

The second block in the same of the covariate, taken directly from the input files.

The third block is

  • Estimate: The fitted estimated coefficent for that covariate for abundance or dispersion
  • se: The standard error of the estimate
  • t: The t-value, derived via the Wald statistic
  • p: The p-value (derived from the t-value). Two tailed.

Why use beta-binomial regression?

Thanks to next-generation sequencing, biological scientists have an opportunity to make discoveries with count data. Whether RNAseq, ATAC-seq, microbiome 16S rRNA amplicons, or shotgun metagenomes, often the data we wish to compare to our outcomes are ultimately a big table of number of reads assigned in each specimen to a gene, locus, microbe, etc. These read-counts pose an analytic challenge.

While superficially appearing like any other continuous measure (particularly when normalized to relative abundance), these counts-of-reads are tricky. It is typical that the total number of reads to vary a lot from specimen-to-specimen. Even modest noise in the most predominant features (e.g. OTU) can wreak havoc on the precision of true featues further into the tail. Each read-count is not really independent; thre are likely all sorts of latent covariance we are barely beginning to understand. Compounding all this is that most of these read-count matricies are sparse; most of the values are zeros.

An ideal regression method for read-count data would:

  • Be fine with zeros, even a lot of zeros, and not neet to do something like "add one to each count" to work.
  • Recognize read-counts are likely correlated with one another in ways that we do not fully understand
  • Take advantage of the variable total read depth (total reads) per specimen.
  • Consider how our covariates (outcomes) relate not just to the relative abundance but also the variability (dispersion) of a given feature's read count.

The adaptation of beta-binomial models by Martin et al fullfils these criteria.

About

beta-binomial based testing of count data

Resources

License

Stars

Watchers

Forks

Packages

No packages published