Note: ceftools is no longer being maintained. Check out Loom, our high-performance file format for very large-scale omics datasets!
Tools for manipulating cell expression format (CEF) files.
You can download the latest version from the releases page.
ceftools is a set of tools designed to manipulate large-scale gene expression data. It aims to be for gene expression data what samtools is for sequence data. It was designed to simplify the exchange and manipulation of very large-scale transcriptomics data, particularly from single-cell RNA-seq. However, it can process any omics-style dataset that can be represented as an annotated matrix of numbers.
ceftools is implemented as a command-line utility called
cef, which operates on an annotated matrix of gene-expression
values. The annotation consists of headers (name-value pairs) as well as row and column attributes. It supports useful operations such as filtering, sorting, splitting, joining and transposing the input. Multiple commands can be chained to perform more complex operations.
ceftools processes files in the text-based CEF format ('cell expression format',
.cef). CEF files are human-readable, tab-delimited
text files that can be easily parsed and generated from any scripting language.
cef help - print help for the cef command cef info - overview of file contents cef view - interactively navigate the matrix cef transpose - transpose the file cef sort - sort by row attribute or by specific column cef sort --spin - sort rows by the SPIN algorithm cef select - select rows that match given criteria cef join - join two datasets by given attributes cef add - add attribute or header with constant value cef drop - drop attribute(s) or header(s) cef rename - rename attribute cef rescale - rescale rows (rpkm, tpm or log-transformed) cef aggregate - calculate aggregate statistics for every row cef import - import from STRT
The examples below make use of the
oligos.cef sample dataset, which you can download from the BackSPIN release page. If you have installed
cef and have
oligos.cef in your current working directory, you should be able to run all the examples below without modification.
Commands operate on rows by default. For example
drop can be used to remove row attributes, but not column attributes. Use the global
--bycol flag to operate instead on columns. For example, to remove column attribute
Gene then sort on column attribute
< infile.cef cef --bycol drop --attrs Gene | cef --bycol sort Length > outfile.cef
Note that since
--bycol is a global flag it must always be positioned before the command:
cef --bycol <command>
Show a summary of the contents of a CEF file.
< oligos.cef cef info
Columns: 820 Rows: 19972 Flags: 0 Headers: Genome = mm10 Citation = http://www.sciencemag.org/content/347/6226/1138.abstract Column attributes: Tissue, Group, Total_mRNA, Well, Sex, Age, Diameter, CellID, Class, Subclass Row attributes: GeneType, Gene, GeneGroup
Interactively view the contents of a CEF file (in the terminal window).
< oligos.cef cef view
The yellow toolbar at the bottom shows the available commands for navigating the file.
Use the 'wasd' keys to scroll the matrix; hold down shift to scroll by a whole screen at a time. Press 'h' to jump to the top-left corner, and 'z' to jump to the bottom-right.
Use the arrow keys to scroll the entire view (including the row and column attributes).
To sort by an attribute or column, use the arrow keys to scroll the view. Position the column you want to sort by at the left of you screen, then press 'o'. To reverse the sort order, press 'o' again.
To transpose rows and columns, press 't'.
Press 'q' to exit the viewer.
Transpose rows and columns.
< oligos.cef cef transpose | cef info
Columns: 19972 Rows: 820 Flags: 0 Headers: Genome = mm10 Citation = http://www.sciencemag.org/content/347/6226/1138.abstract Column attributes: GeneType, Gene, GeneGroup Row attributes: Tissue, Group, Total_mRNA, Well, Sex, Age, Diameter, CellID, Class, Subclass
Compare this output to the example given above (Info command).
Sort the file based on a row attribute or the values in a specific column.
cef sort --by "attr=X" Sort by the column where column attribute 'attr' has value 'X' cef sort --by "attr" Sort by the row attribute 'attr' Options: --numerical When sorting by row attribute, sort numerically (default: sort alphabetically) --reverse Sort in reverse order
< oligos.cef cef sort --by "CellID=1772067057_G07" --reverse > oligos_sorted.cef
The output file is sorted by the first column, which has CellID '1772067057_G07', in descending order. You can verify this by doing
< oligos_sorted.cef cef view.
Note: when sorting by column, the "attr=X" clause must be put in double quotes (as above), or bash will interpret the equals sign as a variable assignment.
Sort rows based on correlations using the SPIN algorithm
cef sort --spin Sort rows by SPIN Options: --corrfile "FILE" Write the correlation matrix to the given file
< oligos.cef cef sort --spin --corrfile oligos_spin_corrs.tab > oligos_spin_sorted.cef
The output file is sorted by correlation similarity, and the row correlation matrix is written to the indicated file in tab-delimited text format.
Select rows that match a given criterion.
cef select --where "attr=X" Select rows where the given attribute has value 'X' cef select --range "10:20" Select rows 10 through 20 cef select --range ":20" Select all rows up to and including row 20 cef select --range "100:" Select all rows starting with row 100 and to the last row Options: --except Invert the selection
< oligos.cef cef select --where "Gene=Actb" | cef view
The result is a file with a single row, containing the data for Actb.
Join two datasets by matching up rows that have the same value for an attribute.
cef join --with <other.cef> --on "attr1=attr2"
This command joins two CEF files, one from standard input (STDIN) and one given by the
--with <other.cef> option. The result is a new CEF file which has been extended on the right with all the data from 'other.cef' that matches with the existing data in the input CEF file.
Column attributes of the same name are merged. For example, if both of the input files have column attribute
Age, the resulting file will have a single column attribute
Age. But if one file has
Sex and the other has
Gender, the output will have two attributes (
Gender), and missing values will be blank.
Row attributes are merged if they contain identical values for all the retained rows. For example, if both input files have row attributes
Gene and all retained rows have the same values in the same order, then the output will have only a single row attribute
Gene. But if there are any differences, then the output will have two row attributes both called
Rows that do not match are silently dropped.
Add a header or a constant attribute.
cef add --attr "attr=X" Add a row attribute 'attr' with constant value 'X' in every row cef add --header "hdr=X" Add a new header 'hdr' with value 'X'
< oligos.cef cef --bycol add --attr "File=oligos.cef" | cef view
The resulting file will have a new column attribute (since we specified
--bycol; note that this option goes before the command), with the same value for every column.
This can be useful when joining two datasets, to keep track of which columns came from which file originally.
Remove an attribute or header.
cef drop --attrs "att1,att2" Remove attributes 'att1' and 'att2' (comma-separated, case sensitive list) cef drop --headers "hdr1,hdr2" Remove headers 'hdr1' and 'hdr2' (comma-separated, case sensitive list)
< oligos.cef cef drop --headers "Genome,Citation" | cef view
Columns: 820 Rows: 19972 Flags: 0 Headers: Column attributes: Tissue, Group, Total_mRNA, Well, Sex, Age, Diameter, CellID, Class, Subclass Row attributes: GeneType, Gene, GeneGroup
The two headers were dropped.
Rename an attribute.
cef rename --attr "old=new" Rename the attribute 'old' to 'new'
< oligos.cef cef rename --attr "Gene=Symbol" | cef info
Columns: 820 Rows: 19972 Flags: 0 Headers: Genome = mm10 Citation = http://www.sciencemag.org/content/347/6226/1138.abstract Column attributes: Tissue, Group, Total_mRNA, Well, Sex, Age, Diameter, CellID, Class, Subclass Row attributes: GeneType, Symbol, GeneGroup
Notice the renamed row attribute.
Scale or normalize the main matrix using one of several common methods.
cef rescale --method [log|tmp|rpkm] Rescale using the given method Options: --length <attr> Gives the name of the row attribute that gives the gene length (for rpkm)
< oligos.cef cef rescale --method log | cef view
Notice that the values have been rescaled from X to log(X+1).
The 'tpm' option normalizes each row by dividing by the row sum and multiplying by 1000000.
The 'rpkm' option normalizes each row by dividing by the row sum and by the length, and multiplying by 1000. The length must be given as a row attribute, indicated using the
--length option. The length is normally given in basepairs.
Compute aggregated values for each row.
cef aggregate [options] Options: --mean Compute mean --stdev Compute standard deviation --cv Compute CV (standard deviation divided by the mean) --max Compute max value --min Compute minimum value --noise <method> Compute noise as offset from CV-vs-mean fit
< oligos.cef cef aggregate --mean --stdev | cef view
Two new row attributes are added, named "Mean" and "Stdev". This makes it possible to sort by mean and standard deviation in the viewer.
To calculate noise using the standard CV-vs-mean fit, you must pass a method parameter:
cef aggregate --noise std
The following values are supported:
|std||Fit log(CV) = log(meank0 + k1) using least absolute deviation|
|bands||Fit log(CV) = log(mean0.52 + k1) by bisection|
The most reliable and well-tested method is
bands can sometimes give a better fit. In both cases, the fitting algorithms
have been tested only on data represented as absolute mRNA molecule counts per cell. If you have RPKM or TPM data, the fits may or may not
converge and may be completely off (let us know what you find!).
Import from other file formats to CEF.
cef import --format "format" Import a file expected to be in 'format'
Currently, this command allows a single format "strt", which can be used to import a Linnarsson lab legacy file format ("_expression.tab").
CEF file format
CEF files are tab-delimited text files in UTF-8 encoding, no BOM. The first four characters are 'CEF\t' (that's a single tab character at the end), equivalent to the hexadecimal 4-byte number 0x09464543 in little-endian order. CEF files are guaranteed to always begin with these four bytes, which can be used to identify the file format in the absence of a file name extension.
Each row has the same number of tab-separated fields, equal to
max(7, column count + row attribute count + 1). In other words, the entire file is a rectangular tab-delimited matrix, with at least seven columns. CEF file readers should accept CEF files that have less than the required number of fields in any row, and the missing fields should be interpreted as empty strings (but empty strings should not be interpreted as zeros; thus zeros must always be explicitly represented as '0'). CEF file writers should always generate a rectangular tab-delimited matrix.
Each line is terminated by a single newline character:
\n. CEF file writers should always generate lines ending in a single
\n, but readers should silently ignore any number of adjacent newline
\n and carriage return
\r characters. This makes it a little easier to generate CEF files manually, e.g. in Excel.
The first row of values defines the file structure. It begins 'CEF', followed by header count, row attribute count, column attribute count, row count, column count, and the
This is followed by header lines, which are name-value pairs, with the name in the first column and the value in the second. There are no restrictions on either the names or the values. The order of headers is not necessarily preserved when CEF files are read and written. There can be multiple headers with the same name.
Next, the column attributes are given, each in a single row with an offset of
(row attribute count). Finally, the rows are given, starting with row attributes, and followed by the values of the main matrix. Values are represented in text as decimal floating-point numbers in scientific notation, with optional exponent (e.g.
-1.4203939E+2; the regex is
[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?). Values must fit in a 32-bit IEEE-754 floating point number.
The order of headers, row and column attributes is not significant and need not be preserved. Of course, the order of the values of row and column attributes is significant, and must be preserved.
Example of a file with 1 header, 4 Row Attributes, 2 Column Attributes, 345 Rows, 123 Columns. The last number (0) in the first row is the
Flags value, currently unused.
|Header name||Header value|
Note that a CEF file can have zero row attributes, zero column attributes, and even zero rows or columns (in any combination). A CEF file without data, but with only row attributes, can be a useful way of storing annotations. Such a file can be joined to a data file to add the annotation to the data file.
Speed up CEF writer Tutorials for common tasks Rescale by given column attribute (mean centered) Import simple tables Aggregate maxcor, mincorr Left, right joins Select by regex Select by < and > Parsers and generators for R, Python, MATLAB, Mathematica, Java, Test suite for parsers and generators Validator for CEF files Repository