Developer Notes for CSI
I'll try to keep notes relevant to the translation of CSI to Python in this document.
Data from the DREAM challenge
Test data for CSI from the DREAM project is in
It contains two header rows, the first defines the 5 "Treatments"
while the second row the 21 evenly spaced X values for the timeseries.
The first column contains the 10 gene names and the remaining 10 by
(21*5) cells contain the data values.
Calculation of parental sets should probably be done outside of the "main" CSI code—maybe in user friendly calling part. Use case is to optimise hyperparameters using a smaller set (i.e. truncate at two parents) then run full algorithm with larger set (i.e. three parents).
Displaying Results Interactively
Graph/Network of genes as nodes and edges marginal likelihood over parental sets. Hovering over nodes displays name, edge line weight proportional to marginals.
Plots of raw data; selected node black, genes having a casual effect on selected gene in one colour and genes our node has a causal effect on in another colour. Line weight proportional to marginal likelihood.
A table of genes: name, number child and parent genes at current cutoff
Table of parental sets for a selected gene, highlighted by cutoff
Slider for cutoff on marginal likelihoods of parental sets.
Menubar for setting options, exporting data
I want to…
get a general idea of the network structure while easily cutting the graph at different marginals (uses: graph and cutoff)
make sure that CSI is "doing the right thing" by looking at expression profiles and comparing them to their parental sets (uses graph and plots)
look at my genes of interest and explore the network near them (uses graph and something else)
use Cytoscape to look at the network cut as appropriate. I want the graph in a .SIF file and the edges in a .EDA file.
get citation information
csi is the entry point of the program, starts the GUI and responds
to button presses.
panelfuncs contains a list of panels to display,
all subdirectories are added to Matlab's search path. The data is
loaded, parsed, genes and transcription factors selected and finally a
structure is passed to
run_CSI. This structure contains:
dataa double matrix, as per demo data file
genenamesa cell array of gene names
genedesca cell array of gene descriptions
replicatenamesa cell array of replicate names
startingColsmatrix describing which columns of the data refer to which treatment
time_valuesdouble vector, as per demo data
toplotscalar numeric value
tfdouble vector, indicies of genes to treat as transcription factors
params: has members
Prpriors for hyper parameters
inference1 for EM or 2 for MCMC (only for CSI).
sparseoptimisation (only for EM)
Nnumber of MC steps (only for MCMC or Hierarchical)
temptemp for Hierarchical
fixtempboolean for Hierarchical?
indegreewhere to truncate the in-degree
parEnvnot sure, empty matrix!
run_CSI is still within the GUI, more investigation needed! It also
documents the contents of
data, which appear to be consistent with
what I inferred. The first stage prints out a description of the
options chosen along with a summary of the data size. Next the data,
D, is normalised (within each gene; mean=0, sd=1) and split into two
Yin , each
nreps long. The data for gene
ijk and replicate
i is extracted,
run_CSI appears to have a big distinction between CSI and HCSI, file
handles are set up for preprocessing, running for a single gene, and
postprocessing. Focusing on the simple Vanilla EM CSI seems like a
good starting point, will come back to other (more complicated?)
The preprocessing for CSI takes all the matrices from the nice cell
array and concatenates them back into a matrix! It leaves a structure
with two elements
.Y defined. There is a choice of whether
a serial or parallel run will be attempted, but both will loop through
every gene with the parallel version passing each gene off to a
separate task---could be interesting to measure how much time is spent
distributing data around the cluster, probably not much as the data
should be "small".
The meat of CSI is in
run_CSI.m), with a
big divide depending on whether sparse processing has been selected.
Starting with the non-sparse implementation, we calculate
Pa (parental set?) and pass off to
CSI_EM_v2 (WTF is the
_v2, why isn't there just a "current" CSI EM algorithm
distributed---Chris says it's because of a lack of version control).
CSI_EM_v2 is in
Functions/General, this gets called from the GUI
with four parameters. First half of the code seems to be decoding
parameters, probably for compatibility with different callers?
Parameters are then "randomly" initialised, and the parameters
optimised to minimise
CSIEngine_v2. The results are then summarised
and returned to the caller.
CSIEngine_v2 is also in
Functions/General, and seems to be about
as far as we need to go. Various hacks to deal with varargs that
appear unnecessary, followed by lots of calls to the GPML library to
get marginal likelihoods. There appear to be
_v3 versions of lots
of functions, after speaking with Chris, these are likely the versions
that came from Ahmed Shifaz and are only partially incorporated into