(backup fork since google code is going down)
C++ Matlab R Python Shell Makefile Batchfile
Pull request Compare This branch is even with ryanbressler:master.
Latest commit cbe31e1 Mar 23, 2013 timo.erkkila@gmail.com Small tweaks
Permalink
Failed to load latest commit information.
R Changed type of the container where categorical predictions were stor… Mar 18, 2013
data Implemented a feature that allows one to feed in substrings that matc… Jul 5, 2011
man Updated benchmark script. Feb 11, 2013
matlab Changed SuperFastHash to MurMurHash3. Dec 17, 2012
src Small tweaks Mar 23, 2013
test Made some dramatic changes: removed support for ARFF; made TreeData a… Mar 21, 2013
tmp Added a skeleton for new, optimized Reader class, and couple utility … Jan 9, 2013
DESCRIPTION Added information about dependencies (Rcpp) Oct 2, 2012
Makefile Made some dramatic changes: removed support for ARFF; made TreeData a… Mar 21, 2013
NAMESPACE Added information about dependencies (Rcpp) Oct 2, 2012
README Updated README (for version 0.5.5) Jul 8, 2011
doxy.cfg Patching over changes from a localized Git copy of the project. Unfor… Aug 11, 2011
install_R.sh A huge leap towards R integration. Model training now works; model st… Sep 25, 2012
make_package.sh Made some dramatic changes: removed support for ARFF; made TreeData a… Mar 21, 2013
make_win32.bat Fixed a bug in hasher that was preventing the code to run properly un… Jan 11, 2013
make_win64.bat Fixed a bug in hasher that was preventing the code to run properly un… Jan 11, 2013
rf-ace-launcher.sh Updating GenePattern script file Aug 23, 2011
rf_ace_batch.py RF-ACE has reached version 0.9.0. The new splitter for categorical fe… Nov 2, 2011
test_103by300_mixed_matrix.afm Revised the test feature matrix. May 3, 2012
test_103by300_mixed_nan_matrix.afm Fixed a small bug in how ties were resolved with numerical splitters.… Aug 6, 2012
test_2by10_text_matrix.afm Small tweaks. Dec 14, 2012
test_2by8_numerical_matrix.tsv Added a new split method for the Splitter class: it looks up the spli… Jan 4, 2012
test_3by10_categorical_matrix.tsv Shrinkage parameter is now updated properly when RF is grown. Updated… Feb 20, 2012
test_6by10_mixed_matrix.tsv Added a new split method for the Splitter class: it looks up the spli… Jan 4, 2012
test_fullSplitterSweep.txt Fixed a small bug in how ties were resolved with numerical splitters.… Aug 6, 2012
test_fullSplitterSweep_class.txt Added distributions::InvCDF for weighted sampling of features to make… Oct 22, 2012
test_predictor.sf Update to support creating ternary splits, where the third branch con… Dec 28, 2012
test_rfacer.R Added tester script for the R package Oct 7, 2012
testdata.tsv Added the missing test files. Nov 28, 2011

README

#summary Manual pages.

*The manual pages have been written on the basis of RF-ACE verson 0.5.5*

= Description =

RF-ACE is an efficient C++ implementation of a robust machine learning algorithm for uncovering multivariate associations from large and diverse data sets. RF-ACE natively handles numerical and categorical data with missing values, and potentially large quantities of noninformative features are handled gracefully utilizing artificial contrast features, bootstrapping, and p-value estimation.

= Installation =

Download the latest stable release from the [http://code.google.com/p/rf-ace/downloads/list download page], or checkout the latest development version (to directory rf-ace/) by typing
{{{
svn checkout http://rf-ace.googlecode.com/svn/trunk/ rf-ace
}}}

Compiler makefiles for Linux (`Makefile`) and Visual Studio for Windows (`make.bat`) are provided in the package. In Linux, you can compile the program by typing 
{{{
make
}}}
or
{{{
make rf_ace
}}}

In Windows and using Visual Studio, first open up the Visual Studio terminal and execute `make.bat` by typing
{{{
make
}}}
Simple as that! If you feel lucky, check for compiled binaries at the [http://code.google.com/p/rf-ace/downloads/list download page]. 

= Supported data formats =
RF-ACE currently supports two file formats, Annotated Feature Matrix (AFM) and Attribute-Relation File Format (ARFF).

== Annotated Feature Matrix (AFM) ==

Annotated Feature Matrix represents the data as a tab-delimited table, where both columns and rows contain headers describing the samples and features. Based on the headers, the AFM reader is able to discern the right orientation (features as rows or columns in the matrix) of the matrix. Namely AFM feature headers must encode whether the feature is (`N`)umerical, (`C`)ategorical, (`O`)rdinal, or (`B`)inary, followed by colon and the actual name of the feature as follows:

 * `B:is_alive`
 * `N:age`
 * `C:tumor_grage` 
 * `O:anatomic_organ_subdivision`

In fact any string, even including colons, spaces, and other special characters, encodes a valid feature name as long as it starts with the preamble `N:`/`C:`/`O:`/`B:`. Thus, the following is a valid feature header:

 * `N:GEXP:TP53:chr17:123:456`

Sample headers are not constrained, except that they must not contain preambles `N:`/`C:`/`O:`/`B:`, being reserved for the feature headers. 

== Attribute-Relation File Format (ARFF) ==

[http://www.cs.waikato.ac.nz/~ml/weka/arff.html ARFF specification].      

= Usage =
The following examples follow Linux syntax. Type 
{{{
bin/rf_ace --help
}}}
or 
{{{
bin/rf_ace -h
}}}
to bring up help:
{{{
REQUIRED ARGUMENTS:
 -I / --input        input feature file (AFM or ARFF)
 -i / --target       target, specified as integer or string that is to be matched with the content of input
 -O / --output       output association file

OPTIONAL ARGUMENTS:
 -n / --ntrees       number of trees per RF (default nsamples/nrealsamples)
 -m / --mtry         number of randomly drawn features per node split (default sqrt(nfeatures))
 -s / --nodesize     minimum number of train samples per node, affects tree depth (default max{5,nsamples/20})
 -p / --nperms       number of Random Forests (default 50)
 -t / --pthreshold   p-value threshold below which associations are listed (default 0.1)
 -g / --gbt          Enable (1 == YES) Gradient Boosting Trees, a subsequent filtering procedure (default 0 == NO)
}}} 

So all that is required is an input file (`-I/--input`), either of type `.arff` or `.afm`, and a target (`-i/--target`) to build the RF-ACE model upon. Target in this case corresponds to a feature in the input file, and it can be identified with an index corresponding to it's order of appearance in the file, or with it's name. Thus, if the target is `N:age` (we would be looking for features associated with age) existing on row `123` (0-base and omitting the header row), one execute RF-ACE by typing
{{{
bin/rf_ace --input featurematrix.afm --target 123 --output associations.tsv 
}}}
or with the short-hand notation equivalently as
{{{
bin/rf_ace -I featurematrix.afm -i 123 -O associations.tsv 
}}}
or by using the header "N:age" instead of the index by typing
{{{
bin/rf_ace -I featurematrix.afm -i N:age -O associations.tsv
}}}
In case a provided (sub)string identifies multiple target candidates, RF-ACE will be executed serially for all target candidates, results catenated in the specified output file.

The above will execute RF-ACE with the default parameters; as the help documentation points out, most of the parameters are estimated dynamically based on the data dimensions and content, so running RF-ACE with no information about the algorithm itself is possible.

= Output = 
The following call (assuming now the substring `age` uniquely identifies just one feature, `N:age`)
{{{
bin/rf_ace -I featurematrix.afm -i age -O associations.tsv
}}}
produces the output
{{{


 ---------------------------------------------------------------
| RF-ACE -- efficient feature selection with heterogeneous data |
|                                                               |
|  Version:      RF-ACE v0.5.5, July 4th, 2011                  |
|  Project page: http://code.google.com/p/rf-ace                |
|  Contact:      timo.p.erkkila@tut.fi                          |
|                kari.torkkola@gmail.com                        |
|                                                               |
|              DEVELOPMENT VERSION, BUGS EXIST!                 |
 ---------------------------------------------------------------

Reading file 'featurematrix.afm'
File type is unknown -- defaulting to Annotated Feature Matrix (AFM)
AFM orientation: features as rows

RF-ACE parameter configuration:
  --input      = featurematrix.afm
  --nsamples   = 223 / 282 (20.922% missing)
  --nfeatures  = 48912
  --targetidx  = 123, header 'N:age'
  --ntrees     = 356
  --mtry       = 221
  --nodesize   = 12
  --nperms     = 50
  --pthresold  = 0.1
  --output     = associations.tsv

Growing 50 Random Forests (RFs), please wait...
  RF 1: 4880 nodes (avg. 13.7079 nodes / tree)
  RF 2: 4810 nodes (avg. 13.5112 nodes / tree)
  RF 3: 4856 nodes (avg. 13.6404 nodes / tree)
  RF 4: 4994 nodes (avg. 14.0281 nodes / tree)
  RF 5: 5036 nodes (avg. 14.1461 nodes / tree)
  RF 6: 5016 nodes (avg. 14.0899 nodes / tree)
  RF 7: 5132 nodes (avg. 14.4157 nodes / tree)
...
  RF 47: 4736 nodes (avg. 13.3034 nodes / tree)
  RF 48: 5234 nodes (avg. 14.7022 nodes / tree)
  RF 49: 4582 nodes (avg. 12.8708 nodes / tree)
  RF 50: 5210 nodes (avg. 14.6348 nodes / tree)
50 RFs, 17800 trees, and 247516 nodes generated in 102.91 seconds (2405.17 nodes per second)
Gradient Boosting Trees *DISABLED*

Association file created. Format:
TARGET   PREDICTOR   P-VALUE   IMPORTANCE   CORRELATION

Done.
}}}

If there are no associations found, the program would end as follows:
{{{
No significant associations found, quitting...
}}}

= RF-ACE configuration =

Information will be added in the future