Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Suffix Trees in R via the libstree Clibrary

tree: 905e2ad2da

Fetching latest commit…

Cannot retrieve the latest commit at this time

README
This is a quickly written interface to a suffix tree library to
explore different aspects of external data and the use of suffix trees
in R for text manipulation.  Currently, we use the libstree
(http://www.cl.cam.ac.uk/~cpk25/libstree/, download from
http://www.cl.cam.ac.uk/~cpk25/downloads/libstree-0.4.0.tar.gz
or more recently from http://www.icir.org/christian/libstree/) source
code by Christian Kreibich. Installation of that library is relatively
straighforward as it is stand-alone, not depending on other libraries.
Unfortunately, to use it easily with another program, the libstree
code should be installed.  To put it someplace for which you do not
need special permissions, use

  cd libstree-0.4-0      # or whatever the relevant directory is.
  ./configure --prefix=$HOME/local
  make install 

Then use

   R CMD INSTALL --configure-args=--with-libstree=$HOME/local Rlibstree

and that should find the relevant header and libaries files.  
Specifically, the argument for --prefix when building and installing libstree
should be the same as the value for --with-libstree in the R CMD INSTALL.


The package provides access to StringSet and SuffixTree classes
which are external pointers, i.e. references to the C-level data structures.
The package provides an interface to the getLongestSubstring() facilities in 
libstree for finding the longest common and repeated substring of a given length.
These are the algorithms that are currently available in the libstree code.



In this R package, one can iterate over the elements in a StringSet
using lapply/sapply.  The operation can be given as either an R
function or a C routine (an object of class "NativeSymbol"). The C
routine can be obtained using getNativeSymbolInfo(symbolName).  We
want the address of that, but methods for lapply will coerce a
NativeSymbolInfo appropriately.  The ability to use a C routine is
intended to illustrate this facility in the R-C interface and also
for efficiency.  We can add the same for traversing the tree.




The longest substring algorithms are suboptimal in this library.  So
why are we using it?  Primarily because we hope the interface design
from R to the library will carry over to different implementations.
We have used this to provide an illustration of different aspects of
S4 classes, external pointers, using C routines in lapply.
We will probably move to a different implementation of suffix trees
if there is sufficient motivation.
Other possible libraries include
 Shlomo Yona's at  http://yeda.cs.technion.ac.il/~yona/suffix_tree/
and
 Stefan Kurtz's which is part of MUMmer from TIGR (The Institute for Genomic Research) available
 at sourceforge.net.
  (See the file  src/kurtz/streesrc/streeproto.h for the potentially relevant C routines)
 or in a different form 
  the wotdsrc.new from 
     http://bibiserv.techfak.uni-bielefeld.de/download/tools/wotd.html
  associated with the paper
     http://www.zbh.uni-hamburg.de/staff/kurtz/papers/GieKurSto2003.pdf

I have no experience with any of these yet, so take the pointers as merely
places to explore.



Something went wrong with that request. Please try again.