-
Notifications
You must be signed in to change notification settings - Fork 0
Automatically exported from code.google.com/p/lzbg
License
hdbn/lzbg
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
-------------------------------------------------------------
This repository contains implementations of a new, simple,
and fast approach for LZ77 factorization.
-------------------------------------------------------------
Simple explanation of approach
-------------------------------------------------------------
Our approach basically computes the PSV and NSV values
based on the suffix array SA, where
PSV[i] = max { j < i | SA[j] < SA[i] }
NSV[i] = min { j > i | SA[j] < SA[i] }
which can be computed by a process called peak elimination.
Then,
LPF[i] = max { lcp(T[j:N], T[SA[i]:N]) | j < SA[i] }
= max { lcp(T[SA[PSV[i]]:N], T[SA[i]:N]),
lcp(T[SA[NSV[i]]:N], T[SA[i]:N]) }.
Where, by definition, an LZ77 factor starting at position
SA[i] in the text will have length LPF[i].
(above description is in lex order, but text order is
considered depending on the variant).
Most previous algorithms compute LPFs for all 1 \leq i \leq N.
However, our algorithm avoids computing all of these values,
but rather computes the LZ77 parse directly from
PSV and NSV, making the algorithm simpler and faster.
There are 6 variants: BGS, BGL, BGT and iBGS, iBGL, and iBGT.
All run in linear time, given the suffix array of the string.
Working memory is as follows:
BGS, iBGS: 17*N Bytes + stack
BGL, iBGL: 17*N Bytes
BGT, iBGT: 13*N Bytes
-------------------------------------------------------------
Files
-------------------------------------------------------------
Each of the above variants are coded in the following files:
bgCommon.hpp: header file for common utility functions
bgCommon.cpp: implementation of common utility functions
bgsMain.cpp: peak elimination in lex order using stack
ibgsMain.cpp: peak elimination in lex order using stack, interleaving PSV,NSV
bglMain.cpp: peak elimination in lex order
ibglMain.cpp: peak elimination in lex order, interleaving PSV,NSV
bgtMain.cpp: peak elimination in text order with the help of \Phi
ibgtMain.cpp: peak elimination in text order with the help of \Phi, interleaving PSV,NSV
where interleaving PSV, NSV means that they are stored in a
single array PNSV of length 2*N and PNSV[2*i] = PSV[i] and PNSV[2*i+1] = NSV[i].
Computational experiments so far have shown that iBGS seems to be
fastest except for extremely repetitive data, where iBGT seems to be fastest.
We also provide implementation for the algorithm shown in:
E. Ohlebusch & S. Gog, "Lempel-Ziv Factorization Revisited",
In Proc. CPM 2011, LNCS 6661:15-26, 2011.
(our implementation is based on the pseudo-code in the paper,
and the performance is basically equal to the implementation
that Dr. Simon Gog shared with us.)
These are implemented in:
ogMain.cpp: LZ_OG
iogMain.cpp: LZ_OG with interleaving LPS and PrevOcc
LZ_OG requires 13*N Bytes of working memory.
The files:
divsufsort.h
divsufsort.c
are from libdivsufsort-lite by Yuta Mori, for computing the
suffix array of a string.
-------------------------------------------------------------
Usage:
-------------------------------------------------------------
The programs can be compiled via the SCons build tool:
http://scons.org
just type 'scons' in the directory.
The SConstruct file should be edited if you need to modify
options passed to the compiler.
The following executables will be produced:
lzBGL
lzBGS
lzBGT
lzOG
lziBGL
lziBGS
lziBGT
lziOG
All usage is the same for all the programs:
Usage : ./lzBGL [options]
Options:
-f iFile : file to process
-x : use iFile + '.sa' for suffix array cache
-g : check if resulting factorization produces input string
if -x is specified, the program will also look for a file with
extension '.sa' appended to the input filename, and use it as the
suffix array if it exists, or create/overwrite the file if it does
not exist or seems to be out of date.
The LZ factorization is returned in:
std::vector<std::pair<int,int> > lz;
which is a sequence of
(length of factor, previous occurrence) if LPF > 0, or (0, T[p]).
Currently the programs output only simple statistics based on lz.
-------------------------------------------------------------
Authors:
Hideo Bannai
Keisuke Goto
-------------------------------------------------------------
About
Automatically exported from code.google.com/p/lzbg
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published