digest

This is a program which builds suffix array with LCP and BWT for large collection of files, where each file consists of short lines

Each suffix in each line is indexed

The main feature is that program uses as little memory as possible and stores inputs, intermediate data and outputs on disk

This makes it suitable for very large text collections

In particular, it is designed to handle short reads of DNA - sequences up to 100 characters over alphabet {a,c,g,t}, but it can easily be modified to handle any alphabet.

The software runs on any operating system.

However, the first step is preprocessing of files into numbered cleaned collection is unix-only (GNU), as it uses library <fts.h>. This header file is missing on some platforms: AIX 5.1, HP-UX 11, IRIX 6.5, OSF/1 5.1, Solaris 11 2011-11, mingw, MSVC 9, BeOS.

Compile: make

How to use

Preprocess your files placed in a particular folder "folder_path/folder_name". Create folder "inputs" and run: ./prepareonunix folder_path/folder_name This creates a collection of files in folder "inputs" which are cleaned and prepared for further processing

Software consists of two major steps: Step 1: Sortig suffixes of each line in each separate file and writing them on disk, together with the binary representation of each suffix. Each file with sorted suffixes, file numbers and BWT chars is a run. Step 2: External memory multiway merge sort for runs produces a single output file, which contains all suffixes of the collection in lexicographic order, with attached file numbers and BWT chars.

Single machine processing

The steps can be run separately: ./buildsaallfiles inputs/input_ <num_files> ./mergesasinglefile inputs/input_ 'num_files' 'memory_bytes_for_all_input_buffers' 'elements_in_output_buffer'

or as a single program ./buildandmerge inputs/input_ 'num_files' 'memory_bytes_for_all_input_buffers' 'elements_in_output_buffer'

two last parameters define memory usage during the merge

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENCE.txt		LICENCE.txt
Makefile		Makefile
README.md		README.md
SuffixArray.h		SuffixArray.h
bit_operations.c		bit_operations.c
divsufsort.c		divsufsort.c
divsufsort.h		divsufsort.h
main.c		main.c
merge_external.c		merge_external.c
merge_sa_lcp_filenames.c		merge_sa_lcp_filenames.c
merge_sa_main.c		merge_sa_main.c
prefix_compare.c		prefix_compare.c
prepare_inputs.c		prepare_inputs.c
sa_all_main.c		sa_all_main.c
sa_mori.c		sa_mori.c
sort_sa_main.c		sort_sa_main.c
suffix_array_all_files.c		suffix_array_all_files.c
suffix_array_single_file.c		suffix_array_single_file.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

digest

About

Releases

Packages

Languages

License

mgbarsky/digest

Folders and files

Latest commit

History

Repository files navigation

digest

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages