Skip to content

LINGLI97/CPM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contextual Pattern Mining and Counting

Overview

This repository contains the implementation of several algorithms proposed in our paper Contextual Pattern Mining and Counting, including IM Method, EM Method, CPC Index without optimization, CPRI Method, and CPC Index Method. The build process is managed using GNU Make, with modular rules defined in the Makefile for compiling and cleaning.


Directory Structure

.
├── EM-SparsePhi-0.2.0/           # lib for external LCP construction of EM
├── include/                      # Header files directory
├── psascan/                      # lib for external SA construction of EM
├── stxxl/                        # STXXL library includes and binaries
├── baseline_mining.cpp           # Source file for IM method
├── ExternalMining.cpp         	  # Source file for EM method
├── baseline_reporting.cpp        # Source file for CPR Index method
├── counting.cpp                  # Source file for CPC index without optimization method
├── count_stabbed_char_rtree.cpp  # Source file for CPC index method
├── ...                           # other source files
├── Makefile.32-bit.gcc           # Makefile for 32-bit compilation
├── Makefile.64-bit.gcc           # Makefile for 64-bit compilation

Dependencies

Required Tools

  • GCC Compiler (supporting C++17)
  • GNU Make (build system)

Required Libraries

  • SDSL Library:

    • Includes libsdsl, libdivsufsort, and libdivsufsort64.
    • please install sdsl.
    • ./pre-install.sh
      
  • STXXL Library:

    • please install STXXL.
    • Required for external memory operations.
  • Boost:

    • please install Boost.
    • Required for building range counting data structure.
  • EM-SparsePhi-0.2.0:

  • psascan:

    • please install psascan.
    • Required for constructing SA externally.
  • kkp:

    • please install kkp.
    • Required for computing the LZ77 factorization used in CPC index.


Ensure the paths for these libraries are correctly configured in the Makefile.


Build Instructions

  1. Navigate to the project directory:

    cd /path/to/project
  2. Use the appropriate Makefile depending on your system architecture:

    • For 64-bit systems:
      make -f Makefile.64-bit.gcc
    • For 32-bit systems:
      make -f Makefile.32-bit.gcc
  3. To build specific components, use the corresponding target:

    • IM Method:

      make -f Makefile.64-bit.gcc run_IM
    • EM Method:

      make -f Makefile.64-bit.gcc run_EM
    • CPC Index without optimization (CI-):

      make -f Makefile.64-bit.gcc run_CI_minus
    • CPRI Method:

      make -f Makefile.64-bit.gcc run_CPRI
    • CPC Index Method:

      make -f Makefile.64-bit.gcc run_CPC
  4. R Index competitor

    • Please refer to here for the implementation of R Index.

Usage


CPM Problem

1. EM Method

Run EM Method with the following command:

./run_EM -f input.txt -x 6 -y 6 -m 3 -t 1000
  • Explanation:
    • -f input.txt: Specifies the input text file to process.
    • -x 6: Specifies the context size in one dimension (e.g., rows).
    • -y 6: Specifies the context size in another dimension (e.g., columns).
    • -m 3: Specifies the minimum pattern length for mining.
    • -t 1000: Specifies the frequency threshold for patterns.

2. IM Method

Run IM Method with the following command:

./run_IM -f input.txt -x 6 -y 6 -m 3 -t 1000
  • Explanation:
    • Similar to run_EM, but uses the IM method.

CPC Problem

1.R-Index method:

  • Build the R-Index:
./ri-build input.txt
  • Locate patterns:
./ri-locate -c input.txt input.txt.ri pattern.txt_r_index  6 6
  • Explanation:
    • -c input.txt: The input text file.
    • input.txt.ri: The R-Index file generated using ri-build.
    • pattern.txt_r_index: The file containing patterns for R-Index querying.
    • 6 6: Context sizes for the query.

2.Run baseline reporting:

./run_CPRI -f input.txt -p pattern.txt

3.CPC Index without optimization (CI-):

./run_CI_minus -f input.txt -p pattern.txt
  • Explanation:
    • -f input.txt: Specifies the input text file to process.
    • -p pattern.txt: Specifies the file containing patterns to be searched.

4.CPC Index:

./run_CPC -f input.txt -p pattern.txt
  • Explanation:
    • Similar to run_CI_minus, but uses the CPC Index method for pattern counting.

Datasets

The preprocessed datasets are available at Google Drive.


Notes

  • Make sure the required libraries (SDSL, STXXL and so on) are properly installed and accessible from the paths specified in the Makefile.
  • Modify the Makefile variables (CFLAGS, LFLAGS, etc.) if your library paths or compiler configurations differ.

License

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors