This repository contains the implementation of several algorithms proposed in our paper Contextual Pattern Mining and Counting, including IM Method, EM Method, CPC Index without optimization, CPRI Method, and CPC Index Method. The build process is managed using GNU Make, with modular rules defined in the Makefile for compiling and cleaning.
.
├── EM-SparsePhi-0.2.0/ # lib for external LCP construction of EM
├── include/ # Header files directory
├── psascan/ # lib for external SA construction of EM
├── stxxl/ # STXXL library includes and binaries
├── baseline_mining.cpp # Source file for IM method
├── ExternalMining.cpp # Source file for EM method
├── baseline_reporting.cpp # Source file for CPR Index method
├── counting.cpp # Source file for CPC index without optimization method
├── count_stabbed_char_rtree.cpp # Source file for CPC index method
├── ... # other source files
├── Makefile.32-bit.gcc # Makefile for 32-bit compilation
├── Makefile.64-bit.gcc # Makefile for 64-bit compilation
- GCC Compiler (supporting C++17)
- GNU Make (build system)
-
SDSL Library:
- Includes
libsdsl,libdivsufsort, andlibdivsufsort64. - please install sdsl.
-
./pre-install.sh
- Includes
-
STXXL Library:
- please install STXXL.
- Required for external memory operations.
-
Boost:
- please install Boost.
- Required for building range counting data structure.
-
EM-SparsePhi-0.2.0:
- please install EM-SparsePhi-0.2.0.
- Required for constructing LCP externally.
-
psascan:
- please install psascan.
- Required for constructing SA externally.
-
kkp:
- please install kkp.
- Required for computing the LZ77 factorization used in CPC index.
Ensure the paths for these libraries are correctly configured in the Makefile.
-
Navigate to the project directory:
cd /path/to/project -
Use the appropriate Makefile depending on your system architecture:
- For 64-bit systems:
make -f Makefile.64-bit.gcc
- For 32-bit systems:
make -f Makefile.32-bit.gcc
- For 64-bit systems:
-
To build specific components, use the corresponding target:
-
IM Method:
make -f Makefile.64-bit.gcc run_IM
-
EM Method:
make -f Makefile.64-bit.gcc run_EM
-
CPC Index without optimization (CI-):
make -f Makefile.64-bit.gcc run_CI_minus
-
CPRI Method:
make -f Makefile.64-bit.gcc run_CPRI
-
CPC Index Method:
make -f Makefile.64-bit.gcc run_CPC
-
-
R Index competitor
- Please refer to here for the implementation of R Index.
Run EM Method with the following command:
./run_EM -f input.txt -x 6 -y 6 -m 3 -t 1000- Explanation:
-f input.txt: Specifies the input text file to process.-x 6: Specifies the context size in one dimension (e.g., rows).-y 6: Specifies the context size in another dimension (e.g., columns).-m 3: Specifies the minimum pattern length for mining.-t 1000: Specifies the frequency threshold for patterns.
Run IM Method with the following command:
./run_IM -f input.txt -x 6 -y 6 -m 3 -t 1000- Explanation:
- Similar to
run_EM, but uses the IM method.
- Similar to
1.R-Index method:
- Build the R-Index:
./ri-build input.txt- Locate patterns:
./ri-locate -c input.txt input.txt.ri pattern.txt_r_index 6 6- Explanation:
-c input.txt: The input text file.input.txt.ri: The R-Index file generated usingri-build.pattern.txt_r_index: The file containing patterns for R-Index querying.6 6: Context sizes for the query.
2.Run baseline reporting:
./run_CPRI -f input.txt -p pattern.txt3.CPC Index without optimization (CI-):
./run_CI_minus -f input.txt -p pattern.txt- Explanation:
-f input.txt: Specifies the input text file to process.-p pattern.txt: Specifies the file containing patterns to be searched.
4.CPC Index:
./run_CPC -f input.txt -p pattern.txt- Explanation:
- Similar to
run_CI_minus, but uses the CPC Index method for pattern counting.
- Similar to
The preprocessed datasets are available at Google Drive.
- Make sure the required libraries (SDSL, STXXL and so on) are properly installed and accessible from the paths specified in the Makefile.
- Modify the
Makefilevariables (CFLAGS,LFLAGS, etc.) if your library paths or compiler configurations differ.
License
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.