CGraphIndex [1] is a novel compressed self-index for labeled property multidigraphs. It for the first time achieves the kth-order entropy space bound for multidigraph properties (the dominant term in practice) and the 1st-order graph entropy for multidigraph structures. A self-index actually encodes the original input and thus there is no need to store the input separately.
CGraphIndex supports fundamental and navigational operations on the structures and on the properties in constant time, supports fast property extraction on vertices and edges, and efficiently supports typical interactive complex and business intelligence queries [2,3], as well as BFS and PageRank [4,5]. Experimental results on the tested large LDBC SNB data [6] demonstrate that substantial improvements in practice.
- Ubuntu 20.04
- GCC 10.0 or higher with C++20 standard support
- CMake3.15 or higher
- ninja-build
- oneTBB [7] and OpenMP [8] : for parallel primitive and to support multithreading
- BBHash [9] : for vertex ID remapping
- pSAscan [10] : for building suffix arrays in external memory
- divsufsort [11] : for building suffix arrays in memory
BBHash, pSAscan and divsufsort have been placed in the CGraphIndex/deps directory in source code form. OpenMP and oneTBB need to be installed by the user and make CMake be able to find the corresponding configuration files. OpenMP is typically installed with common compiler toolchains such as GCC. For oneTBB, you can install it via a package manager or an installer script available at onetbb-install .
To enable debug messages, use the following commands:
1 cd CGraphIndex
2 cmake --build "$(pwd)/build" --config Debug --target all -j $(expr $(nproc) / 2) --To compile in optimized mode, use the following commands:
1 cd CGraphIndex
2 cmake --build "$(pwd)/build" --config Release --target all -j $(expr $(nproc) / 2) --The above commands use half of the logical threads of the current machine to build, and when the build is completed, 4 executable files are generated in the CGraphIndex/build directory:
- create_init : used to preprocess LDBC datasets and split them into graph structure information, vertex property text, and edge property text.
- create_adjlist : used to construct the compressed adjacency list structure gStruct/gInStruct, the primary input of which is the TmpAdjTable file generated in the preprocessing.
- create_index : used to build the compressed self-indexes.
- run_benchmark : for batch testing.
It is recommended to use a Docker image of ldbc/datagen-standalone for dataset preparation, after pulling the image, run the following commands to generate a dataset of the specified size:
mkdir -p "sf${SF}"
docker run \
--mount type=bind,source="$(pwd)/sf${SF}",target=/out \
ldbc/datagen-standalone --parallelism 1 -- --format csv --scale-factor ${SF} –mode interactiveBefore running the above command, you need to set the variable SF in the shell, given the specified scale, for example, to generate a dataset of SF30, you need to set SF=30, and the generated files will be placed in the ./sf30 directory.
Use the create_init program to preprocess LDBC raw data by:
1 create_init <data_dir> <dst_dir> <cut_size>- data_dir is the directory where the LDBC raw datasets are stored.
- dst_dir indicates the directory where the preprocessed files are stored.
- cut_size represents the cut size of the vertex/edge property text in gigabytes.
Since the peak memory for building the GIndex index is about 18 times the size of the input text, it is not recommended to set the cut_size greater than 50GB on machines with 1TB of memory.
After the create_init operation is completed, several files are generated in the specified directory:
- Vertex_x/Edge_x : the text that holds the vertex/edge properties, which is the input to GIndex to generate VIndex_x / RIndex_x.
- Vrows.myg/Erows.myg : Holds the number of lines (i.e., the number) of the vertex/edge properties contained in each Vertex_x/Edge_x text. When you search for vertex/edge property data based on vertex Vid or edge Eid, you will determine on which line on which index file you need to search for based on the two files.
- Vkinds.myg : Hold information about vertex labels; details see the code comments on the Vdetail class.
- Ekinds.myg : Store edge-related information; details see the code comments on the Edetail class.
- Vmap.myg : Store all MPhash values.
- Emap.myg : Hold the fixed-length code for each edge relation property.
- Etype.myg : Store the edge relationship properties in increasing order of Eid, and uses a packed integer array to store.
- EHasdes.myg : The succinct bit array that indicates if an edge contains the residual properties, and supports the rank operation.
- TmpAdjTable : A temporary adjacency list file that is used to build a compressed adjacency list later.
Use create_adjlist to build a compressed adjacency list structure gStruct/gInStruct in the following ways:
1 create_adjlist <tmp_adj_table_dir> <dst_dir>- tmp_adj_table_dir : The directory where the TmpAdjTable is.
- dst_dir : The directory where the gStruct/gInStruct is generated after the build is complete.
After the build is complete, the file is stored in the dst_dir directory and the file name is AdjList.myg.
Use create_index to compress the text in the dataset using GIndex. How to use:
1 create_index <block_size> <sample_size> <dst_dir> <src_file1> [src_files...]- block_size : The chunk size of GIndex, the recommended setting is 128.
- sample_size : The sample size of the suffix array in GIndex, the recommended setting is 128.
- dst_dir : The directory for storing the generated GIndex compressed index.
- src_file1 ... : Text files that need to be processed, which are Vertex_x/Edge_x files resulting from the preprocessing.
After the build is complete, the file is stored in the dst_dir directory with the suffix .geindex.
run_benchmark is used to test the query time performance of CGraphIndex, and output the time for the test in millisecond (ms), as follows:
1 run_benchmark <data_dir> <query> <input> <output> [max_num]- data_dir is the directory where CGraphIndex's data files are located.
- query is the name of the query to be made, data type string, and its value must be one of the following table:
- input is the name of the input file, and each line of the input file represents an input, and each parameter in the input is separated by a space.
- output is the name of the resulting output file to which the test results will be written.
- max_num is an optional parameter that indicates the maximum number of tests, and the final number of tests will not exceed the value of this parameter. If this parameter is not provided, every piece of data in the input file will be tested.
| query | descriptions |
|---|---|
| bfs | BFS |
| page_rank | PageRank |
| khops | k-hops friends query |
| ic1 | LDBC SNB-interactive-complex-1 query |
| ic5 | LDBC SNB-interactive-complex-5 query |
| ic10 | LDBC SNB-interactive-complex-10 query |
| ic13 | LDBC SNB-interactive-complex-13 query |
| bi2 | LDBC SNB-business-intelligence-2 query |
| bi3 | LDBC SNB-business-intelligence-3 query |
| bi10 | LDBC SNB-business-intelligence-10 query |
| bi12 | LDBC SNB-business-intelligence-12 query |
| bi16 | LDBC SNB-business-intelligence-16 query |
| bi19 | LDBC SNB-business-intelligence-19 query |
Input File Format Description: Different from the substitution_parameter generated in LDBC that uses | The format of the split parameters, run_benchmark requires the input parameters to be separated by spaces, and a Python3 script convert_sep.py is provided in the CGraphIndex/scripts directory to complete the process, and the script usage is:
1 python3 convert_sep.py -p <filename>where filename is the input file to be processed.
[1] Hongwei Huo, Yongze Yu, Zongtao He, and Jeffrey Scott Vitter, Indexing Labeled Property Multidigraphs in Entropy Space, With Applications, November 2024. https://github.com/Hongweihuo-Lab/CGraphIndex
[2] Orri Erling et al., The LDBC Social Network Benchmark: Interactive Workload, In SIGMOD, pages 619–630, 2015. https://doi.org/10.1145/2723372.2742786
[3] Szárnyas et al., The LDBC Social Network Benchmark: Business Intelligence Workload. PVLDB 16(4): 877–890, 2022. https://doi.org/10.14778/3574245.3574270
[4] Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd, The PageRank Citation Ranking: Bringing Order to the Web, The Web Conference, 1999. https://api.semanticscholar.org/CorpusID:1508503
[5] Iosup et al., The LDBC Graphalytics Benchmark, 2023. https://arxiv.org/abs/2011.15028
[6] The LDBC Social Network Benchmark, 2024. https://arxiv.org/abs/2001.02299
[7] oneTBB, https://github.com/oneapi-src/oneTBB
[8] OpenMP, https://github.com/OpenMP
[9] Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo, Fast and Scalable Minimal Perfect Hashing for Massive Key Sets, In SEA, pages 25:1–25:16, 2017. https://doi.org/10.4230/LIPIcs.SEA.2017.25
[10] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi, Parallel External Memory Suffix Sorting, In CPM, pages 329--342, 2015. https://doi.org/10.1007/978-3-319-19929-0_28
[11] Yuta Mori, A lightweight suffix-sorting library, 2008. https://github.com/y-256/libdivsufsort/