This project provides the experiment code and scripts of our paper: Reservoir Sampling over Joins.
- RSJoin: Code and scripts of our algorithms
RSJoin
andRSJoin_opt
. - SJoinMod: Modifications to the baseline SJoin in order to support more experiments.
The queries used in our experiments are included in Queries.
# build RSJoin/RSJoin_opt
cd RSJoin
mkdir release
cd release
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
# clone SJoin
git clone git@github.com:InitialDLab/SJoin.git
# apply the modifications
cp -rf SJoinMod/* ./SJoin
# build SJoin
cd SJoin/sjoin
mkdir release
cd release
CXXFLAGS=-O2 CPPFLAGS=-DNDEBUG ../configure
make
# build TPC-DS data preprocess in SJoin
cd ../../tpcds_data_proc
make
- Download the Epinions dataset
curl -O https://snap.stanford.edu/data/soc-Epinions1.txt.gz > /dev/null 2>&1
gzip -d soc-Epinions1.txt.gz
tail -n +5 soc-Epinions1.txt > epinions.txt
rm -f soc-Epinions1.txt
sed -i "s/\t/,/g" epinions.txt
- Modify the
inputFile
andoutputDir
inUtils/GraphDataPreprocess.scala
. Then run
scala -J-Xmx200g -J-Xms200g ./Utils/GraphDataPreprocess.scala
- Download the TPC-DS tool from the website and unzip.
- Run
make
under the$tpc-ds-tool/tools/
folder. ($TPC-DS-tool is the extracted folder) - Run
$tpc-ds-tool/tools/dsdgen
[your options] to generate TPC-DS data - Run
create_qx_data $tpc-ds-tool/tools/ qx_sf10.dat
inSJoin/tpcds_data_proc/
- Run
create_qy_data
andcreate_qz_data
- Clone LDBC SNB Datagen from git@github.com:ldbc/ldbc_snb_datagen_spark.git
- Build the Datagen following the instructions
- Run the following
PLATFORM_VERSION=$(sbt -batch -error 'print platformVersion')
DATAGEN_VERSION=$(sbt -batch -error 'print version')
LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
./tools/run.py --parallelism 1 --memory 180g -- --format csv --scale-factor 1 --mode raw
- Modify the
inputDir
andoutputFile
inUtils/SNBDataPreprocess.scala
. Then run
scala -J-Xmx200g -J-Xms200g ./Utils/SNBDataPreprocess.scala
- Modify the
$data_path
inRSJoin/scripts/run_*.sh
- Run
RSJoin/scripts/run_rsjoin.sh
andRSJoin/scripts/run_rsjoin_opt.sh
- Run
RSJoin/scripts/run_line3_input_size.sh
- Run
RSJoin/scripts/run_line3_sample_size.sh
- Run
RSJoin/scripts/run_line4_update_time.sh
- Run
RSJoin/scripts/run_qz_scale_factor.sh
- Run
RSJoin/scripts/run_qz_fk_scale_factor.sh
- Modify the
$data_path
inSJoin/scripts/run_*.sh
- Run
SJoin/scripts/run_sjoin.sh
andSJoin/scripts/run_sjoin_opt.sh
- Run
SJoin/scripts/run_line3_input_size.sh
- Run
SJoin/scripts/run_line3_sample_size.sh
- Run
SJoin/scripts/run_line4_update_time.sh
- Run
SJoin/scripts/run_qz_fk_scale_factor.sh
- Modify the
$data_path
inRSJoin/scripts/run_predicate*.sh
- Run
RSJoin/scripts/run_predicate_density.sh
andRSJoin/scripts/run_predicate_input_size.sh
Collect the results from run_*.out
files in the same folder.
Line-k joins and Star-k joins are supported for any k > 1. See line_joins and star_joins.
You can implement acyclic joins using the JoinTreeTemplate. See Q10Algorithm as an example.