Skip to content
main
Switch branches/tags
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LDBC_LOGO

LDBC SNB Datagen (Spark-based)

Build Status

Datagen is part of the LDBC project.

📜 If you wish to cite the LDBC SNB, please refer to the documentation repository.

⚠️ There are two different versions of the Datagen:

  • The Hadoop-based Datagen generates the Interactive SF1-1000 data sets
  • For the BI workload, use the Spark-based Datagen (in this repository).
  • For the Interactive workloads's larger data sets, there is no out-of-the-box solution (see this issue).

The LDBC SNB Data Generator (Datagen) is the responsible for providing the datasets used by all the LDBC benchmarks. This data generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official LDBC SNB specification document.

Generated small data sets are deployed by the CI.

Quick start

Build the JAR

You can build the JAR with both Maven and SBT.

  • To assemble the JAR file with Maven, run:

    tools/build.sh
  • For faster builds during development, consider using SBT. To assemble the JAR file with SBT, run:

    sbt assembly

    ⚠️ When using SBT, change the path of the JAR file in the instructions provided in the README (target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -> ./target/scala-2.11/ldbc_snb_datagen-assembly-${DATAGEN_VERSION}.jar).

Install tools

Some of the build utilities are written in Python. To use them, you have to create a Python virtual environment and install the dependencies.

E.g. with pyenv and pyenv-virtualenv:

pyenv install 3.7.7
pyenv virtualenv 3.7.7 ldbc_datagen_tools
pyenv local ldbc_datagen_tools
pip install -U pip 
pip install ./tools

Running locally

The tools/run.py is intended for local runs. To use it, download and extract Spark as follows.

Spark 3.1.x

Spark 3.1.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.1.x.

curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
export SPARK_HOME="/opt/spark-3.1.2-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin":"$PATH"

Both Java 8 and Java 11 work.

To build, run

tools/build.sh

Run the script with:

export PLATFORM_VERSION=2.12_spark3.1
export DATAGEN_VERSION=0.4.0-SNAPSHOT
tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>

Older Spark versions

Spark 2.4.x

Spark 2.4.x with Hadoop 2.7 (Scala 2.11 / JVM 8) is supported, but it is recommended to switch to Spark 3.

curl https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz | sudo tar -xz -C /opt/
export SPARK_HOME="/opt/spark-2.4.8-bin-hadoop2.7"
export PATH="$SPARK_HOME/bin":"$PATH"

Make sure you use Java 8.

To build, run

tools/build.sh -Pspark2.4

Run the script with:

export PLATFORM_VERSION=2.11_spark2.4
export DATAGEN_VERSION=0.4.0-SNAPSHOT

tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>

Runtime configuration arguments

The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:

tools/run.py --help

To generate a single part-*.csv file, reduce the parallelism (number of Spark partitions) to 1.

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive

Generator configuration arguments

The generator configuration arguments allow the configuration of the output directory, output format, layout, etc.

To get a complete list of the arguments, pass --help to the JAR file:

./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --help
  • Passing params.ini files:

    ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --param-file params.ini
  • Generating CsvBasic files in Interactive mode:

    ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --explode-edges --explode-attrs --mode interactive
  • Generating CsvCompositeMergeForeign files in BI mode resulting in compressed .csv.gz files:

    ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip
  • Generating CSVs in raw mode:

    ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw
  • For the interactive and bi formats, the --format-options argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the Spark formatting options for details), and whether quoting the fields in the CSV required:

    ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode interactive --format-options timestampFormat=MM/dd/YYYY\ HH:mm:ss,dateFormat=MM/dd/YYYY,header=false,quoteAll=true

To change the Spark configuration directory, adjust the SPARK_CONF_DIR environment variable.

A complex example:

export SPARK_CONF_DIR=./conf
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/YYYY\ HH:mm:ss,dateFormat=MM/dd/YYYY --explode-edges --explode-attrs --mode interactive --scale-factor 0.003

Docker image

The Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:

tools/docker-build.sh

See Build the JAR to build the library. Then, run the following:

tools/docker-run.sh

Elastic MapReduce

We provide scripts to run Datagen on AWS EMR. See the README in the tools/emr directory for details.

Parameter generation

The parameter generator is currently being reworked (see relevant issue) and no parameters are generated by default. However, the legacy parameter generator is still available. To use it, run the following commands:

mkdir substitution_parameters
# for Interactive
paramgenerator/generateparams.py out/build/ substitution_parameters
# for BI
paramgenerator/generateparamsbi.py out/build/ substitution_parameters

Larger scale factors

The scale factors SF3k+ are currently being fine-tuned, both regarding optimizing the generator and also for tuning the distributions.

Graph schema

The graph schema is as follows:

Troubleshooting

  • When running the tests, they might throw a java.net.UnknownHostException: your_hostname: your_hostname: Name or service not known coming from org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal. The solution is to add an entry of your machine's hostname to the /etc/hosts file: 127.0.1.1 your_hostname.
  • If you are using Docker and Spark runs out of space, make sure that Docker has enough space to store its containers. To move the location of the Docker containers to a larger disk, stop Docker, edit (or create) the /etc/docker/daemon.json file and add { "data-root": "/path/to/new/docker/data/dir" }, then sync the old folder if needed, and restart Docker. (See more detailed instructions).
  • If you are using a local Spark installation and run out of space in /tmp, set the SPARK_LOCAL_DIRS to point to a directory with enough free space.